Skip to content

Parsers

The parsers module provides parsers for extracting metadata from sequencer-specific files, including RunInfo.xml, RunParameters.xml, and SampleSheet.csv. It supports various sequencer types and file formats.

Key Features

  • Extensible parser architecture with a common base class
  • Specialized parsers for different file types
  • Parser factory for selecting the appropriate parser
  • Support for multiple sequencer types (MiSeq, NextSeq, NovaSeq, etc.)
  • Support for different file formats (including SampleSheet v2)

Available Parsers

The following parsers are available:

  • RunInfoParser: Parses RunInfo.xml files
  • RunParametersParser: Parses RunParameters.xml files
  • SampleSheetParser: Parses SampleSheet.csv files (both v1 and v2 formats)

Basic Usage

Using the Parser Factory

The easiest way to use the parsers is through the parser factory, which selects the appropriate parser based on the file name:

from rodrunner.parsers.factory import ParserFactory

# Create a parser factory
factory = ParserFactory()

# Parse a file
metadata = factory.parse_file("/path/to/RunInfo.xml")
print(metadata)

# Parse a directory
metadata_dict = factory.parse_directory("/path/to/sequencer/run")
print(metadata_dict["RunInfo.xml"])
print(metadata_dict["RunParameters.xml"])
print(metadata_dict["SampleSheet.csv"])

Using Individual Parsers

You can also use the individual parsers directly:

from rodrunner.parsers.runinfo import RunInfoParser
from rodrunner.parsers.runparameters import RunParametersParser
from rodrunner.parsers.samplesheet import SampleSheetParser

# Parse RunInfo.xml
run_info_parser = RunInfoParser()
run_info_metadata = run_info_parser.parse("/path/to/RunInfo.xml")
print(run_info_metadata)

# Parse RunParameters.xml
run_parameters_parser = RunParametersParser()
run_parameters_metadata = run_parameters_parser.parse("/path/to/RunParameters.xml")
print(run_parameters_metadata)

# Parse SampleSheet.csv
samplesheet_parser = SampleSheetParser()
samplesheet_metadata = samplesheet_parser.parse("/path/to/SampleSheet.csv")
print(samplesheet_metadata)

Metadata Structure

RunInfo.xml Metadata

The RunInfo parser extracts the following metadata:

{
    "run_id": "220101_M00001_0001_000000000-A1B2C",
    "instrument": "M00001",
    "flowcell": "000000000-A1B2C",
    "date": "1/1/2022",
    "reads": [
        {
            "number": "1",
            "num_cycles": "151",
            "is_indexed_read": "N"
        },
        {
            "number": "2",
            "num_cycles": "8",
            "is_indexed_read": "Y"
        },
        # ...
    ],
    "flowcell_layout": {
        "lane_count": "1",
        "surface_count": "2",
        "swath_count": "1",
        "tile_count": "14"
    }
}

RunParameters.xml Metadata

The RunParameters parser extracts the following metadata:

{
    "run_id": "220101_M00001_0001_000000000-A1B2C",
    "scanner_id": "M00001",
    "rta_version": "2.4.0.3",
    "chemistry": "Amplicon",
    "application_name": "MiSeq Control Software",
    "application_version": "4.0.0.1769",
    "experiment_name": "Test Run"
}

SampleSheet.csv Metadata (v1)

The SampleSheet parser extracts the following metadata for v1 format:

{
    "header": {
        "IEMFileVersion": "5",
        "Date": "1/1/2022",
        "Workflow": "GenerateFASTQ",
        "Application": "FASTQ Only",
        "Instrument Type": "MiSeq",
        "Assay": "Nextera XT",
        "Index Adapters": "Nextera XT Index Kit (96 Indexes, 384 Samples)",
        "Chemistry": "Amplicon"
    },
    "reads": ["151", "151"],
    "settings": {
        "ReverseComplement": "0",
        "Adapter": "CTGTCTCTTATACACATCT"
    },
    "data": [
        {
            "Sample_ID": "Sample1",
            "Sample_Name": "Sample1",
            "Sample_Plate": "Plate1",
            "Sample_Well": "A01",
            "Index_Plate_Well": "A01",
            "I7_Index_ID": "N701",
            "index": "TAAGGCGA",
            "I5_Index_ID": "S501",
            "index2": "TAGATCGC",
            "Sample_Project": "Project1",
            "Description": "Description1"
        },
        # ...
    ]
}

SampleSheet.csv Metadata (v2)

The SampleSheet parser extracts the following metadata for v2 format:

{
    "header": {
        "FileFormatVersion": "2",
        "RunName": "Test Run",
        "InstrumentPlatform": "NextSeq 2000",
        "InstrumentType": "NextSeq 2000"
    },
    "reads": {
        "Read1Cycles": "151",
        "Read2Cycles": "151",
        "Index1Cycles": "10",
        "Index2Cycles": "10"
    },
    "bclconvert_settings": {
        "AdapterRead1": "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA",
        "AdapterRead2": "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    },
    "bclconvert_data": [
        {
            "Sample_ID": "Sample1",
            "Index": "ATCACGTT",
            "Index2": "AACGTGAT"
        },
        # ...
    ],
    "cloud_settings": {
        "Cloud_LOT_Enabled": "true"
    },
    "cloud_data": [
        {
            "Sample_ID": "Sample1",
            "Project": "Project1"
        },
        # ...
    ]
}

Advanced Usage

Validating Metadata

Each parser has a validate method that checks if the extracted metadata is valid:

from rodrunner.parsers.runinfo import RunInfoParser

# Parse RunInfo.xml
run_info_parser = RunInfoParser()
metadata = run_info_parser.parse("/path/to/RunInfo.xml")

# Validate the metadata
if run_info_parser.validate(metadata):
    print("Metadata is valid")
else:
    print("Metadata is invalid")

Creating Custom Parsers

You can create custom parsers by extending the BaseParser class:

from rodrunner.parsers.base import BaseParser
from typing import Dict, Any

class CustomParser(BaseParser):
    def parse(self, file_path: str) -> Dict[str, Any]:
        # Implement parsing logic
        metadata = {}
        with open(file_path, 'r') as f:
            # Parse the file
            pass
        return metadata

    def validate(self, metadata: Dict[str, Any]) -> bool:
        # Implement validation logic
        return True

Extending the Parser Factory

You can extend the parser factory to support custom parsers:

from rodrunner.parsers.factory import ParserFactory
from my_custom_parsers import CustomParser

class ExtendedParserFactory(ParserFactory):
    def get_parser(self, file_path: str):
        # Check for custom file types
        if file_path.endswith(".custom"):
            return CustomParser()

        # Fall back to the default parser factory
        return super().get_parser(file_path)

Examples

Extracting Metadata from a Sequencer Run

from rodrunner.parsers.factory import ParserFactory
import os

# Create a parser factory
factory = ParserFactory()

# Parse a sequencer run directory
run_dir = "/path/to/sequencer/run"
metadata = factory.parse_directory(run_dir)

# Extract key metadata
run_id = metadata["RunInfo.xml"]["run_id"]
instrument = metadata["RunInfo.xml"]["instrument"]
chemistry = metadata["RunParameters.xml"]["chemistry"]
sample_count = len(metadata["SampleSheet.csv"]["data"])

print(f"Run ID: {run_id}")
print(f"Instrument: {instrument}")
print(f"Chemistry: {chemistry}")
print(f"Sample Count: {sample_count}")

# Print sample information
print("\nSamples:")
for sample in metadata["SampleSheet.csv"]["data"]:
    print(f"  {sample['Sample_ID']} ({sample['Sample_Project']})")

Converting Metadata to iRODS Metadata

from rodrunner.parsers.factory import ParserFactory
from rodrunner.config import get_config
from rodrunner.irods.client import iRODSClient
import os

# Create a parser factory
factory = ParserFactory()

# Parse a sequencer run directory
run_dir = "/path/to/sequencer/run"
metadata = factory.parse_directory(run_dir)

# Extract key metadata for iRODS
irods_metadata = {
    "run_id": metadata["RunInfo.xml"]["run_id"],
    "instrument": metadata["RunInfo.xml"]["instrument"],
    "date": metadata["RunInfo.xml"]["date"],
    "chemistry": metadata["RunParameters.xml"]["chemistry"],
    "sample_count": str(len(metadata["SampleSheet.csv"]["data"])),
    "run_type": "miseq",
    "status": "metadata_extracted"
}

# Create an iRODS client
config = get_config()
irods_client = iRODSClient(config.irods)

# Upload the run directory with metadata
irods_path = f"/tempZone/home/rods/sequencer/{os.path.basename(run_dir)}"
coll = irods_client.upload_directory(run_dir, irods_path, metadata=irods_metadata)

print(f"Uploaded run to {irods_path} with metadata")

API Reference

For detailed API documentation, see the Parsers API Reference.