NGS and Sanger Pipeline - AbXtract

Category Paths

  • Solution-based/Biologics/Antibody Design

  • Role-based/Bioinformatician

  • Role-based/Biologist

  • Product-based/AbXtract

Description

Process NGS, FASTQ or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. Includes Sanger as FASTA, FASTQ, TSV, CSV, Excel or Dataset, which can be clustered in context of NGS sequences and overlapped by a given region of interest (ROI).

Promoted Parameters

Title in user interface (promoted name)

Sanger FILE Input

SANGER Input File (FASTA/Q, Excel, TSV, or CSV (input_fastq_sanger): If using excel, tsv or csv then input consists of two rows with the form: id, sequence

  • Type: file_in

Species Database to Select From (species_sanger): Species group (species) desired to generate the db for igmatcher. NOTE: Jain et al, 2017 database is available at amino acid only.

  • Required

  • Type: string

  • Default: [‘Human’]

  • Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’, ‘Jain Dataset (amino acid only)’]

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (custom_annotation_dna_sanger): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (custom_annotation_aa_sanger): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Is the Input SANGER a Single Domain or Chain (is_single_dom): Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length, minimum query and minimum length that is flagged. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Long-Read (PacBio) FILE Inputs

NGS Input FASTQ (Long-Read/PacBio) (pacbio_input_file): Input FASTQ File

  • Type: file_in

Barcode Table (barcode_table_ngs): XLS/CSV/TSV file containing barcodes in the format Name,5’barcode,3’barcode,barcode_round(e.g., early/late),barcode_group Do not include header. If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,

  • Type: file_in

Species Database to Select From (species_ngs): Species reference database to generate the db for igmatcher

  • Required

  • Type: string

  • Default: [‘Human’]

  • Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Short-Read (Illumina) FILE Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (illumina_input_file1): Input FASTQ File

  • Type: file_in

NGS Input FASTQ Reverse (Short-Read/Illumina) (illumina_input_file2): Input FASTQ File

  • Type: file_in

Barcode Table (barcode_table_ngs_ill): XLS/CSV/TSV file containing barcodes in the format Name,5’barcode,3’barcode,barcode_round(e.g., early/late),barcode_group Do not include header. If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,

  • Type: file_in

Species Database to Select From (species_illumina): Species reference database to generate the db for igmatcher

  • Required

  • Type: string

  • Default: [‘Human’]

  • Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher_illumina): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Optional DATASET Inputs from Sanger and/or NGS

NGS Dataset Input (optional_input_ngs_file): Typically NGS datasets from Upstream FLOE (Load Either PacBio OR Illumina, Not Both)

  • Type: data_source

SANGER Dataset Input (typically datasets processed by Quick Sanger FLOE) (optional_input_sanger_file): The optional dataset(s) to read records from

  • Type: data_source

Key Sanger Input Parameters

Full/Partial alignment to annotate CDRs (cdr_method_sanger): Align query sequence to the entire germline or to partial regions to annotate. IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).

  • Type: string

  • Default: full

  • Choices: [‘partial’, ‘full’]

Region of Interest (ROI) For Condensing Sequences (roi_sanger_condense): This will condense the Sanger sequences based on the ROI based rank ordered on abundance. IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.

  • Required

  • Type: string

  • Default: Full-Length

  • Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]

Key Downstream Parameters

Region of Interest For Sanger overlap (roi_sanger): Indicate the region of interest (roi) to map back to sanger clones

  • Required

  • Type: string

  • Default: CDR3 Chain_2 (Downstream Chain)

  • Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’]

Region of Interest For Enrichment and Clustering (roi_cluster): Indicate the region of interest for processing, only top representative full-length sequence will be kept IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING. overlap TO SANGER ADJUSTED TO HEAVY CDR3 FOR ILLUMINA.

  • Required

  • Type: string

  • Default: HCDR3 and LCDR3

  • Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]

Clustering Type (cluster_type_ngs): Cluster type to apply to sequencing dataset

  • Required

  • Type: string

  • Default: AbScan

  • Choices: [‘AbScan’, ‘Unique Only’, ‘Levenshtein Distance’, ‘Hamming Distance’]

Keep Only Functional Sequences (At NGS Level) (filter_functional): Eliminates non-functional sequences, truncations, stop-codons, frame-shifts NOTE: This does not apply to Sanger sequences, which has same option as hidden variable

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Exclude Values That Did Not Match In-Line Barcode (exclude_unknown): If True, will exclude unknown values that did not have a barcode match, unless there is only one barcode for the entire NGS population.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Key Liability Parameters

Polyspecificity Liabilities (liability_choices_poly): polyspecificity liabilities to quantify

  • Type: string

  • Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

  • Choices: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘YY - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’]

Deamidation Liabilities (liability_choices_deam): deamidation liabilities to quantify

  • Type: string

  • Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

  • Choices: [‘N[GSTN] - Deamidation’, ‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GN[FYTG] - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Glycosylation Liabilities (liability_choices_glyc): glycosylation liabilities to quantify

  • Type: string

  • Default: [‘NXT/S - Glycosylation’]

  • Choices: [‘NXT/S - Glycosylation’, ‘NXT - Glycosylation’, ‘NXS - Glycosylation’]

Hydrolysis Liabilities (liability_choices_hydrolysis): hydrolysis liabilities to quantify

  • Type: string

  • Default: [‘DP - Hydrolysis’]

  • Choices: [‘DP - Hydrolysis’]

Isomerization Liabilities (liability_choices_iso): isomerization liabilities to quantify

  • Type: string

  • Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

  • Choices: [‘D[GSD] - Isomerization’, ‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Biophysical Liabilities (liability_choices_charge): Net charge or hydropathy liabilities to quantify

  • Type: string

  • Default: [‘Charge (>1)’]

  • Choices: [‘Charge (>-1)’, ‘Charge (>0)’, ‘Charge (>1)’, ‘Charge (>2)’, ‘Charge (>3)’, ‘Charge (>4)’, ‘Parker Hydropathy (<0.0)’, ‘Parker Hydropathy (<-0.1)’, ‘Parker Hydropathy (<-0.2)’, ‘Parker Hydropathy (<-0.3)’, ‘Parker Hydropathy (<-0.4)’, ‘Parker Hydropathy (<-0.5)’, ‘Parker Hydropathy (<-0.6)’, ‘Parker Hydropathy (<-0.7)’, ‘Parker Hydropathy (<-0.8)’, ‘Parker Hydropathy (<-0.9)’, ‘Parker Hydropathy (<-1.0)’, ‘Parker Hydropathy (<-2.0)’, ‘Parker Hydropathy (<-3.0)’, ‘Parker Hydropathy (<-4.0)’, ‘Parker Hydropathy (<-5.0)’]

Cysteine Liabilities (liability_choices_cysteine): cysteine-based liabilities to quantify

  • Type: string

  • Default: [‘Unpaired Cysteine’]

  • Choices: [‘Unpaired Cysteine’, ‘Any Cysteine’]

Upstream Long-Read (PacBio) OR Short-Read (Illumina) AND Sanger Output Names

Output Basename of the Upstream Long-Read/PacBio Datasets (consolidate_out): This dataset contains all Long-Read/PacBio files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)

  • Required

  • Type: dataset_out

  • Default: up.long_read

Output Name of the Sanger Dataset (sanger_out): This dataset contains all SANGER files processed immediately after IgMatcher and before downstream processing.

  • Required

  • Type: dataset_out

  • Default: up.sanger

Output Basename of the Upstream Illumina Datasets (consolidate_out_ill): This dataset contains all Short-Read/Illumina files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)

  • Required

  • Type: dataset_out

  • Default: up.short_read

Output CSV Filename for the SANGER upstream file (file_out_sanger_csv): All upstream SANGER (pre-downstream processing) records are written to csv file, must contain the *.csv extension

  • Required

  • Type: file_out

  • Default: upstream_sanger.csv

Downstream Output Names

Output Name of the Consolidated Dataset (output_out): This dataset will contain a consolidated dataset where all sample names and barcode groups belong to same dataset

  • Required

  • Type: dataset_out

  • Default: down.consolidated

Output Basename of the Downstream Long-Read/PacBio Datasets (process_out): All records are written to downstream datasets according to group name, with this base name appended to output

  • Required

  • Type: dataset_out

  • Default: down.long_read

Output Basename of the Downstream Short-Read/Illumina Datasets (process_out_ill): All records are written to downstream datasets according to group name, with this base name appended to output

  • Required

  • Type: dataset_out

  • Default: down.short_read

Output CSV Filename (file_out_csv): All records are written to downstream csv file, must contain the *.csv extension

  • Required

  • Type: file_out

  • Default: down.consolidated.csv

Failure Output

Failed Dataset Output Name (fout): Contains failed records from both upstream and downstream processes

  • Required

  • Type: dataset_out

  • Default: problematic.ngs_processing_with_sanger