NGS Pipeline - AbXtract¶

Process NGS, FASTQ files or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment.

Long-Read (PacBio) FILE/SELECTION Inputs

NGS Input FASTQ (Long-Read, E.G., PacBio) (file_in) : Input FASTQ File.

Barcode Table (Long-Read/PacBio IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Long-Read/PacBio IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Database with Alignment Scheme of Interest (for NGS). (Long-Read/PacBio IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Short-Read (Illumina) FILE/SELECTION Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 1.

NGS Input FASTQ Reverse (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 2.

Barcode Table (Short-Read/Illumina IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Short-Read/Illumina IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Database with Alignment Scheme of Interest (for NGS). (Short-Read/NGS IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Optional DATASET Inputs From Sanger And/Or NGS

Optional NGS Dataset For Input (Typically NGS Datasets From Upstream FLOE) (data_source) : The optional dataset(s) to read records from.

Key Downstream Parameters

Region of Interest For Enrichment and Clustering (string) : Indicate the region of interest for processing, only top representative full-length sequence will be kept. IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Clustering Type (string) : Cluster type to apply to sequencing dataset. Non-redundant full-length sequences are assigned cluster id by ROI depending on selected method. NOTE: Unique only is same as 100% homology by provided ROI.

Default: AbScan

Choices: AbScan, Unique Only, Levenshtein Distance, Hamming Distance

Keep Only Functional Sequences (At NGS Level) (boolean) : Eliminates non-functional sequences, truncations, stop-codons, frame-shifts. NOTE: This does not apply to SANGER sequences (same option in the hidden parameters).

Default: True

Exclude Values That Did Not Match In-Line Barcode (boolean) : If True, will exclude unknown values that did not have a barcode match.

Default: True

Key Liability Parameters

Biophysical Liabilities (string) : Net charge or hydropathy liabilities to quantify.

Default: [‘Charge (>1)’]

Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)

Glycosylation Liabilities (string) : Glycosylation liabilities to quantify.

Default: [‘NXT/S - Glycosylation’]

Choices: NXT/S - Glycosylation, NXT - Glycosylation. NXS - Glycosylation

Cysteine Liabilities (string) : Cysteine-based liabilities to quantify.

Default: [‘Unpaired Cysteine’]

Choices: Unpaired Cysteine, Any Cysteine

Deamidation Liabilities (string) : Deamidation liabilities to quantify.

Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation

Hydrolysis Liabilities (string) : Hydrolysis liabilities to quantify.

Default: [‘DP - Hydrolysis’]

Choices: DP - Hydrolysis

Isomerization Liabilities (string) : Isomerization liabilities to quantify.

Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization

Polyspecificity Liabilities (string) : Polyspecificity liabilities to quantify.

Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity

Key Downstream Hidden Parameters

Liability Database File (file_in) : Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability). DO NOT INCLUDE characters ‘+’ or ‘,’ in second description ‘name of liability’ column. Special case liabilities include net charge, Parker hydropathy and unpaired cysteine. For net charge, write “Charge”, case-sensitive, in first column and the net charge integer in second column (values above are flagged). Similarly, for Parker hydropathy, write “Parker Hydropathy”, case-sensitive, in first column and the hydropathy float value in second column (values below are flagged). For unpaired cysteine write “Cysteine”, case-sensitive, and in second column can write anything but typically write “Unpaired Cysteine”. NOTE, if even number of cysteines exists and cysteines are only spaced within 3 AAs, these are still considered unpaired.