NGS and Sanger Pipeline - AbXtract¶

Process NGS, FASTQ files or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. Includes Sanger as FASTA, FASTQ, TSV, CSV, Excel or Dataset, which can be clustered in context of NGS sequences and overlapped by a given region of interest (ROI).

SANGER Key Inputs

Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (file_in) : Input FASTQ/A in standard format or CSV, TSV or Excel in format id,sequence. Header Column SHOULD NOT be provided.

Region of Interest (ROI) For Condensing Sequences (string) : This will condense the SANGER sequences based on the ROI based rank ordered on abundance. All values sharing same ROI will display concatenated ‘:’ separated list by the ‘id’ field. IMPORTANT: all redundant sequences ROIs will be removed if lower abundance. If two sequences have same count by ROI only one will be selected at random. Default values condense by full-length including framework regions.

Default: Full-Length, Including Framework

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Species Database to Select From (SANGER Input Cube) (string) : Species group (species) desired to generate the db for igmatcher. NOTE: Database of small reference therapeutics is available at amino acid only.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only)

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12

2) Heavy chain representative should contain AA space separated:

match_name_vh FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa

3) Light chain representative should contain AA space separated:

match_name_vl FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Is the Input SANGER a Single Domain or Chain (boolean) : Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length, minimum query and minimum length that is flagged. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.

Default: False

Long-Read (PacBio) FILE/SELECTION Inputs

NGS Input FASTQ (Long-Read, E.G., PacBio) (file_in) : Input FASTQ File.

Barcode Table (Long-Read/PacBio IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Long-Read/PacBio IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (Long-Read/PacBio IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Short-Read (Illumina) FILE/SELECTION Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 1.

NGS Input FASTQ Reverse (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 2.

Barcode Table (Short-Read/Illumina IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Short-Read/Illumina IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (Short-Read/NGS IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Optional DATASET Inputs From Sanger And/Or NGS

Optional NGS Dataset For Input (Typically NGS Datasets From Upstream FLOE) (data_source) : The optional dataset(s) to read records from.

Optional SANGER Dataset For Input (Typically Datasets Processed By Quick Sanger FLOE) (data_source) : The optional dataset(s) to read records from.

Key Sanger Input Parameters

Full/Partial alignment to annotate CDRs (string) : Align query sequence to the entire germline or to partial regions to annotate. Full works better for amino acid inputs. For DNA inputs, ‘full’ works better for natural antibodies and ‘partial’ works better with DNA for synthetic/degenerate antibodies.

Default: partial

Choices: partial, full

Region of Interest (ROI) For Condensing Sequences (string) : This will condense the Sanger sequences based on the ROI based rank ordered on abundance. IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.

Default: Full-Length, Including Framework

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Key Downstream Parameters

Region of Interest For Sanger Overlay (string) : Indicate the region of interest (roi) to map back to sanger clones. This is defaulted to 100% homology by ROI, but can relax this stringency with the Sanger Edit Distance for Sanger overlap and Edit Distance Method for Sanger overlap parameters.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3

Region of Interest For Enrichment and Clustering (string) : Indicate the region of interest for processing, only top representative full-length sequence will be kept. IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Clustering Type (string) : Cluster type to apply to sequencing dataset. Non-redundant full-length sequences are assigned cluster id by ROI depending on selected method. NOTE: Unique only is same as 100% homology by provided ROI.

Default: AbScan

Choices: AbScan, Unique Only, Levenshtein Distance, Hamming Distance

Keep Only Functional Sequences (At NGS Level) (boolean) : Eliminates non-functional sequences, truncations, stop-codons, frame-shifts. NOTE: This does not apply to SANGER sequences (same option in the hidden parameters).

Default: True

Exclude Values That Did Not Match In-Line Barcode (boolean) : If True, will exclude unknown values that did not have a barcode match.

Default: True

Key Liability Parameters

Biophysical Liabilities (string) : Net charge or hydropathy liabilities to quantify.

Default: [‘Charge (>1)’]

Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)

Glycosylation Liabilities (string) : Glycosylation liabilities to quantify.

Default: [‘NXT/S - Glycosylation’]

Choices: NXT/S - Glycosylation, NXT - Glycosylation. NXS - Glycosylation

Cysteine Liabilities (string) : Cysteine-based liabilities to quantify.

Default: [‘Unpaired Cysteine’]

Choices: Unpaired Cysteine, Any Cysteine

Deamidation Liabilities (string) : Deamidation liabilities to quantify.

Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation

Hydrolysis Liabilities (string) : Hydrolysis liabilities to quantify.

Default: [‘DP - Hydrolysis’]

Choices: DP - Hydrolysis

Isomerization Liabilities (string) : Isomerization liabilities to quantify.

Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization

Polyspecificity Liabilities (string) : Polyspecificity liabilities to quantify.

Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity

Key Downstream Hidden Parameters

Liability Database File (file_in) : Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability). DO NOT INCLUDE characters ‘+’ or ‘,’ in second description ‘name of liability’ column. Special case liabilities include net charge, Parker hydropathy and unpaired cysteine. For net charge, write “Charge”, case-sensitive, in first column and the net charge integer in second column (values above are flagged). Similarly, for Parker hydropathy, write “Parker Hydropathy”, case-sensitive, in first column and the hydropathy float value in second column (values below are flagged). For unpaired cysteine write “Cysteine”, case-sensitive, and in second column can write anything but typically write “Unpaired Cysteine”. NOTE, if even number of cysteines exists and cysteines are only spaced within 3 AAs, these are still considered unpaired.