NGS and Sanger Pipeline with Automated Top Lead Selection - AbXtract¶

NGS, FASTQ files or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. Includes Sanger as FASTA, FASTQ, TSV, CSV, Excel or Dataset, which can be clustered in context of NGS sequences and overlapped by a given region of interest (ROI). This FLOE will use automatically select top X number of sequences across defined cluster space using rank ordered, user-specified, metrics.

SANGER Key Inputs

Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (file_in) : Input FASTQ/A in standard format or CSV, TSV or Excel in format id,sequence. Header Column SHOULD NOT be provided.

Region of Interest (ROI) For Condensing Sequences (string) : This will condense the SANGER sequences based on the ROI based rank ordered on abundance. All values sharing same ROI will display concatenated ‘:’ separated list by the ‘id’ field. IMPORTANT: all redundant sequences ROIs will be removed if lower abundance. If two sequences have same count by ROI only one will be selected at random. Default values condense by full-length including framework regions.

Default: Full-Length, Including Framework

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Species Database to Select From (SANGER Input Cube) (string) : Species group (species) desired to generate the db for igmatcher. NOTE: Database of small reference therapeutics is available at amino acid only.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only)

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain AA space separated:

match_name_vh FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa

3) Light chain representative should contain AA space separated:

match_name_vl FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Is the Input SANGER a Single Domain or Chain (boolean) : Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length, minimum query and minimum length that is flagged. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.

Default: False

Key NGS Selection Parameters

Maximum Number of Full-Length Sequences (integer) : Indicate the maximum number of full-length, non-redundant sequences.This value also depends on the total number of non-redundant sequences, maximum # of clusters and the maximum # of sequences per cluster. If the total number is below desired, try adjusting max # of clusters or max number of sequences per cluster. Alternatively, if you want to fill quota with additional sequences per cluster turn on the ‘Attempt to Fulfill the Desired Number of Sequences Quota’ parameter.

Default: 300 Min: 1 Max: 1000000

Maximum Number Sequences Per Cluster (integer) : Indicate the maximum number of unique full-length sequences per given cluster (NOTE: if clustering not performed, will select full-length sequences that share common region of interest (ROI)).

Default: 3 Min: 1 Max: 10000

Maximum Number of Clusters Preferred (integer) : Indicate the maximum number of unique full-length sequences per given cluster (NOTE: if clustering not performed, will select full-length sequences that share common region of interest (ROI)).

Default: 100 Min: 1 Max: 1000000

Metrics for Ranking (string) : Place metrics in order of ranking (if nothing, ranks by full-length count).

Default: [‘ROI Percent, Final Round Only’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘Liabilities Both Chains’, ‘Liabilities CDR3_2’]

Choices: Full Length (Corrects for Illumina or PacBio), Count, Full Length (Corrects for Illumina or PacBio), Percent, ROI Count, Final Round Only, ROI Percent, Final Round Only, ROI Fold Enrichment, Final Round Only, ROI Log2 Enrichment, Final Round Only, Liabilities Both Chains, Liabilities Chain_2, Liabilities Chain_1, Liabilities CDR1_1, Liabilities CDR2_1, Liabilities CDR3_1, Liabilities CDR1_2, Liabilities CDR2_2, Liabilities CDR3_2, ROI Count, Early Round Only, ROI Percent, Early Round Only, Cluster Count (e.g. unique sequences per cluster), Cluster Percent (e.g. unique rep per cluster)

Rank Sanger Clones First in Population (boolean) : Sanger clones, if present, will be ranked first.

Default: True

Attempt to Fulfill the Desired Number of Sequences Quota (boolean) : This will attempt to fulfill the total number of sequences quota if goal is not reached by selecting across clusters by prioritizing top clones. NOTE: this is likely to result in a greater number of sequences per cluster.

Default: False

Long-Read (PacBio) FILE/SELECTION Inputs

NGS Input FASTQ (Long-Read, E.G., PacBio) (file_in) : Input FASTQ File.

Barcode Table (Long-Read/PacBio IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Long-Read/PacBio IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (Long-Read/PacBio IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Short-Read (Illumina) FILE/SELECTION Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 1.

NGS Input FASTQ Reverse (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 2.

Barcode Table (Short-Read/Illumina IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Short-Read/Illumina IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (Short-Read/NGS IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Optional DATASET Inputs From Sanger And/Or NGS

Optional NGS Dataset For Input (Typically NGS Datasets From Upstream FLOE) (data_source) : The optional dataset(s) to read records from.

Optional SANGER Dataset For Input (Typically Datasets Processed By Quick Sanger FLOE) (data_source) : The optional dataset(s) to read records from.

Key Sanger Input Parameters

Full/Partial alignment to annotate CDRs (string) : Align query sequence to the entire germline or to partial regions to annotate. IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).

Default: full

Choices: partial, full

Region of Interest (ROI) For Condensing Sequences (string) : This will condense the Sanger sequences based on the ROI based rank ordered on abundance. IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.

Default: Full-Length, Including Framework

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Key Downstream Parameters

Region of Interest For Sanger Overlay (string) : Indicate the region of interest (roi) to map back to sanger clones. This is defaulted to 100% homology by ROI, but can relax this stringency with the Sanger Edit Distance for Sanger overlap and Edit Distance Method for Sanger overlap parameters.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3

Region of Interest For Enrichment and Clustering (string) : Indicate the region of interest for processing, only top representative full-length sequence will be kept. IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Clustering Type (string) : Cluster type to apply to sequencing dataset. Non-redundant full-length sequences are assigned cluster id by ROI depending on selected method. NOTE: Unique only is same as 100% homology by provided ROI.

Default: AbScan

Choices: AbScan, Unique Only, Levenshtein Distance, Hamming Distance

Keep Only Functional Sequences (At NGS Level) (boolean) : Eliminates non-functional sequences, truncations, stop-codons, frame-shifts. NOTE: This does not apply to SANGER sequences (same option in the hidden parameters).

Default: True

Exclude Values That Did Not Match In-Line Barcode (boolean) : If True, will exclude unknown values that did not have a barcode match.

Default: True

Key Liability Parameters

Biophysical Liabilities (string) : Net charge or hydropathy liabilities to quantify.

Default: [‘Charge (>1)’]

Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)

Glycosylation Liabilities (string) : Glycosylation liabilities to quantify.

Default: [‘NXT/S - Glycosylation’]

Choices: NXT/S - Glycosylation, NXT - Glycosylation. NXS - Glycosylation

Cysteine Liabilities (string) : Cysteine-based liabilities to quantify.

Default: [‘Unpaired Cysteine’]

Choices: Unpaired Cysteine, Any Cysteine

Deamidation Liabilities (string) : Deamidation liabilities to quantify.

Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation

Hydrolysis Liabilities (string) : Hydrolysis liabilities to quantify.

Default: [‘DP - Hydrolysis’]

Choices: DP - Hydrolysis

Isomerization Liabilities (string) : Isomerization liabilities to quantify.

Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization

Polyspecificity Liabilities (string) : Polyspecificity liabilities to quantify.

Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity

Key Downstream Hidden Parameters

Liability Database File (file_in) : Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability). DO NOT INCLUDE characters ‘+’ or ‘,’ in second description ‘name of liability’ column. Special case liabilities include net charge, Parker hydropathy and unpaired cysteine. For net charge, write “Charge”, case-sensitive, in first column and the net charge integer in second column (values above are flagged). Similarly, for Parker hydropathy, write “Parker Hydropathy”, case-sensitive, in first column and the hydropathy float value in second column (values below are flagged). For unpaired cysteine write “Cysteine”, case-sensitive, and in second column can write anything but typically write “Unpaired Cysteine”. NOTE, if even number of cysteines exists and cysteines are only spaced within 3 AAs, these are still considered unpaired.