NGS Pipeline with Automated Top Lead Selection - AbXtract¶

Process NGS, FASTQ files or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. This FLOE will use automatically select top X number of sequences across defined cluster space using rank ordered, user-specified, metrics.

Key Selection Parameters

Maximum Number of Full-Length Sequences (integer) : Indicate the maximum number of full-length, non-redundant sequences.This value also depends on the total number of non-redundant sequences, maximum # of clusters and the maximum # of sequences per cluster. If the total number is below desired, try adjusting max # of clusters or max number of sequences per cluster. Alternatively, if you want to fill quota with additional sequences per cluster turn on the ‘Attempt to Fulfill the Desired Number of Sequences Quota’ parameter.

Default: 300 Min: 1 Max: 1000000

Maximum Number Sequences Per Cluster (integer) : Indicate the maximum number of unique full-length sequences per given cluster (NOTE: if clustering not performed, will select full-length sequences that share common region of interest (ROI)).

Default: 3 Min: 1 Max: 10000

Maximum Number of Clusters Preferred (integer) : Indicate the maximum number of unique full-length sequences per given cluster (NOTE: if clustering not performed, will select full-length sequences that share common region of interest (ROI)).

Default: 100 Min: 1 Max: 1000000

Metrics for Ranking (string) : Place metrics in order of ranking (if nothing, ranks by full-length count).

Default: [‘ROI Percent, Final Round Only’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘Liabilities Both Chains’, ‘Liabilities CDR3_2’]

Choices: Full Length (Corrects for Illumina or PacBio), Count, Full Length (Corrects for Illumina or PacBio), Percent, ROI Count, Final Round Only, ROI Percent, Final Round Only, ROI Fold Enrichment, Final Round Only, ROI Log2 Enrichment, Final Round Only, Liabilities Both Chains, Liabilities Chain_2, Liabilities Chain_1, Liabilities CDR1_1, Liabilities CDR2_1, Liabilities CDR3_1, Liabilities CDR1_2, Liabilities CDR2_2, Liabilities CDR3_2, ROI Count, Early Round Only, ROI Percent, Early Round Only, Cluster Count (e.g. unique sequences per cluster), Cluster Percent (e.g. unique rep per cluster)

Rank Sanger Clones First in Population (boolean) : Sanger clones, if present, will be ranked first.

Default: True

Attempt to Fulfill the Desired Number of Sequences Quota (boolean) : This will attempt to fulfill the total number of sequences quota if goal is not reached by selecting across clusters by prioritizing top clones. NOTE: this is likely to result in a greater number of sequences per cluster.

Default: False

Long-Read (PacBio) FILE/SELECTION Inputs

NGS Input FASTQ (Long-Read, E.G., PacBio) (file_in) : Input FASTQ File.

Barcode Table (Long-Read/PacBio IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Long-Read/PacBio IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Database with Alignment Scheme of Interest (for NGS). (Long-Read/PacBio IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Short-Read (Illumina) FILE/SELECTION Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 1.

NGS Input FASTQ Reverse (Short-Read/Illumina) (file_in) : Path to FastQ containing Illumina sequences, read 2.

Barcode Table (Short-Read/Illumina IGMATCHER Cube) (file_in) : Excel/CSV/TSV file containing barcodes with the column/comma/tab-separated format as “5’barcode,3’barcode,barcode_round,barcode_group”. Do not include header. If you only have a 5’ barcode then the column/comma/tab-separated format is “name, 5’barcode,,,”. If you just have a 3’ barcode then the column/comma/tab-separated format is “name,,3’barcode,,,”.

Species Database to Select From (Short-Read/Illumina IGMATCHER Cube) (string) : Species reference database to generate the db for igmatcher.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit

Provide a Customized DNA Database with Alignment Scheme of Interest (for NGS). (Short-Read/NGS IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.

2) Heavy chain representative should contain DNA bases space separated:

match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

3) Light chain representative should contain DNA bases space separated:

match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna

4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).

5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).

Optional DATASET Inputs From Sanger And/Or NGS

Optional NGS Dataset For Input (Typically NGS Datasets From Upstream FLOE) (data_source) : The optional dataset(s) to read records from.

NGS Key Selection Parameters

Maximum Number of Full-Length Sequences (integer) : Indicate the maximum number of full-length, non-redundant sequences. This value depends on the total number of non-redundant sequences, maximum # of clusters and the maximum # of sequences per cluster. If the total number is below desired, try adjusting max # of clusters or max number of sequences per cluster. Alternatively, if you want to fill quota with additional sequences per cluster turn on the ‘Attempt to Fulfill the Desired Number of Sequences Quota’ parameter.

Default: 300 Min: 1 Max: 1000000

Maximum Number Sequences Per Cluster (integer) : Indicate the maximum number of unique full-length sequences per given cluster (NOTE: if clustering not performed, will select full-length sequences that share common region of interest (ROI)).

Default: 3 Min: 1 Max: 10000

Maximum Number Of Clusters Preferred (integer) : Indicate the maximum number of clusters that you want to choose from.

Default: 100 Min: 1 Max: 10000000

Metrics for Ranking (string) : Place metrics in order of ranking (if nothing, ranks by full-length count).

Default: [‘ROI Percent, Final Round Only’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘Liabilities Both Chains’, ‘Liabilities CDR3_2’]

Choices: Full Length (Corrects for Illumina or PacBio), Count, Full Length (Corrects for Illumina or PacBio), Percent, ROI Count, Final Round Only, ROI Percent, Final Round Only, ROI Fold Enrichment, Final Round Only, ROI Log2 Enrichment, Final Round Only, Liabilities Both Chains, Liabilities Chain_2, Liabilities Chain_1, Liabilities CDR1_1, Liabilities CDR2_1, Liabilities CDR3_1, Liabilities CDR1_2, Liabilities CDR2_2, Liabilities CDR3_2, ROI Count, Early Round Only, ROI Percent, Early Round Only, Cluster Count (e.g. unique sequences per cluster), Cluster Percent (e.g. unique rep per cluster)

Attempt to Fulfill the Desired Number of Sequences Quota (boolean) : This will attempt to fulfill the total number of sequences quota if goal is not reached by selecting across clusters by prioritizing top clones. NOTE: this is likely to result in a greater number of sequences per cluster.

Default: False

Key Downstream Parameters

Region of Interest For Sanger Overlay (string) : Indicate the region of interest (roi) to map back to Sanger clones. This is defaulted to 100% homology by ROI, but can relax this stringency with the Sanger Edit Distance for Sanger overlap and Edit Distance Method for Sanger overlap parameters.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3

Region of Interest For Enrichment and Clustering (string) : Indicates the region of interest for processing, only top representative full-length sequence will be kept. IF INPUT IS ILLUMINA, WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework

Clustering Type (string) : Cluster type to apply to sequencing dataset. Non-redundant full-length sequences are assigned cluster id by ROI depending on selected method. NOTE: Unique only is same as 100% homology by provided ROI.

Default: AbScan

Choices: AbScan, Unique Only, Levenshtein Distance, Hamming Distance

Keep Only Functional Sequences (At NGS Level) (boolean) : Eliminates non-functional sequences, truncations, stop-codons, frame-shifts. NOTE: This does not apply to SANGER sequences (same option in the hidden parameters).

Default: True

Exclude Values That Did Not Match In-Line Barcode (boolean) : If True, will exclude unknown values that did not have a barcode match.

Default: True

Key Liability Parameters

Biophysical Liabilities (string) : Net charge or hydropathy liabilities to quantify.

Default: [‘Charge (>1)’]

Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)

Glycosylation Liabilities (string) : Glycosylation liabilities to quantify.

Default: [‘NXT/S - Glycosylation’]

Choices: NXT/S - Glycosylation, NXT - Glycosylation. NXS - Glycosylation

Cysteine Liabilities (string) : Cysteine-based liabilities to quantify.

Default: [‘Unpaired Cysteine’]

Choices: Unpaired Cysteine, Any Cysteine

Deamidation Liabilities (string) : Deamidation liabilities to quantify.

Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation

Hydrolysis Liabilities (string) : Hydrolysis liabilities to quantify.

Default: [‘DP - Hydrolysis’]

Choices: DP - Hydrolysis

Isomerization Liabilities (string) : Isomerization liabilities to quantify.

Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization

Polyspecificity Liabilities (string) : Polyspecificity liabilities to quantify.

Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity

Key Downstream Hidden Parameters

Liability Database File (file_in) : Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability). DO NOT INCLUDE characters ‘+’ or ‘,’ in second description ‘name of liability’ column. Special case liabilities include net charge, Parker hydropathy and unpaired cysteine. For net charge, write “Charge”, case-sensitive, in first column and the net charge integer in second column (values above are flagged). Similarly, for Parker hydropathy, write “Parker Hydropathy”, case-sensitive, in first column and the hydropathy float value in second column (values below are flagged). For unpaired cysteine write “Cysteine”, case-sensitive, and in second column can write anything but typically write “Unpaired Cysteine”. NOTE, if even number of cysteines exists and cysteines are only spaced within 3 AAs, these are still considered unpaired.