NGS Pipeline with Automated Top Lead Selection - AbXtract

Category Paths

  • Solution-based/Biologics/Antibody Design

  • Role-based/Bioinformatician

  • Role-based/Biologist

  • Product-based/AbXtract

Description

Process NGS, FASTQ or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. This FLOE will use automatically select top X number of sequences across defined cluster space using rank ordered, user-specified, metrics.

Promoted Parameters

Title in user interface (promoted name)

Long-Read (PacBio) FILE/SELECTION Inputs

NGS Input FASTQ (Long-Read, e.g., PacBio) (pacbio_input_file): Input FASTQ File

  • Type: file_in

Barcode Table (barcode_table_ngs): XLS/CSV/TSV file containing barcodes in the format Name,5’barcode,3’barcode,barcode_round(e.g., early/late),barcode_group Do not include header. If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,

  • Type: file_in

Species Database to Select From (species_ngs): Species reference database to generate the db for igmatcher

  • Required

  • Type: string

  • Default: [‘Human’]

  • Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Short-Read (Illumina) FILE/SELECTION Inputs

NGS Input FASTQ Forward (Short-Read/Illumina) (illumina_input_file1): Input FASTQ File

  • Type: file_in

NGS Input FASTQ Reverse (Short-Read/Illumina) (illumina_input_file2): Input FASTQ File

  • Type: file_in

Barcode Table (barcode_table_ngs_ill): XLS/CSV/TSV file containing barcodes in the format Name,5’barcode,3’barcode,barcode_round(e.g., early/late),barcode_group Do not include header. If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,

  • Type: file_in

Species Database to Select From (species_illumina): Species reference database to generate the db for igmatcher

  • Required

  • Type: string

  • Default: [‘Human’]

  • Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher_illumina): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.

  • Type: file_in

Optional DATASET Inputs from NGS

Optional NGS Dataset for Input (typically NGS datasets from upstream FLOE) (optional_input): The optional dataset(s) to read records from

  • Type: data_source

NGS Key Selection Parameters

Maximum Number of Full-Length Sequences (number_of_seqs_ngs_picking): Indicate the maximum number of full-length, non-redundant sequences. This value depends on the total number of non-redundant sequences, maximum # of clusters and the maximum # of sequences per cluster. If the total number is below desired, try adjusting max # of clusters or max number of sequences per cluster. Alternatively, if you want to fill quota with additional sequences per cluster turn on the ‘Attempt to Fulfill the Desired Number of Sequences Quota’ parameter

  • Required

  • Type: integer

  • Default: 100

Maximum Number Sequences Per Cluster (max_seq_per_cluster): Indicate the maximum number of unique full-length sequences per given cluster. This value may be exceeded if the ‘Write the Automated Top Clones Output to CSV File’ is turned ON

  • Required

  • Type: integer

  • Default: 10

Maximum Number of Clusters Preferred? (number_of_clusters_to_select_from): Indicate the maximum number of clusters that you want to choose from. This value may be exceeded if the ‘Write the Automated Top Clones Output to CSV File’ is turned ON

  • Type: integer

  • Default: 40

Metrics for Ranking (picking_choices): Place metrics in order of ranking (if nothing, ranks by full-length count)

  • Required

  • Type: string

  • Default: [‘ROI Percent, Final Round Only’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘Liabilities Both Chains’, ‘Liabilities CDR3_2’]

  • Choices: [‘Full Length (Corrects for Illumina or PacBio), Count’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘ROI Count, Final Round Only’, ‘ROI Percent, Final Round Only’, ‘ROI Fold Enrichment, Final Round Only’, ‘ROI Log2 Enrichment, Final Round Only’, ‘Liabilities Both Chains’, ‘Liabilities Chain_2’, ‘Liabilities Chain_1’, ‘Liabilities CDR1_1’, ‘Liabilities CDR2_1’, ‘Liabilities CDR3_1’, ‘Liabilities CDR1_2’, ‘Liabilities CDR2_2’, ‘Liabilities CDR3_2’, ‘ROI Count, Early Round Only’, ‘ROI Percent, Early Round Only’, ‘Cluster Count (e.g. unique sequences per cluster)’, ‘Cluster Percent (e.g. unique rep per cluster)’]

Attempt to Fill the Desired Number of Full-Length Sequences Quota (quota_attempt): This will attempt to fulfill the total number of sequences quota if goal of desired number of full-length sequences is not reached by 1) selecting additional clones across lower in rank aross designated clusters followed by 2) selecting the remaining top ranked clones from different clusters by prioritizing top clones. NOTE: if turned on, this is likely to result in a greater number of sequences per cluster and if quota still not met, more clusters then designated in total to reach the desired number of full-length sequences goal

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Key Downstream Parameters

Region of Interest For Enrichment and Clustering (roi_cluster): Indicate the region of interest for processing, only top representative full-length sequence will be kept IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING.

  • Required

  • Type: string

  • Default: HCDR3 and LCDR3

  • Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]

Clustering Type (cluster_type_ngs): Cluster type to apply to sequencing dataset

  • Required

  • Type: string

  • Default: AbScan

  • Choices: [‘AbScan’, ‘Unique Only’, ‘Levenshtein Distance’, ‘Hamming Distance’]

Keep Only Functional Sequences (filter_functional): Eliminates non-functional sequences, truncations, stop-codons, frame-shifts

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Exclude Values That Did Not Match In-Line Barcode (exclude_unknown): If True, will exclude unknown values that did not have a barcode match, unless there is only one barcode for the entire NGS population.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Key Liability Parameters

Polyspecificity Liabilities (liability_choices_poly): polyspecificity liabilities to quantify

  • Type: string

  • Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]

  • Choices: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘YY - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’]

Deamidation Liabilities (liability_choices_deam): deamidation liabilities to quantify

  • Type: string

  • Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

  • Choices: [‘N[GSTN] - Deamidation’, ‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GN[FYTG] - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]

Glycosylation Liabilities (liability_choices_glyc): glycosylation liabilities to quantify

  • Type: string

  • Default: [‘NXT/S - Glycosylation’]

  • Choices: [‘NXT/S - Glycosylation’, ‘NXT - Glycosylation’, ‘NXS - Glycosylation’]

Hydrolysis Liabilities (liability_choices_hydrolysis): hydrolysis liabilities to quantify

  • Type: string

  • Default: [‘DP - Hydrolysis’]

  • Choices: [‘DP - Hydrolysis’]

Isomerization Liabilities (liability_choices_iso): isomerization liabilities to quantify

  • Type: string

  • Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

  • Choices: [‘D[GSD] - Isomerization’, ‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]

Biophysical Liabilities (liability_choices_charge): Net charge or hydropathy liabilities to quantify

  • Type: string

  • Default: [‘Charge (>1)’]

  • Choices: [‘Charge (>-1)’, ‘Charge (>0)’, ‘Charge (>1)’, ‘Charge (>2)’, ‘Charge (>3)’, ‘Charge (>4)’, ‘Parker Hydropathy (<0.0)’, ‘Parker Hydropathy (<-0.1)’, ‘Parker Hydropathy (<-0.2)’, ‘Parker Hydropathy (<-0.3)’, ‘Parker Hydropathy (<-0.4)’, ‘Parker Hydropathy (<-0.5)’, ‘Parker Hydropathy (<-0.6)’, ‘Parker Hydropathy (<-0.7)’, ‘Parker Hydropathy (<-0.8)’, ‘Parker Hydropathy (<-0.9)’, ‘Parker Hydropathy (<-1.0)’, ‘Parker Hydropathy (<-2.0)’, ‘Parker Hydropathy (<-3.0)’, ‘Parker Hydropathy (<-4.0)’, ‘Parker Hydropathy (<-5.0)’]

Cysteine Liabilities (liability_choices_cysteine): cysteine-based liabilities to quantify

  • Type: string

  • Default: [‘Unpaired Cysteine’]

  • Choices: [‘Unpaired Cysteine’, ‘Any Cysteine’]

Top Selected Output Names

Output Name of the Consolidated Dataset, Long-Read/PacBio Picked Population (ngs_picking_out): This dataset will contain a consolidated dataset of your selected output, of NGS clones, e.g. 100 clones desired all consolidated (if separate groups will be output as well)

  • Required

  • Type: dataset_out

  • Default: picked.consolidated

Output Basename of the Long-Read/PacBio Selected Group (Top Leads Picked) (ngs_picking_consolidate): This dataset contains picked clones, separated by group (if from separate groups upstream).

  • Required

  • Type: dataset_out

  • Default: picked.long_read

Output Basename of the Short-Read/Illumina Selected Group (Top Leads Picked) (ngs_picking_consolidate_ill): This dataset contains picked clones, separated by group (if from separate groups upstream).

  • Required

  • Type: dataset_out

  • Default: picked.short_read

Output CSV Filename for the NGS Selection (file_out_picking_csv): All records are written to downstream picking csv file, must contain the *.csv extension

  • Required

  • Type: file_out

  • Default: picked.csv

Downstream Output Names

Output Name of the Consolidated Dataset, Before Picking, Downstream of IgMatcher (output_out): This dataset will contain a consolidated dataset where all sample names and barcode groups belong to same dataset

  • Required

  • Type: dataset_out

  • Default: down.consolidated

Output Basename of the Downstream Long-Read/PacBio Datasets (process_out): All records are written to downstream datasets according to group name, with this base name appended to output

  • Required

  • Type: dataset_out

  • Default: down.long_read

Output Basename of the Downstream Short-Read/Illumina Datasets (process_out_ill): All records are written to downstream datasets according to group name, with this base name appended to output

  • Required

  • Type: dataset_out

  • Default: down.short_read

Output CSV Filename (file_out_csv): All records are written to downstream csv file, must contain the *.csv extension

  • Required

  • Type: file_out

  • Default: down.consolidated.csv

Upstream Long-Read (PacBio) OR Short-Read (Illumina)

Output Basename of the Upstream Long-Read/PacBio Datasets (consolidate_out): This dataset contains all Long-Read/PacBio files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)

  • Required

  • Type: dataset_out

  • Default: up.long_read

Output Basename of the Upstream Short-Read/Illumina Datasets (consolidate_out_ill): This dataset contains all Short-Read/Illumina files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)

  • Required

  • Type: dataset_out

  • Default: up.short_read

Failure Output

Failed Dataset Output Name (fout): Contains failed records from both upstream and downstream processes

  • Required

  • Type: dataset_out

  • Default: problematic.ngs_abxtract_process_auto_picking