NGS and Sanger Pipeline with Automated Top Lead Selection - AbXtract
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Solution-based/Biologics/Antibody Design
Role-based/Bioinformatician
Role-based/Biologist
Product-based/AbXtract
Description
NGS, FASTQ files or Datasets, for annotation, demultiplexing, relative abundance, enrichment, clustering and overlap assessment. Includes Sanger as FASTA, FASTQ, TSV, CSV, Excel or Dataset, which can be clustered in context of NGS sequences and overlapped by a given region of interest (ROI). This floe will use automatically select top X number of sequences across defined cluster space using rank ordered, user-specified, metrics.
Promoted Parameters
Title in user interface (promoted name)
Sanger Key FILE Inputs
SANGER Input File (FASTA/Q, Excel, TSV, or CSV) (input_fastq_sanger): If using excel, tsv or csv then input consists of two rows and two columns (NO HEADER REQUIRED) with the form: id, sequence
Type: file_in
Species Database to Select From (species_sanger): Species group (species) desired to generate the db for igmatcher. NOTE: Jain et al, 2017 database is available at amino acid only.
Required
Type: string
Default: [‘Human’]
Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’, ‘Jain Dataset (amino acid only)’]
Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (custom_annotation_dna_sanger): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
Type: file_in
Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (custom_annotation_aa_sanger): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
Type: file_in
Is the Input SANGER a Single Domain or Chain (is_single_dom): Turn ON if sequence covers only a single chain/domain (e.g. VH or VL). This will automatically adjust the acceptable minimum read length and annotation query length. If this is turned ON, both chain_1 and chain_2 columns are provided in the output file but they will be identical. IMPORTANT: has no effect on amino acid inputs, only on DNA inputs.
Required
Type: boolean
Default: False
Choices: [True, False]
Shared Region of Interest (ROI) Sequences (shared_roi): This will provide an overlap_roi output that shows all the individual wells that share the same id.
Required
Type: string
Default: CDR3 Chain_2 (Downstream Chain)
Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]
Long-Read (PacBio) FILE/SELECTION Inputs
NGS Input FASTQ (Long-Read, e.g., PacBio) (pacbio_input_file): Input FASTQ File
Type: file_in
Barcode Table (barcode_table_ngs): File without header, formatted as Name,5’barcode,3’barcode,barcode_round,barcode_group If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,
Type: file_in
Species Database to Select From (species_ngs): Species reference database to generate the db for igmatcher, must have value selected even if custom annotation file selected.
Required
Type: string
Default: [‘Human’]
Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]
Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
Type: file_in
Short-Read (Illumina) FILE/SELECTION Inputs
NGS Input FASTQ Forward (Short-Read/Illumina) (illumina_input_file1): Input FASTQ File 1
Type: file_in
NGS Input FASTQ Reverse (Short-Read/Illumina) (illumina_input_file2): Input FASTQ File 2
Type: file_in
Barcode Table (barcode_table_ngs_ill): File without header, formatted as Name,5’barcode,3’barcode,barcode_round,barcode_group If you just have a 5’ barcode write name,5’barcode,,, If you just have a 3’ barcode write name,,3’barcode,,,
Type: file_in
Species Database to Select From (species_illumina): Species reference database to generate the db for igmatcher, must have value selected even if custom annotation file selected.
Required
Type: string
Default: [‘Human’]
Choices: [‘Alpaca’, ‘Human’, ‘Mouse’, ‘Rabbit’]
Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (custom_annotation_dna_igmatcher_illumina): ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
Type: file_in
Optional DATASET Inputs from Sanger and/or NGS
Optional NGS Dataset for Input (typically NGS datasets from upstream FLOE) (optional_input_ngs_file): The optional dataset(s) to read records from
Type: data_source
Optional SANGER Dataset for Input (typically datasets processed by Quick Sanger FLOE) (optional_input_sanger_file): The optional dataset(s) to read records from
Type: data_source
NGS Key Selection Parameters
Maximum Number of Full-Length Sequences (number_of_seqs_ngs_picking): Indicate the maximum number of full-length, non-redundant sequences. This value depends on the total number of non-redundant sequences, maximum # of clusters and the maximum # of sequences per cluster. If the total number is below desired, try adjusting max # of clusters or max number of sequences per cluster. Alternatively, if you want to fill quota with additional sequences per cluster turn on the ‘Attempt to Fulfill the Desired Number of Sequences Quota’ parameter
Required
Type: integer
Default: 100
Maximum Number Sequences Per Cluster (max_seq_per_cluster): Indicate the maximum number of unique full-length sequences per given cluster. This value may be exceeded if the ‘Write the Automated Top Clones Output to CSV File’ is turned ON
Required
Type: integer
Default: 10
Maximum Number of Clusters Preferred? (number_of_clusters_to_select_from): Indicate the maximum number of clusters that you want to choose from. This value may be exceeded if the ‘Write the Automated Top Clones Output to CSV File’ is turned ON
Type: integer
Default: 40
Metrics for Ranking (picking_choices): Place metrics in order of ranking (if nothing, ranks by full-length count)
Required
Type: string
Default: [‘ROI Percent, Final Round Only’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘Liabilities Both Chains’, ‘Liabilities CDR3_2’]
Choices: [‘Full Length (Corrects for Illumina or PacBio), Count’, ‘Full Length (Corrects for Illumina or PacBio), Percent’, ‘ROI Count, Final Round Only’, ‘ROI Percent, Final Round Only’, ‘ROI Fold Enrichment, Final Round Only’, ‘ROI Log2 Enrichment, Final Round Only’, ‘Liabilities Both Chains’, ‘Liabilities Chain_2’, ‘Liabilities Chain_1’, ‘Liabilities CDR1_1’, ‘Liabilities CDR2_1’, ‘Liabilities CDR3_1’, ‘Liabilities CDR1_2’, ‘Liabilities CDR2_2’, ‘Liabilities CDR3_2’, ‘ROI Count, Early Round Only’, ‘ROI Percent, Early Round Only’, ‘Cluster Count (e.g. unique sequences per cluster)’, ‘Cluster Percent (e.g. unique rep per cluster)’]
Rank Sanger Clones First in Population (rank_sanger): Sanger clones, if present, will be ranked first.
Required
Type: boolean
Default: True
Choices: [True, False]
Attempt to Fill the Desired Number of Full-Length Sequences Quota (quota_attempt): This will attempt to fulfill the total number of sequences quota if goal of desired number of full-length sequences is not reached by 1) selecting additional clones across lower in rank aross designated clusters followed by 2) selecting the remaining top ranked clones from different clusters by prioritizing top clones. NOTE: if turned on, this is likely to result in a greater number of sequences per cluster and if quota still not met, more clusters then designated in total to reach the desired number of full-length sequences goal
Required
Type: boolean
Default: True
Choices: [True, False]
Key Sanger Input Parameters
Full/Partial alignment to annotate CDRs (cdr_method_sanger): Align query sequence to the entire germline or to partial regions to annotate. IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).
Type: string
Default: full
Choices: [‘partial’, ‘full’]
Region of Interest (ROI) For Condensing Sequences (roi_sanger_condense): This will condense the Sanger sequences based on the ROI based rank ordered on abundance. IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.
Required
Type: string
Default: Full-Length
Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]
Key Downstream Parameters
Region of Interest For Sanger overlap (roi_sanger): Indicate the region of interest (roi) to map back to sanger clones
Required
Type: string
Default: CDR3 Chain_2 (Downstream Chain)
Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’]
Region of Interest For Enrichment and Clustering (roi_cluster): Indicate the region of interest for processing, only top representative full-length sequence will be kept IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING. overlap TO SANGER ADJUSTED TO HEAVY CDR3 FOR ILLUMINA.
Required
Type: string
Default: HCDR3 and LCDR3
Choices: [‘Merged CDRs’, ‘CDR3 Chain_1 (Upstream Chain)’, ‘CDR3 Chain_2 (Downstream Chain)’, ‘HCDR3 and LCDR3’, ‘Full-Length’]
Clustering Type (cluster_type_ngs): Cluster type to apply to sequencing dataset
Required
Type: string
Default: AbScan
Choices: [‘AbScan’, ‘Unique Only’, ‘Levenshtein Distance’, ‘Hamming Distance’]
Keep Only Functional Sequences (At NGS Level) (filter_functional): Eliminates non-functional sequences, truncations, stop-codons, frame-shifts NOTE: This does not apply to Sanger sequences, which has same option as hidden variable
Required
Type: boolean
Default: True
Choices: [True, False]
Exclude Values That Did Not Match In-Line Barcode (exclude_unknown): If True, will exclude unknown values that did not have a barcode match, unless there is only one barcode for the entire NGS population.
Required
Type: boolean
Default: True
Choices: [True, False]
Key Liability Parameters
Polyspecificity Liabilities (liability_choices_poly): polyspecificity liabilities to quantify
Type: string
Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]
Choices: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘YY - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’]
Deamidation Liabilities (liability_choices_deam): deamidation liabilities to quantify
Type: string
Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]
Choices: [‘N[GSTN] - Deamidation’, ‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GN[FYTG] - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]
Glycosylation Liabilities (liability_choices_glyc): glycosylation liabilities to quantify
Type: string
Default: [‘NXT/S - Glycosylation’]
Choices: [‘NXT/S - Glycosylation’, ‘NXT - Glycosylation’, ‘NXS - Glycosylation’]
Hydrolysis Liabilities (liability_choices_hydrolysis): hydrolysis liabilities to quantify
Type: string
Default: [‘DP - Hydrolysis’]
Choices: [‘DP - Hydrolysis’]
Isomerization Liabilities (liability_choices_iso): isomerization liabilities to quantify
Type: string
Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]
Choices: [‘D[GSD] - Isomerization’, ‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]
Biophysical Liabilities (liability_choices_charge): Net charge or hydropathy liabilities to quantify
Type: string
Default: [‘Charge (>1)’]
Choices: [‘Charge (>-1)’, ‘Charge (>0)’, ‘Charge (>1)’, ‘Charge (>2)’, ‘Charge (>3)’, ‘Charge (>4)’, ‘Parker Hydropathy (<0.0)’, ‘Parker Hydropathy (<-0.1)’, ‘Parker Hydropathy (<-0.2)’, ‘Parker Hydropathy (<-0.3)’, ‘Parker Hydropathy (<-0.4)’, ‘Parker Hydropathy (<-0.5)’, ‘Parker Hydropathy (<-0.6)’, ‘Parker Hydropathy (<-0.7)’, ‘Parker Hydropathy (<-0.8)’, ‘Parker Hydropathy (<-0.9)’, ‘Parker Hydropathy (<-1.0)’, ‘Parker Hydropathy (<-2.0)’, ‘Parker Hydropathy (<-3.0)’, ‘Parker Hydropathy (<-4.0)’, ‘Parker Hydropathy (<-5.0)’]
Cysteine Liabilities (liability_choices_cysteine): cysteine-based liabilities to quantify
Type: string
Default: [‘Unpaired Cysteine’]
Choices: [‘Unpaired Cysteine’, ‘Any Cysteine’]
Top Selected Output Names
Output Name of the Consolidated Dataset, NGS Picked Population (ngs_picking_out): This dataset will contain a consolidated dataset of your selected output, of NGS clones, e.g. 100 clones desired all consolidated (if separate groups will be output as well)
Required
Type: dataset_out
Default: picked.consolidated
Output CSV Filename for the NGS Selection (file_out_picking_csv): All records are written to downstream picking csv file, must contain the *.csv extension
Required
Type: file_out
Default: picked.csv
Output Subset CSV Filename (subset_file_out_picking_csv): Name of smaller subset of key fields are written to downstream csv file, must contain the *.csv extension
Required
Type: file_out
Default: picked.subset.csv
Lead Selection Report Name (custom_picking_report): The name of the floe report for selected clones.
Required
Type: dataset_out
Default: lead_clones.FloeReport
Downstream Output Names
Output Name of the Consolidated Dataset, Before Picking, Downstream of IgMatcher (output_out): This dataset will contain a consolidated dataset where all sample names and barcode groups belong to same dataset
Required
Type: dataset_out
Default: down.consolidated
Output Basename of the Downstream Long-Read/PacBio Datasets (process_out): All records are written to downstream datasets according to group name, with this base name appended to output
Required
Type: dataset_out
Default: down.long_read
Output Basename of the Downstream Short-Read/Illumina Datasets (process_out_ill): All records are written to downstream datasets according to group name, with this base name appended to output
Required
Type: dataset_out
Default: down.short_read
Output CSV Filename (file_out_csv): All records are written to downstream csv file, must contain the *.csv extension
Required
Type: file_out
Default: down.consolidated.csv
Upstream Long-Read (PacBio) OR Short-Read (Illumina) AND Sanger Output Names
Output Basename of the Upstream Long-Read/PacBio Datasets (consolidate_out): This dataset contains all Long-Read/PacBio files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)
Required
Type: dataset_out
Default: up.long_read
Output Name of the Sanger Dataset (sanger_out): This dataset contains all SANGER files processed immediately after IgMatcher and before downstream processing.
Required
Type: dataset_out
Default: up.sanger
Output Basename of the Upstream Short-Read/Illumina Datasets (consolidate_out_ill): This dataset contains all Short-Read/Illumina files processed immediately after IgMatcher and before downstream processing (if barcode used, will use the sample name)
Required
Type: dataset_out
Default: up.short_read
Output CSV Filename for the SANGER upstream file (file_out_sanger_csv): All upstream SANGER (pre-downstream processing) records are written to csv file, must contain the *.csv extension
Required
Type: file_out
Default: upstream_sanger.csv
Output Floe Report Name (long_report_floe_report): Name of the Floe Report for FASTQ Quality statistics, if desired
Required
Type: dataset_out
Default: Long.FloeReport
Output Floe Report Name (short_report_floe_report): Name of the Floe Report for FASTQ Quality statistics, if desired
Required
Type: dataset_out
Default: Short.FloeReport
Failure Output
Failed Dataset Output Name (fout): Contains failed records from both upstream and downstream processes
Required
Type: dataset_out
Default: problematic.ngs_processing_with_sanger_auto_picking