Quick Sanger - AbXtract

This will process single file in either FASTA, FASTQ, TSV, CSV, or EXCEL format. Each sequence ID typically represents unique well ID. If using TSV, CSV or EXCEL files as input, format each row as follows:
  1. column A = id

  2. column B = sequence (dna or amino acid), WITHOUT header

This will condense AA sequences based on user-defined option (Default = Full-Length, Framework included). Redundant sequences will be condensed and the ‘id’ field will contain a ‘:’ separated list of IDs (typically Well ID). This FLOE will calculate liabilities and biophysical properties by CDR (length, net charge, Parker hydropathy).

Main Parameters

  • Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (file_in) : Inputs FASTQ/A in standard format or CSV, TSV or Excel in comman/tab/column-separated format like “id,sequence”. Header Column SHOULD NOT be provided.
  • Full/Partial alignment to annotate CDRs (string) : Align query sequence to the entire germline or to partial regions to annotate. IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).
    Default: full
    Choices: partial, full
  • Region of Interest (ROI) For Condensing Sequences (string) : This will condense the SANGER sequences based on the ROI based rank ordered on abundance. All values sharing same ROI will display concatenated ‘:’ separated list by the ‘id’ field. IMPORTANT: all redundant sequences ROIs will be removed if lower abundance. If two sequences have same count by ROI only one will be selected at random. Default values condense by full-length including framework regions.
    Default: Full-Length, Including Framework
    Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length, Including Framework
  • Species Database to Select From (string) : Species group (species) desired to generate the db for igmatcher. NOTE: Database of small reference therapeutics is available at amino acid only.
    Default: [‘Human’]
    Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only)
  • Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
    1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.
    2) Heavy chain representative should contain AA space separated:
    match_name_vh FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa
    3) Light chain representative should contain AA space separated:
    match_name_vl FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa
    4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).
    5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).
  • Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for NGS). (Long-Read/PacBio IGMATCHER Cube) (file_in) : ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings.
    1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12.
    2) Heavy chain representative should contain DNA bases space separated:
    match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna
    3) Light chain representative should contain DNA bases space separated:
    match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna
    4) Each chain expected in sequence file needs least one representative (e.g., VH or VL).
    5) Each scaffold should contain at least one representative (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12).
  • Is the Input SANGER a Single Domain or Chain (boolean) : Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length of sequence accepted and also flagged. It will also adjust the minimum query length of the sequence. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.
    Default: False
  • Biophysical Liabilities (string) : Net charge or hydropathy liabilities to quantify.
    Default: [‘Charge (>1)’]
    Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)
  • Glycosylation Liabilities (string) : Glycosylation liabilities to quantify.
    Default: [‘NXT/S - Glycosylation’]
    Choices: NXT/S - Glycosylation, NXT - Glycosylation. NXS - Glycosylation
  • Cysteine Liabilities (string) : Cysteine-based liabilities to quantify.
    Default: [‘Unpaired Cysteine’]
    Choices: Unpaired Cysteine, Any Cysteine
  • Deamidation Liabilities (string) : Deamidation liabilities to quantify.
    Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]
    Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation
  • Hydrolysis Liabilities (string) : Hydrolysis liabilities to quantify.
    Default: [‘DP - Hydrolysis’]
    Choices: DP - Hydrolysis
  • Isomerization Liabilities (string) : Isomerization liabilities to quantify.
    Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]
    Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization
  • Polyspecificity Liabilities (string) : Polyspecificity liabilities to quantify.
    Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]
    Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity

Hidden Parameters

  • Liability Database File (file_in) : Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability). DO NOT INCLUDE characters ‘+’ or ‘,’ in second description ‘name of liability’ column. Special case liabilities include net charge, Parker hydropathy and unpaired cysteine. For net charge, write “Charge”, case-sensitive, in first column and the net charge integer in second column (values above are flagged). Similarly, for Parker hydropathy, write “Parker Hydropathy”, case-sensitive, in first column and the hydropathy float value in second column (values below are flagged). For unpaired cysteine write “Cysteine”, case-sensitive, and in second column can write anything but typically write “Unpaired Cysteine”. NOTE, if even number of cysteines exists and cysteines are only spaced within 3 AAs, these are still considered unpaired.
  • Heavy Chain CDR1 Annotation Scheme (string) : Indicate the annotation scheme for Heavy CDR1.
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Heavy Chain CDR2 Annotation Scheme (string) : Indicate the annotation scheme for Heavy CDR2.
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Heavy Chain CDR3 Annotation Scheme (string) : Indicate the annotation scheme for Heavy CDR3.
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Light Chain CDR1 Annotation Scheme (string) : Indicate the annotation scheme for Light CDR1.
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Light Chain CDR2 Annotation Scheme (string) : Indicate the annotation scheme for Light CDR2.
    Default: KABAT
    Choices: IMGT, KABAT, CHOTHIA
  • Light Chain CDR3 Annotation Scheme (string) : Indicate the annotation scheme for Light CDR3.
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Length of K-mers for germline identification (integer) : Length of K-mers for germline identification (default 9 for DNA). If sequences differ significantly from the germline, smaller (5-7) may help finding the correct result.
    Default: 9 Min: 5 Max: 12
  • Length of query chain (integer) : How many bases or amino acids in each end of the read to use to query for a antibody chain.
    Default: 500 Min: 90 Max: 700
  • Maximum length of sequence for read to be kept (integer) :
    Default: 1300
  • Minimum chain length (integer) : Minimum chain length for it to be considered functional and not truncated.
    Default: 273
  • Minimum length (integer) : Minimum length of sequence for a read to be kept.
    Default: 650
  • Minimum votes for germline assignment (integer) : Minimum number of matching K-mers for germline assignment. Higher numbers make the algorithm more stringent at the expense of not annotating sequences that deviate from reference database.
    Default: 100
  • Keep Only Functional Sequences, Sanger (boolean) : Eliminates non-functional sequences, truncations, stop-codons, frame-shifts.
    Default: False
  • ROI for sequence logo (choose Chain1 CDR3 if short-read/single-chain data) (string) : Name of regions to be aligned for sequence logo. Logo is output only if not more than 500 records.
    Default: CDR3 Chain_2 (Downstream Chain)
    Choices: CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3

Advanced Parameters for Writing Reports and Datasets

  • Write Records to Dataset (boolean) : Write out records to dataset.
    Default: False
  • Write Group to Their Own Dataset After Processing (boolean) : Write group (if provided) to their own dataset after processing, NOTE: if only a single group then no separate dataset will be written.
    Default: False
  • Write Report (boolean) : Write out a floe report after consolidation.
    Default: True
  • Unsuccessfully Processed Sanger File Name (dataset_out) : Output dataset to write to.
    Default: problematic
  • Successfully Processed Sanger File Name (dataset_out) : Output dataset to write to.
    Default: down.sanger
  • Output report for consolidated cube (dataset_out) : Output dataset to write to.
    Default: sanger.report
  • Provides Report of the Selected Antibody Leads (boolean) : Provides detailed information on the biophysical characteristics of the selected antibodies.
    Default: False
  • Metrics to Assess Sanger in Presence of NGS (boolean) : Indicates whether additional metrics are to be included to identify Sanger sequences in NGS and vice-versa.
    Default: False
  • Are these already processed records? (boolean) : Indicates whether input is to be analyzed post-processing for generating specific plots.
    Default: False
  • Is This A Downstream Processed File? (boolean) : Indicates whether the input contains data for downstream processing.
    Default: False
  • Is This A Sanger Processed File? (boolean) : Indicates whether the input contains Sanger (low-throughput) Sequencing Data.
    Default: True
  • Split by cluster? Only applies to downstream records. (boolean) : Indicates whether to split sequences by cluster before creating sequence logos. Cluster logos are output only if not more than 500 records.
    Default: False