Inputting SANGER File (Excel, TSV, CSV)

A Cube that takes in processed FASTQ or FASTA and condenses by user-defined region of interest (ROI). The default ROI is the full-length sequence including CDRs. If sequences share the same ROI, only the most abundant full-length sequence is retained

Main Parameters

Parameter Name

Keep Only Functional Sequences, Sanger

Heavy Chain CDR1 Annotation Scheme

Heavy Chain CDR2 Annotation Scheme

Heavy Chain CDR3 Annotation Scheme

Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data)

Is the Input SANGER a Single Domain or Chain

Length of K-mers for germline identification

Length of query chain

Light Chain CDR1 Annotation Scheme

Light Chain CDR2 Annotation Scheme

Light Chain CDR3 Annotation Scheme

Maximum length of sequence for read to be kept

Minimum chain length

Minimum length

Minimum votes for germline assignment

This retains the original FASTA file information

Region of Interest (ROI) For Condensing Sequences

Shared Region of Interest (ROI) Sequences

Species Database to Select From

Write the Upstream Sanger Output to CSV File


Calculation Parameters

  • Full/Partial alignment to annotate CDRs (cdr_method) type: string: Align query sequence to the entire germline or to partial regions to annotate.

IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).

Default: full
Choices: partial, full
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (custom_annotation_aa) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies.

If provided, will override annotation and species/database selection settings.

  • Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (custom_annotation_dna) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies.

If provided, will override annotation and species/database selection settings.

  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • Keep Only Functional Sequences, Sanger (filter_functional) type: boolean: Eliminates non-functional sequences, truncations, stop-codons, frame-shifts
    Default: False
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Heavy Chain CDR1 Annotation Scheme (heavy_cdr1) type: string: Indicate the annotation scheme for Heavy CDR1
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Heavy Chain CDR2 Annotation Scheme (heavy_cdr2) type: string: Indicate the annotation scheme for Heavy CDR2
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Heavy Chain CDR3 Annotation Scheme (heavy_cdr3) type: string: Indicate the annotation scheme for Heavy CDR3
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (input_fastq) type: file_in: Input FASTQ/A in standard format or CSV, TSV or Excel in format id,sequence

ENSURE THE SUPPLIED FILE DOES NOT HAVE A HEADER COLUMN, which may lead to aberrant behavior.

  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Is the Input SANGER a Single Domain or Chain (is_single_dom) type: boolean: Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length, minimum query and minimum length that is flagged. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.
    Default: False
  • Length of K-mers for germline identification (k) type: integer: Length of K-mers for germline identification (default 9 for DNA). If sequences differ signifacntly from the germline, smaller (5-7) may help finding the correct result.
    Default: 9 , Min: 5, Max: 12
  • Length of query chain (len_chain_query) type: integer: How many bases or amino acids in each end of the read to use to query for a antibody chain
    Default: 500 , Min: 90, Max: 700
  • Light Chain CDR1 Annotation Scheme (light_cdr1) type: string: Indicate the annotation scheme for Light CDR1
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Light Chain CDR2 Annotation Scheme (light_cdr2) type: string: Indicate the annotation scheme for Light CDR2
    Default: KABAT
    Choices: IMGT, KABAT, CHOTHIA
  • Light Chain CDR3 Annotation Scheme (light_cdr3) type: string: Indicate the annotation scheme for Light CDR3
    Default: IMGT
    Choices: IMGT, KABAT, CHOTHIA
  • Maximum length of sequence for read to be kept (max_length) type: integer:
    Default: 1300
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Minimum chain length (min_len_chain) type: integer: Minimum chain length for it to be considered functional and not truncated
    Default: 273
  • Minimum length (min_length) type: integer: Minimum length of sequence for a read to be kept
    Default: 650
  • Minimum votes for germline assignment (min_votes) type: integer: Minimum number of matching K-mers for germline assignment. Higher numbers make the algorithm more stringent at the expense of not annotating some sequences.
    Default: 150
  • This retains the original FASTA file information (retain_fasta) type: boolean: This is important for retaining the key descriptors for registration of sequences.
    Default: False
  • Region of Interest (ROI) For Condensing Sequences (roi) type: string: This will condense the Sanger sequences based on the ROI based rank ordered on abundance.

IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.

Default: Full-Length
Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Shared Region of Interest (ROI) Sequences (shared_roi) type: string: This will provide an overlap_roi output that shows all the individual wells that share the same id.
    Default: CDR3 Chain_2 (Downstream Chain)
    Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length
  • Species Database to Select From (species) type: string: Species group (species) desired to generate the db for igmatcher.
NOTE: Jain et al, 2017 database is available at amino acid only.
Default: [‘Human’]
Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only)
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Write the Upstream Sanger Output to CSV File (write_to_csv_file) type: boolean: Allows the option to write to CSV after the AbXtract Processing, NGS Only file at the cost of additional time. If not, can do this in separate step. Writes to empty file if turned off.
    Default: True

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address
    Default: 64
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network