Inputting SANGER File (Excel, TSV, CSV)

A Cube that takes in processed FASTQ or FASTA and condenses by user-defined region of interest (ROI). The default ROI is the full-length sequence including CDRs. If sequences share the same ROI, only the most abundant full-length sequence is retained

Main Parameters

Parameter Name
Keep Only Functional Sequences, Sanger
Heavy Chain CDR1 Annotation Scheme
Heavy Chain CDR2 Annotation Scheme
Heavy Chain CDR3 Annotation Scheme
Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data)
Is the Input SANGER a Single Domain or Chain
Length of K-mers for germline identification
Length of query chain
Light Chain CDR1 Annotation Scheme
Light Chain CDR2 Annotation Scheme
Light Chain CDR3 Annotation Scheme
Maximum length of sequence for read to be kept
Minimum chain length
Minimum length
Minimum votes for germline assignment
This retains the original FASTA file information
Region of Interest (ROI) For Condensing Sequences
Shared Region of Interest (ROI) Sequences
Species Database to Select From
Write the Upstream Sanger Output to CSV File

Calculation Parameters

Full/Partial alignment to annotate CDRs (cdr_method) type: string: Align query sequence to the entire germline or to partial regions to annotate.

IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).

Default: full

Choices: partial, full

CPUs (cpu_count) type: integer: The number of CPUs to run this cube with

Default: 1 , Min: 1, Max: 128

Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

Choices: cpu, disk, memory, network

Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (custom_annotation_aa) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies.

If provided, will override annotation and species/database selection settings.

Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (custom_annotation_dna) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies.

If provided, will override annotation and species/database selection settings.

Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 5120.0 , Min: 128.0, Max: 8589934592

Keep Only Functional Sequences, Sanger (filter_functional) type: boolean: Eliminates non-functional sequences, truncations, stop-codons, frame-shifts

Default: False

GPUs (gpu_count) type: integer: The number of GPUs to run this cube with

Default: 0 , Max: 16

Heavy Chain CDR1 Annotation Scheme (heavy_cdr1) type: string: Indicate the annotation scheme for Heavy CDR1

Default: IMGT

Choices: IMGT, KABAT, CHOTHIA

Heavy Chain CDR2 Annotation Scheme (heavy_cdr2) type: string: Indicate the annotation scheme for Heavy CDR2

Default: IMGT

Choices: IMGT, KABAT, CHOTHIA

Heavy Chain CDR3 Annotation Scheme (heavy_cdr3) type: string: Indicate the annotation scheme for Heavy CDR3

Default: IMGT

Choices: IMGT, KABAT, CHOTHIA

Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (input_fastq) type: file_in: Input FASTQ/A in standard format or CSV, TSV or Excel in format id,sequence

ENSURE THE SUPPLIED FILE DOES NOT HAVE A HEADER COLUMN, which may lead to aberrant behavior.

Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)

Default: “”

Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on

Is the Input SANGER a Single Domain or Chain (is_single_dom) type: boolean: Turn ON if sequence covers only a single chain/domain (e.g. VH or VL). This will automatically adjust the acceptable minimum read length and annotation query length. If this is turned ON, both chain_1 and chain_2 columns are provided in the output file but they will be identical. IMPORTANT: has no effect on amino acid inputs, only on DNA inputs.

Default: False

Length of K-mers for germline identification (k) type: integer: Length of K-mers for germline identification (default 9 for DNA). If sequences differ signifacntly from the germline, smaller (5-7) may help finding the correct result.

Default: 9 , Min: 5, Max: 12

Length of query chain (len_chain_query) type: integer: How many bases or amino acids in each end of the read to use to query for a antibody chain

Default: 500 , Min: 90, Max: 700

Light Chain CDR1 Annotation Scheme (light_cdr1) type: string: Indicate the annotation scheme for Light CDR1

Default: IMGT

Choices: IMGT, KABAT, CHOTHIA

Light Chain CDR2 Annotation Scheme (light_cdr2) type: string: Indicate the annotation scheme for Light CDR2

Default: KABAT

Choices: IMGT, KABAT, CHOTHIA

Light Chain CDR3 Annotation Scheme (light_cdr3) type: string: Indicate the annotation scheme for Light CDR3

Default: IMGT

Choices: IMGT, KABAT, CHOTHIA

Maximum length of sequence for read to be kept (max_length) type: integer:

Default: 1300

Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 1800 , Min: 256.0, Max: 8589934592

Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds

Default: 60

Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300

Minimum chain length (min_len_chain) type: integer: Minimum chain length for it to be considered functional and not truncated

Default: 273

Minimum length (min_length) type: integer: Minimum length of sequence for a read to be kept

Default: 650

Minimum votes for germline assignment (min_votes) type: integer: Minimum number of matching K-mers for germline assignment. Higher numbers make the algorithm more stringent at the expense of not annotating some sequences.

Default: 150

This retains the original FASTA file information (retain_fasta) type: boolean: This is important for retaining the key descriptors for registration of sequences.

Default: False

Region of Interest (ROI) For Condensing Sequences (roi) type: string: This will condense the Sanger sequences based on the ROI based rank ordered on abundance.

IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.

Default: Full-Length

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length

Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address

Default: 64

Shared Region of Interest (ROI) Sequences (shared_roi) type: string: This will provide an overlap_roi output that shows all the individual wells that share the same id.

Default: CDR3 Chain_2 (Downstream Chain)

Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length

Species Database to Select From (species) type: string: Species group (species) desired to generate the db for igmatcher.

NOTE: Jain et al, 2017 database is available at amino acid only.

Default: [‘Human’]

Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only)

Spot policy (spot_policy) type: string: Control cube placement on spot market instances

Default: Prohibited

Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
Write the Upstream Sanger Output to CSV File (write_to_csv_file) type: boolean: Allows the option to write to CSV after the AbXtract Processing, NGS Only file at the cost of additional time. If not, can do this in separate step. Writes to empty file if turned off.

Default: True

Hardware Parameters

Machine hardware requirements

Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 1800 , Min: 256.0, Max: 8589934592
Shared Memory (MiB) (shared_memory_mb) type: decimal: The amount of shared memory to allow a container to address

Default: 64
Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.

Default: 5120.0 , Min: 128.0, Max: 8589934592
GPUs (gpu_count) type: integer: The number of GPUs to run this cube with

Default: 0 , Max: 16
CPUs (cpu_count) type: integer: The number of CPUs to run this cube with

Default: 1 , Min: 1, Max: 128
Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
Spot policy (spot_policy) type: string: Control cube placement on spot market instances

Default: Prohibited

Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)

Default: “”

Metrics Parameters

Cube Metric Parameters

Metric Period (None) type: decimal: How often to sample metrics, in seconds

Default: 60

Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
Cube Metrics (None) type: string: Set of metrics to be collected

Choices: cpu, disk, memory, network