Inputting SANGER File (Excel, TSV, CSV)¶
A Cube that takes in processed FASTQ or FASTA and condenses by user-defined region of interest (ROI). The default ROI is the full-length sequence including CDRs. If sequences share the same ROI, only the most abundant full-length sequence is retained
Main Parameters¶
Parameter Name |
---|
Keep Only Functional Sequences, Sanger |
Heavy Chain CDR1 Annotation Scheme |
Heavy Chain CDR2 Annotation Scheme |
Heavy Chain CDR3 Annotation Scheme |
Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) |
Is the Input SANGER a Single Domain or Chain |
Length of K-mers for germline identification |
Length of query chain |
Light Chain CDR1 Annotation Scheme |
Light Chain CDR2 Annotation Scheme |
Light Chain CDR3 Annotation Scheme |
Maximum length of sequence for read to be kept |
Minimum chain length |
Minimum length |
Minimum votes for germline assignment |
This retains the original FASTA file information |
Region of Interest (ROI) For Condensing Sequences |
Shared Region of Interest (ROI) Sequences |
Species Database to Select From |
Write the Upstream Sanger Output to CSV File |
Parameter Details¶
Calculation Parameters¶
Full/Partial alignment to annotate CDRs (cdr_method) type: string: Align query sequence to the entire germline or to partial regions to annotate. IMPORTANT POINT 1: Full should be used for amino acid inputs. IMPORTANT POINT 2: For DNA inputs, ‘full’ works better for natural repertoires (e.g., when NOT using custom annotation file) and ‘partial’ works better with DNA for synthetic/degenerate antibodies (e.g., when uploading a custom annotation file).Default: fullChoices: partial, full CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128 Cube Metrics (cube_metrics) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network Provide a Customized Amino Acid Annotation File with Alignment Scheme of Interest (for Sanger) (custom_annotation_aa) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings. Provide a Customized DNA Annotation File with Alignment Scheme of Interest (for Sanger). (custom_annotation_dna) type: file_in: ONLY REQUIRED for custom scaffolds like the Specifica Gen3 Library or codon optimized sequences: OPTIONAL for natural antibodies. If provided, will override annotation and species/database selection settings. Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592 Keep Only Functional Sequences, Sanger (filter_functional) type: boolean: Eliminates non-functional sequences, truncations, stop-codons, frame-shiftsDefault: False GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16 Heavy Chain CDR1 Annotation Scheme (heavy_cdr1) type: string: Indicate the annotation scheme for Heavy CDR1Default: IMGTChoices: IMGT, KABAT, CHOTHIA Heavy Chain CDR2 Annotation Scheme (heavy_cdr2) type: string: Indicate the annotation scheme for Heavy CDR2Default: IMGTChoices: IMGT, KABAT, CHOTHIA Heavy Chain CDR3 Annotation Scheme (heavy_cdr3) type: string: Indicate the annotation scheme for Heavy CDR3Default: IMGTChoices: IMGT, KABAT, CHOTHIA Input FASTQ/A, CSV, TSV, or XLS for Sanger (or Low-Throughput Data) (input_fastq) type: file_in: Input FASTQ/A in standard format or CSV, TSV or Excel in format id,sequence
ENSURE THE SUPPLIED FILE DOES NOT HAVE A HEADER COLUMN, which may lead to aberrant behavior.
Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “” Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on Is the Input SANGER a Single Domain or Chain (is_single_dom) type: boolean: Indicates whether SANGER provided is only a single chain read (e.g. VH or VL) or domain antibody VHH, which will adjust the acceptable min length, minimum query and minimum length that is flagged. Usually SANGER is treated as VH+VL, full-length, similar to PacBio. IMPORTANT NOTE: if turned on both chain_1 and chain_2 information is provided and is duplicated, use either.Default: False Length of K-mers for germline identification (k) type: integer: Length of K-mers for germline identification (default 9 for DNA). If sequences differ signifacntly from the germline, smaller (5-7) may help finding the correct result.Default: 9 , Min: 5, Max: 12 Length of query chain (len_chain_query) type: integer: How many bases or amino acids in each end of the read to use to query for a antibody chainDefault: 500 , Min: 90, Max: 700 Light Chain CDR1 Annotation Scheme (light_cdr1) type: string: Indicate the annotation scheme for Light CDR1Default: IMGTChoices: IMGT, KABAT, CHOTHIA Light Chain CDR2 Annotation Scheme (light_cdr2) type: string: Indicate the annotation scheme for Light CDR2Default: KABATChoices: IMGT, KABAT, CHOTHIA Light Chain CDR3 Annotation Scheme (light_cdr3) type: string: Indicate the annotation scheme for Light CDR3Default: IMGTChoices: IMGT, KABAT, CHOTHIA Maximum length of sequence for read to be kept (max_length) type: integer:Default: 1300 Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592 Metric Period (metric_period) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300 Minimum chain length (min_len_chain) type: integer: Minimum chain length for it to be considered functional and not truncatedDefault: 273 Minimum length (min_length) type: integer: Minimum length of sequence for a read to be keptDefault: 650 Minimum votes for germline assignment (min_votes) type: integer: Minimum number of matching K-mers for germline assignment. Higher numbers make the algorithm more stringent at the expense of not annotating some sequences.Default: 150 This retains the original FASTA file information (retain_fasta) type: boolean: This is important for retaining the key descriptors for registration of sequences.Default: False Region of Interest (ROI) For Condensing Sequences (roi) type: string: This will condense the Sanger sequences based on the ROI based rank ordered on abundance. IMPORTANT: this will remove full-length sequences and only keep most abundant full-length count. If two sequences have same full-length count, then it will pick one or the other.Default: Full-LengthChoices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length Shared Region of Interest (ROI) Sequences (shared_roi) type: string: This will provide an overlap_roi output that shows all the individual wells that share the same id.Default: CDR3 Chain_2 (Downstream Chain)Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length Species Database to Select From (species) type: string: Species group (species) desired to generate the db for igmatcher. NOTE: Jain et al, 2017 database is available at amino acid only.Default: [‘Human’]Choices: Alpaca, Human, Mouse, Rabbit, Jain Dataset (amino acid only) Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required Write the Upstream Sanger Output to CSV File (write_to_csv_file) type: boolean: Allows the option to write to CSV after the AbXtract Processing, NGS Only file at the cost of additional time. If not, can do this in separate step. Writes to empty file if turned off.Default: True
Hardware Parameters¶
- Machine hardware requirements
- Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 1800 , Min: 256.0, Max: 8589934592
- Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.Default: 5120.0 , Min: 128.0, Max: 8589934592
- GPUs (gpu_count) type: integer: The number of GPUs to run this cube withDefault: 0 , Max: 16
- CPUs (cpu_count) type: integer: The number of CPUs to run this cube withDefault: 1 , Min: 1, Max: 128
- Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
- Spot policy (spot_policy) type: string: Control cube placement on spot market instancesDefault: ProhibitedChoices: Allowed, Preferred, NotPreferred, Prohibited, Required
- Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)Default: “”
Metrics Parameters¶
- Cube Metric Parameters
- Metric Period (None) type: decimal: How often to sample metrics, in secondsDefault: 60Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
- Cube Metrics (None) type: string: Set of metrics to be collectedChoices: cpu, disk, memory, network