AbXtract Processing, NGS and SANGER

Processes NGS FASTQ files and SANGER sequences, clusters by region of interest (ROI), calculates fold enrichment and/or relative abundance by ROI, quantifies liabilities across all the cdrs,

Main Parameters

Parameter Name

Biophysical Conversion

Clustering Type

Early Round Absence Penalty

Late Round Absence Penalty

Edit Distance Method for overlap of NGS Barcode Groups

Edit Distance Method For Sanger Overlap

Exclude Sanger populations from the ‘Minimum Count Criterion

Exclude Values That Did Not Match In-Line Barcode

Keep Only Functional Sequences (At NGS Level)

Max Distance for Levenshtein or Hamming, If Selected

Minimum Count for the Full-Length Sequence by Sample

Minimum Number of Points to Consider a Cluster

Indicate whether the ABSCAN should utilize OPTICS (preferred) or DBSCAN

Edit Distance for overlap of NGS Barcode Groups

Edit Distance for Overlap to Sanger

Region of Interest For Enrichment and Clustering

Minimum Count for the Region of Interest (ROI)

Minimum Percent for the Region of Interest (ROI) for the Final Round

Region of Interest For Sanger overlap

Tabulate Ratio to Top Most Abundant Clone in Sequence or By Selected ROI

Write the Downstream Output to CSV File

Xi Minimum Steepness of Reachability Plot


Calculation Parameters

  • Biophysical Conversion (biophysical_conversion) type: boolean: Should we convert each AA sequence into physicochemical equivalent, e.g. E,D - negative charge
This is only applicable to AbScan.
Default: True
  • Clustering Type (cluster_type) type: string: Cluster type to apply to sequencing dataset
    Default: AbScan
    Choices: AbScan, Unique Only, Levenshtein Distance, Hamming Distance
  • Early Round Absence Penalty (corr_factor_1) type: integer: Indicates the divisor in event sequence does not appear in a early round population min(round 2) / (correction factor 1). Increase value places greater weight on clones found in final round that are not found in the early round
    Default: 2
  • Late Round Absence Penalty (corr_factor_2) type: integer: Indicates the divisor in event sequence does not appear in a late round population min(round 3) / (correction factor 2). Increase value places penalizes values that were high in early round but were completely depleted by the final round
    Default: 10
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Cube Metrics (cube_metrics) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • Edit Distance Method for overlap of NGS Barcode Groups (edit_distance_method_overlap) type: string: Indicate the type of edit distance method to apply for overlap among different barcode groups. NOTE: only in effect if edit distance for overlap of NGS barcode groups takes on non-zero value
    Default: Levenshtein Distance
    Choices: Hamming Distance, Levenshtein Distance
  • Edit Distance Method For Sanger Overlap (edit_distance_method_sanger) type: string: Indicate the type of edit distance method to relax stringency to SANGER population by ROINOTE: only in effect if edit distance for overlap of NGS to SANGER population takes on non-zero value
    Default: Levenshtein Distance
    Choices: Hamming Distance, Levenshtein Distance
  • Exclude Sanger populations from the ‘Minimum Count Criterion (exclude_sanger_from_min_count) type: boolean: Exclude Sanger populations from the ‘Minimum Count for Full-Length Sequence by Sample’
    Default: True
  • Exclude Values That Did Not Match In-Line Barcode (exclude_unknown) type: boolean: If True, will exclude unknown values that did not have a barcode match, unless there is only one barcode for the entire NGS population.
    Default: True
  • Keep Only Functional Sequences (At NGS Level) (filter_functional) type: boolean: Eliminates non-functional sequences, truncations, stop-codons, frame-shifts
NOTE: This does not apply to Sanger sequences, which has same option as hidden variable
Default: True
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • liability database file (liabilities_db) type: file_in: Provide liabilities file (xls, csv, tsv) with 2 columns (regex pattern, name of liability)
  • Biophysical Liabilities (liability_choices_charge) type: string: Net charge or hydropathy liabilities to quantify
    Default: [‘Charge (>1)’]
    Choices: Charge (>-1), Charge (>0), Charge (>1), Charge (>2), Charge (>3), Charge (>4), Parker Hydropathy (<0.0), Parker Hydropathy (<-0.1), Parker Hydropathy (<-0.2), Parker Hydropathy (<-0.3), Parker Hydropathy (<-0.4), Parker Hydropathy (<-0.5), Parker Hydropathy (<-0.6), Parker Hydropathy (<-0.7), Parker Hydropathy (<-0.8), Parker Hydropathy (<-0.9), Parker Hydropathy (<-1.0), Parker Hydropathy (<-2.0), Parker Hydropathy (<-3.0), Parker Hydropathy (<-4.0), Parker Hydropathy (<-5.0)
  • Cysteine Liabilities (liability_choices_cysteine) type: string: cysteine-based liabilities to quantify
    Default: [‘Unpaired Cysteine’]
    Choices: Unpaired Cysteine, Any Cysteine
  • Deamidation Liabilities (liability_choices_deam) type: string: deamidation liabilities to quantify
    Default: [‘NG - Deamidation’, ‘NS - Deamidation’, ‘NT - Deamidation’, ‘NN - Deamidation’, ‘GNF - Deamidation’, ‘GNY - Deamidation’, ‘GNT - Deamidation’, ‘GNG - Deamidation’, ‘QG - Glutamine Deamidation’]
    Choices: N[GSTN] - Deamidation, NG - Deamidation, NS - Deamidation, NT - Deamidation, NN - Deamidation, GN[FYTG] - Deamidation, GNF - Deamidation, GNY - Deamidation, GNT - Deamidation, GNG - Deamidation, QG - Glutamine Deamidation
  • Glycosylation Liabilities (liability_choices_glyc) type: string: glycosylation liabilities to quantify
    Default: [‘NXT/S - Glycosylation’]
    Choices: NXT/S - Glycosylation, NXT - Glycosylation, NXS - Glycosylation
  • Hydrolysis Liabilities (liability_choices_hydrolysis) type: string: hydrolysis liabilities to quantify
    Default: [‘DP - Hydrolysis’]
    Choices: DP - Hydrolysis
  • Isomerization Liabilities (liability_choices_iso) type: string: isomerization liabilities to quantify
    Default: [‘DG - Isomerization’, ‘DS - Isomerization’, ‘DD - Isomerization’]
    Choices: D[GSD] - Isomerization, DG - Isomerization, DS - Isomerization, DD - Isomerization
  • Polyspecificity Liabilities (liability_choices_poly) type: string: polyspecificity liabilities to quantify
    Default: [‘Three Consecutive Aromatics - Polyspecificity’, ‘RR - Polyspecificity’, ‘VG - Polyspecificity’, ‘VV - Polyspecificity’, ‘WW - Polyspecificity’, ‘GGG - Polyspecificity’, ‘WXW - Polyspecificity’, ‘YY - Polyspecificity’]
    Choices: Three Consecutive Aromatics - Polyspecificity, RR - Polyspecificity, VG - Polyspecificity, VV - Polyspecificity, YY - Polyspecificity, WW - Polyspecificity, GGG - Polyspecificity, WXW - Polyspecificity
  • Max Distance for Levenshtein or Hamming, If Selected (max_dist_ld_hm) type: integer: Select the maximum edit distance for two sequences to belong to same cluster group
Works if Levenshtein Distance or Hamming Distance selected for Clustering Type.
Default: 2 , Max: 50
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Metric Period (metric_period) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Minimum Count for the Full-Length Sequence by Sample (min_count_sample) type: integer: Minimum count for the full-length sequence by the sample_name.
    Default: 1 , Min: 1, Max: 10000000000
  • Minimum Number of Points to Consider a Cluster (min_pts) type: integer: This is the minimum number of points that will be considered a cluster.
    Default: 2
  • Indicate whether the ABSCAN should utilize OPTICS (preferred) or DBSCAN (optics_or_dbscan) type: string: Default base algorithm to identify clusters in an unsupervised manner. Both methods use an automated
application of the Elbow Estimation method, but OPTICS uses this as a max as opposed to preset value so more optimal for automation.
Default: OPTICS
Choices: OPTICS, DBSCAN
  • Edit Distance for overlap of NGS Barcode Groups (overlap_edit_distance_overlap) type: integer: Relaxes the stringency of the overlap among different barcode group populations by ROI.
    Default: 0 , Max: 100
  • Edit Distance for Overlap to Sanger (overlap_edit_distance_sanger) type: integer: Increasing this value will allow for more relaxed overlap among NGS to Sanger populations by given ROI.
    Default: 0 , Max: 100
  • Region of Interest For Enrichment and Clustering (roi) type: string: Indicate the region of interest for processing, only top representative full-length sequence will be kept
IF INPUT IS ILLUMINA WILL ONLY USE CDR3 (CHAIN_1/UPSTREAM CHAIN) FOR ENRICHMENT, RELATIVE ABUNDANCE, AND CLUSTERING. overlap TO SANGER ADJUSTED TO HEAVY CDR3 FOR ILLUMINA.
Default: HCDR3 and LCDR3
Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3, Full-Length
  • Minimum Count for the Region of Interest (ROI) (roi_count) type: integer: This will set the minimum count for a given region of interest, all below will be removed.
    Default: 1 , Min: 1, Max: 10000000000
  • Minimum Percent for the Region of Interest (ROI) for the Final Round (roi_percent) type: decimal: This will set the minimum percent for a given region of interest, all below will be removed.
    Default: 1e-12 , Min: 1e-12, Max: 100
  • Region of Interest For Sanger overlap (roi_sanger_overlap) type: string: Indicate the region of interest (roi) to map back to sanger clones
    Default: CDR3 Chain_2 (Downstream Chain)
    Choices: Merged CDRs, CDR3 Chain_1 (Upstream Chain), CDR3 Chain_2 (Downstream Chain), HCDR3 and LCDR3
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Tabulate Ratio to Top Most Abundant Clone in Sequence or By Selected ROI (tabulate_ratio_based_selected_roi) type: string: Select the cluster or sequence region to understand how frequent the 2nd, 3rd, 4th, etc., clone full-length frequency compares to top most abundant frequency in cluster or by given roi
    Default: cluster
    Choices: cluster, cluster_cdr3_1, cluster_cdr3_2, hcdr3+lcdr3 cluster, cdr3_aa_1, cdr3_aa_2, hcdr3+lcdr3 sequence, merged CDRs, full-length
  • Write the Downstream Output to CSV File (write_to_csv_file) type: boolean: Allows the option to write to CSV after the AbXtract Processing, NGS Only file at the cost of additional time. If not, can do this in separate step. Writes to empty file if turned off.
    Default: True
  • Xi Minimum Steepness of Reachability Plot (xi) type: decimal: Float value between 0 and 1. Value sets the minimum steepness on the reachability plot to define cluster boundary. An upwards
point in reachability is essentially the ratio from one point to successor being at most 1-xi.
Default: 0.0 , Max: 1.0

Hardware Parameters

Machine hardware requirements
  • Memory (MiB) (memory_mb) type: decimal: The minimum amount of memory in MiBs (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 1800 , Min: 256.0, Max: 8589934592
  • Temporary Disk Space (MiB) (disk_space) type: decimal: The minimum amount of disk space in MiB (1048576 B) this cube requires. Due to overhead, request a couple hundred MiB more than required.
    Default: 5120.0 , Min: 128.0, Max: 8589934592
  • GPUs (gpu_count) type: integer: The number of GPUs to run this cube with
    Default: 0 , Max: 16
  • CPUs (cpu_count) type: integer: The number of CPUs to run this cube with
    Default: 1 , Min: 1, Max: 128
  • Instance Type (instance_type) type: string: The type of instance that this cube needs to be run on
  • Spot policy (spot_policy) type: string: Control cube placement on spot market instances
    Default: Prohibited
    Choices: Allowed, Preferred, NotPreferred, Prohibited, Required
  • Instance Tags (instance_tags) type: string: Only run on machines with matching tags (comma separated)
    Default: “”

Metrics Parameters

Cube Metric Parameters
  • Metric Period (None) type: decimal: How often to sample metrics, in seconds
    Default: 60
    Choices: 1, 5, 10, 30, 60, 120, 180, 240, 300, Min: 1, Max: 300
  • Cube Metrics (None) type: string: Set of metrics to be collected

    Choices: cpu, disk, memory, network