Frequently Asked Questions about AbXtract Floes

What is the difference between upstream and downstream chain designations, _1 and _2?

The _1 and _2 define the orientation in a vector and are predominantly utilized with PacBio (Long-Read) and Sanger sequencing. If the antibody is sequenced with the heavy chain variable domain (VH) at the 5’ end and the light chain variable domain (VL) at the 3’ end, then ‘_1’ (e.g. cdr1_aa_1, cdr2_aa_1, cdr3_aa_1) is defined as the VH. However if the VL is 5’ and VH is 3’ then ‘_2’ (e.g. cdr1_aa_2, cdr2_aa_2, cdr3_aa_2) is defined as the VH. Any designation with ‘_1’ or ‘_2’ designated the VH or VL dependent on the orientation.

What is the difference between upstream and downstream AbXtract FLOE processing?

Upstream FLOEs take in FASTQ files for NGS or FASTQ/FASTA/Excel/TSV/CSV for SANGER. If provided, upstream FLOES also take on optional barcode files and custom database references. The main goals of upstream processing are to:

1) Annotate sequences by closest match to reference database (Human, Mouse, Rabbit, Alpaca) or custom database.
a) By identifying the closest match (germline or scaffold).
b) By annotating the framework and complementarity determining region (CDR) boundaries by annotation scheme of interest (IMGT, Kabat, or Chothia).
2) Demultiplex samples by the sample_name field.

Downstream processing can take on any set of sequences that have been properly annotated and will perform clustering, enrichment calculations by population, relative abundance calculations by population, look for overlap among different populations and/or quantify liabilities for every complementarity determining region (CDR) individually. To be processed by the downstream FLOES, datasets require key fields:

1) All framework regions and CDRs at the amino acid level (e.g., fr1_aa_1, fr2_aa_1, fr3_aa_1, fr4_aa_1, cdr1_aa_1, cdr2_aa_1, cdr3_aa_1).
2) Count field that indicates the number of times a full-length sequence by amino acid and sample_name appear.
3) Sample names (sample_name), while optional, should be distinct by the used nucleotide barcode; if no sample names are present, all sequences are treated as the same population.
a) If the barcode table is not supplied or in-line barcodes are not used, you can use the “Modify Barcode and Sample Name FLOE” to alter the dataset
4) Barcode groups (barcode_group) indicate how the relative abundance (and enrichment) is to be calculated, all samples belonging to same barcode group are calculated relative to one another.
5) Barcode round (early or late) are upstream only fields that are important for determing the relative fold enrichment of late round relative to early round.
a) Barcode round will compare the late relative to early rounds by barcode_group to calculate relative fold enrichment values.

Upstream and Downstream processing can occur simultaneously in the comprehensive A to Z FLOES (NGS Processing) with/without SANGER and with/without Automated Selection.

What input formats do the AbXtract Upstream Processing FLOES accept, and what are the requirements?

The Upstream processing takes in REQUIRED FASTQ file(s) that differ between Illumina and PacBio as follows:

1) Illumina takes on a forward and reverse FASTQ file with input ports specified.
2) PacBio takes on single FASTQ file. Typically, the sequencing facility that conducts the sequencing will convert BAM files to FASTQ which is the appropriate input.

Upstream processing is always linked to IgMatcher annotation using FLOE “NGS (PacBio/Illumina) IgMatcher, Annotation Only - AbXtract”

OPTIONAL file inputs to Upstream processing include:

1) Barcode file (see more details below).
2) Customized DNA Database with alignment scheme of interest (see more details below).

What is the region of interest (ROI)?

The region of interest (ROI) allow you to define regions for condensing sequences, calculating NGS metrics (enrichment/relative abundance), overlapping different datasets, and/or clustering. The ROIs included in AbXtract are as follows:

1) CDR3 Chain_2 (Downstream Chain)
2) CDR3 Chain_1 (Upstream Chain)
3) Merged CDRs
4) HCDR3 and LCDR3
5) Full-Length

Where are the liabilities tabulated in AbXtract?

Liabilities are always tabulated across all 6 CDRs (user-specified annotation) in PacBio or 3 CDRs in Illumina. Each CDR is tabulated for liabilities individually (not by concatenation) based on the annotation scheme of interest (e.g., IMGT).

There are several overlap options available. What are the differences among the different overlap options?

Overlap is conducted across different datasets designed for overlap of downstream populations (e.g. barcode groups are assigned to distinct populations) either between two NGS populations (each with different barcode groups) or between NGS and Sanger populations (Sanger populations must have barcode_group labeled Sanger).

For overlap among different NGS populations:

Overlap among the different NGS populations are meant to identify which region of interest (ROI) is found in distinct barcode_group populations, particularly in downstream processed datasets. If two populations do not have distinct barcode groups one can use the FLOE “Modify Sample Name or Barcode Group in AbXtract Dataset” to adjust the “barcode group” with distinct populations. If an ROI (say HCDR3) is found in another population within the match of a user-defined stringency such as 1) Unique Only (100% identity), 2) Levenshtein distance (with edit distance > 0), or 3) Hamming distance (with edit distance > 0), then the corresponding population will be assigned as an overlap under the field called ‘overlap_population’ as a ‘:’ separated list.

Overlap of NGS to SANGER is slightly different from the overlap among different NGS populations. In these populations, the overlap_population field is produced (typically with ‘Sanger’ as its own barcode_group), but also be mindful of the fields overlap_to_ngs and overlap_to_sanger, which are TRUE/FALSE value indicating if given user-defined ROI is found in NGS and/or SANGER. Most importantly, a well_id field is used to identify the NGS sequences that share the exact (Unique Only, aka 100% identity) or similar (Levenshtein/Hamming edit distance > 0) sequences to SANGER by building a concatenated ‘:’ separated list of clones by the ‘id’ column.

Important Note

SANGER clones are condensed by the full-length region of interest (ROI) by default. If a different ROI is used then there may be SANGER clones that are dropped as only the most abundant full-length sequences is kept by a given ROI.

What input formats do the AbXtract Downstream Processing FLOES accept, and what are the requirements?

The Downstream processing takes in REQUIRED datasets of records that must have at a minimum have been processed by the NGS Annotation Only FLOE.

1) If using the comprehensive A to Z NGS processing cubes, datasets from the upstream FLOEs and downstream FLOEs as well individually barcoded populations (by sample_name and barocode_group) are written to their individual datasets and consolidated datasets as datasets of records and CSV files.
a) Any of the datasets of records can be used in any AbXtract FLOE or visualized in the Analyze tool (<100k records).
b) CSV files can be downloaded directly for local processing.
2) If the FLOEs are used in steps (e.g., Annotation Only FLOE is run first) then these datasets can be used in any individualized downstream FLOE (e.g., Liabilities, Enrichment/Relative Abundance, Clustering). However, upstream FLOES cannot be used directly in Automated Picking FLOES as no NGS parameters are calculated except count.

How many sequences can the platform process, and what is the output?

The goal of the platform is to enable the user to filter the sequence space from millions of reads to a smaller subset of records in the 100s to 1000s. To achieve this, the platform conducts annotation to identify regions of interest (e.g., HCDR3). The frequencies of these regions of interest in respective populations are used to calculate NGS metrics such as relative abundances, enrichment and to calculate liabilities. Further, the sequence space also allows for clustering of sequences based on sequence similarity across the whole population to enable efficient density based clustering. All of these calculated metrics are then used to rank full-length sequences in the population by user-defined priorities metrics in each cluster, so the user can work with individual leads in a population that are representative of each of these clusters.

While a large number of records produced in earlier processes can be used, they are mostly intended for use in AbXtract to feed the filtering FLOES such as:

1) Automated Selection
2) Custom Selection NGS/Sanger
3) Condense by Region of Interest

For most use cases, FLOES are designed to use the underlying NGS metrics to select representatives from each cluster most effectively.

For the more advanced users. Upstream datasets can be exported, but require longer times to process. These larger datasets are not intended for the interactive plots in Orion’s Analyze tool.

Why is the number of picked representatives below the total number of full-length when using the ‘Automated Top Lead Selection’ FLOE ?

It may be due to several reasons including but not limited to the following:

1) If min/max thresholds were set in hidden parameters (e.g. minimum ROI % threshold) which eliminates a large portion of the population.
2) If the number of clusters or number of sequences per cluster is below the defined limits (e.g. 10 clusters in real population but 40 clusters desired).

The uniform picking option (see hidden parameters) used with the ‘Automated Top Lead Selection’ FLOE essentially scans over the desired number of clusters in the population to first explore all the clusters in a given population. For instance, it scans over every top representative of each cluster first before moving to the 2nd and 3rd representative, if the number of sequences per cluster is set above 1. Using the parameter ‘Attempt to Fulfill the Desired Number of Sequences Quota’ will relax some of the requirements (e.g., number of clusters explored or increase number of sequences per cluster). To fulfill the quota if you are falling short of the desired number of clones:

1) Turn on the fill quota option in the NGS lead selection parameters.
2) Increase the number of maximum number of clusters preferred.
3) Reduce the maximum number of sequences per cluster.
4) Change the picking type from ‘uniform’ to ‘most abundant’ (see hidden parameters).

The clustering is taking quite a while with my inputs. Is there a way to bypass clustering in the NGS (PacBio/Illumina) Pipeline FLOE?

Clustering may take some time depending on the underlying diversity (often HCDR3 diversity). Usually, the time for a typical PacBio run with <20,000 HCDR3 diversity takes from 30min to 5 hours. However, users should allow at least up to 1 day. If the FLOE fails check the log file to see if the error is related to memory. If so, allocate more memory (see hidden parameters under system requirements). If it extends beyond this timeframe and/or exceeds the cost limits, this suggests that you are working with large diversity. The quickest solution is to simply change the cluster method to “Unique Only” using the HCDR3 region of interest (ROI). This approach will still assign a cluster id to same HCDR3 though collapse dataset by non-redundant full-length sequences by population. This will greatly reduce the time of the run. Alternatively, if one still desires a more relaxed stringency you can change the ‘Clustering Type’ method to Levenshtein or Hamming distance, which will be slower than the Unique Only option but faster than AbScan. After these sequences are grouped, you can use any of the Automated Lead Selection FLOES or the ‘Condense Total Number of Sequences by Region of Interest (ROI)’ to gather only top representative full-length sequences from each cluster.

I do not need the FLOE Report. Can I bypass this option?

Yes, within each respective consolidate cube for a given FLOE in the Hidden Parameters turn off the option to “Write Report”.

What is a barcode file, barcode_group, barcode_round, and sample_name? How do you format this table?

Barcode files can be in Excel (preferred, 1 sheet only), TSV or CSV format WITHOUT headers.

The format for the barcode files, if used are as follows:

1) sample_name, 5’ barcode DNA, 3’ barcode DNA, barcode_round, barcode_group.
A) Sample name (sample_name): This is unique to each population distinguished by the 5’ and/or 3’ DNA. For example, in the in-vitro tutorial the different populations are as follows: | i) rbd2 | ii) rbd3 | iii) s12 | iv) s13 | v) tri1 | vi) tri3
IMPORTANT NOTE: All sample names should be distinct by the in-line barcode used. All populations with distinct sample names will be written to their own dataset.
B) 5’ barcode DNA: The 5’ barcode (e.g., 8-mer) written in 5’ to 3’ direction.
C) 3’ barcode DNA: The 3’ barcode (e.g., 8-mer) written in 5’ to 3’ direction.
D) Barcode round (barcode_round): This is used by downstream pipeline to calculate enrichment/relative abundance and can take on these values:
a) early
b) late
E) Barcode group (barcode_group): This is used by downstream pipeline to calculate enrichment/relative abundance and can take any value. If user provides barcode_round and barcode_group this enables calculation of enrichment from early to late round. For example, in the in-vitro tutorial the different populations are as follows: | i) rbd | ii) s1 | iii) trimer

Important Notes

All barcode groups should indicate the population by given target, condition or other. For example, rbd2 and rbd3 sample_name populations are both assigned a barcode_group ‘rbd’ but different barcode_rounds of ‘early’ and ‘late’, respectively. First, relative abundance will be calculated for the early and late populations as ‘percent_roi_early’ and ‘percent_roi_final’. If both an ‘early’ and ‘late’ round exist, then fold_enrichment_roi will be calculated.

The sample_name, barcode_group and barcode_round can be added after initial processing by annotation only or downstream. If this option is selected, it will either retain the initial name or modify the entire group by population.

What is a customized DNA or AA Database?

Customized databases are simple text files with user-defined annotation scheme and scaffolds of interest. This is only needed if a representative sequence in the NGS or SANGER input likely deviates significantly from the natural Human, Rabbit, Mouse or Alpaca reference databases or if user has different annotation distinct from IMGT, KABAT or CHOTHIA. For custom amino acid databases or DNA databases follow these guidelines:

Customized Amino Acid Database

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12
2) Heavy chain representative should contain AA space separated:
a) match_name_vh FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa
3) Light chain representative should contain AA space separated:
a) match_name_vl FR1_aa CDR1_aa FR2_aa CDR2_aa FR3_aa CDR3_aa FR4_aa
4) Each chain expected in sequence file needs least one representative (e.g., VH or VL)
5) Every scaffold/germline within the NGS/Sanger dataset should contain one representative in this custom annotation file (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12). If you are unsure of the expected diversity from the selected output then it is better to use any of the databases (e.g., Human, Mouse, etc.) provided with the AbXtract platform.

Customized DNA Database

1) match_name: name of the heavy or light chain scaffold. Examples include - mepolizumab_vh, mepolizumab_vl, IGHV1-18, IGLV1-36, IGKV1-12
2) Heavy chain representative should contain DNA bases space separated:
a) match_name_vh FR1_dna CDR1_dna FR2_dna CDR2_dna FR3_dna CDR3_dna FR4_dna
3) Light chain representative should contain DNA bases space separated:
a) match_name_vl FR1_dna FR1_dna FR1_dna CDR2_dna FR3_dna CDR3_dna FR4_dna
4) Each chain expected in sequence file needs least one representative (e.g., VH or VL)
5) Every scaffold/germline within the NGS/Sanger dataset should contain one representative in this custom annotation file (e.g., Mepolizumab_VL, Mepolizumab_VH, IGHV1-18, IGLV1-36, IGKV1-12). If you are unsure of the expected diversity from the selected output then it is better to use any of the databases (e.g., Human, Mouse, etc.) provided with the AbXtract platform.

How is relative enrichment calculated?

Two options are available for this option:

A) ‘Relative Abundance and Enrichment Calculation by Region of Interest (ROI)’ that takes reads in early round dataset(s) (multi-select option available) as one input, and a late round dataset(s) (multi-select option available) as the other input.
B) If in-line barcodes are used, one can define a barcode_round and barcode_group in the barcode file/table in the 4th and 5th column, respectively, and run any one of the ‘NGS Pipeline - AbXtract’ or ‘NGS and Sanger Pipeline - AbXtract’ FLOEs.

Relative enrichment is built for selection campaigns that are designed to look for enrichment from an “early” round to a “late” round. In some instances the early round is the population that is selected using higher (less stringent) concentrations of the target (e.g., 100nM target concentration), while the late round is the population that is selected at lower (more stringent) concentration (e.g., 1nM). The user defines a region of interest (ROI) such as ‘CDR3 Chain_2 (Downstream Chain)’. The relative abundance (%) by ‘barcode_group’ is calculated first, so any ‘early’ round population that belongs to barcode_group population ‘X’ will quantify the frequency of that population across its entire population. While non-redundant full-length sequences are condensed by population, if the region of interest (ROI) is smaller than full-length fragment (e.g., CDR3 Chain_2 (Downstream Chain)) then you will likely have many distinct full-length sequences that share the same ROI. This will result in many of the non-redudant full-length sequences with same ROI sharing the same relative abundance values. Next, once the relative abundance values are calcualted the late round relative abundance values are compared to the early round relative abundance values by population (e.g, barcode_group) to find the relative fold enrichment by ROI (fold_enrichment_roi). Several different outcomes may come about if a barcode_round and/or barcode_group is supplied:

1) If the option B) is used above, and there is no supplied barcode_round and barcode_group, then no enrichment value is calculated (all values will be the same).
2) If the option B) is used above, and there is no supplied barcode_round but different barcode_group, then no enrichment value is calculated and all values by barcode_group will retain similar relative abundance values.
3) If the option B) is used above, if a barcode_round is supplied and no barcode_group, then it will compare the respective barcode_rounds (early vs late) across the entire dataset.

If a given ROI is not found in the early round or the late round a pseudo value is supplied at the relative abundance by taking the minimum value in one group and dividing by a correction factor, one for the early round ROIs not found in late rounds and another one for the late round ROIs not found in early round.

See Relative Abundance and Enrichment Calculation by Region of Interest (ROI) documentation for details.

How is relative abundance calculated?

Two options are available for this option:

A) One is the individual Enrichment/Relative Abundance FLOE that takes reads in early round dataset in one port, and a late round dataset in the other port.
B) Alternatively, if in-line barcodes are used, one can define a barcode_round and barcode_group in the barcode file/table in the 4th and 5th column, respectively, and run any one of the of the NGS pipeline FLOEs.

Relative abundance is built for selection campaigns to calculate the relative frequency of populations by their barcode_group, and if supplied, barcode_round. The user defines a region of interest (ROI) such as downstream CDR3. The relative abundance (%) by ‘barcode_group’ is calculated first, so any ‘early’ round population that belongs to barcode_group population ‘X’ will quantify frequency of that population across its entire population. If the ROI is a smaller region (e.g., CDR3 only) then you may have many distinct full-length sequences that share the same ROI. The full-length sequences are not condensed, but will share the same relative abundance values by ROI. The relative abundance is calculated depending on the designated group supplied by the barcode table or after specifying using the Modify Sample Name or Barcode Group FLOE.

Several different outcomes may come about if a barcode_round and/or barcode_group is supplied:

1) If the option B) is used above, and there is no supplied barcode_round and barcode_group, then relative abundance is calculated across the entire population.
2) If the option B) is used above, and there is no supplied barcode_round but different barcode_group, then relative abundance is calculated by the respective barcode_groups.
3) If the option B) is used above, if a barcode_round is supplied and no barcode_group, relative abundance will be calculated by respective barcode_rounds but the entire population will be treated as a single barcode_group population.

If a given ROI is not found in the early round or the late round a pseudo value is supplied at the relative abundance by taking the minimum value in one group and dividing by a correction factor, one for the early round ROIs not found in late rounds and another one for the late round ROIs not found in early round.

See Relative Abundance and Enrichment Calculation by Region of Interest (ROI) FLOE or CUBE documentation.

What is meaning of the cluster vocabulary? What is a cluster? What is the difference between cluster and cluster_numeric?

Clustering can be performed using any option such as AbScan, Unique Only (100% identity), Levenshtein distance, or Hamming distance. The number of non-redundant clusters is a function of the selection conditions, library diversity and sequencing depth, among others. Typically, the lower edit distance stringency (e.g., edit distance <= 5) the greater the diversity. From our experience the total number of non-redundant unique/clonotype/clusters are ranked as follows:

Unique Only (100% identity) >= Hamming distance >= Levenshtein distance >= AbScan.

If a greater value is applied for the edit distance, then the order may be altered such that Unique Only (100% identity) >= AbScan >= Hamming distance >= Levenshtein distance. However, if values are increased to the length of the given region of interest the entire population would be clustered together.

Regardless of the clustering method used, the vocabulary of the cluster stays the same and will be assigned a letter by the CDR for given chain as follows:

1) AB = upstream CDR1 and CDR2
2) C = upstream CDR3
3) DE = downstream CDR1 and CDR2
4) F = downstream CDR3

If the orientation of the VL and VH in vector is 3’ VL and 5’ VH, then the HCDR3 would be defined with the letter F. Further, clustering is conducted by clustering only similar regions together, for example, LCDR3 is first clustered against all LCDR3 in the population, and HCDR3 is first clustered against all HCDR3 in the population. After the respective regions are assigned a cluster id, they are concatenated after the fact to form the complete cluster id. This approach is applied in similar manner to whatever the cluster method is selected.

Clustering also depends on the region of interest (ROI) selected. Depending on the type of ROI selected for clustering determines the type of clusters that will be assigned. For instance, if the orientation of the VL and VH is 3’ VL and 5’ VH and the user selects “CDR3 Chain_2 (Downstream Chain)” as the option, then only the HCDR3 will be used for clustering, so typical outputs will assign the unique cluster ID integer followed concatenated with the letter ‘F’ such as 1F, 11F, 33F, etc. Other examples by ROI are as follows using same VL and VH orientation:

1) Merged CDRs - All CDRs are considered by sequencing type, if PacBio, numbering examples would be: 1AB-33C-24DE-22F indicating cluster 1AB of LCDR1 and LCDR2, cluster 33 of the LCDR3, cluster 24 of the HCDR1 and HCDR2, and cluster 22 of the HCDR3.
2) Merged CDRs - All CDRs are considered by sequencing type, if Illumina and only heavy chain is sequenced, numbering examples would be: 1AB-33C indicating cluster 1AB of HCDR1 and HCDR2, cluster 33 of the HCDR3.
3) CDR3 Chain_1 (Upstream Chain) - Only upstream CDR3 considered, if PacBio, numbering examples would be: 33C indicating cluster 33 of the LCDR3.
4) CDR3 Chain_1 (Upstream Chain) - Only upstream CDR3 considered, if Illumina and only heavy chain is sequenced, numbering examples would be: 33C cluster 33 of the HCDR3.
5) HCDR3 and LCDR3 - only the HCDR3 and LCDR3 are considered, if PacBio, numbering examples would be: 33C-22F indicating cluster 33 of the LCDR3 and 22F of the HCDR3.
6) HCDR3 and LCDR3 - only the HCDR3 and LCDR3 are considered, if illumina and only heavy chain is sequenced, numbering examples would be: numbering examples would be: 33C indicating cluster 33 of the HCDR3.
7) Full-Length, Including Framework - this is treated similarly to ‘Merged CDRs’.

Due to the importance of specificity in the CDR3, there are additional fields called cluster_cdr3_1 for the upstream CDR3 and cluster_cdr3_2 for the downstream CDR3. If the orientation of the VL and VH is 3’ VL and 5’ VH, the sequencing conducted is PacBio and the user only selects “CDR3 Chain_2 (Downstream Chain)”, the cluster_cdr3_2 will be given a standard value (e.g., 22F); however, the cluster_cdr3_1 will be assigned a -1C, as no clustering is performed at this region. Any of the options such as “Merged CDRs”, “HCDR3 and LCDR3” or “Full-Length, Including Framework” using pacbio will be assigned a value (e.g., 33C) to cluster_cdr3_1. If Levenshstein distance or Hamming distance is used the edit distance is calculated across by the individual regions.

The alternate field called “cluster numeric” is the numeric equivalent of the non-redundant “cluster” fields, but can aid in plotting using the Analyze interactive tool when the cluster values are in excess.

AbScan is the default option. AbScan is based on an unsupervised density-based clustering using a biophysical tokenization of the sequence space to enable grouping of antibodies that are more likely to share similar properties. With AbXtract through AbScan the goal is to select top full-length representatives that are density connected through the sequence-space that are not only connected through high density but also distinctly separated from other clusters by low density regions. In contrast to common clonotyping methodologies, clustering is dependent on variety of factors, including:

1) Sequencing depth of the population.
2) Abundances of given regions of interest.
3) Diversity of the underlying population from selection campaign.
4) Size of the target.
5) Number of accessible epitopes on the target.
6) Biophysical properties of the antibody paratope.

In addition to increasing the chances of identifying sequences with distinct properties, AbScan has a secondary benefit by minimizing the selection of sequences that are likely contaminants, sequencing errors, or amplification errors that may be introduced into a given process.

I have selected the Levenshtein or Hamming Distance Clustering Method. Where do I modify the edit distance?

To adjust the Levenshtein or Hamming edit distance (default = 2) one must go into the Hidden Parameters under appropriate cube and change parameter ‘Max Distance for Levenshtein Or Hamming’.

How are liabilities quantified, and what do the different liability fields mean?

Liabilities are quantified by across all 3 CDRs in Illumina or all 6 CDRs in PacBio (using the user-defined annotation scheme). The liability fields include both string fields and quantification fields. These are determined by the user-defined selections in the individual FLOEs that quantify liabilities or by using a customized liabilities table (see below on the customized table format). This latter option will override the former.

For the string fields you have designations as follows with corresponding descriptions using VL and VH orientation in PacBio:

1) liability_string_cdr1_aa_1 - ‘+’ separated list of LCDR1 liabilities found in region.
2) liability_string_cdr2_aa_1 - ‘+’ separated list of LCDR2 liabilities found in region.
3) liability_string_cdr3_aa_1 - ‘+’ separated list of LCDR3 liabilities found in region.
4) liability_string_cdr1_aa_2 - ‘+’ separated list of HCDR1 liabilities found in region.
5) liability_string_cdr2_aa_2 - ‘+’ separated list of HCDR2 liabilities found in region.
6) liability_string_cdr3_aa_2 - ‘+’ separated list of HCDR3 liabilities found in region.

For the quantification fields you have designations as follows with corresponding descriptions using VL and VH orientation in PacBio:

1) liability_quant_cdr1_aa_1 - total number of LCDR1 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in LCDR1).
2) liability_quant_cdr2_aa_1 - total number of LCDR2 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in LCDR2).
3) liability_quant_cdr3_aa_1 - total number of LCDR3 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in LCDR3).
4) liability_quant_cdr1_aa_2 - total number of HCDR1 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in HCDR1).
5) liability_quant_cdr2_aa_2 - total number of HCDR2 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in HCDR2).
6) liability_quant_cdr3_aa_2 - total number of HCDR3 liabilities (a given liability, ‘e.g. DD liability’, is tabulated once in HCDR3).

In some cases, you have liabilities that are contained within another liability, e.g., ‘YY’ and ‘YYY’ liability. These are counted as two distinct liabilities.

What is a customized liabilities table/database, how is it formatted, and does it override the FLOE parameters?

The customized liabilities will override any of the options provided by the Liabilities CUBE or FLOES that have this capability. These can be in Excel, TSV, or CSV format and simply contain the regular expression (regex) pattern in first column and the name of the liability in the second column. For instance, the column (Excel) or comma (CSV) separated value can be written as follows:

DD,’DD - Isomerization’

The first value (column A) contains the regular expression ‘DD’, so any CDR containing this liability will be quantified in the liability_quant field (+1 per CDR) and identified in liability_string field such as liability_quant_cdr3_aa_2 and liability_string_cdr3_aa_2, respectively. The second value (column B) is simply the identity of liability. Unless the user provides a specialized liability, see below, this second value is arbitrary. However, it recommended to use a short descriptor that specifies the liability type. This will appear in the liability_string field (e.g. liability_string_cdr3_aa_2).

Important Note

Avoid the use of ‘+’ or ‘,’ values within these strings.

There are few examples of specialized liabilities including ‘Cysteine’, ‘Charge’, ‘Parker Hydropathy’. The column (Excel) or comma (CSV) separated value can be written as follows:

Cysteine, ‘Unpaired Cysteines’ Charge, 1 Charge, 2 Parker Hydropathy, 0 Parker Hydropathy, -1

The first column (column A) contains the special string name (case-sensitive) that needs to match exactly as above. Once the specialized name is provided in first column, the second column values follow their own rules as follows:

1) The second column B for Cysteine can be any description (usually ‘Unpaired Cysteine’). | EXAMPLE: Unpaired Cysteines will first quantify whether a given CDR contains an unpaired cysteine (e.g., ‘AGGSCDGSDG’ or ‘AGGCSDCTC’) are both unpaired. However, in event of even number of cysteines but the spacing between is <4 then this will still be quantified as an Unpaired Cysteine.
2) Column B for ‘Charge’ is an integer, any value above this integer is flagged (more positive net charge is undesirable). | EXAMPLE: Charge, 1 will quantify any CDR above a +1 Net Charge. | EXAMPLE: Charge, 2 will quantify any CDR above a +2 Net Charge.
3) Column B for ‘Parker Hydropathy’ takes on negative float value (more negative value is less hydrophilic). | EXAMPLE: Parker Hydropathy, 0 will quantify any CDR below 0. | EXAMPLE: Parker Hydropathy, -1 will quantify any CDR below -1 .

Important Note

Parker Hydropathy is a hydrophilicity index with more positive values indicative of greater hydrophilicity. Charge is assessed at physiological pH of 7.

How many records are typically handled by the Analyze tool?

Orion’s interactive Analyze tool can handle up to 100K records, but performance decreases above 10-20k records. Use the condense by ROI or automated selection tools to narrow population to only select few representative sequences within each cluster. Otherwise, download options as CSV are available.

How long does it take to load the FLOE Report? Where in Orion can the FLOE reports be found?

The FLOE report needs to load all the contents in HTML format and typically takes about 10-20 seconds. Sometimes allow up to one min to load. Also, it uses memory on the browser and local machine so make sure to remove other applications to optimize the loading of the page.

FLOE reports are treated as their own separate entity like “Files” or “Datasets” and can be found in the Data folders under “Collections”. These are typically generated from many of the FLOEs.

I am having difficulty uploading my FASTQ file or Orion logs out before process is complete, how do I overcome this?

Try the OCLI command-line option, if available, which is faster. If not, try any of the following:

1) Break the FASTQ into chunks of smaller sizes (e.g., 500k sequences), upload each separately and run separately. Records can be loaded collectively into downstream cubes.

If any of the above do not work, please contact support@eyesopen.com.

Why is my download taking so long?

Downloads of records is typically reserved to ≤100k number of records, designed to work with filtered datasets of top representatives of sequences by given cluster for use in experimental validation. Unlike simple downloads of file formats, the dataset of records needs to be converted from this structure into a CSV for instance.

Nonetheless, we are in the process of improving the efficiency of downloads of much larger datasets, >100k records.

My FLOE fails due to a memory allocation issue. How do I overcome this?

Each CUBE within a FLOE has a hidden parameter that allows for users to adjust the memory and disk space requirements. Most often the failure occurs within the Consolidate CUBE or FLOE REPORT Writer Cube. Increase the requirements by doubling, quadrupling (or more) the original value. Make sure to set acceptable cost limits for the job. If the job continues to fail, please contact support@eyesopen.com.

My FLOE is stalling for more than 48 hours in the NGS (PacBio/Illumina) Pipeline (NGS), or beyond the cost level acceptable for this job?

The most likely reason is the dataset is highly diverse. You have several options available:

1) Under the Hidden Parameters in the “Downstream KEY PARAMETER Cube” increase the minimum percent for region of interest or minimum count for region of interest option. This will eliminate low frequency clones from population after enrichment calculation but before the clustering process (the most computationally intensive process).
2) If AbScan is used, one can try to change the cluster type to Unique Only (100% Homology) which by the HCDR3.
3) Turn off the write FLOE Report option.

What happens if I want to change a barcode group after the FASTQ has already been processed or I did not add a barcode group?

Easy to do, just select the FLOE titled ‘Modify Sample Name / Barcode Group for Downstream Processing’. Fields that are left alone stay the same, while those that are altered are changed for the entire population. Sometimes the sequencing provider will run the pre-barcoded sets. Best practice would be to run individual files independently using the annotation only FLOE then use the ‘Modify Sample Name / Barcode Group for Downstream Processing’ FLOE to add desired barcode_group to each population.

I am having difficulty downloading my file from the dropdown option for my given dataset?

Downloads of records is typically reserved to ≤100k number of records. Nonetheless, if the user still desires to download a large number of records several different options are available.

1) Try the OCLI command-line tool, if not available, continue with other options listed here.
2) As opposed to using the download from the dropdown under ‘…’ option, try to run the FLOE titled: “Dataset to File Export”, provide the desired filename and be sure to write out the suffix ‘.csv’. This will run the job as a FLOE so one does not need to keep the browser open for the job to complete.
a) May need to adjust the ‘Record to File Converter’ memory or temporary disk space options well above their default values.
3) Download the file as an oedb as opposed to csv, support can provide an oedb to csv converter.

If any of the above options do not work, please contact support (support@eyesopen.com).

Will annotating my sequences using the FLOE “NGS (PacBio/Illumina) IgMatcher, Annotation Only - AbXtract” give me the same result as my AIRR-compatible file?

Not necessarily. This depends on the annotation scheme used. The FLOE “NGS (PacBio/Illumina) IgMatcher, Annotation Only - AbXtract” provides the option to annotate the framework and complementarity determining region (CDR) boundaries using different annotation schemes: IMGT, Kabat, or Chothia.

Why are there twice as many records/rows output when I convert a PacBio input to an AIRR-compatible file?

Each chain (_1 and _2) is split over two rows when paired chain records are converted to AIRR format. The paired chains can be linked using the “sequence_id” column.