How to Guides¶
How to Use the Floe Report and NGS Select to Select Population of Interest¶
The Floe reports are generated at each stage of the NGS Pipeline Floe. The goal of these reports is to provide a broad statistical overview of the underlying population such as cluster groups or scaffolds that are favored in a given selection. This information can be utilized to guide you on critical information such as which sequences (by seq_id) are the most important to a given selection, which clusters are most relevant, which scaffolds are most relevant, how many additional sequences exist in a given cluster, how many individual regions of interest (ROI) overlap a given population, and many more. These same parameters can be utilized on a smaller subset to select populations of interest in the interactive interface tool.
In this How-to Guide, we will describe how to utilize the Floe Report to identify additional clones. Next we will use the Custom NGS Select by Seq ID of Additional NGS Representatives by Group - AbXtract Floe using an excel file to set the maximum number of additional sequences to identify for a given seq_id that share the same cluster or region of interest.
Use any of the tutorials and identify the job’s Floe Report. Select the Floe page > Jobs Tab > Job Name to bring up all the information pertaining to a given job. Find the ‘Collections’ associated with the given report. The report, if default names were used, should say something along the lines of ‘NGS & Sanger Downstream-down.long_read’ or ‘NGS Downstream-down.long_read’.
Launch the Floe Report by first clicking “Show in Project Data” in the Data directory. Click “Open Floe Report” to launch the Floe Report. Click the Show Report Button and give it about 10 seconds to load, you can click the box with arrow leading out of it to launch in separate tab within the browser, again this requires about 5-15 seconds to load.
Within the report identify the plot that states ‘Range of # CDRs with sequence liabilities across cluster sequences’. Identify the clusters that have range of 3 (red). Use hover text and keep track of the seq_id’s (which are the top sequences ranked within this group).
With Excel, use these seq_id’s to search for NGS sequences that are similar to these ‘seq_id’. In the first Column A in Excel, write the seq_id identified from the hover_text (e.g., 11981_tri). Scroll to another location and identify additional sequences. Add these row by row in the Excel under Column A. Now under Column B, specify the maximum number of sequences desired for sequences that share similar sequence-based features as these identified seq_id’s. Save this file as ‘custom_ngs_select.xlsx’ or any name that suits you.
Find the floe titled Custom NGS Select by Seq ID of Additional NGS Representatives by Group - AbXtract. Under Custom Input File With SEQ ID, REQUIRED Add the ‘custom_ngs_select.xlsx’ that was created, upload directly into the floe by clicking the tab ‘Upload File/URL’, next to the ‘Existing Data’ tab. Next under Input NGS Dataset choose the dataset THAT CONTAINS SAME SEQ_ID AS IDENTIFIED IN FLOE REPORT. It is important that you do not select another dataset or the floe will fail.
Next, provide an output_name and the ranking order to choose additional sequences that share similar sequence-based features as the seq_id specified in the Excel file. Select the ‘Region Of Interest (ROI) To Select Top Representatives’ that will map to the same representative.
Important Note
If clustering is not performed then the region of interest (ROI) (e.g., HCDR3) should be specified.
Use the default ‘Cluster’ option. Click ‘Start Job’. The output should contain a set of sequences that share the same cluster and should be no more than that specified in the Column B of the uploaded file (may be less if not enough sequences available in the cluster).
Download the file or visualize with the Analyze function.
How to Use the Floe Report and SANGER Select to Pick Population of Interest¶
Custom Sanger Select ‘Custom SANGER Select of Additional NGS Representatives by Group - AbXtract’ works similarly to the ‘Custom NGS Select by Seq ID of Additional NGS Representatives by Group - AbXtract’. One can provide the ‘seq_id’ associated with a SANGER sequence of interest, identify the maximum number of additional sequences that are within the same cluster or region of interest (ROI).
However, this floe also has another option to simply pick a max number of sequences for every SANGER clone in a consolidated SANGER dataset (must contain both NGS and SANGER and have been clustered at the same time). Simply 1) load the consolidated dataset that contains the SANGER clone and NGS population, 2) Select the Max Number of Unique NGS Desired (10). 3) Keep the ranking order the same and 4) keep the region of interest as ‘Cluster’. Select ‘Start Job’ and the output should contain less than or equal to the ‘Max Number of Unique NGS Desired’. Send the output dataset to the ‘Subset the Number of Fields for Export - AbXtract’ to get only the desired field names of interest before downloading.
Alternatively, use the Floe Report to identify the SANGER ‘id’ of interest, e.g., ‘155_Sanger’. Build Excel similar to the one used in the ‘Custom NGS Select’ but name it ‘custom_sanger_select.xlsx’.
Download the file or visualize with the Analyze function.
How to Condense a Dataset with Too Many Records¶
Sometimes a dataset has far too many records for some downstream processing steps. To reduce the size of the dataset use the Condense Dataset by Region of Interest by Most Abundant - AbXtract Floe, which has several options to condense the dataset. Simply choose the dataset you want to condense and provide the output names to identify these in your directory. If the option to ‘Keep All Representatives By Group’ is turned on, then the floe will keep all sequences that meet the minimum thresholds for ‘percent_roi_final’, ‘count_roi_final’ and/or ‘count’.
If the Condense Dataset by Region of Interest by Most Abundant - AbXtract Floe is not selected, one can still filter on ‘percent_roi_final’, ‘count_roi_final’ and/or ‘count’. This option ranks each full-length sequence by a given region of intrest (ROI) or cluster using the count within each population, then only picks the X defined number of sequences that share this region or cluster.
I Performed Enrichment. How Do I Know which Population of My Given Sequence was Found?¶
If enrichment information is desired, our enrichment approach ensures that only one copy of the full-length sequence is maintained. It will calculate the enrichment based on specified region of interest (ROI), but if the ROI is smaller (e.g., HCDR3) many full-length sequences will share the same values for the enrichment (or relative abundance). The early, late round information is still identified by the ‘count_roi_final’, ‘count_roi_early’, ‘percent_roi_final’, or ‘percent_roi_early’ fields. One can also identify the ‘round_enrich’ field which specifies if the given full-length sequence region of interest was identified in early (early round early), late (final round only) or ‘both’ rounds.
How to Reduce the Number of Fields in the Output¶
Use the Subset the Number of Fields for Export - AbXtract Floe to choose the fields that are most important to you. If you cannot recall which fields, click on the options to see all available fields. If your field is not within the dataset and you accidentally choose the field (e.g., ‘cdr3_aa_2’ in Illumina only dataset), not to worry, the floe will simply skip the name and output what is actually within the dataset.
How to Download the CSV of the Dataset of Records¶
CSVs are produced by the given Job or can be downloaded using the ‘…’ to the right of the file in directory. Be aware that datasets with greater than 50K - 100K records can take quite some time to download, in which case the CSV is a better option (if available under the Floe page > Jobs Tab > Job Name and simply click the ‘Show non-dataset files’ checkbox.
How to Visualize Custom Experimental Metrics¶
Uploading files through the standard Orion upload will do an excellent job at identifying the data type. Alternatively, one can also use an Excel, CSV, or TSV to upload files that contain headers that pertain to AbXtract fields. A few ‘special’ fields can be utilized through the addition of these column names to a downloaded AbXtract dataset (e.g. down.consolidate.csv). Usually saving these as ‘xlsx’ or ‘xls’ (or CSV or TSV) then uploading into Orion using the Upload File with AbXtract Fields - AbXtract Floe. The special fields that will be recognized by AbXtract are as follows:
Important Note: These are case-sensitive (see here):
off_rate
on_rate
KD
integer_field
float_field
string_field
bool_field
This enables the direct visualization of select sequence metrics (e.g., charge) to experimental data using the interactive Analyze tool.
How to upload data from an AIRR-compatible file¶
An AIRR-formatted TSV file can be uploaded using the Upload File with AbXtract Fields - AbXtract Floe, toggling on the option for Uploaded file is in AIRR-compatible format. In the hidden parameters, you will also find an option for choosing which column to assign to the AbXtract ‘count’ field, duplicate_count or consensus_count. Rows from the input file will be stored with the dataset through all subsequent AbXtract processing floes and can be retrieved during export if the Export AIRR Fields for Dataset - AbXtract Floe is used.
How to export any dataset to an AIRR-compatible file¶
Simply export the dataset of interest using the Export AIRR Fields for Dataset - AbXtract Floe. This conversion works on any AbXtract-processed dataset. You also have the option to add the cluster call from the Cluster (AbScan) Antibody Binding Regions - AbXtract or NGS Pipeline - AbXtract Floes to the “clone_id” column of the file.