Protein Sequence to AI Folded Structure Ligand Affinities
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Product-based/SPRUCE
Role-based/Computational Chemist
Role-based/Structural Biologist
Role-based/Bioinformatician
Solution-based/Target Identification
Solution-based/Target Identification/Target Preparation
Solution-based/Hit to Lead/Target Preparation
Task-based/Target Prep & Analysis/Protein Preparation
Description
Protein sequences are used as inputs to predict protein structures using AI folding models. This floe supports Boltz2 models for structure prediction. After predicting the protein–ligand structure, the affinity is predicted and structures with the lowest affinities (highest protein–ligand interactions) are ranked and saved. Boltz is a third-party sequence-to-structure protein folding method that uses a protein language model (pLM) in combination with multiple sequence alignment (MSA) to make accurate single sequence and multimeric sequence predictions. This floe and its defaults are based on the standard folding practices as outlined by Boltz Documentation (to avoid leaving this page, right-click and open link in new tab)
While MSA inputs are considered optional, they are recommended. If an MSA is wanted prior to prediction, a user can provide their own MSAs in .a3m format as input, or these can be generated at run time if an input FASTA collection is indicated.
Potential Input Sources MSA Align and Search
Limitations Longer sequences (700+) can be computationally demanding and could potentially require an increase in cube resources.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Ligand Inputs (lig_input): Dataset of molecules with 3D coordinates or SMILES strings to be co-folded with the input protein sequence and ranked based on affinity. These input ligands will be given the LIG chain ID in the protein–ligand system.
Type: data_source
Ligand File Inputs (lig_files): Molecular or CSV file containing ligands to be be co-folded with the input protein sequence and ranked based on affinity. These input ligands will be given the LIG chain ID in the protein–ligand system.
Type: file_in
Boltz Model (boltz_model): Boltz Model to run for the prediction. Some features are model-specific. Feature information that does not support the indicated model will be ignored.
Required
Type: string
Default: 2
Choices: [‘2’, ‘2x’]
Input Sequence (sequence): Sequence title and primary sequence input for structure prediction delineated with a colon. Multiple sequences can be added by using ‘Add more’ input option. The sequence title is important to be a unique identifier that can be used as a reference for other parts of the job form. Example: ‘2MG4_1:MEKRPRTEFSEEQ’
Type: string
Input Sequence FASTA File (sequence_fasta_file): Input fasta file containing the system for folding. Multiple sequences in this input indicate a multimeric prediction. The sequence title is a unique identifier for input sequences and is indicated inside the fasta file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title. Any place on this floe job form requiring a sequence title will match to the 2MG4_1 value.
Type: file_in
System Name (system_name): Name to be used to identify the input sequence. Examples include PDB codes or UniRef sequence IDs. If multiple systems are detected, this value will be used to prefix those systems.
Required
Type: string
Default: Boltz_Prediction
Chain Identifiers (sequence_chainid): It is important to generate a unique entry for every sequence in your input FASTA file. Format = unique_sequence_identifier: cid1, cid2. If left blank, each sequence will be labeled in alphabetical chain ID order, and will assume only one instance of the sequence be passed for AI Folding. Chain identifiers indicate a sequence’s chain ID and are used to identify multimeric folding schemes. For every chain input, that chain must be paired with a chain ID(s). Match sequences using sequence titles (see Input Sequence and Input Sequence FASTA File). Use comma separated values of multiple chain IDs to indicate multiple instances of that chain for folding. For example, a homodimer of the example in Input Sequence would follow the format: ‘2MG4_1:A,B’
Type: string
Enforce Chain ID (enforce_chainid): Enforce proper formatting of the Chain Identifiers Field. If turned off, any sequences in the input without a chain ID will use the next available alphanumeric chain ID (starting at ‘A’) and will be assumed to only fold with one copy of that sequence. It is recommended to turn this on if defining system constraints.
Required
Type: boolean
Default: False
Choices: [True, False]
Method Conditioning (method_cond): Provides experimental context for a structure prediction and will bias conformations that look like the indicated conditioning method.
Type: string
Default: None
Choices: [‘None’, ‘MD Simulations’, ‘X-ray Diffraction’, ‘Electron Microscopy’, ‘Solution NMR’, ‘Solid-state NMR’, ‘Neutron Diffraction’, ‘Electron Crystallography’, ‘Fiber Diffraction’, ‘Powder Diffraction’, ‘Infrared Spectroscopy’, ‘Fluorescence Transfer’, ‘Electron Paramagnetic Resonance’, ‘Theoretical Model’, ‘Solution Scattering’, ‘AlphaFold Databases’]
Hit List Size (hit_list_size): The desired size of the hit list.
Required
Type: integer
Default: 100
Boltz Constraints and Properties
Pocket Constraint (pocket_constraint): Pocket constraints define important residues for the bonding pocket when doing structure prediction with a ligand co-folder. This constraint is defined by the binder (i.e. ligand) interacting with the pocket. Define the binder with it’s chain ID. Delineate all the following sub-fields with a colon. Pocket residues are required and are defined in brackets with a chain ID and residue index (starting at 1). The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘E:[A,192][B,47][A,111]:8.2:true’
Type: string
Bond Constraint (bond_constraint): Bond constraints define important covalent bonds to be made during structure prediction. Bonded atoms are defined in brackets with a chain ID, residue index (starting at 1) and atom name. Only two sets of brackets are accepted. Because atom names are required, only canonical residues and CCD ligands can be used. Example: ‘[A,16,SG][A,74,SG]’
Type: string
Contact Constraint (contact_constraint): Contact constraints define important residues that should be in proximity of each other during structure prediction. Proximal residues are defined in brackets with a chain ID and residue index (starting at 1). Only two sets of brackets are accepted. The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘[A,192][B,47]:7.2:true’
Type: string
Outputs
Boltz Mol Results (boltz_out): Output dataset to which to write.
Required
Type: dataset_out
Default: Boltz_BiologicalUnit_Predictions
Save Boltz input (save_yaml_to_orion): The boltz input .yaml file is where input sequence and constraint information is saved. If set to true this input file can be saved to orion and is named using the System Name input value.
Required
Type: boolean
Default: False
Choices: [True, False]
Boltz Failures (fout): Output dataset to which to write.
Required
Type: dataset_out
Default: Boltz_Failures
MSA Search Options
Input MSA Collection (msa_shards): Collection with FASTA files used for MSA search. If a collection is not provided, an MSA search will not be run.
Type: collection_source
Precomputed MSA File (msa_a3m_file): Input MSA files should be in a3m format. Note the query sequence titles in this file must match a sequence title in the query input/fasta. The sequence title is a unique identifier for input sequences and is indicated inside the fasta or a3m file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title.
Type: file_in
Save MSA File (save_msa_file): Save MSA results to Orion in the a3m file format. Each query will generate its own file.
Required
Type: boolean
Default: False
Choices: [True, False]
Output MSA Name (msa_out): Name for the MSA file that will be exported.
Required
Type: string
Default: MSA_Result_File
Max Sequences Cutoff (seq_num_cutoff): Maximum number of results to be saved in an MSA search result per query sequence. Increasing value will increase sensitivity. A value of 0 will save all sequences from prefilter steps.
Required
Type: integer
Default: 500
MSA Search Sensitivity (msa_search_sensitivity): MSA Search sensitivity for the mmseqs2 sequence search. Default is the default for mmseqs2
Required
Type: decimal
Default: 5.7
Sequence Identity Cutoff (seq_id_cutoff): Reject any search hits with sequence identity score lower than this value. Setting value to 0 will accept all search hits. Ranges for this cutoff are: [0, 0.93].
Type: decimal
Default: 0.0