Protein Sequence to AI Folded Structure Prediction
Category Paths
Follow one of these paths in the Orion user interface, to find the floe.
Product-based/SPRUCE
Role-based/Computational Chemist
Role-based/Structural Biologist
Role-based/Bioinformatician
Solution-based/Target Identification
Solution-based/Target Identification/Target Preparation
Solution-based/Hit to Lead/Target Preparation
Task-based/Target Prep & Analysis/Protein Preparation
Description
Protein sequences are used as inputs to predict protein structures using AI folding models. This floe supports Boltz1/2 models for structure prediction. Boltz is a third-party sequence-to-structure protein folding method that uses a protein language model (pLM) in combination with multiple sequence alignment (MSA) to make accurate single sequence and multimeric sequence predictions. This floe and its defaults are based on the standard folding practices as outlined by Boltz Documentation (to avoid leaving this page, right-click and open link in new tab)
While MSA inputs are considered optional, they are recommended. If an MSA is wanted prior to prediction, a user can provide their own MSAs in .a3m format as input, or these can be generated at run time if an input FASTA collection is indicated.
Potential Input Sources MSA Align and Search
Limitations Longer sequences (700+) can be computationally demanding and could potentially require an increase in cube resources.
Promoted Parameters
Title in user interface (promoted name)
Inputs
Boltz Model (boltz_model): Boltz Model to run for the prediction. Some features are model-specific. Feature information that does not support the indicated model will be ignored.
Required
Type: string
Default: 2
Choices: [‘1’, ‘2’, ‘1x’, ‘2x’]
Input Sequence (sequence): Sequence title and primary sequence input for structure prediction delineated with a colon. Multiple sequences can be added by using ‘Add more’ input option. The sequence title is important to be a unique identifier that can be used as a reference for other parts of the job form. Example: ‘2MG4_1:MEKRPRTEFSEEQ’
Type: string
Input Sequence FASTA File (sequence_fasta_file): Input fasta file containing the system for folding. Multiple sequences in this input indicate a multimeric prediction. The sequence title is a unique identifier for input sequences and is indicated inside the fasta file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title. Any place on this floe job form requiring a sequence title will match to the 2MG4_1 value.
Type: file_in
System Name (system_name): Name to be used to identify the input sequence. Examples include PDB codes or UniRef sequence IDs. If multiple systems are detected, this value will be used to prefix those systems.
Required
Type: string
Default: Boltz_Prediction
Chain Identifiers (sequence_chainid): It is important to generate a unique entry for every sequence in your input FASTA file. Format = unique_sequence_identifier: cid1, cid2. If left blank, each sequence will be labeled in alphabetical chain ID order, and will assume only one instance of the sequence be passed for AI Folding. Chain identifiers indicate a sequence’s chain ID and are used to identify multimeric folding schemes. For every chain input, that chain must be paired with a chain ID(s). Match sequences using sequence titles (see Input Sequence and Input Sequence FASTA File). Use comma separated values of multiple chain IDs to indicate multiple instances of that chain for folding. For example, a homodimer of the example in Input Sequence would follow the format: ‘2MG4_1:A,B’
Type: string
Enforce Chain ID (enforce_chainid): Enforce proper formatting of the Chain Identifiers Field. If turned off, any sequences in the input without a chain ID will use the next available alphanumeric chain ID (starting at ‘A’) and will be assumed to only fold with one copy of that sequence. It is recommended to turn this on if defining system constraints.
Required
Type: boolean
Default: False
Choices: [True, False]
Ligands Co-folders (ligands): Ligands to be co-folded with the chain ID and input ligand delineated with a colon. Multiple ligands can be added by using ‘Add more’ input option. Multiple instances of the same ligand can be indicated by providing comma separated chain ID for the same ligand identifier. Each ligand should have their own chain ID unique from all other sequences and ligands in the system. For example, multiple instances of the same ligand would follow the format: ‘Cc1ccc(O)cc1:E,F’ Ligand identifiers can be 3-letter CCD molecules or SMILES strings.
Type: string
Method Conditioning (method_cond): Provides experimental context for a structure prediction and will bias conformations that look like the indicated conditioning method.
Type: string
Default: None
Choices: [‘None’, ‘MD Simulations’, ‘X-ray Diffraction’, ‘Electron Microscopy’, ‘Solution NMR’, ‘Solid-state NMR’, ‘Neutron Diffraction’, ‘Electron Crystallography’, ‘Fiber Diffraction’, ‘Powder Diffraction’, ‘Infrared Spectroscopy’, ‘Fluorescence Transfer’, ‘Electron Paramagnetic Resonance’, ‘Theoretical Model’, ‘Solution Scattering’, ‘AlphaFold Databases’]
Boltz Constraints and Properties
Pocket Constraint (pocket_constraint): Pocket constraints define important residues for the bonding pocket when doing structure prediction with a ligand co-folder. This constraint is defined by the binder (i.e. ligand) interacting with the pocket. Define the binder with it’s chain ID. Delineate all the following sub-fields with a colon. Pocket residues are required and are defined in brackets with a chain ID and residue index (starting at 1). The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘E:[A,192][B,47][A,111]:8.2:true’
Type: string
Bond Constraint (bond_constraint): Bond constraints define important covalent bonds to be made during structure prediction. Bonded atoms are defined in brackets with a chain ID, residue index (starting at 1) and atom name. Only two sets of brackets are accepted. Because atom names are required, only canonical residues and CCD ligands can be used. Example: ‘[A,16,SG][A,74,SG]’
Type: string
Contact Constraint (contact_constraint): Contact constraints define important residues that should be in proximity of each other during structure prediction. Proximal residues are defined in brackets with a chain ID and residue index (starting at 1). Only two sets of brackets are accepted. The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘[A,192][B,47]:7.2:true’
Type: string
Affinity Calculation (affinity_property): Boltz can predict affinity of a ligand during structure prediction. Only one ligand affinity prediction can be run. Populate this filed with the ligand of choice’s chain ID from the Ligands Co-folders field.
Type: string
Outputs
Boltz Mol Results (boltz_out): Output dataset to which to write.
Required
Type: dataset_out
Default: Boltz_BiologicalUnit_Predictions
Run Spruce (run_spruce): Option to run Spruce preparation on the Boltz predictions. Whether true or false, the raw Boltz structure prediction will be saved to a biological unit.
Required
Type: boolean
Default: True
Choices: [True, False]
Boltz Results (out): Output dataset to which to write.
Required
Type: dataset_out
Default: Boltz_Predictions
Save Boltz input (save_yaml_to_orion): The boltz input .yaml file is where input sequence and constraint information is saved. If set to true this input file can be saved to orion and is named using the System Name input value.
Required
Type: boolean
Default: False
Choices: [True, False]
Boltz Failures (fout): Output dataset to which to write.
Required
Type: dataset_out
Default: Boltz_Failures
MSA Search Options
Input MSA Collection (msa_shards): Collection with FASTA files used for MSA search. If a collection is not provided, an MSA search will not be run.
Type: collection_source
Precomputed MSA File (msa_a3m_file): Input MSA files should be in a3m format. Note the query sequence titles in this file must match a sequence title in the query input/fasta. The sequence title is a unique identifier for input sequences and is indicated inside the fasta or a3m file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title.
Type: file_in
Save MSA File (save_msa_file): Save MSA results to Orion in the a3m file format. Each query will generate its own file.
Required
Type: boolean
Default: False
Choices: [True, False]
Output MSA Name (msa_out): Name for the MSA file that will be exported.
Required
Type: string
Default: MSA_Result_File
Max Sequences Cutoff (seq_num_cutoff): Maximum number of results to be saved in an MSA search result per query sequence. Increasing value will increase sensitivity. A value of 0 will save all sequences from prefilter steps.
Required
Type: integer
Default: 500
MSA Search Sensitivity (msa_search_sensitivity): MSA Search sensitivity for the mmseqs2 sequence search. Default is the default for mmseqs2
Required
Type: decimal
Default: 5.7
Sequence Identity Cutoff (seq_id_cutoff): Reject any search hits with sequence identity score lower than this value. Setting value to 0 will accept all search hits. Ranges for this cutoff are: [0, 0.93].
Type: decimal
Default: 0.0
Structure Post-Prediction Options
Use OEPocket Finding (run_oepocket): Option to use OEPocket to generate design units.
Required
Type: boolean
Default: True
Choices: [True, False]
Use F-Pocket Finding (run_fpocket): Option to use f-pocket to generate design units.
Required
Type: boolean
Default: False
Choices: [True, False]
Save Biological Units (output_bio_designunits): If no pocket-finding method is used or no valid pockets are found, this option saves the structure as a biological unit.
Required
Type: boolean
Default: True
Choices: [True, False]
Reference DU (ref_du): Dataset containing the reference DU for superposition.
Type: data_source
Superposition method (superpos_method): Superposition method.
Required
Type: string
Default: SiteSequence
Choices: [‘GlobalSequence’, ‘SiteSequence’, ‘DDMatrix’, ‘SSE’, ‘SiteHopper’]
Unliganded Structure Parameters
Enumerate Pockets (enum_pocket): Option to enumerate pockets when no ligand is found
Required
Type: boolean
Default: True
Choices: [True, False]
Site Residue Entry (site_residue): Single site residue specification for APO structures. Format: ‘RESNAME:RESNUM:ICODE:CHAINID[:FRAGNO:ALTLOC]’, e.g. ‘ALA:325: :A’ (note the blank/whitespace insert code). The regex ‘.*’ notation can be used as a wildcard.
Type: string
General Spruce Parameters
Add Interaction Hints (add_interactions): Option add interactions to the design units.
Required
Type: boolean
Default: True
Choices: [True, False]
Add Style (add_style): Option add style to the design units.
Required
Type: boolean
Default: True
Choices: [True, False]
Allow Cap Residue Truncation (allow_truncate): Option to allow terminal residue to converted to cap, if cap will otherwise clash.
Required
Type: boolean
Default: True
Choices: [True, False]
Alternate Location Handling Method (altloc): Option to pick method of handling alternate locations.
Required
Type: string
Default: Default
Choices: [‘Primary’, ‘Enumerate’, ‘Default’]
Build C-Terminal Caps (build_cterm_caps): Option to cap broken C-termini in protein chains.
Required
Type: boolean
Default: True
Choices: [True, False]
Build N-Terminal Caps (build_nterm_caps): Option to cap broken N-termini in protein chains.
Required
Type: boolean
Default: True
Choices: [True, False]
Build Partial Sidechains (build_sidechains): Option to build missing or partial protein sidechains.
Required
Type: boolean
Default: True
Choices: [True, False]
Assign Charges and Radii (charge_radii): Option to assign partial charge and radii.
Required
Type: boolean
Default: True
Choices: [True, False]
Add Cofactor Code(s) (cofactor_codes): Add uncommon, or custom, cofactor 3-letter codes.
Type: string
Collapse Non-Site Alternates (collapse_nonsite_alts): Option to deduplicate structures with different alts, if the alt locations are not near the binding site.
Required
Type: boolean
Default: True
Choices: [True, False]
Delete Clashing Solvent (delete_clashing_solvent): Option to allow build steps to remove clashing solvent.
Required
Type: boolean
Default: True
Choices: [True, False]
Duplicate Removal (duplicate_removal): Option to deduplicate identical structures resulting from symmetry operation.
Required
Type: boolean
Default: True
Choices: [True, False]
Enumerate Cofactor Sites (enum_cofactors_sites): Option to generate individual design units based on the recognized cofactors.
Required
Type: boolean
Default: False
Choices: [True, False]
Add Excipient Code(s) (excipient_codes): Add uncommon, or custom, excipient 3-letter codes.
Type: string
Fix Backbone Atom Issues (fix_backbone): Option to fix backbone atom issues in protein chains.
Required
Type: boolean
Default: True
Choices: [True, False]
Generate Tautomers (generate_tautomers): Option to generate and use tautomers in the hydrogen network optimization.
Required
Type: boolean
Default: True
Choices: [True, False]
Hetgroup Cluster Distance (het_group_nbr_dist): Distance between heterogens used to determine optimization clusters for protonation.
Required
Type: decimal
Default: 3.5
Include Binding Site Grids (include_bsite_edens_grids): Include electron density and difference density maps around the binding site
Required
Type: boolean
Default: True
Choices: [True, False]
Add Ligand Smiles (ligand_metadata): Provide the ligand code and SMILES using a comma-separated CSV-style format (e.g. ‘BNZ, c1ccc(cc1)O’). The ligand code is used to identify the entry in the molecule. The SMILES is used to verify or remediate the connectivity and valence state of the ligand read from the PDB or MMCIF file format. Optional remaining CSV value(s) are ligand smiles tautomers SPRUCE will check during hydrogen network optimization. (e.g. ‘BNZ, c1ccc(cc1)O, c1ccc(cc1)[O-]’). SPRUCE will by default generate tautomers from the provided SMILES, thus to exclusively test the provided tautomers, turn off the Generate Tautomers parameter.This parameter will be applied globally to all input structures for the floe.
Type: string
Add Lipid Codes(s) (lipid_codes): Add uncommon, or custom, lipid 3-letter codes
Type: string
Make Packing Residues (make_pack_res): Generate packing residues from an asymmetric unit.
Required
Type: boolean
Default: True
Choices: [True, False]
Maximum Atoms in Biological Unit (max_bu_atoms): Option to limit the size of BUs processed based on number of atoms.
Required
Type: integer
Default: 50000
Maximum Parts in Biological Unit (max_bu_parts): Option to limit the size of BUs processed based on number of parts (chains).
Required
Type: integer
Default: 24
Max Atoms for a Ligand (max_lig_atoms): Override for the maximum number of heavy atoms in a molecule to be detected as a ligand.
Type: integer
Max Residues for a Ligand (max_lig_residues): Override for the maximum number of residues in a molecule to be detected as a ligand.
Type: integer
Max System Atoms (max_system_atoms): Maximum number of atoms in the system.
Required
Type: integer
Default: 50000
Minimum Alignment Score for Biological Unit Extraction (min_align_score): Option to specify minimum sequence alignment score for biological unit extraction.
Required
Type: integer
Default: 200
Min Atoms for a Ligand (min_lig_atoms): Override for the minimum number of heavy atoms in a molecule to be detected as a ligand.
Type: integer
Optimize Experimental Protons (opt_expt_protons): Option to optimize hydrogens assigned in the experiment.
Required
Type: boolean
Default: False
Choices: [True, False]
Prefer Author BIOMT Records (pref_author_record): Option where the author BIOMT record is preferred over the software generated one.
Required
Type: boolean
Default: True
Choices: [True, False]
Protonate (protonate): Option to add and optimize protons in the system.
Required
Type: boolean
Default: True
Choices: [True, False]
Restrict DUs to Reference Site Removal (restrict_to_refsite): Option to not generate design units with sites not matching the reference (if one is provided).
Required
Type: boolean
Default: True
Choices: [True, False]
Rotamer Coverage % (rot_coverage): Coverage of the rotamers returned from the library in percent.
Required
Type: decimal
Default: 100.0
Rotamer Library (rot_lib): Rotamer library to use for side-chain building.
Required
Type: string
Default: Richardson2016
Choices: [‘Dunbrack’, ‘Richardson’, ‘Richardson2016’]
Size Used to Define Binding Site (site_size): Distance used to determine the size of the site.
Required
Type: decimal
Default: 5.0
Strict Protonation Mode (strict_protonate): Option to fail prep if protons could not be added.
Required
Type: boolean
Default: True
Choices: [True, False]
Superpose Design Units (superpose): Option to superpose DUs (if multiple), first onto the reference structure (if provided).
Required
Type: boolean
Default: True
Choices: [True, False]
Superposition Method (superpose_method): Superposition method.
Required
Type: string
Default: SiteSequence
Choices: [‘GlobalSequence’, ‘SiteSequence’, ‘DDMatrix’, ‘SSE’, ‘SiteHopper’]
Target Classification (target): Option to pick whether target is protein or nucleic acid component.
Required
Type: string
Default: Protein
Choices: [‘Protein’, ‘Nucleic’]