Protein Sequence to AI Folded Structure Prediction

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

  • Product-based/SPRUCE

  • Role-based/Computational Chemist

  • Role-based/Structural Biologist

  • Role-based/Bioinformatician

  • Solution-based/Target Identification

  • Solution-based/Target Identification/Target Preparation

  • Solution-based/Hit to Lead/Target Preparation

  • Task-based/Target Prep & Analysis/Protein Preparation

Description

Protein sequences are used as inputs to predict protein structures using AI folding models. This floe supports Boltz1/2 models for structure prediction. Boltz is a third-party sequence-to-structure protein folding method that uses a protein language model (pLM) in combination with multiple sequence alignment (MSA) to make accurate single sequence and multimeric sequence predictions. This floe and its defaults are based on the standard folding practices as outlined by Boltz Documentation (to avoid leaving this page, right-click and open link in new tab)

While MSA inputs are considered optional, they are recommended. If an MSA is wanted prior to prediction, a user can provide their own MSAs in .a3m format as input, or these can be generated at run time if an input FASTA collection is indicated.

Potential Input Sources MSA Align and Search

Limitations Longer sequences (700+) can be computationally demanding and could potentially require an increase in cube resources.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Boltz Model (boltz_model): Boltz Model to run for the prediction. Some features are model-specific. Feature information that does not support the indicated model will be ignored.

  • Required

  • Type: string

  • Default: 2

  • Choices: [‘1’, ‘2’, ‘1x’, ‘2x’]

Input Sequence (sequence): Sequence title and primary sequence input for structure prediction delineated with a colon. Multiple sequences can be added by using ‘Add more’ input option. The sequence title is important to be a unique identifier that can be used as a reference for other parts of the job form. Example: ‘2MG4_1:MEKRPRTEFSEEQ’

  • Type: string

Input Sequence FASTA File (sequence_fasta_file): Input fasta file containing the system for folding. Multiple sequences in this input indicate a multimeric prediction. The sequence title is a unique identifier for input sequences and is indicated inside the fasta file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title. Any place on this floe job form requiring a sequence title will match to the 2MG4_1 value.

  • Type: file_in

System Name (system_name): Name to be used to identify the input sequence. Examples include PDB codes or UniRef sequence IDs. If multiple systems are detected, this value will be used to prefix those systems.

  • Required

  • Type: string

  • Default: Boltz_Prediction

Chain Identifiers (sequence_chainid): It is important to generate a unique entry for every sequence in your input FASTA file. Format = unique_sequence_identifier: cid1, cid2. If left blank, each sequence will be labeled in alphabetical chain ID order, and will assume only one instance of the sequence be passed for AI Folding. Chain identifiers indicate a sequence’s chain ID and are used to identify multimeric folding schemes. For every chain input, that chain must be paired with a chain ID(s). Match sequences using sequence titles (see Input Sequence and Input Sequence FASTA File). Use comma separated values of multiple chain IDs to indicate multiple instances of that chain for folding. For example, a homodimer of the example in Input Sequence would follow the format: ‘2MG4_1:A,B’

  • Type: string

Enforce Chain ID (enforce_chainid): Enforce proper formatting of the Chain Identifiers Field. If turned off, any sequences in the input without a chain ID will use the next available alphanumeric chain ID (starting at ‘A’) and will be assumed to only fold with one copy of that sequence. It is recommended to turn this on if defining system constraints.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Ligands Co-folders (ligands): Ligands to be co-folded with the chain ID and input ligand delineated with a colon. Multiple ligands can be added by using ‘Add more’ input option. Multiple instances of the same ligand can be indicated by providing comma separated chain ID for the same ligand identifier. Each ligand should have their own chain ID unique from all other sequences and ligands in the system. For example, multiple instances of the same ligand would follow the format: ‘Cc1ccc(O)cc1:E,F’ Ligand identifiers can be 3-letter CCD molecules or SMILES strings.

  • Type: string

Method Conditioning (method_cond): Provides experimental context for a structure prediction and will bias conformations that look like the indicated conditioning method.

  • Type: string

  • Default: None

  • Choices: [‘None’, ‘MD Simulations’, ‘X-ray Diffraction’, ‘Electron Microscopy’, ‘Solution NMR’, ‘Solid-state NMR’, ‘Neutron Diffraction’, ‘Electron Crystallography’, ‘Fiber Diffraction’, ‘Powder Diffraction’, ‘Infrared Spectroscopy’, ‘Fluorescence Transfer’, ‘Electron Paramagnetic Resonance’, ‘Theoretical Model’, ‘Solution Scattering’, ‘AlphaFold Databases’]

Boltz Constraints and Properties

Pocket Constraint (pocket_constraint): Pocket constraints define important residues for the bonding pocket when doing structure prediction with a ligand co-folder. This constraint is defined by the binder (i.e. ligand) interacting with the pocket. Define the binder with it’s chain ID. Delineate all the following sub-fields with a colon. Pocket residues are required and are defined in brackets with a chain ID and residue index (starting at 1). The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘E:[A,192][B,47][A,111]:8.2:true’

  • Type: string

Bond Constraint (bond_constraint): Bond constraints define important covalent bonds to be made during structure prediction. Bonded atoms are defined in brackets with a chain ID, residue index (starting at 1) and atom name. Only two sets of brackets are accepted. Because atom names are required, only canonical residues and CCD ligands can be used. Example: ‘[A,16,SG][A,74,SG]’

  • Type: string

Contact Constraint (contact_constraint): Contact constraints define important residues that should be in proximity of each other during structure prediction. Proximal residues are defined in brackets with a chain ID and residue index (starting at 1). Only two sets of brackets are accepted. The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘[A,192][B,47]:7.2:true’

  • Type: string

Affinity Calculation (affinity_property): Boltz can predict affinity of a ligand during structure prediction. Only one ligand affinity prediction can be run. Populate this filed with the ligand of choice’s chain ID from the Ligands Co-folders field.

  • Type: string

Outputs

Boltz Mol Results (boltz_out): Output dataset to which to write.

  • Required

  • Type: dataset_out

  • Default: Boltz_BiologicalUnit_Predictions

Run Spruce (run_spruce): Option to run Spruce preparation on the Boltz predictions. Whether true or false, the raw Boltz structure prediction will be saved to a biological unit.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Boltz Results (out): Output dataset to which to write.

  • Required

  • Type: dataset_out

  • Default: Boltz_Predictions

Save Boltz input (save_yaml_to_orion): The boltz input .yaml file is where input sequence and constraint information is saved. If set to true this input file can be saved to orion and is named using the System Name input value.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Boltz Failures (fout): Output dataset to which to write.

  • Required

  • Type: dataset_out

  • Default: Boltz_Failures

MSA Search Options

Input MSA Collection (msa_shards): Collection with FASTA files used for MSA search. If a collection is not provided, an MSA search will not be run.

  • Type: collection_source

Precomputed MSA File (msa_a3m_file): Input MSA files should be in a3m format. Note the query sequence titles in this file must match a sequence title in the query input/fasta. The sequence title is a unique identifier for input sequences and is indicated inside the fasta or a3m file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title.

  • Type: file_in

Save MSA File (save_msa_file): Save MSA results to Orion in the a3m file format. Each query will generate its own file.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Output MSA Name (msa_out): Name for the MSA file that will be exported.

  • Required

  • Type: string

  • Default: MSA_Result_File

Max Sequences Cutoff (seq_num_cutoff): Maximum number of results to be saved in an MSA search result per query sequence. Increasing value will increase sensitivity. A value of 0 will save all sequences from prefilter steps.

  • Required

  • Type: integer

  • Default: 500

MSA Search Sensitivity (msa_search_sensitivity): MSA Search sensitivity for the mmseqs2 sequence search. Default is the default for mmseqs2

  • Required

  • Type: decimal

  • Default: 5.7

Sequence Identity Cutoff (seq_id_cutoff): Reject any search hits with sequence identity score lower than this value. Setting value to 0 will accept all search hits. Ranges for this cutoff are: [0, 0.93].

  • Type: decimal

  • Default: 0.0

Structure Post-Prediction Options

Use OEPocket Finding (run_oepocket): Option to use OEPocket to generate design units.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Use F-Pocket Finding (run_fpocket): Option to use f-pocket to generate design units.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Save Biological Units (output_bio_designunits): If no pocket-finding method is used or no valid pockets are found, this option saves the structure as a biological unit.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Reference DU (ref_du): Dataset containing the reference DU for superposition.

  • Type: data_source

Superposition method (superpos_method): Superposition method.

  • Required

  • Type: string

  • Default: SiteSequence

  • Choices: [‘GlobalSequence’, ‘SiteSequence’, ‘DDMatrix’, ‘SSE’, ‘SiteHopper’]

Unliganded Structure Parameters

Enumerate Pockets (enum_pocket): Option to enumerate pockets when no ligand is found

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Site Residue Entry (site_residue): Single site residue specification for APO structures. Format: ‘RESNAME:RESNUM:ICODE:CHAINID[:FRAGNO:ALTLOC]’, e.g. ‘ALA:325: :A’ (note the blank/whitespace insert code). The regex ‘.*’ notation can be used as a wildcard.

  • Type: string

General Spruce Parameters

Add Interaction Hints (add_interactions): Option add interactions to the design units.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Add Style (add_style): Option add style to the design units.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Allow Cap Residue Truncation (allow_truncate): Option to allow terminal residue to converted to cap, if cap will otherwise clash.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Alternate Location Handling Method (altloc): Option to pick method of handling alternate locations.

  • Required

  • Type: string

  • Default: Default

  • Choices: [‘Primary’, ‘Enumerate’, ‘Default’]

Build C-Terminal Caps (build_cterm_caps): Option to cap broken C-termini in protein chains.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Build N-Terminal Caps (build_nterm_caps): Option to cap broken N-termini in protein chains.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Build Partial Sidechains (build_sidechains): Option to build missing or partial protein sidechains.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Assign Charges and Radii (charge_radii): Option to assign partial charge and radii.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Add Cofactor Code(s) (cofactor_codes): Add uncommon, or custom, cofactor 3-letter codes.

  • Type: string

Collapse Non-Site Alternates (collapse_nonsite_alts): Option to deduplicate structures with different alts, if the alt locations are not near the binding site.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Delete Clashing Solvent (delete_clashing_solvent): Option to allow build steps to remove clashing solvent.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Duplicate Removal (duplicate_removal): Option to deduplicate identical structures resulting from symmetry operation.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Enumerate Cofactor Sites (enum_cofactors_sites): Option to generate individual design units based on the recognized cofactors.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Add Excipient Code(s) (excipient_codes): Add uncommon, or custom, excipient 3-letter codes.

  • Type: string

Fix Backbone Atom Issues (fix_backbone): Option to fix backbone atom issues in protein chains.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Generate Tautomers (generate_tautomers): Option to generate and use tautomers in the hydrogen network optimization.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Hetgroup Cluster Distance (het_group_nbr_dist): Distance between heterogens used to determine optimization clusters for protonation.

  • Required

  • Type: decimal

  • Default: 3.5

Include Binding Site Grids (include_bsite_edens_grids): Include electron density and difference density maps around the binding site

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Add Ligand Smiles (ligand_metadata): Provide the ligand code and SMILES using a comma-separated CSV-style format (e.g. ‘BNZ, c1ccc(cc1)O’). The ligand code is used to identify the entry in the molecule. The SMILES is used to verify or remediate the connectivity and valence state of the ligand read from the PDB or MMCIF file format. Optional remaining CSV value(s) are ligand smiles tautomers SPRUCE will check during hydrogen network optimization. (e.g. ‘BNZ, c1ccc(cc1)O, c1ccc(cc1)[O-]’). SPRUCE will by default generate tautomers from the provided SMILES, thus to exclusively test the provided tautomers, turn off the Generate Tautomers parameter.This parameter will be applied globally to all input structures for the floe.

  • Type: string

Add Lipid Codes(s) (lipid_codes): Add uncommon, or custom, lipid 3-letter codes

  • Type: string

Make Packing Residues (make_pack_res): Generate packing residues from an asymmetric unit.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Maximum Atoms in Biological Unit (max_bu_atoms): Option to limit the size of BUs processed based on number of atoms.

  • Required

  • Type: integer

  • Default: 50000

Maximum Parts in Biological Unit (max_bu_parts): Option to limit the size of BUs processed based on number of parts (chains).

  • Required

  • Type: integer

  • Default: 24

Max Atoms for a Ligand (max_lig_atoms): Override for the maximum number of heavy atoms in a molecule to be detected as a ligand.

  • Type: integer

Max Residues for a Ligand (max_lig_residues): Override for the maximum number of residues in a molecule to be detected as a ligand.

  • Type: integer

Max System Atoms (max_system_atoms): Maximum number of atoms in the system.

  • Required

  • Type: integer

  • Default: 50000

Minimum Alignment Score for Biological Unit Extraction (min_align_score): Option to specify minimum sequence alignment score for biological unit extraction.

  • Required

  • Type: integer

  • Default: 200

Min Atoms for a Ligand (min_lig_atoms): Override for the minimum number of heavy atoms in a molecule to be detected as a ligand.

  • Type: integer

Optimize Experimental Protons (opt_expt_protons): Option to optimize hydrogens assigned in the experiment.

  • Required

  • Type: boolean

  • Default: False

  • Choices: [True, False]

Prefer Author BIOMT Records (pref_author_record): Option where the author BIOMT record is preferred over the software generated one.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Protonate (protonate): Option to add and optimize protons in the system.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Restrict DUs to Reference Site Removal (restrict_to_refsite): Option to not generate design units with sites not matching the reference (if one is provided).

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Rotamer Coverage % (rot_coverage): Coverage of the rotamers returned from the library in percent.

  • Required

  • Type: decimal

  • Default: 100.0

Rotamer Library (rot_lib): Rotamer library to use for side-chain building.

  • Required

  • Type: string

  • Default: Richardson2016

  • Choices: [‘Dunbrack’, ‘Richardson’, ‘Richardson2016’]

Size Used to Define Binding Site (site_size): Distance used to determine the size of the site.

  • Required

  • Type: decimal

  • Default: 5.0

Strict Protonation Mode (strict_protonate): Option to fail prep if protons could not be added.

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Superpose Design Units (superpose): Option to superpose DUs (if multiple), first onto the reference structure (if provided).

  • Required

  • Type: boolean

  • Default: True

  • Choices: [True, False]

Superposition Method (superpose_method): Superposition method.

  • Required

  • Type: string

  • Default: SiteSequence

  • Choices: [‘GlobalSequence’, ‘SiteSequence’, ‘DDMatrix’, ‘SSE’, ‘SiteHopper’]

Target Classification (target): Option to pick whether target is protein or nucleic acid component.

  • Required

  • Type: string

  • Default: Protein

  • Choices: [‘Protein’, ‘Nucleic’]