Protein Sequence to AI Folded Structure Ligand Affinities

Category Paths

Follow one of these paths in the Orion user interface, to find the floe.

Product-based/SPRUCE

Role-based/Computational Chemist

Role-based/Structural Biologist

Role-based/Bioinformatician

Solution-based/Target Identification

Solution-based/Target Identification/Target Preparation

Solution-based/Hit to Lead/Target Preparation

Task-based/Target Prep & Analysis/Protein Preparation

Description

Protein sequences are used as inputs to predict protein structures using AI folding models. This floe supports Boltz2 models for structure prediction. After predicting the protein–ligand structure, the affinity is predicted and structures with the lowest affinities (highest protein–ligand interactions) are ranked and saved. Boltz is a third-party sequence-to-structure protein folding method that uses a protein language model (pLM) in combination with multiple sequence alignment (MSA) to make accurate single sequence and multimeric sequence predictions. This floe and its defaults are based on the standard folding practices as outlined by Boltz Documentation (to avoid leaving this page, right-click and open link in new tab)

While MSA inputs are considered optional, they are recommended. If an MSA is wanted prior to prediction, a user can provide their own MSAs in .a3m format as input, or these can be generated at run time if an input FASTA collection is indicated.

Potential Input Sources MSA Align and Search

Limitations Longer sequences (700+) can be computationally demanding and could potentially require an increase in cube resources.

Promoted Parameters

Title in user interface (promoted name)

Inputs

Ligand Inputs (lig_input): Dataset of molecules with 3D coordinates or SMILES strings to be co-folded with the input protein sequence and ranked based on affinity. These input ligands will be given the LIG chain ID in the protein–ligand system.

Type: data_source

Ligand File Inputs (lig_files): Molecular or CSV file containing ligands to be be co-folded with the input protein sequence and ranked based on affinity. These input ligands will be given the LIG chain ID in the protein–ligand system.

Type: file_in

Boltz Model (boltz_model): Boltz Model to run for the prediction. Some features are model-specific. Feature information that does not support the indicated model will be ignored.

Required

Type: string

Default: 2

Choices: [‘2’, ‘2x’]

Input Sequence (sequence): Sequence title and primary sequence input for structure prediction delineated with a colon. Multiple sequences can be added by using ‘Add more’ input option. The sequence title is important to be a unique identifier that can be used as a reference for other parts of the job form. Example: ‘2MG4_1:MEKRPRTEFSEEQ’

Type: string

Input Sequence FASTA File (sequence_fasta_file): Input fasta file containing the system for folding. Multiple sequences in this input indicate a multimeric prediction. The sequence title is a unique identifier for input sequences and is indicated inside the fasta file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title. Any place on this floe job form requiring a sequence title will match to the 2MG4_1 value.

Type: file_in

System Name (system_name): Name to be used to identify the input sequence. Examples include PDB codes or UniRef sequence IDs. If multiple systems are detected, this value will be used to prefix those systems.

Required

Type: string

Default: Boltz_Prediction

Chain Identifiers (sequence_chainid): It is important to generate a unique entry for every sequence in your input FASTA file. Format = unique_sequence_identifier: cid1, cid2. If left blank, each sequence will be labeled in alphabetical chain ID order, and will assume only one instance of the sequence be passed for AI Folding. Chain identifiers indicate a sequence’s chain ID and are used to identify multimeric folding schemes. For every chain input, that chain must be paired with a chain ID(s). Match sequences using sequence titles (see Input Sequence and Input Sequence FASTA File). Use comma separated values of multiple chain IDs to indicate multiple instances of that chain for folding. For example, a homodimer of the example in Input Sequence would follow the format: ‘2MG4_1:A,B’

Type: string

Enforce Chain ID (enforce_chainid): Enforce proper formatting of the Chain Identifiers Field. If turned off, any sequences in the input without a chain ID will use the next available alphanumeric chain ID (starting at ‘A’) and will be assumed to only fold with one copy of that sequence. It is recommended to turn this on if defining system constraints.

Required

Type: boolean

Default: False

Choices: [True, False]

Method Conditioning (method_cond): Provides experimental context for a structure prediction and will bias conformations that look like the indicated conditioning method.

Type: string

Default: None

Choices: [‘None’, ‘MD Simulations’, ‘X-ray Diffraction’, ‘Electron Microscopy’, ‘Solution NMR’, ‘Solid-state NMR’, ‘Neutron Diffraction’, ‘Electron Crystallography’, ‘Fiber Diffraction’, ‘Powder Diffraction’, ‘Infrared Spectroscopy’, ‘Fluorescence Transfer’, ‘Electron Paramagnetic Resonance’, ‘Theoretical Model’, ‘Solution Scattering’, ‘AlphaFold Databases’]

Hit List Size (hit_list_size): The desired size of the hit list.

Required

Type: integer

Default: 100

Boltz Constraints and Properties

Pocket Constraint (pocket_constraint): Pocket constraints define important residues for the bonding pocket when doing structure prediction with a ligand co-folder. This constraint is defined by the binder (i.e. ligand) interacting with the pocket. Define the binder with it’s chain ID. Delineate all the following sub-fields with a colon. Pocket residues are required and are defined in brackets with a chain ID and residue index (starting at 1). The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘E:[A,192][B,47][A,111]:8.2:true’

Type: string

Bond Constraint (bond_constraint): Bond constraints define important covalent bonds to be made during structure prediction. Bonded atoms are defined in brackets with a chain ID, residue index (starting at 1) and atom name. Only two sets of brackets are accepted. Because atom names are required, only canonical residues and CCD ligands can be used. Example: ‘[A,16,SG][A,74,SG]’

Type: string

Contact Constraint (contact_constraint): Contact constraints define important residues that should be in proximity of each other during structure prediction. Proximal residues are defined in brackets with a chain ID and residue index (starting at 1). Only two sets of brackets are accepted. The optional maximum distance (in angstroms) specifies the longest distance between the binder and a pocket residue with a range between 4 and 20 (default=6.0). An optional true/false force value will apply a potential enforcing the constraint (default=false). Example: ‘[A,192][B,47]:7.2:true’

Type: string

Outputs

Boltz Mol Results (boltz_out): Output dataset to which to write.

Required

Type: dataset_out

Default: Boltz_BiologicalUnit_Predictions

Save Boltz input (save_yaml_to_orion): The boltz input .yaml file is where input sequence and constraint information is saved. If set to true this input file can be saved to orion and is named using the System Name input value.

Required

Type: boolean

Default: False

Choices: [True, False]

Boltz Failures (fout): Output dataset to which to write.

Required

Type: dataset_out

Default: Boltz_Failures

MSA Search Options

Input MSA Collection (msa_shards): Collection with FASTA files used for MSA search. If a collection is not provided, an MSA search will not be run.

Type: collection_source

Precomputed MSA File (msa_a3m_file): Input MSA files should be in a3m format. Note the query sequence titles in this file must match a sequence title in the query input/fasta. The sequence title is a unique identifier for input sequences and is indicated inside the fasta or a3m file. This title is defined to be between the fasta title (‘>’) and the first pipe delineator (‘|’). For example a fasta title: ‘>2MG4_1|Drosophila melanogaster’ will be automatically assigned the 2MG4_1 sequence title.

Type: file_in

Save MSA File (save_msa_file): Save MSA results to Orion in the a3m file format. Each query will generate its own file.

Required

Type: boolean

Default: False

Choices: [True, False]

Output MSA Name (msa_out): Name for the MSA file that will be exported.

Required

Type: string

Default: MSA_Result_File

Max Sequences Cutoff (seq_num_cutoff): Maximum number of results to be saved in an MSA search result per query sequence. Increasing value will increase sensitivity. A value of 0 will save all sequences from prefilter steps.

Required

Type: integer

Default: 500

MSA Search Sensitivity (msa_search_sensitivity): MSA Search sensitivity for the mmseqs2 sequence search. Default is the default for mmseqs2

Required

Type: decimal

Default: 5.7

Sequence Identity Cutoff (seq_id_cutoff): Reject any search hits with sequence identity score lower than this value. Setting value to 0 will accept all search hits. Ranges for this cutoff are: [0, 0.93].

Type: decimal

Default: 0.0