DLProteinFormats
Documentation for DLProteinFormats.
DLProteinFormats.featurizerDLProteinFormats.flattenDLProteinFormats.sample_batched_indsDLProteinFormats.unflattenProteinChains.writepdb
DLProteinFormats.featurizer — Method
featurizer(table::DataFrame, features::NamedTuple; all_mask_prob = 0, feat_mask_prob = 0)Returns a function that will convert a PDB and chain to a feature vector. all_mask_prob is the probability of masking all features for a chain. feat_mask_prob is the probability of masking a feature for a chain, or a UnivariateDistribution that will be sampled for each chain, giving the probability of masking each feature for that chain. rand_cats_prob is the probability of randomly setting an extra category for a feature to positve. This allows conditioning on a range of values. rand_cats_weight is the number of chains that will have non-zero randcatsprob.
In the returned function, override is a dictionary of chain_id => dictionary of feature name => value.
Example: chainfeatures = DLProteinFormats.load(PDBTable) ff = featurizer(chainfeatures, features, allmaskprob = 0.33, featmaskprob = Uniform(0,1)) ff("7A7B", "D", override = Dict(["D" => Dict(["gene_superkingdom" => "Eukaryota"])]))
DLProteinFormats.flatten — Method
flatten(rec::ProteinStructure; T = Float32)Takes a ProteinStructure and returns a tuple of the translations, rotations, residue indices, and features for each chain.
DLProteinFormats.sample_batched_inds — Method
sample_batched_inds(flatrecs; l2b = length2batch(1000, 1.9))Takes a vector of (flattened) protein structures, and returns a vector of indices into the original array, with each batch containing a random sample of one protein from each cluster.
DLProteinFormats.unflatten — Method
unflatten(locs, rots, seqints, chainids, resnums)
unflatten(locs, rots, seqhots, chainids, resnums)
unflatten(locs, rots, seq, chainids, resnums)Converts flattened protein structure data back into ProteinChain objects.
Arguments
locs: Array of translations/locations (3×1×L or 3×1×L×B for batched)rots: Array of rotations (3×3×L or 3×3×L×B for batched)seqints/seqhots/seq: Sequence data as integers, one-hot encoding, or generic sequencechainids: Chain identifiers for each residueresnums: Residue numbers for each position
Returns
- Vector of
ProteinChainobjects (or vector of vectors for batched input)
The function reconstructs protein chains from flattened representations, applying unit scaling to locations and converting sequence integers back to amino acid strings.
ProteinChains.writepdb — Function
writepdb(path, chains::AbstractVector{<:ProteinChains.ProteinChain})Examples
using DLProteinFormats
data = DLProteinFormats.load(PDBSimpleFlat500);
flat_chains = data[1];
chains = DLProteinFormats.unflatten(
flat_chains.locs,
flat_chains.rots,
flat_chains.AAs,
flat_chains.chainids,
flat_chains.resinds) # unflatten the flat data
writepdb("chains-1.pdb", chains) # view in e.g. chimerax or vscode protein viewer extension