At-a-glance
Abstract
Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) has emerged as a central approach for studying T cell and B cell receptor populations, and is now an important component of studies of autoimmunity, immune responses to pathogens, vaccines, allergens, and cancers, and for antibody discovery. When amplifying the rearranged V(D)J genes encoding antigen receptors, each cycle of the Polymerase Chain Reaction (PCR) can produce spurious “chimeric” hybrids of two or more different template sequences. This project aimed to use a Hidden Markov Model (HMM) to take a collection of "reference" sequences, and identify query sequences that are spurious chimeras. Initially, the HMM was too slow, scaling with the square of the number of reference sequences (which can be in the hundreds for real-world problems) but we managed to identify some exploitable mathematical structure specific to our problem, and rewrite the HMM maths to be linear in the number of reference sequences, which provided a tractable solution to the problem.
This project began in the summer school, where the key algorithmic trick was figured out, but was developed much further, eventually being published in the journal Bioinformatics.
At-a-glance
Abstract
Transformer models power most of the recent developments in deep learning and AI. When transformers act on sequences of words, they must know what the word is, and where it is in the sentence so that the relationship to other words can be understood. However, when transformers act on something like the 3-dimensional structure of a protein, these relationships become spatial, with relative 3D position and rotation playing a central role. A large component of the success of AlphaFold 2, which won The 2024 Nobel Prize in Chemistry for protein folding, was the development of a transformer that is invariant to rotation and changes in position, which allows us to build deep learning systems that are equivariant to rigid-body transformations. The goal of this project was to understand the Invariant Point Attention (IPA) layer from AlphaFold 2, and implement it in the Julia language, allowing for additional customization and flexibility.
The flexibility afforded by this implementation allowed us to develop an "autoregressive" version of IPA, where each amino acid can be sampled in turn, which was developed into a totally new kind of generative model for protein structures.
At-a-glance
Abstract
Quantifying natural selection at the molecular level often involves estimating the ratio of nonsynonymous to synonymous substitution rates (dN/dS). The FUBAR (Fast Unconstrained Bayesian AppRoximation) method approximates the posterior distribution of selection across a gene using a pre-computed grid of rates and a Dirichlet prior. However, the original FUBAR assumes that the prior probability of each grid cell is independent. This project aimed to extend FUBAR by imposing a smooth prior over the discretized dN/dS distribution surface. Instead of a standard Dirichlet distribution, we used Random Fourier Features (RFF) to parameterize a continuous surface over the grid. By using Markov Chain Monte Carlo (MCMC) to sample from the weights of these features, we can more robustly estimate the landscape of selection, effectively sharing information between neighboring rate categories while maintaining computational tractability.
At-a-glance
Abstract
Positional information is crucial for transformers to understand spatial relationships, but for 3D molecular structures, these encodings should ideally respect physical symmetries like translational equivariance. This project explored two different architectural approaches to multidimensional positional encoding. First, we implemented a Multi-dimensional RoPE that generalizes the original rotary position embedding (Su et al., 2021) by applying rotary transforms independently along each coordinate axis. Second, we implemented STRING (Schneck et al., 2025), a learnable rotary embedding designed to ensure true translational equivariance, where the model's internal representation remains invariant even as the molecule shifts in space.
We applied these architectures to the problem of protein side-chain rotamer sampling using Flow Matching—a modern generative modeling framework. By training the model to "flow" side-chain atoms from a random initial state toward their correct physical conformations, conditioned on the protein backbone, we developed a method that efficiently captures the complex geometric constraints between neighboring amino acids.
At-a-glance
Abstract
Generative models for 3D molecules have immense potential in drug discovery and materials science. This project developed an autoregressive transformer-based architecture in the Julia language that constructs molecules atom-by-atom. Unlike models that generate entire molecular graphs simultaneously, this approach treats molecule building as a sequential growth process. For each step, the model predicts the next atom's chemical identity, its 3D displacement relative to a selected "anchor" atom, and a "climb" variable that selects the next anchor point for subsequent growth steps, effectively navigating the branching structure derived from SMILES strings.
The MOGMOG architecture combines a multi-channel token encoder with a deep transformer trunk. The encoder integrates chemical identity, 3D coordinates, and sequence indices into a rich representation. A unique Mixture of Gaussians (MoG) decoder is used to model the continuous 3D displacement vectors, allowing the model to capture the multimodal spatial distributions of atom placements. The transformer's attention mechanism is further augmented by a learned pairwise attention bias, which incorporates relative 3D distances and structural relations between atoms at multiple length scales.