At-a-glance
Abstract
Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) has emerged as a central approach for studying T cell and B cell receptor populations, and is now an important component of studies of autoimmunity, immune responses to pathogens, vaccines, allergens, and cancers, and for antibody discovery. When amplifying the rearranged V(D)J genes encoding antigen receptors, each cycle of the Polymerase Chain Reaction (PCR) can produce spurious “chimeric” hybrids of two or more different template sequences. This project aimed to use a Hidden Markov Model (HMM) to take a collection of "reference" sequences, and identify query sequences that are spurious chimeras. Initially, the HMM was too slow, scaling with the square of the number of reference sequences (which can be in the hundreds for real-world problems) but we managed to identify some exploitable mathematical structure specific to our problem, and rewrite the HMM maths to be linear in the number of reference sequences, which provided a tractable solution to the problem.
This project began in the summer school, where the key algorithmic trick was figured out, but was developed much further, eventually being published in the journal Bioinformatics.
At-a-glance
Abstract
Transformer models power most of the recent developments in deep learning and AI. When transformers act on sequences of words, they must know what the word is, and where it is in the sentence so that the relationship to other words can be understood. However, when transformers act on something like the 3-dimensional structure of a protein, these relationships become spatial, with relative 3D position and rotation playing a central role. A large component of the success of AlphaFold 2, which won The 2024 Nobel Prize in Chemistry for protein folding, was the development of a transformer that is invariant to rotation and changes in position, which allows us to build deep learning systems that are equivariant to rigid-body transformations. The goal of this project was to understand the Invariant Point Attention (IPA) layer from AlphaFold 2, and implement it in the Julia language, allowing for additional customization and flexibility.
The flexibility afforded by this implementation allowed us to develop an "autoregressive" version of IPA, where each amino acid can be sampled in turn, which was developed into a totally new kind of generative model for protein structures.