SOFO Comp Bio

Selected student projects from the Karolinska Institute summer research school in computational biology and bioinformatics.

A linear time-complexity Hidden Markov Model for detecting PCR recombination in antibody repertoire sequencing datasets

Developing and implementing a novel algorithm to rapidly identify spurious sequences that pollute immunoinformatics datasets.

next generation sequencing immunoinformatics 2022

At-a-glance

Students
Aron Stålmarck
Mentors
Mark Chernyshev, Ben Murrell
Cohort
Summer 2022
Code

Abstract

Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) has emerged as a central approach for studying T cell and B cell receptor populations, and is now an important component of studies of autoimmunity, immune responses to pathogens, vaccines, allergens, and cancers, and for antibody discovery. When amplifying the rearranged V(D)J genes encoding antigen receptors, each cycle of the Polymerase Chain Reaction (PCR) can produce spurious “chimeric” hybrids of two or more different template sequences. This project aimed to use a Hidden Markov Model (HMM) to take a collection of "reference" sequences, and identify query sequences that are spurious chimeras. Initially, the HMM was too slow, scaling with the square of the number of reference sequences (which can be in the hundreds for real-world problems) but we managed to identify some exploitable mathematical structure specific to our problem, and rewrite the HMM maths to be linear in the number of reference sequences, which provided a tractable solution to the problem.

Result figure
Image

This project began in the summer school, where the key algorithmic trick was figured out, but was developed much further, eventually being published in the journal Bioinformatics.

Flexible Invariant Point Attention

Implementing the key SE(3) equivariant layer from AlphaFold2, but with extra flexibility.

deep learning proteins 2023

At-a-glance

Students
Lukas Billera
Mentors
Ben Murrell
Cohort
Summer 2023
Code

Abstract

Transformer models power most of the recent developments in deep learning and AI. When transformers act on sequences of words, they must know what the word is, and where it is in the sentence so that the relationship to other words can be understood. However, when transformers act on something like the 3-dimensional structure of a protein, these relationships become spatial, with relative 3D position and rotation playing a central role. A large component of the success of AlphaFold 2, which won The 2024 Nobel Prize in Chemistry for protein folding, was the development of a transformer that is invariant to rotation and changes in position, which allows us to build deep learning systems that are equivariant to rigid-body transformations. The goal of this project was to understand the Invariant Point Attention (IPA) layer from AlphaFold 2, and implement it in the Julia language, allowing for additional customization and flexibility.

Image

The flexibility afforded by this implementation allowed us to develop an "autoregressive" version of IPA, where each amino acid can be sampled in turn, which was developed into a totally new kind of generative model for protein structures.

Smoothing the Selection: Bayesian Inference of dN/dS with Random Fourier Features

Extending the FUBAR method by using Random Fourier Features and MCMC to impose a smooth prior over the dN/dS grid.

evolutionary biology bayesian statistics 2024

At-a-glance

Students
Hedwig Nora Nordlinder and Liam Virsand Gerrbrand
Mentors
Ben Murrell
Cohort
Summer 2024
Code

Abstract

Quantifying natural selection at the molecular level often involves estimating the ratio of nonsynonymous to synonymous substitution rates (dN/dS). The FUBAR (Fast Unconstrained Bayesian AppRoximation) method approximates the posterior distribution of selection across a gene using a pre-computed grid of rates and a Dirichlet prior. However, the original FUBAR assumes that the prior probability of each grid cell is independent. This project aimed to extend FUBAR by imposing a smooth prior over the discretized dN/dS distribution surface. Instead of a standard Dirichlet distribution, we used Random Fourier Features (RFF) to parameterize a continuous surface over the grid. By using Markov Chain Monte Carlo (MCMC) to sample from the weights of these features, we can more robustly estimate the landscape of selection, effectively sharing information between neighboring rate categories while maintaining computational tractability.

Animation of the smooth prior distribution over the dN/dS grid surface.
Animation of the resulting posterior distribution, capturing the landscape of selection.

Translationally Equivariant Positional Encodings for Protein Modeling

Implementing learnable, translationally equivariant rotary position embeddings and applying them to side-chain rotamer sampling.

deep learning proteins 2025

At-a-glance

Students
B.H. and A.K.
Mentors
Aron Stålmarck and Ben Murrell
Cohort
Summer 2025
Code

Abstract

Positional information is crucial for transformers to understand spatial relationships, but for 3D molecular structures, these encodings should ideally respect physical symmetries like translational equivariance. This project explored two different architectural approaches to multidimensional positional encoding. First, we implemented a Multi-dimensional RoPE that generalizes the original rotary position embedding (Su et al., 2021) by applying rotary transforms independently along each coordinate axis. Second, we implemented STRING (Schneck et al., 2025), a learnable rotary embedding designed to ensure true translational equivariance, where the model's internal representation remains invariant even as the molecule shifts in space.

We applied these architectures to the problem of protein side-chain rotamer sampling using Flow Matching—a modern generative modeling framework. By training the model to "flow" side-chain atoms from a random initial state toward their correct physical conformations, conditioned on the protein backbone, we developed a method that efficiently captures the complex geometric constraints between neighboring amino acids.

Protein side-chain sampling animation 1
Generative sampling of protein side-chain rotamers using Flow Matching and translationally equivariant rotary embeddings.
Protein side-chain sampling animation 2
Repeated independent side chain conformations sampled by the model, from the same backbone.

Autoregressive 3D Molecular Generation

Building a transformer that grows molecules one atom at a time, predicting both chemical identity and 3D geometry.

generative AI chemistry 2025

At-a-glance

Students
Adam Mehranrad and Alice Stenbeck
Mentors
Anton Oresten and Ben Murrell
Cohort
Summer 2025

Abstract

Generative models for 3D molecules have immense potential in drug discovery and materials science. This project developed an autoregressive transformer-based architecture in the Julia language that constructs molecules atom-by-atom. Unlike models that generate entire molecular graphs simultaneously, this approach treats molecule building as a sequential growth process. For each step, the model predicts the next atom's chemical identity, its 3D displacement relative to a selected "anchor" atom, and a "climb" variable that selects the next anchor point for subsequent growth steps, effectively navigating the branching structure derived from SMILES strings.

The MOGMOG architecture combines a multi-channel token encoder with a deep transformer trunk. The encoder integrates chemical identity, 3D coordinates, and sequence indices into a rich representation. A unique Mixture of Gaussians (MoG) decoder is used to model the continuous 3D displacement vectors, allowing the model to capture the multimodal spatial distributions of atom placements. The transformer's attention mechanism is further augmented by a learned pairwise attention bias, which incorporates relative 3D distances and structural relations between atoms at multiple length scales.

Training loss curve
Training loss.
Generated molecule sample 1
A 3D molecular structure generated by the model.
Generated molecule sample 2
Additional generated sample showing structural diversity.