API Reference

Onion.AdaLN — Type

AdaLN(dim::Int, cond_dim::Int)

Adaptive Layer Normalization.

aln = AdaLN(5, 3)
h = randn(Float32, 5,10,1)
cond = randn(Float32, 3,1)
h = aln(h, cond)

source

Onion.Attention — Type

Attention(
    in_dim::Int, n_heads::Int, n_kv_heads=n_heads;
    head_dim=in_dim÷n_heads, qkv_bias=false,
    q_norm=identity, k_norm=identity,
    out_init_scale=1,
)

Attention layer that supports both self-attention and cross-attention (as in Llama3).

Examples

Self-attention

in_dim = 256
n_heads = 8
n_kv_heads = 4
head_dim = 64
attn = Attention(in_dim, n_heads, n_kv_heads; head_dim)

seq_len = 10
batch = 2
x = randn(in_dim, seq_len, batch)
output = attn(x)

source

Onion.BlockLinear — Type

BlockLinear(
    d1 => d2, k, σ=identity;
    bias::Bool=true, init=Flux.glorot_uniform)

A block-diagonal version of a linear layer, comprising k blocks, where the blocks are of size (d2 ÷ k, d1 ÷ k).

Equivalent to Linear when k=1.

source

Onion.CrossFrameIPA — Type

CrossFrameIPA(dim::Int, ipa; ln = Flux.LayerNorm(dim))

Constructs a layer that takes one embedding, and two sets of frames. Runs layernorm on the embedding, and then makes a cross-attention IPA call with one embedding but two frames. Useful for self-conditioning where two sets of frames need to communicate with each other.

source

Onion.DART — Type

DART(transformer; mask=:causal)

"Doubly Auto-Regressive Transformer" (DART) is a convenience layer wrapping a transformer block that can be used to model auto-regressive data represented along two dimensions.

Note

The mask acts on the flattened tokens sequence.

Examples

julia> dart = DART(TransformerBlock(64, 8));

julia> x = randn(Float32, 64, 4, 20);

julia> dart(x) |> size
(64, 4, 20)

source

Onion.DyT — Method

DyT(dim::Integer; init_alpha::T = 0.5f0)

Make a Dynamic Tanh (DyT) layer for normalizing the input tensor.

See Transformers without Normalization for more details.

source

Onion.FSQ — Type

FSQ(l, chunk_size)

Finite Scalar Quantization. l is the number of quantization levels. For a sequence with d channels, the codebook size would be l^d. chunk_size is the number of channels that get combined/separated when chunk/unchunk are called.

source

Onion.Framemover — Type

Framemover(dim::Int; init_gain = 0.1f0)

Differentiable rigid body updates (AF2-style).

source

Onion.IPAblock — Type

IPAblock(dim::Int, ipa; ln1 = Flux.LayerNorm(dim), ln2 = Flux.LayerNorm(dim), ff = StarGLU(dim, 3dim))

For use with Invariant Point Attention, either from InvariantPointAttention.jl or MessagePassingIPA.jl. If ipablock.ipa is from InvariantPointAttention.jl, then call ipablock(frames, x; pair_feats = nothing, cond = nothing, mask = 0, kwargs...) If ipablock.ipa is from MessagePassingIPA.jl, then call ipablock(g, frames, x, pair_feats; cond = nothing) Pass in cond if you're using eg. AdaLN that takes a second argument.

source

Onion.L2Norm — Type

L2Norm(; dims=1, eps=1f-6)

Alias for LpNorm with p=2.

source

Onion.LayerNorm — Type

LayerNorm(dim::Int; eps::T=1f-6)

Layer Normalization.

ln = LayerNorm(64)
x = randn(Float32, 64, 10, 1)
y = ln(x)

source

Onion.Linear — Type

Linear(
    d1 => d2, k, σ=identity;
    bias::Bool=true,
    init=Flux.glorot_uniform
)

See also the L2Norm alias for p=2.

source

Onion.Modulator — Type

Modulator(in_dim => out_dim; σ=sigmoid, op=*)

Takes an input Y and a conditioning input X and applies a gate to Y based on X.

See Gated Attention for Large Language Models

Examples

julia> gate = Modulator(32 => 64);

julia> Y = randn(Float32, 64);

julia> X = randn(Float32, 32);

julia> gate(Y, X) |> size
(64,)

source

Onion.MultidimRoPE — Method

MultidimRoPE(; theta=10000f0)

Multi-dimensional Rotary Position Embedding (RoPE) for 2D, 3D, or higher-dimensional coordinate inputs. This is a fixed (non-learnable) generalization of the original RoPE from Su et al. (2021), where each rotary pair of channels is assigned to a specific coordinate dimension and rotated accordingly.

Example

dim, n_heads, n_kv_heads, seqlen = 64, 8, 4, 16
t = TransformerBlock(dim, n_heads, n_kv_heads)
h = randn(Float32, dim, seqlen, 1)
mask = 0

positions = randn(Float32, 3, seqlen, 1)
rope = MultidimRoPE(theta=10000f0)

h_out = t(h, positions, rope, mask)  # self-attention with multi-dim RoPE

source

Onion.RMSNorm — Type

RMSNorm(dim::Int; T=Float32, eps=1f-5, zero_centered=false)

Root Mean Square Layer Normalization. As used in Llama3.

source

Onion.RoPE — Type

RoPE(dim::Int, max_length; theta::T=10000f0)

Rotary Position Embeddings (as in Llama3).

dim = 64
n_heads = 8
n_kv_heads = 4
seqlen = 10

t = TransformerBlock(dim, n_heads, n_kv_heads)
h = randn(Float32, dim, seqlen, 1)

rope = RoPE(dim ÷ n_heads, 1000)
h = t(h, 1, rope[1:seqlen]) #Note the subsetting to match seqlen

source

Onion.STRINGRoPE — Type

STRINGRoPE(head_dim::Int, n_heads::Int, d_coords::Int; init_scale=0.001f0, theta=10000f0)

Multidimensional, learnable Rotary Position Embedding (RoPE) from Schneck et al. (2025), "Learning the RoPEs: Better 2D and 3D Position Encodings with STRING".

Example

head_dim = 64
n_heads = 8
d_coords = 3
rope = STRINGRoPE(head_dim, n_heads, d_coords)

x = rand(Float32, head_dim, 16, n_heads, 2)      # (head_dim, seq_len, n_heads, batch)
positions = rand(Float32, d_coords, 16, 2)       # (d_coords, seq_len, batch)
x_rot = rope(x, positions)

Note

As this needs to be learnable it should preferably be used with the STRINGTransformerBlock/AdaSTRINGTransformerBlock

source

Onion.StarGLU — Type

StarGLU(dim::Int, ff_hidden_dim::Int; act=Flux.swish)

Gated Linear Unit with flexible activation function (default: swish, making it a SwiGLU layer as used in Llama3).

l = StarGLU(6, 8)
h = randn(Float32, 6, 10, 1)
h = l(h)

source

Onion.TransformerBlock — Type

TransformerBlock(dim::Int, n_heads::Int, n_kv_heads::Int = n_heads, ff_hidden_dim = 4 * dim; norm_eps=1f-5, qkv_bias=false)

Transformer block for GQAttention (as in Llama3).

dim = 64
n_heads = 8
n_kv_heads = 4
seqlen = 10

rope = RoPE(dim ÷ n_heads, 1000)
t = TransformerBlock(dim, n_heads, n_kv_heads)

h = randn(Float32, dim, seqlen, 1)

#Use without a mask:
h = t(h, 1, rope[1:seqlen])

#Use with a causal mask:
mask = Onion.causal_mask(h)
h = t(h, 1, rope[1:seqlen], mask)

source

Onion.VirtualWidthNetwork — Type

VirtualWidthNetwork(layer, n, m)

Wrap a sublayer (e.g. attention or FFN) with the static form of Generalized Hyper-Connections (GHC).

Given a backbone hidden size $D$, the over-width representation is partitioned into n segments, while the backbone operates on only m segments. This layer:

compresses an over-width state of size $\frac{n}{m}D$ down

to backbone width $D$ by projecting down the n segments into m segments,

applies the wrapped layer at backbone width,
expands the backbone output back to n segments,
carries forward the previous over-width state with a projection

from n segments to n segments, adding it to the expanded backbone output.

See: Virtual Width Networks

source

Onion.chunk — Method

chunk(x, q::FSQ, chunk_size)

Make a long quantized sequence shorter and wider (to make it more transformer-friendly). x may have a batch dimension. Contiguous chunks of chunk_size are recoded as a single integer in the product space q.l^chunk_size`.

source

Onion.sample_uniform_causal_chunk_mask — Method

sample_uniform_causal_chunk_mask(x, chunk_size)

Generate a mask of all the "chunks" towards the end of the sequence, separately for each batch. The mask dims will be length-by-batch, but contiguous chunks of chunk_size will be always be masked together.

source

Onion.unchunk — Method

unchunk(x, q::FSQ)

Take a sequence that has been chunked, and expand it back to the original length. x == unchunk(chunk(x,q),q) should be true.

source

Onion.UNet.Bottleneck — Type

Bottleneck(channels::Int; time_emb=false, emb_dim=256, dropout=0.0, activation=relu)

A bottleneck block for UNet architecture with optional time embeddings and dropout.

Arguments

channels::Int: Number of input and output channels
time_emb=false: Whether to use time embeddings
emb_dim=256: Dimension of time embeddings
dropout=0.0: Dropout probability (0.0 means no dropout)
activation=relu: Activation function to use

Examples

bn = Onion.UNet.Bottleneck(256, time_emb=true, emb_dim=256, dropout=0.2)
h = randn(Float32, 8, 8, 256, 1)
t = randn(Float32, 256, 1)
h = bn(h, t)

source

Onion.UNet.DecoderBlock — Type

DecoderBlock(in_channels::Int, out_channels::Int; time_emb=false, emb_dim=256, dropout=0.0, activation=relu)

A decoder block for UNet architecture with optional time embeddings and dropout.

Arguments

in_channels::Int: Number of input channels
out_channels::Int: Number of output channels
time_emb=false: Whether to use time embeddings
emb_dim=256: Dimension of time embeddings
dropout=0.0: Dropout probability (0.0 means no dropout)
activation=relu: Activation function to use

Examples

dec = Onion.UNet.DecoderBlock(256, 128, time_emb=true, emb_dim=256, dropout=0.1)
h = randn(Float32, 8, 8, 256, 1)
skip = randn(Float32, 16, 16, 128, 1)
t = randn(Float32, 256, 1)
h = dec(h, skip, t)

source

Onion.UNet.EncoderBlock — Type

EncoderBlock(in_channels::Int, out_channels::Int; time_emb=false, emb_dim=256, dropout=0.0, activation=relu)

An encoder block for UNet architecture with optional time embeddings and dropout.

Arguments

in_channels::Int: Number of input channels
out_channels::Int: Number of output channels
time_emb=false: Whether to use time embeddings
emb_dim=256: Dimension of time embeddings
dropout=0.0: Dropout probability (0.0 means no dropout)
activation=relu: Activation function to use

Examples

enc = Onion.UNet.EncoderBlock(3, 64, time_emb=true, emb_dim=256, dropout=0.1)
h = randn(Float32, 32, 32, 3, 1)
t = randn(Float32, 256, 1)
skip, h = enc(h, t)

source

Onion.UNet.FlexibleUNet — Type

FlexibleUNet(;
    in_channels=3,
    out_channels=3,
    depth=3,
    base_channels=64,
    channel_multipliers=[1, 2, 4],
    time_embedding=false,
    num_classes=0,
    embedding_dim=128,
    time_emb_dim=256,
    dropout=0.0,
    dropout_depth=0,
    activation=relu
)

A flexible UNet architecture with configurable depth and channel dimensions. Supports optional time and class embeddings for diffusion models and conditional generation.

Arguments

in_channels=3: Number of input channels
out_channels=3: Number of output channels
depth=3: Number of encoder/decoder blocks
base_channels=64: Base channel dimension (multiplied at each level)
channel_multipliers=[1, 2, 4]: Multipliers for channel dimensions at each level
time_embedding=false: Whether to use time embeddings
num_classes=0: Number of class labels for conditional generation
embedding_dim=128: Dimension for class embeddings
time_emb_dim=256: Dimension for time embeddings
dropout=0.0: Dropout probability to apply to inner layers
dropout_depth=0: Number of layers to apply dropout to, starting from the innermost layers (0 means no dropout). Maximum value is 1+depth (bottleneck + all encoding/decoding levels)
activation=relu: Activation function to use throughout the network

Examples

# Basic model without dropout
model = Onion.UNet.FlexibleUNet(
    in_channels=3,
    out_channels=3,
    depth=4,
    base_channels=32,
    channel_multipliers=[1, 2, 4, 8],
    time_embedding=true
)

# Model with dropout applied to the 3 innermost layers
model = Onion.UNet.FlexibleUNet(
    in_channels=3,
    out_channels=3,
    depth=4,
    base_channels=32,
    channel_multipliers=[1, 2, 4, 8],
    time_embedding=true,
    dropout=0.2,
    dropout_depth=3
)

x = randn(Float32, 32, 32, 3, 1)
t = randn(Float32, 1)
labels = [5]
y = model(x, t, labels)

source

Onion.UNet.GaussianFourierProjection — Type

GaussianFourierProjection(embed_dim::Int, scale::T=32.0f0)

Creates a Gaussian Fourier feature projection for time embeddings. Used in diffusion models.

Arguments

embed_dim::Int: Embedding dimension. Should be even.
scale::T=32.0f0: Scaling factor for the random weights.

source

Onion.UNet.ResidualBlock — Type

ResidualBlock(channels::Int; kernel_size=3, time_emb=false, emb_dim=256, dropout=0.0, activation=relu)

A ResNet-style residual block with optional time embeddings, dropout, and configurable activation.

Arguments

channels::Int: Number of input and output channels
kernel_size=3: Size of convolutional kernel
time_emb=false: Whether to use time embeddings
emb_dim=256: Dimension of time embeddings
dropout=0.0: Dropout probability (0.0 means no dropout)
activation=relu: Activation function to use (e.g., relu, swish, etc.)

Examples

# Basic block with dropout
rb = Onion.UNet.ResidualBlock(64, dropout=0.1)

# Block with time embeddings and custom activation
rb = Onion.UNet.ResidualBlock(64, time_emb=true, emb_dim=256, dropout=0.1, activation=swish)

# Usage
h = randn(Float32, 32, 32, 64, 1)
t = randn(Float32, 256, 1)
h = rb(h, t)

source

Onion.UNet.TimeEmbedding — Type

TimeEmbedding(embed_dim::Int, num_classes::Int, embedding_dim::Int)

Creates time and optional class embeddings for diffusion models.

Arguments

embed_dim::Int: Output dimension for time embeddings
num_classes::Int: Number of classes for conditional generation
embedding_dim::Int: Dimension for class embeddings

Examples

time_emb = Onion.UNet.TimeEmbedding(256, 10, 128)
t = randn(Float32, 16)
labels = rand(1:10, 16)
h = time_emb(t, labels)

source

Onion.Utils.:⊞ — Method

⊞(xs...)
a ⊞ b

Adds the arguments together, ignoring Nothings.

"⊞" can be typed by \boxplus<tab>

Examples

julia> using Onion.Utils

julia> 1 ⊞ 2
3

julia> 1 ⊞ nothing
1

julia> (rand(Float32, 10) .⊞ nothing) isa Vector{Float32}
true

source

Onion.Utils.cross_att_padding_mask — Method

cross_att_padding_mask(padmask, other_dim; T=Float32)

Takes a sequence-level padmask and a dimension other_dim and returns a cross-attention mask that is length-by-other_dim-by-batch. This prevents information flow from padded key positions to any query positions (but ignores padding in the query positions, because nothing should flow out of those).

Examples

julia> cross_att_padding_mask([1 1; 1 1; 1 0], 4)
3×4×2 Array{Float32, 3}:
[:, :, 1] =
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0

[:, :, 2] =
   0.0    0.0    0.0    0.0
   0.0    0.0    0.0    0.0
 -Inf   -Inf   -Inf   -Inf

source

Onion.Utils.falses_like — Method

falses_like(x::AbstractArray, [T=eltype(x)], [dims=size(x)])

Returns an array of falses of type Bool with an array type similar to x. The dimensions default to size(x).

falses_like(args...) is equivalent to like(false, Bool, args...)

source

Onion.Utils.glut — Method

glut(t::AbstractArray, d::Int, pos::Int)
glut(t::Real, d::Int, pos::Int) = t

glut adds dimensions to the middle. The resulting array will have d dimensions. pos is where to add the dimensions. pos=0 adds dims to the start, pos=1 after the first element, etc. If t is scalar, it is returned unmodified (because scalars don't need to match dims to broadcast).

Typically when broadcasting x .* t, you would call something like glut(t, ndims(x), 1).

source

Onion.Utils.like — Function

like(x::AbstractArray, array::DenseArray, T=eltype(x))

Like like(v, x::AbstractArray, args...), but an arbitrary AbstractArray, such as an AbstractRange, can be instantiated on device.

Examples

julia> like(1:5, rand(1))
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> like((1:5)', rand(1), Float32)
1×5 Matrix{Float32}:
 1.0  2.0  3.0  4.0  5.0

source

Onion.Utils.like — Method

like(v, x::AbstractArray, [T=eltype(x)], [dims=size(x)])

Returns an array of v (converted to type T) with an array type similar to x. The element type and dimensions default to eltype(x) and size(x).

like(v, x::AbstractArray, args...) is equivalent to fill!(similar(x, args...), v), but the function is marked as non-differentiable using ChainRulesCore.

source

Onion.Utils.ofeltype — Method

ofeltype(v::Number, x::AbstractArray{T}) where T = convert(T, v)

Convert v to type T.

source

Onion.Utils.ones_like — Method

ones_like(x::AbstractArray, [T=eltype(x)], [dims=size(x)])

Returns an array of ones with an array type similar to x. The element type and dimensions default to eltype(x) and size(x).

ones_like(args...) is equivalent to like(true, args...)

source

Onion.Utils.self_att_padding_mask — Method

self_att_padding_mask(padmask; T=Float32)

Takes a sequence-level padmask (ie. length-by-batch, where 0 indicates a padded position) and returns a (non-causal) self-attention mask that is length-by-length-by-batch and which prevents information flow from padded positions to unpadded positions.

Examples

julia> self_att_padding_mask([1 1; 1 1; 1 0])
3×3×2 Array{Float32, 3}:
[:, :, 1] =
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

[:, :, 2] =
   0.0    0.0  -Inf
   0.0    0.0  -Inf
 -Inf   -Inf     0.0

source

Onion.Utils.trues_like — Method

trues_like(x::AbstractArray, [T=eltype(x)], [dims=size(x)])

Returns an array of trues of type Bool with an array type similar to x. The dimensions default to size(x).

trues_like(args...) is equivalent to like(true, Bool, args...)

source

Onion.Utils.zeros_like — Method

zeros_like(x::AbstractArray, [T=eltype(x)], [dims=size(x)])

Returns an array of zeros with an array type similar to x. The element type and dimensions default to eltype(x) and size(x).

zeros_like(args...) is equivalent to like(false, args...)

source