API Reference

horizontal_stride::Int: How far this move advances along the top (reference) sequence.
horizontal_phase::Int: Frame offset in the top sequence; helps track codon boundaries.

Reference to the bottom sequence (vertical axis)

vertical_stride::Int: How far this move advances along the bottom (query) sequence.
vertical_phase::Int: Frame offset in the bottom sequence.
extensionAble::Bool: Whether this move can be extended (e.g., for affine gap penalties).

Description

Moves define basic operations used in dynamic programming alignment: matches, mismatches, and gaps. Moves can preserve codon reading frames by keeping strides in multiples of 3 and matching phase positions.

Two helper constructors are provided:

Move(step, score, stride, phase, extensionAble) assumes reference (vertical) stride and phase
Move(step, score, extensionAble) for simple moves with unit stride and zero phase.

source

SeededAlignment.MoveSet — Type

MoveSet

Defines the full set of allowed alignment moves used during pairwise or multiple sequence alignment.

Fields

match_moves::Vector{Move}: Moves that represent matches or substitutions between nucleotides.
hor_moves::Vector{Move}: Moves that introduce gaps in the reference (horizontal) sequence.
vert_moves::Vector{Move}: Moves that introduce gaps in the query (vertical) sequence.

Description

A MoveSet groups the allowable Moves into categories used during dynamic programming alignment. Each type controls how the algorithm can transition between states, including nucleotide-level match and gap moves with customizable reading frame behavior.

Used by alignment algorithms such as seed_chain_align and msa_codon_align to control the scoring and allowed operations during alignment.

source

SeededAlignment.ScoreScheme — Type

ScoreScheme(; match_score=0.0, mismatch_score=0.5,extension_score=0.1,edge_ext_begin=true,edge_ext_end=true,kmerlength=21)

A struct for storing scoring parameters used for sequence alignment. Uses a keyword constructor.

Fields

match_score::Float64: Score (typically ≤ 0) awarded for matching nucleotide. Lower is better.
mismatch_score::Float64: Penalty for nucleotide mismatches. Higher values penalize substitutions more strongly.
extension_score::Float64: Cost to extend a gap (indel). Affects how gaps are penalized during alignment.
edge_ext_begin::Bool: If true, allows gaps to be extended at the beginning of sequences.
edge_ext_end::Bool: If true, allows gaps to be extended at the end of sequences.
kmerlength::Int64: Length of k-mers used for seeding alignments (if applicable). Ignored if no seeding is used.

Description

ScoreScheme defines how matches, mismatches, and gaps are scored during nucleotide-level sequence alignment.

This struct is typically passed to functions like seed_chain_align or msa_codon_align.

example

score_params = ScoreScheme(extension_score = 0.3, mismatch_score = 0.7) # (everything else will be keept at default values)

source

SeededAlignment.clean_alignment_readingframe — Method

clean_alignment_readingframe(aligned_ref::LongDNA{4},aligned_seq::LongDNA{4})

Takes a pairwise alignment of a reference (with known reading frame) and a sequence, and removes single indels which
don't respect the reference's reading frame. To clean a full codon alignment you need multiple pairwise alignments and can clean up 
single indels by broadcasting with "." 

# NOTE We assume the readingFrame is 0 mod 3 with sequences 0 indexed

source

SeededAlignment.msa_codon_align — Function

msa_codon_align(ref::LongDNA{4}, seqs::Vector{LongDNA{4}}, moveset::MoveSet, score_params::ScoreScheme)

Produces a fast codon aligment with consistent readingFrame.

Arguments

ref::LongDNA{4}: The reference DNA sequence, assumed to have an readingFrame which starts at its first nucleotide.
seqs::Vector{LongDNA{4}}: A vector of DNA sequences to align against the reference.
moveset::MoveSet: Defines the allowable alignment operations (e.g. match nucleotide, single indel, readingFrame respecting triple indel). Assumes moveset is choosen so some moves respect the readingFrame.
score_params::ScoreScheme: Scoring parameters for codon-aware alignment, including match/mismatch and gap penalties.

Returns

Vector{LongDNA{4}}: A vector of aligned sequences (including the reference), with codon-aware gaps that preserve reading frame consistency.

Description

Produces a fast multiple codon alignment based on a reference with known readingFrame, a moveset and a scoreScheme. This is done by computing a pairwise alignment with respect to the reference for each sequence and then cleaning up single indel noise.

Example

ref = LongDNA{4}("ATGACGTGA")  # Reference with known reading frame
seqs = [LongDNA{4}("ATGTCGTGA"), LongDNA{4}("ATGACGAGA")]
moveset = std_codon_moveset()
score_params = std_codon_scoring()

alignment = msa_codon_align(ref, seqs, moveset, score_params)

NOTE: The sequences (ungapped) might not be preserved by the alignment by the cleanup step. Some nucleotides might be inserted or removed.

source

SeededAlignment.nw_align — Method

nw_align(A::LongDNA{4},B::LongDNA{4},moveset::MoveSet,scoreScheme::ScoreScheme)

Takes two ungapped LongDNA{4} sequences and computes an optimal pairwise alignment 
with respect to the moveset and the scoreScheme.

source

SeededAlignment.pairwise_noisy_moveset — Method

pairwise_noisy_moveset()

source

SeededAlignment.read_fasta — Method

read_fasta(filepath::String)

Reads in a fasta file and returns a tuple of (seqnames, seqs).

source

SeededAlignment.seed_chain_align — Method

seed_chain_align(A::LongDNA{4}, B::LongDNA{4}, moveset::MoveSet, scoreScheme::ScoreScheme)

Perform pairwise alignment between two DNA sequences using a seeding strategy.

Arguments

A::LongDNA{4}: The first DNA sequence to align.
B::LongDNA{4}: The second DNA sequence to align.
moveset::MoveSet: Defines allowed alignment moves (e.g., match, mismatch, gap).
scoreScheme::ScoreScheme: Scoring scheme for e.g. matches, mismatches, and gaps.

Returns

Vector{LongDNA{4}}: A vector the two aligned sequences.

Description

This function performs a fast (sub-quadratic) pairwise alignment using seeding and chaining. It identifies high-scoring seed matches between A and B, chains them to build a candidate alignment skeleton, and fills gaps between seeds using dynamic programming guided by moveset and scoreScheme.

Example

A = LongDNA{4}("ACGTACGT")
B = LongDNA{4}("ACGTTGCA")
moveset = std_codon_moveset()
scoreScheme = std_codon_scoring()

alignment = seed_chain_align(A, B, moveset, scoreScheme)

source

SeededAlignment.std_codon_moveset — Method

std_codon_moveset()

Return a standard codon-aware MoveSet for sequence alignment.

Returns

MoveSet: Contains codon-aware match and gap moves.

Moves

Match: Move(1, 0.0)
Horizontal (gaps in reference sequence):
- Move(1, 2.0, 1, 0, 1, 0, false)
- Move(3, 2.0, 1, 0, 3, 0, true)
Vertical (gaps in query sequence):
- Move(1, 2.0, 1, 0, 1, 0, false)
- Move(3, 2.0, 1, 0, 3, 0, true)

Use this as a default MoveSet for codon-preserving alignments.

source

SeededAlignment.std_codon_scoring — Method

std_codon_scoring()

Return a standard codon-aware ScoreScheme for alignment.

Returns

ScoreScheme: Default scoring parameters for codon alignment.

Parameters

Match score: 0.0
Mismatch score: 0.3
Gap extension score: 0.1
Allow extension at sequence ends: true (both ends)
K-mer length for seeding: 21

Use this as a default ScoreScheme for codon-preserving alignments.

source

SeededAlignment.write_fasta — Method

write_fasta(filepath::String, sequences::Vector{LongDNA{4}}; seq_names = nothing)

Writes a fasta file from a vector of sequences, with optional seq_names.

source