API Reference
This section provides detailed documentation for all exported types, functions and constants in SeededAlignment.jl.
Core Methods
Utilities
methods for reading and writing of sequence data in FASTA format.
Core Types
MoveMovesetScoringScheme- [
LongDNA{4}] - Array specialized for DNA from BioSequences.jl
Constants
Index
A complete alphabetical listing of all documented functions.
SeededAlignment.STD_CODON_MOVESETSeededAlignment.STD_NOISY_MOVESETSeededAlignment.STD_SCORINGSeededAlignment.MoveSeededAlignment.MovesetSeededAlignment.ScoringSchemeSeededAlignment.ScoringSchemeSeededAlignment.clean_frameshiftsSeededAlignment.clean_frameshiftsSeededAlignment.msa_codon_alignSeededAlignment.nw_alignSeededAlignment.nw_alignSeededAlignment.read_fastaSeededAlignment.seed_chain_alignSeededAlignment.seed_chain_alignSeededAlignment.write_fastaSeededAlignment.write_fasta
Index Docstrings
SeededAlignment.Move — TypeMove(ref::Bool, step_length::Int64, score::Float64, extendable::Float64)Represents a gap move during alignment. The Move instance represents either an insertion or deletion depending on how it is passed to the Moveset instance - collection of Move instances used in alignment methods.
For example, if Move.step_length = 3 then that represents a gap of length 3 (either insertion or deletion).
Extended Help
Fields
-ref::Bool: Whether move respects the coding reading frame -step_length::Int64: Length of gap in alignment (must be 1,2 or 3) -score::Float64: penalty for using the move - cost of adding the gaps to the alignment (must be < 0) -extendable::Bool: Whether the move can be affinely extended by another move.
Constructors
-Move(; ref=false::Bool, step_length::Int64, score::Float64, extendable=false): keyword constructor for more easily constructing Moves. Has some default values but requires at least step_length and score to be provided.
-RefMove(; score::Float64): Constructor for a Move that respect coding reading frame. Only requires score argument. Returns: Move(ref=true, step_length=3, score=score, extendable=true)
-FrameshiftMove(; step_length::Int64, score::Float64, extendable::Bool=false): Constructor for a Move that cause frameshifts and breaks reading frame symmetry. Requires arguments step_length, score, and extendable. Returns: Move(ref=false, step_length=step_length, score=score, extendable=extendable)
Examples
- codon moveset with no frameshifts
# represents codon insertion or codon deletion
codon_indel = RefMove(score=1.0)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions.
=#
Moveset(ref_insertions = (codon_indel,), ref_deletions = (codon_indel,))- codon moveset with frameshift moves allowed
# represents codon insertion or codon deletion
codon_indel = RefMove(score=-1.0)
# represents frameshift causing insertion or deletion
frm_indel = FrameshiftMove(score=-1.5, step_length=1, extendable=true)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions and single nucleotide indels that are extendable.
=#
Moveset(ref_insertions = (codon_indel,frm_indel), ref_deletions = (codon_indel, frm_indel))SeededAlignment.Moveset — TypeMoveset{X,Y}(; ref_insertions::NTuple{X,Move}, ref_deletions::NTuple{Y,Move})Represents a collection of Move instances that are either insertions or deletions
Extended Help
Fields
-vert_moves::NTuple{X,Move}: insertions relative to reference - gap operation in top/first provided sequence -hor_moves::NTuple{Y,Move}: deletions relative to reference - gap operations in left/second provided sequence
Examples
- codon moveset with no frameshifts
# represents codon insertion or codon deletion
codon_indel = RefMove(score=1.0)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions.
=#
ms = Moveset(ref_insertions = (codon_indel,), ref_deletions = (codon_indel,))- codon moveset with frameshift moves allowed
# represents codon insertion or codon deletion
codon_indel = RefMove(score=-1.0)
# represents frameshift causing insertion or deletion
frm_indel = FrameshiftMove(score=-1.5, step_length=1, extendable=true)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions and single nucleotide indels that are extendable.
=#
ms = Moveset(ref_insertions = (codon_indel,frm_indel), ref_deletions = (codon_indel, frm_indel))SeededAlignment.ScoringScheme — TypeScoringScheme(;
extension_score::Float64=-0.3,
kmer_length::Int64=18,
edge_ext_begin::Bool=true,
edge_ext_end::Bool=true,
nucleotide_match_score::Float64 = 0.0,
nucleotide_mismatch_score::Float64 = -0.8,
nucleotide_score_matrix::Union{Nothing,Matrix{Float64}} = nothing
codon_match_bonus_score::Float64 = 6.0)
ScoringScheme defines how matches, mismatches, and partially how gaps are scored during sequence alignment. Beyond alignment operations governed by Moveset it encompasses everything else we can customize about the alignment process.
This struct is typically passed to functions like seed_chain_align, nw_align and msa_codon_align.
Fields
-extension_score::Float64=-0.3: penalty for affinely extending a gap that is already open. -kmer_length::Int64=18: length of seeds used in seedchainalign -edge_ext_begin=true: Whether to subsidize gaps in begining of the alignment -edge_ext_end=true: Whether to subsidize gaps in end of the alignment -nucleotide_match_score = 0.0: score awarded for matching nucleotides (has to be >= 0) -nucleotide_mismatch_score = -0.8: penalty for matching distinct nucleotides (has to be < 0) -nucleotide_score_matrix::Union{Nothing, Matrix{Float64}}: optional custom matrix for scoring nucleotide substitutions. -codon_match_bonus_score = 6.0: score awarded for matching codons
example
score_params = ScoringScheme(extension_score = -0.2, mismatch_score = -0.7) # (everything else will be keept at default values)
A = LongDNA{4}("ATGATGATAA")
B = LongDNA{4}("ATGACCCGATAA")
seed_chain_align(A,B scoring=score_params)SeededAlignment.ScoringScheme — MethodScoringScheme(;
extension_score::Float64=-0.3,
kmer_length::Int64=12,
edge_ext_begin::Bool=true,
edge_ext_end::Bool=true,
nucleotide_match_score::Float64 = 0.0,
nucleotide_mismatch_score::Float64 = -0.4,
nucleotide_score_matrix::Union{Nothing,Matrix{Float64}} = nothing
codon_match_bonus_score::Float64 = 6.0)
keyword constructor for ScoringScheme with helpful default parameters. It is important to note that supplying a custom matrix for nucleotide_score_matrix might slow performance of some alignment methods.
Extended Help
Slow down might occur due to allocating ScoreScheme to the heap instead of the stack since matrix{Float64} is dynamically sized.
Arguments
-extension_score::Float64=-0.3: penalty for affinely extending a gap that is already open. -kmer_length::Int64=18: length of seeds used in seedchainalign -edge_ext_begin=true: Whether to subsidize gaps in begining of the alignment -edge_ext_end=true: Whether to subsidize gaps in end of the alignment -nucleotide_match_score = 0.0: score awarded for matching nucleotides (has to be >= 0) -nucleotide_mismatch_score = -0.8: penalty for matching distinct nucleotides (has to be < 0) -nucleotide_score_matrix::Union{Nothing, Matrix{Float64}}: optional custom matrix for scoring nucleotide substitutions. -codon_match_bonus_score = 6.0: score awarded for matching codons
Returns
-ScoringScheme
example
```julia scoreparams = ScoringScheme(extensionscore = -0.2, mismatchscore = -0.7) # (everything else will be keept at default values) A = LongDNA{4}("ATGATGATAA") B = LongDNA{4}("ATGACCCGATAA") seedchainalign(A,B scoring=scoreparams)
SeededAlignment.clean_frameshifts — Methodclean_frameshifts(aligned_ref::LongDNA{4}, aligned_seq::LongDNA{4}; verbose::Bool=false)(clean_frameshifts - cleans pairwise alignments of frameshifts)
Takes a global pairwise alignment of a CDS anchor/reference aligned_ref and a CDS that may contain frameshift errors aligned_seq, and removes frameshift mutations which don't respect the reference's reading frame in the alignment. This is done by removing insertions from the alignment or inserting ambigious nucleotides into deletions.
Extended Help
Arguments
aligned_ref::LongDNA{4}: aligned anchored trusted CDS which decides the reading frame coordinates in the alignmentaligned_seq::LongDNA{4}: aligned CDS (with possible frameshifts due to e.g. sequencing or annotation errors) which is aligned torefand adopts its reading frame coordinatesverbose::Bool: Whether to verbosely display what edits were made during the cleaning of frameshifts
Returns
Tuple{LongDNA{4},LongDNA{4}}: Cleaned global pairwise alignment that represents a protein alignment on a nucleotide level
Examples
1. insertion
ref: ATG-AACGTA -> cleaned_ref: ATGAACGTA
seq: ATGTAACGTA -> cleaned_seq: ATGAACGTA
2. deletion
ref: ATGAACGTA -> cleaned_ref: ATGAACGTA
seq: ATG-ACGTA -> cleaned_seq: ATGNACGTA
3. longer deletion
ref: ATGAACGTA -> cleaned_ref: ATGAACGTA
seq: AT-----TA -> cleaned_seq: ATN---NTASeededAlignment.clean_frameshifts — Methodclean_frameshifts(aligned_ref::LongDNA{4}, aligned_seqs::Vector{LongDNA{4}})(clean_frameshifts - cleans multiple sequence alignments of frameshifts)
Clean a multiple sequence alignment provided one of the sequence is a reference sequence. This is done by projecting the multiple sequence alignment to a collection of pairwise alignments relative to the reference aligned_ref; cleaning those and then scaffolding the results to recover a cleaned multiple sequence alignment.
Extended Help
Arguments
aligned_ref::LongDNA{4}: aligned anchored trusted CDS which decides the reading frame coordinates in the alignmentaligned_seqs::LongDNA{4}: aligned coding sequences (with possible frameshifts due to e.g. sequencing or annotation errors) which are aligned torefand adopts its reading frame coordinatesverbose::Bool: Whether to verbosely display what edits were made during the cleaning of frameshifts
Returns
cleaned_msa::Vector{LongDNA{4}}: a frameshift-free multiple sequence alignment
Example
aligned_seqs = Vector{LongDNA{4}}(undef, 4)
# original unclean msa
aligned_ref = LongDNA{4}("ATG---TTTCCCGGGT-AA")
aligned_seqs[1] = LongDNA{4}("-TG------CCCGGGT-A-")
aligned_seqs[2] = LongDNA{4}("ATGAAATTTCCCGGGT-AA")
aligned_seqs[3] = LongDNA{4}("ATGAAA----CCGGGT-AA")
aligned_seqs[4] = LongDNA{4}("ATG---TTTCCCGGGTTAA")
# produce clean msa
cleaned_msa = clean_frameshifts(aligned_ref, aligned_seqs)
#= results
cleaned_msa[1] = LongDNA{4}("ATG---TTTCCCGGGTAA") # ref sequence
cleaned_msa[2] = LongDNA{4}("NTG------CCCGGGTAN")
cleaned_msa[3] = LongDNA{4}("ATGAAATTTCCCGGGTAA")
cleaned_msa[4] = LongDNA{4}("ATGAAA---NCCGGGTAA")
cleaned_msa[5] = LongDNA{4}("ATG---TTTCCCGGGTAA")
=#SeededAlignment.msa_codon_align — Methodmsa_codon_align(ref::LongDNA{4}, seqs::Vector{LongDNA{4}}; moveset::Moveset=STD_CODON_MOVESET, scoring::ScoringScheme=STD_SCORING, codon_scoring_on=true::Bool)Computes a visual global MSA (multiple Sequence alignment) of coding sequences based on pairwise alignments to a trusted CDS reference ref used as an anchor to determine the apprioate reading frame coordinates for the other coding sequences (which may contain frameshift errors). Possible frameshifts errors in the pairwise alignments are cleaned up and then scaffolded to create a multiple sequence alignment. This is done by left-stacking codon insertions relative to the reference.
Note that this doesn't qualify as a proper multiple sequence alignment in the traditional sense since the aligned sequences are only scored as being aligned against the reference sequnece and not each other.
Even so, it can still provides a useful visualization or approximation for a protein multiple sequence alignment on a nucleotide level.
Extended Help
Arguments
ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignmentseqs::Vector{LongDNA{4}}: Coding Sequences (with possible frameshifts due to e.g. sequencing or annotation errors) which are aligned torefand adopts its reading frame coordinates.moveset::Moveset=STD_CODON_MOVESET: Defines allowable alignment moves (e.g., codon insertions/deletions).scoring::ScoringScheme=STD_SCORING: The scoring scheme used during alignment.codon_scoring_on::Bool=true: Whether to apply additional scoring on codon-level
Returns
msa::Vector{LongDNA{4}}: a frameshift-free multiple sequence alignment
Example
- codon alignment with no frameshifts present in inputs
ref = LongDNA{4}("ATGTTTCCCGGGTAA")
seq1 = LongDNA{4}("ATGTTTTTTCCCGGGTAA")
seq2 = LongDNA{4}("ATGATGTTTTTTCCCGGGTAAGGG")
seq3 = LongDNA{4}("ATGCCCGGG")
msa = msa_codon_align(ref, [seq1,seq2,seq3])
#= alignment results:
msa[1] == LongDNA{4}("---ATG---TTTCCCGGGTAA---")
msa[2] == LongDNA{4}("---ATGTTTTTTCCCGGGTAA---")
msa[3] == LongDNA{4}("ATGATGTTTTTTCCCGGGTAAGGG")
msa[4] == LongDNA{4}("---ATG------CCCGGG------")
=#SeededAlignment.nw_align — Methodnw_align(A::LongDNA{4}, B::LongDNA{4}; moveset::Moveset=STD_NOISY_MOVESET, scoring::ScoringScheme=STD_SCORING)(Needleman-wunsch wrapper - DE-NOVO)
Computes an optimal global pairwise alignment of the two ungapped DNA sequences A and B. This is done purely semantically without any awareness of protein encoding.
Extended Help
Arguments
A::LongDNA{4}: 1st DNA sequence to be alignedB::LongDNA{4}: 2nd DNA sequence to be alignedmoveset::Moveset=STD_NOISY_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)scoring::ScoreScheme=STD_SCORING: Defines alignment scoring together with moveset
Returns
Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequencesAandB.
Examples
A = LongDNA{4}("AATGCTC")
B = LongDNA{4}("ACATGTC")
# produce alignment
alignment = nw_align(A, B)
println(alignment)
#= resulting alignment:
alignment = (
LongDNA{4}("A-ATGCTC"),
LongDNA{4}("ACATG-TC")
)
=#SeededAlignment.nw_align — Methodnw_align(;
ref::LongDNA{4},
query::LongDNA{4},
moveset::Moveset = STD_CODON_MOVESET,
scoring::ScoringScheme = STD_SCORING,
codon_scoring_on::Bool = false,
do_clean_frameshifts::Bool = true,
verbose::Bool = false)(Needleman-wunsch wrapper - CODING given trusted CDS anchor/reference)
Produces an optimal global pairwise alignment of two ungapped CDS (Coding DNA Sequences) ref and query by using an ref as an anchor to determine the apprioate reading frame coordinates for query.
Extended Help
Arguments
ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignmentquery::LongDNA{4}: CDS (with possible frameshifts due to e.g. sequencing or annotation errors) which is aligned torefand adopts its reading frame coordinates.moveset::Moveset = STD_CODON_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)scoring::ScoringScheme = STD_SCORING: Defines alignment scoring together with movesetcodon_scoring_on::Bool = false: Whether to apply additional scoring on codon-leveldo_clean_frameshifts::Bool = true: Whether to clean the alignment output of gaps which cause frameshifts (IMPORTANT: produces a protein alignment on a nucleotide level)verbose::Bool = false: Whether to verbosely display what edits were made during the cleaning of frameshifts.
Returns
Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequencesrefandquery.
Note that this represents a protein alignment on a nucleotide level if (do_clean_frameshifts == true).
Examples
anchor_CDS = LongDNA{4}("ATGCCAGTA")
# untrusted_CDS may contain some frameshift errors due to e.g. sequencing or annotation errors.
untrusted_CDS = LongDNA{4}("ATGTA")
# frameshift errors are removed from the cleaned alignment
cleaned_CDS_alignment = nw_align(ref=anchor_CDS, query=untrusted_CDS, clean_frameshifts=true)
println(cleaned_CDS_alignment)
#= resulting alignment:
cleaned_CDS_alignment = (
LongDNA{4}("ATGCCAGTA"),
LongDNA{4}("ATG---NTA")
)
Here 'N' denotes ambigious nucleotide.
=#SeededAlignment.read_fasta — Methodread_fasta(filepath::String)Reads in a fasta file and returns a tuple of (seqnames, seqs).
SeededAlignment.seed_chain_align — Methodseed_chain_align(A::LongDNA{4}, B::LongDNA{4}; moveset::Moveset=STD_NOISY_MOVESET, scoring::ScoringScheme=STD_SCORING)(SeededAlignment wrapper - DE-NOVO)
Computes a heuristically guided global pairwise alignment of two ungapped DNA sequence A and B based on seeding heuristic. The seeds are then joined together by computing an optimal partial alignment between seeds with the Needleman-Wunsch algorithm (nw_align). Optimal in this context meaning with repect to the choosen Moveset and ScoringScheme.
The advantage of this method is that it is much faster than nw_align and produces similar results for most usecases.
Extended Help
Arguments
A::LongDNA{4}: 1st DNA sequence to be alignedB::LongDNA{4}: 2nd DNA sequence to be alignedmoveset::Moveset=STD_NOISY_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)scoring::ScoreScheme=STD_SCORING: Defines alignment scoring together with moveset
Returns
Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequencesAandB.
Example
# input sequences with no reading frame assumed
A = LongDNA{4}("AATGCTC")
B = LongDNA{4}("ACATGTC")
# produce alignment
alignment = seed_chain_align(A, B)
println(alignment)
#= resulting alignment
alignment = (
LongDNA{4}("A-ATGCTC"),
LongDNA{4}("ACATG-TC")
)
=#SeededAlignment.seed_chain_align — Methodseed_chain_align(;
ref::LongDNA{4},
query::LongDNA{4},
moveset::Moveset = STD_CODON_MOVESET,
scoring::ScoringScheme = STD_SCORING,
codon_scoring_on::Bool = false,
do_clean_frameshifts::Bool = true,
verbose::Bool = false)(SeededAlignment wrapper - CODING given trusted CDS anchor/reference)
Computes a heuristically guided global pairwise alignment of two ungapped CDS (Coding DNA Sequences) ref and query by using ref as an anchor to determine the apprioate reading frame coordinates for query and using a seeding heuristic for speedup. The seeds are then joined together by computing an optimal partial alignment between seeds with the Needleman-Wunsch algorithm (nw_align). Optimal in this context means optimal with repect to the choosen Moveset and ScoringScheme.
The advantage of this method is that it is much faster than nw_align and produces similar results for most usecases.
Extended Help
Arguments
ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignmentquery::LongDNA{4}: CDS (with possible frameshifts due to e.g. sequencing errors) which is aligned torefand adopts its reading frame coordinates.moveset::Moveset = STD_CODON_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)scoring::ScoringScheme = STD_SCORING: Defines alignment scoring together with movesetcodon_scoring_on::Bool = false: Whether to apply additional scoring on codon-leveldo_clean_frameshifts::Bool = true: Whether to clean the alignment output of gaps which cause frameshifts - this produces a protein alignment on a nucleotide level.verbose::Bool = false: Whether to verbosely display what edits were made during the cleaning of frameshifts.
Returns
Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequencesrefandquery.
Note that this represents a protein alignment on a nucleotide level if (do_clean_frameshifts == true).
Example
anchor_CDS = LongDNA{4}("ATGCCAGTA")
# untrusted_CDS may contain some frameshift errors due to e.g. sequencing or annotation errors.
untrusted_CDS = LongDNA{4}("ATGTA")
# frameshift errors are removed from the cleaned alignment
cleaned_CDS_alignment = seed_chain_align(ref=anchor_CDS, query=untrusted_CDS, clean_frameshifts=true)
println(cleaned_CDS_alignment)
#= resulting alignment:
cleaned_CDS_alignment = (
LongDNA{4}("ATGCCAGTA"),
LongDNA{4}("ATG---NTA")
)
Here 'N' denotes ambigious nucleotide.
=#SeededAlignment.write_fasta — Methodwrite_fasta(filepath::String, sequences::Union{Vector{LongDNA{4}}, Vector{LongAA}, Vector{String}}; seq_names = nothing)Writes a fasta file from a vector of sequences, with optional seq_names.
SeededAlignment.write_fasta — Methodwrite_fasta(filepath::String, sequences::Union{NTuple{N,LongDNA{4}}, NTuple{N,LongAA}, NTuple{N,String}}; seq_names = nothing)Writes a fasta file from a Tuple of sequences, with optional seq_names.
SeededAlignment.STD_CODON_MOVESET — ConstantSTD_CODON_MOVESETConstants that represents the default codon moveset with frameshift moves allowed
default parameter values
# the tuple becomes both insertions and deletions in the moveset
const STD_CODON_MOVESET = Moveset(
(
Move(ref=false, step_length=1, score=-2.0, extendable=true),
Move(ref=true, step_length=3, score=-1.0, extendable=true)
)
)SeededAlignment.STD_NOISY_MOVESET — ConstantSTD_NOISY_MOVESETConstants that represents codon oblivious moveset that favors gaps of 3.
default parameter values
# the tuple becomes both insertions and deletions in the moveset
const STD_NOISY_MOVESET = Moveset(
(
Move(ref=false, step_length=1, score=-2.0, extendable=true),
Move(ref=true, step_length=3, score=-1.0, extendable=true)
)
)SeededAlignment.STD_SCORING — ConstantSTD_SCORINGconstant ScoreScheme that stores the default scoring parameters used in alignment methods.
default parameter values
const STD_SCORING = ScoringScheme(
extension_score=-0.3,
kmer_length=12,
edge_ext_begin=true,
edge_ext_end=true,
nucleotide_mismatch_score = -0.8,
nucleotide_match_score = 0.0,
codon_match_bonus_score = 6.0
)