API Reference

This section provides detailed documentation for all exported types, functions and constants in SeededAlignment.jl.

Core Methods

Utilities

methods for reading and writing of sequence data in FASTA format.

Core Types

Constants

Index

A complete alphabetical listing of all documented functions.

Index Docstrings

SeededAlignment.MoveType
Move(ref::Bool, step_length::Int64, score::Float64, extendable::Float64)

Represents a gap move during alignment. The Move instance represents either an insertion or deletion depending on how it is passed to the Moveset instance - collection of Move instances used in alignment methods.

For example, if Move.step_length = 3 then that represents a gap of length 3 (either insertion or deletion).

Extended Help

Fields

-ref::Bool: Whether move respects the coding reading frame -step_length::Int64: Length of gap in alignment (must be 1,2 or 3) -score::Float64: penalty for using the move - cost of adding the gaps to the alignment (must be < 0) -extendable::Bool: Whether the move can be affinely extended by another move.

Constructors

-Move(; ref=false::Bool, step_length::Int64, score::Float64, extendable=false): keyword constructor for more easily constructing Moves. Has some default values but requires at least step_length and score to be provided.

-RefMove(; score::Float64): Constructor for a Move that respect coding reading frame. Only requires score argument. Returns: Move(ref=true, step_length=3, score=score, extendable=true)

-FrameshiftMove(; step_length::Int64, score::Float64, extendable::Bool=false): Constructor for a Move that cause frameshifts and breaks reading frame symmetry. Requires arguments step_length, score, and extendable. Returns: Move(ref=false, step_length=step_length, score=score, extendable=extendable)

Examples

  1. codon moveset with no frameshifts
# represents codon insertion or codon deletion
codon_indel = RefMove(score=1.0)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions. 
=#
Moveset(ref_insertions = (codon_indel,), ref_deletions = (codon_indel,))
  1. codon moveset with frameshift moves allowed
# represents codon insertion or codon deletion
codon_indel = RefMove(score=-1.0)
# represents frameshift causing insertion or deletion
frm_indel = FrameshiftMove(score=-1.5, step_length=1, extendable=true)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions and single nucleotide indels that are extendable. 
=#
Moveset(ref_insertions = (codon_indel,frm_indel), ref_deletions = (codon_indel, frm_indel))
source
SeededAlignment.MovesetType
Moveset{X,Y}(; ref_insertions::NTuple{X,Move}, ref_deletions::NTuple{Y,Move})

Represents a collection of Move instances that are either insertions or deletions

Extended Help

Fields

-vert_moves::NTuple{X,Move}: insertions relative to reference - gap operation in top/first provided sequence -hor_moves::NTuple{Y,Move}: deletions relative to reference - gap operations in left/second provided sequence

Examples

  1. codon moveset with no frameshifts
# represents codon insertion or codon deletion
codon_indel = RefMove(score=1.0)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions. 
=#
ms = Moveset(ref_insertions = (codon_indel,), ref_deletions = (codon_indel,))
  1. codon moveset with frameshift moves allowed
# represents codon insertion or codon deletion
codon_indel = RefMove(score=-1.0)
# represents frameshift causing insertion or deletion
frm_indel = FrameshiftMove(score=-1.5, step_length=1, extendable=true)
#= passing to moveset solidifies what the allowed alignment operations are,
namely single codon insertions and deletions and single nucleotide indels that are extendable. 
=#
ms = Moveset(ref_insertions = (codon_indel,frm_indel), ref_deletions = (codon_indel, frm_indel))
source
SeededAlignment.ScoringSchemeType
ScoringScheme(;
	extension_score::Float64=-0.3,
	kmer_length::Int64=18,
	edge_ext_begin::Bool=true,
	edge_ext_end::Bool=true,
	nucleotide_match_score::Float64 = 0.0,
	nucleotide_mismatch_score::Float64 = -0.8,
	nucleotide_score_matrix::Union{Nothing,Matrix{Float64}} = nothing
	codon_match_bonus_score::Float64 = 6.0

)

ScoringScheme defines how matches, mismatches, and partially how gaps are scored during sequence alignment. Beyond alignment operations governed by Moveset it encompasses everything else we can customize about the alignment process.

This struct is typically passed to functions like seed_chain_align, nw_align and msa_codon_align.

Fields

-extension_score::Float64=-0.3: penalty for affinely extending a gap that is already open. -kmer_length::Int64=18: length of seeds used in seedchainalign -edge_ext_begin=true: Whether to subsidize gaps in begining of the alignment -edge_ext_end=true: Whether to subsidize gaps in end of the alignment -nucleotide_match_score = 0.0: score awarded for matching nucleotides (has to be >= 0) -nucleotide_mismatch_score = -0.8: penalty for matching distinct nucleotides (has to be < 0) -nucleotide_score_matrix::Union{Nothing, Matrix{Float64}}: optional custom matrix for scoring nucleotide substitutions. -codon_match_bonus_score = 6.0: score awarded for matching codons

example

score_params = ScoringScheme(extension_score = -0.2, mismatch_score = -0.7) # (everything else will be keept at default values)
A = LongDNA{4}("ATGATGATAA")
B = LongDNA{4}("ATGACCCGATAA")
seed_chain_align(A,B scoring=score_params)
source
SeededAlignment.ScoringSchemeMethod
ScoringScheme(;
	extension_score::Float64=-0.3,
	kmer_length::Int64=12,
	edge_ext_begin::Bool=true,
	edge_ext_end::Bool=true,
	nucleotide_match_score::Float64 = 0.0,
	nucleotide_mismatch_score::Float64 = -0.4,
	nucleotide_score_matrix::Union{Nothing,Matrix{Float64}} = nothing
	codon_match_bonus_score::Float64 = 6.0

)

keyword constructor for ScoringScheme with helpful default parameters. It is important to note that supplying a custom matrix for nucleotide_score_matrix might slow performance of some alignment methods.

Extended Help

Slow down might occur due to allocating ScoreScheme to the heap instead of the stack since matrix{Float64} is dynamically sized.

Arguments

-extension_score::Float64=-0.3: penalty for affinely extending a gap that is already open. -kmer_length::Int64=18: length of seeds used in seedchainalign -edge_ext_begin=true: Whether to subsidize gaps in begining of the alignment -edge_ext_end=true: Whether to subsidize gaps in end of the alignment -nucleotide_match_score = 0.0: score awarded for matching nucleotides (has to be >= 0) -nucleotide_mismatch_score = -0.8: penalty for matching distinct nucleotides (has to be < 0) -nucleotide_score_matrix::Union{Nothing, Matrix{Float64}}: optional custom matrix for scoring nucleotide substitutions. -codon_match_bonus_score = 6.0: score awarded for matching codons

Returns

-ScoringScheme

example

```julia scoreparams = ScoringScheme(extensionscore = -0.2, mismatchscore = -0.7) # (everything else will be keept at default values) A = LongDNA{4}("ATGATGATAA") B = LongDNA{4}("ATGACCCGATAA") seedchainalign(A,B scoring=scoreparams)

source
SeededAlignment.clean_frameshiftsMethod
clean_frameshifts(aligned_ref::LongDNA{4}, aligned_seq::LongDNA{4}; verbose::Bool=false)

(clean_frameshifts - cleans pairwise alignments of frameshifts)

Takes a global pairwise alignment of a CDS anchor/reference aligned_ref and a CDS that may contain frameshift errors aligned_seq, and removes frameshift mutations which don't respect the reference's reading frame in the alignment. This is done by removing insertions from the alignment or inserting ambigious nucleotides into deletions.

Extended Help

Arguments

  • aligned_ref::LongDNA{4}: aligned anchored trusted CDS which decides the reading frame coordinates in the alignment
  • aligned_seq::LongDNA{4}: aligned CDS (with possible frameshifts due to e.g. sequencing or annotation errors) which is aligned to ref and adopts its reading frame coordinates
  • verbose::Bool: Whether to verbosely display what edits were made during the cleaning of frameshifts

Returns

  • Tuple{LongDNA{4},LongDNA{4}}: Cleaned global pairwise alignment that represents a protein alignment on a nucleotide level

Examples

1. insertion
ref: ATG-AACGTA  -> cleaned_ref: ATGAACGTA 
seq: ATGTAACGTA  -> cleaned_seq: ATGAACGTA

2. deletion
ref: ATGAACGTA  -> cleaned_ref: ATGAACGTA
seq: ATG-ACGTA  -> cleaned_seq: ATGNACGTA

3. longer deletion
ref: ATGAACGTA  -> cleaned_ref: ATGAACGTA
seq: AT-----TA  -> cleaned_seq: ATN---NTA
source
SeededAlignment.clean_frameshiftsMethod
clean_frameshifts(aligned_ref::LongDNA{4}, aligned_seqs::Vector{LongDNA{4}})

(clean_frameshifts - cleans multiple sequence alignments of frameshifts)

Clean a multiple sequence alignment provided one of the sequence is a reference sequence. This is done by projecting the multiple sequence alignment to a collection of pairwise alignments relative to the reference aligned_ref; cleaning those and then scaffolding the results to recover a cleaned multiple sequence alignment.

Extended Help

Arguments

  • aligned_ref::LongDNA{4}: aligned anchored trusted CDS which decides the reading frame coordinates in the alignment
  • aligned_seqs::LongDNA{4}: aligned coding sequences (with possible frameshifts due to e.g. sequencing or annotation errors) which are aligned to ref and adopts its reading frame coordinates
  • verbose::Bool: Whether to verbosely display what edits were made during the cleaning of frameshifts

Returns

  • cleaned_msa::Vector{LongDNA{4}}: a frameshift-free multiple sequence alignment

Example

aligned_seqs = Vector{LongDNA{4}}(undef, 4)
# original unclean msa
            aligned_ref =     LongDNA{4}("ATG---TTTCCCGGGT-AA")
            aligned_seqs[1] = LongDNA{4}("-TG------CCCGGGT-A-")
            aligned_seqs[2] = LongDNA{4}("ATGAAATTTCCCGGGT-AA")
            aligned_seqs[3] = LongDNA{4}("ATGAAA----CCGGGT-AA")
            aligned_seqs[4] = LongDNA{4}("ATG---TTTCCCGGGTTAA")
# produce clean msa
cleaned_msa = clean_frameshifts(aligned_ref, aligned_seqs)
#= results
            cleaned_msa[1] = LongDNA{4}("ATG---TTTCCCGGGTAA") # ref sequence
            cleaned_msa[2] = LongDNA{4}("NTG------CCCGGGTAN")
            cleaned_msa[3] = LongDNA{4}("ATGAAATTTCCCGGGTAA")
            cleaned_msa[4] = LongDNA{4}("ATGAAA---NCCGGGTAA")
            cleaned_msa[5] = LongDNA{4}("ATG---TTTCCCGGGTAA")

=#
source
SeededAlignment.msa_codon_alignMethod
msa_codon_align(ref::LongDNA{4}, seqs::Vector{LongDNA{4}}; moveset::Moveset=STD_CODON_MOVESET, scoring::ScoringScheme=STD_SCORING, codon_scoring_on=true::Bool)

Computes a visual global MSA (multiple Sequence alignment) of coding sequences based on pairwise alignments to a trusted CDS reference ref used as an anchor to determine the apprioate reading frame coordinates for the other coding sequences (which may contain frameshift errors). Possible frameshifts errors in the pairwise alignments are cleaned up and then scaffolded to create a multiple sequence alignment. This is done by left-stacking codon insertions relative to the reference.

Note that this doesn't qualify as a proper multiple sequence alignment in the traditional sense since the aligned sequences are only scored as being aligned against the reference sequnece and not each other.

Even so, it can still provides a useful visualization or approximation for a protein multiple sequence alignment on a nucleotide level.

Extended Help

Arguments

  • ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignment
  • seqs::Vector{LongDNA{4}}: Coding Sequences (with possible frameshifts due to e.g. sequencing or annotation errors) which are aligned to ref and adopts its reading frame coordinates.
  • moveset::Moveset=STD_CODON_MOVESET: Defines allowable alignment moves (e.g., codon insertions/deletions).
  • scoring::ScoringScheme=STD_SCORING: The scoring scheme used during alignment.
  • codon_scoring_on::Bool=true: Whether to apply additional scoring on codon-level

Returns

  • msa::Vector{LongDNA{4}}: a frameshift-free multiple sequence alignment

Example

  1. codon alignment with no frameshifts present in inputs
ref =  LongDNA{4}("ATGTTTCCCGGGTAA")
seq1 = LongDNA{4}("ATGTTTTTTCCCGGGTAA")
seq2 = LongDNA{4}("ATGATGTTTTTTCCCGGGTAAGGG")
seq3 = LongDNA{4}("ATGCCCGGG")

msa = msa_codon_align(ref, [seq1,seq2,seq3])
#= alignment results:

msa[1] == LongDNA{4}("---ATG---TTTCCCGGGTAA---")
msa[2] == LongDNA{4}("---ATGTTTTTTCCCGGGTAA---")
msa[3] == LongDNA{4}("ATGATGTTTTTTCCCGGGTAAGGG")
msa[4] == LongDNA{4}("---ATG------CCCGGG------")

=#
source
SeededAlignment.nw_alignMethod
nw_align(A::LongDNA{4}, B::LongDNA{4}; moveset::Moveset=STD_NOISY_MOVESET, scoring::ScoringScheme=STD_SCORING)

(Needleman-wunsch wrapper - DE-NOVO)

Computes an optimal global pairwise alignment of the two ungapped DNA sequences A and B. This is done purely semantically without any awareness of protein encoding.

Extended Help

Arguments

  • A::LongDNA{4}: 1st DNA sequence to be aligned
  • B::LongDNA{4}: 2nd DNA sequence to be aligned
  • moveset::Moveset=STD_NOISY_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)
  • scoring::ScoreScheme=STD_SCORING: Defines alignment scoring together with moveset

Returns

  • Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequences A and B.

Examples

A = LongDNA{4}("AATGCTC")
B = LongDNA{4}("ACATGTC")
# produce alignment 
alignment = nw_align(A, B)
println(alignment)
#= resulting alignment:

alignment = (
	LongDNA{4}("A-ATGCTC"), 
	LongDNA{4}("ACATG-TC")
)

=#
source
SeededAlignment.nw_alignMethod
nw_align(; 
    	ref::LongDNA{4}, 
    	query::LongDNA{4}, 
    	moveset::Moveset = STD_CODON_MOVESET, 
    	scoring::ScoringScheme = STD_SCORING,
    	codon_scoring_on::Bool = false,
    	do_clean_frameshifts::Bool = true, 
    	verbose::Bool = false)

(Needleman-wunsch wrapper - CODING given trusted CDS anchor/reference)

Produces an optimal global pairwise alignment of two ungapped CDS (Coding DNA Sequences) ref and query by using an ref as an anchor to determine the apprioate reading frame coordinates for query.

Extended Help

Arguments

  • ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignment
  • query::LongDNA{4}: CDS (with possible frameshifts due to e.g. sequencing or annotation errors) which is aligned to ref and adopts its reading frame coordinates.
  • moveset::Moveset = STD_CODON_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)
  • scoring::ScoringScheme = STD_SCORING: Defines alignment scoring together with moveset
  • codon_scoring_on::Bool = false: Whether to apply additional scoring on codon-level
  • do_clean_frameshifts::Bool = true: Whether to clean the alignment output of gaps which cause frameshifts (IMPORTANT: produces a protein alignment on a nucleotide level)
  • verbose::Bool = false: Whether to verbosely display what edits were made during the cleaning of frameshifts.

Returns

  • Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequences ref and query.

Note that this represents a protein alignment on a nucleotide level if (do_clean_frameshifts == true).

Examples

anchor_CDS =    LongDNA{4}("ATGCCAGTA")
# untrusted_CDS may contain some frameshift errors due to e.g. sequencing or annotation errors.
untrusted_CDS = LongDNA{4}("ATGTA") 
# frameshift errors are removed from the cleaned alignment
cleaned_CDS_alignment = nw_align(ref=anchor_CDS, query=untrusted_CDS, clean_frameshifts=true)
println(cleaned_CDS_alignment)
#= resulting alignment:

cleaned_CDS_alignment = (
	LongDNA{4}("ATGCCAGTA"), 
	LongDNA{4}("ATG---NTA")
)
Here 'N' denotes ambigious nucleotide.
=#
source
SeededAlignment.seed_chain_alignMethod
seed_chain_align(A::LongDNA{4}, B::LongDNA{4}; moveset::Moveset=STD_NOISY_MOVESET, scoring::ScoringScheme=STD_SCORING)

(SeededAlignment wrapper - DE-NOVO)

Computes a heuristically guided global pairwise alignment of two ungapped DNA sequence A and B based on seeding heuristic. The seeds are then joined together by computing an optimal partial alignment between seeds with the Needleman-Wunsch algorithm (nw_align). Optimal in this context meaning with repect to the choosen Moveset and ScoringScheme.

The advantage of this method is that it is much faster than nw_align and produces similar results for most usecases.

Extended Help

Arguments

  • A::LongDNA{4}: 1st DNA sequence to be aligned
  • B::LongDNA{4}: 2nd DNA sequence to be aligned
  • moveset::Moveset=STD_NOISY_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)
  • scoring::ScoreScheme=STD_SCORING: Defines alignment scoring together with moveset

Returns

  • Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequences A and B.

Example

# input sequences with no reading frame assumed
A = LongDNA{4}("AATGCTC")
B = LongDNA{4}("ACATGTC")
# produce alignment
alignment = seed_chain_align(A, B)
println(alignment)
#= resulting alignment
alignment = (
	LongDNA{4}("A-ATGCTC"), 
	LongDNA{4}("ACATG-TC")
)
=#
source
SeededAlignment.seed_chain_alignMethod
seed_chain_align(; 
    	ref::LongDNA{4}, 
    	query::LongDNA{4}, 
    	moveset::Moveset = STD_CODON_MOVESET, 
    	scoring::ScoringScheme = STD_SCORING,
    	codon_scoring_on::Bool = false,
    	do_clean_frameshifts::Bool = true, 
    	verbose::Bool = false)

(SeededAlignment wrapper - CODING given trusted CDS anchor/reference)

Computes a heuristically guided global pairwise alignment of two ungapped CDS (Coding DNA Sequences) ref and query by using ref as an anchor to determine the apprioate reading frame coordinates for query and using a seeding heuristic for speedup. The seeds are then joined together by computing an optimal partial alignment between seeds with the Needleman-Wunsch algorithm (nw_align). Optimal in this context means optimal with repect to the choosen Moveset and ScoringScheme.

The advantage of this method is that it is much faster than nw_align and produces similar results for most usecases.

Extended Help

Arguments

  • ref::LongDNA{4}: Anchored trusted CDS which decides the reading frame coordinates in the alignment
  • query::LongDNA{4}: CDS (with possible frameshifts due to e.g. sequencing errors) which is aligned to ref and adopts its reading frame coordinates.
  • moveset::Moveset = STD_CODON_MOVESET: Defines allowable alignment moves (e.g. insertions/deletions and their penalty)
  • scoring::ScoringScheme = STD_SCORING: Defines alignment scoring together with moveset
  • codon_scoring_on::Bool = false: Whether to apply additional scoring on codon-level
  • do_clean_frameshifts::Bool = true: Whether to clean the alignment output of gaps which cause frameshifts - this produces a protein alignment on a nucleotide level.
  • verbose::Bool = false: Whether to verbosely display what edits were made during the cleaning of frameshifts.

Returns

  • Tuple{LongDNA{4},LongDNA{4}}: Tuple representation of pairwise alignment of DNA sequences ref and query.

Note that this represents a protein alignment on a nucleotide level if (do_clean_frameshifts == true).

Example

anchor_CDS =    LongDNA{4}("ATGCCAGTA")
# untrusted_CDS may contain some frameshift errors due to e.g. sequencing or annotation errors.
untrusted_CDS = LongDNA{4}("ATGTA") 
# frameshift errors are removed from the cleaned alignment
cleaned_CDS_alignment = seed_chain_align(ref=anchor_CDS, query=untrusted_CDS, clean_frameshifts=true)
println(cleaned_CDS_alignment)
#= resulting alignment:

cleaned_CDS_alignment = (
	LongDNA{4}("ATGCCAGTA"), 
	LongDNA{4}("ATG---NTA")
)
Here 'N' denotes ambigious nucleotide.
=#
source
SeededAlignment.write_fastaMethod
write_fasta(filepath::String, sequences::Union{Vector{LongDNA{4}}, Vector{LongAA}, Vector{String}}; seq_names = nothing)

Writes a fasta file from a vector of sequences, with optional seq_names.

source
SeededAlignment.write_fastaMethod
write_fasta(filepath::String, sequences::Union{NTuple{N,LongDNA{4}}, NTuple{N,LongAA}, NTuple{N,String}}; seq_names = nothing)

Writes a fasta file from a Tuple of sequences, with optional seq_names.

source
SeededAlignment.STD_CODON_MOVESETConstant
STD_CODON_MOVESET

Constants that represents the default codon moveset with frameshift moves allowed

default parameter values

# the tuple becomes both insertions and deletions in the moveset
const STD_CODON_MOVESET = Moveset(
    (
        Move(ref=false, step_length=1, score=-2.0, extendable=true),
        Move(ref=true,  step_length=3, score=-1.0, extendable=true)
    )
)
source
SeededAlignment.STD_NOISY_MOVESETConstant
STD_NOISY_MOVESET

Constants that represents codon oblivious moveset that favors gaps of 3.

default parameter values

# the tuple becomes both insertions and deletions in the moveset
const STD_NOISY_MOVESET = Moveset(
    (
        Move(ref=false, step_length=1, score=-2.0, extendable=true),
        Move(ref=true,  step_length=3, score=-1.0, extendable=true)
    )
)
source
SeededAlignment.STD_SCORINGConstant
STD_SCORING

constant ScoreScheme that stores the default scoring parameters used in alignment methods.

default parameter values

const STD_SCORING = ScoringScheme(
	extension_score=-0.3,
	kmer_length=12,
	edge_ext_begin=true,
	edge_ext_end=true,
	nucleotide_mismatch_score = -0.8,
	nucleotide_match_score = 0.0,
	codon_match_bonus_score = 6.0
)
source