Alignment Functions
NextGenSeqUtils.align_reference_frames
NextGenSeqUtils.banded_nw_align
NextGenSeqUtils.kmer_seeded_align
NextGenSeqUtils.kmer_seeded_edit_dist
NextGenSeqUtils.loc_kmer_seeded_align
NextGenSeqUtils.local_align
NextGenSeqUtils.local_edit_dist
NextGenSeqUtils.local_kmer_seeded_align
NextGenSeqUtils.nw_align
NextGenSeqUtils.resolve_alignments
NextGenSeqUtils.triplet_kmer_seeded_align
NextGenSeqUtils.triplet_nw_align
#
NextGenSeqUtils.nw_align
— Function.
nw_align(s1::String, s2::String; edge_reduction = 0.99)
Returns aligned strings using the Needleman-Wunch Algorithm (quadratic), with end gaps penalized slightly less. edge_reduction is a multiplier (usually less than one) on gaps on end of strings.
nw_align(s1::String, s2::String, banded::Float64)
Wrapper for nw_align
and banded_nw_align
. A larger banded
value makes alignment slower but more accurate.
#
NextGenSeqUtils.banded_nw_align
— Function.
banded_nw_align(s1::String, s2::String; edge_reduction = 0.99, band_coeff = 1)
Like nw_align, but sub quadratic by only computing values within a band around the center diagonal. One 'band' of radius 3 = (4,1), (3,1), (2,1), (1,1), (1,2), (1,3), (1,4), aka upside-down L shape. band_coeff = 1 is sufficient to get same alignments as nw_align for 10% diverged sequences ~97% of the time; increase this value for more conservative alignment with longer computation time. Radius of band = bandwidth
= band_coeff
* sqrt(avg seq length)
#
NextGenSeqUtils.triplet_nw_align
— Function.
triplet_nw_align(s1::String, s2::String; edge_reduction = 0.99, boundary_mult = 2)
Returns alignment of two sequences where s1
is a reference with reading frame to be preserved and s2
is a query sequence. boundary_mult
adjusts penalties for gaps preserving the reading frame of s1
. This usually works best on range 0 to 3, higher values for more strongly enforced gaps aligned on reference frame (divisible-by-3 indices)
#
NextGenSeqUtils.local_align
— Function.
local_align(ref::String, query::String; mismatch_score = -1,
match_score = 1, gap_penalty = -1,
rightaligned=true, refend = false)
Aligns a query sequence locally to a reference. If true, rightaligned
keeps the right ends of each sequence in final alignment- otherwise they are trimmed; refend
keeps the beginning/left end of ref
. If you want to keep both ends of both strings, use nw_align. For best alignments use the default score values.
#
NextGenSeqUtils.kmer_seeded_align
— Function.
kmer_seeded_align(s1::String, s2::String;
wordlength = 30,
skip = 10,
aligncodons = false,
banded = 1.0,
debug::Bool = false)
Returns aligned strings, where alignment is first done with larger word matches and then (possibly banded) Needleman-Wunsch on intermediate intervals. skip
gives a necessary gap between searched-for words in s1
. For best results, use the default wordlength
and skip
values. See nw_align
for explanation of banded
.
#
NextGenSeqUtils.triplet_kmer_seeded_align
— Function.
triplet_kmer_seeded_align(s1::String, s2::String;
wordlength = 30,
skip = 9,
boundary_mult = 2,
alignedcodons = true,
debug::Bool=false)
Returns aligned strings, where alignment is first done with word matches and then Needleman-Wunsch on intermediate intervals, prefering to preserve the reading frame of the first arg s1
. skip
gives a necessary gap between searched-for words in s1
. For best results, use the default wordlength
and skip
values. See triplet_nw_align
for explanation of boundary_mult
.
#
NextGenSeqUtils.loc_kmer_seeded_align
— Function.
function local_kmer_seeded_align(s1::String, s2::String;
wordlength = 30,
skip = 10,
trimpadding = 100,
debug::Bool=false)
Returns locally aligned strings, where alignment is first done with word matches and then Needleman-Wunsch on intermediate intervals.
s1
is a reference to align to, and s2
is a query to extract a local match from. s2
may be trimmed or expanded with gaps. Before locally aligning ends of sequences, the ends of s2
are trimmed to length trimpadding
for faster alignment. Increasing this will possible increase alignment accuracy but effect runtime. skip
gives a necessary gap between searched-for words in s1
. For best results, use the default wordlength
and skip
values.
#
NextGenSeqUtils.local_kmer_seeded_align
— Function.
function local_kmer_seeded_align(s1::String, s2::String;
wordlength = 30,
skip = 10,
trimpadding = 100,
debug::Bool=false)
Returns locally aligned strings, where alignment is first done with word matches and then Needleman-Wunsch on intermediate intervals.
s1
is a reference to align to, and s2
is a query to extract a local match from. s2
may be trimmed or expanded with gaps. Before locally aligning ends of sequences, the ends of s2
are trimmed to length trimpadding
for faster alignment. Increasing this will possible increase alignment accuracy but effect runtime. skip
gives a necessary gap between searched-for words in s1
. For best results, use the default wordlength
and skip
values.
#
NextGenSeqUtils.kmer_seeded_edit_dist
— Function.
kmer_seeded_edit_dist(s1::String , s2::String;
wordlength = 30,
skip = 5,
aa_matches = false)
Computes levenshtein edit distance with speedups from only computing the dp scoring matrix between word matches. If aa_matches = true, will attempt to find amino acid matches in any reference frame, and add the nucleotide Hamming distance of these matches to Levenshtein distances of mismatches. skip
gives a necessary gap between searched-for words in s1
. For best results, use the default wordlength
and skip
values.
#
NextGenSeqUtils.resolve_alignments
— Function.
resolve_alignments(ref::String, query::String; mode = 1)
Called on aligned strings. Resolves query
with respect to ref
. mode
= 1 for resolving single indels, mode
= 2 for resolving single indels and codon insertions in query.
#
NextGenSeqUtils.align_reference_frames
— Function.
align_reading_frames(clusters; k = 6, thresh = 0.03, verbose = false)
Takes clusters
= [consensus_sequences, cluster_sizes], chooses references out of consensuses that do not have stop codons in the middle, and makes all consensus sequence reading frames agree. Returns resolved consensus seqs (goods
) along with filtered out consensus seqs that are >thresh
divergent from nearest reference (bads
). k
= kmer size for computing kmer vectors of sequences.
#
NextGenSeqUtils.local_edit_dist
— Function.
local_edit_dist(s1::String, s2::String)
Returns the edit distance between two sequences after local alignment