

Hidden Markov Model Functions

# NextGenSeqUtils.viterbi_logs — Function.

viterbi_logs(observations_given_states::Array{Float64, 2}, transitions::Array{Float64, 2}, initials::Array{Float64})

Return viterbi path and log probability for that path. Takes logs of matrices. observations_given_states has # rows = # states, # columns = # steps of markov process.

source

# NextGenSeqUtils.trans_mat — Function.

trans_mat(; uniform_cycle_prob = 0.9999999999, homopoly_cycle_prob = 0.98)

Return 5x5 transition matrix with given transition probabilities State 1 – uniform observation distribution. State 2 – High "A" observation likelihood. State 3 – High "C" observation likelihood. State 4 – High "G" observation likelihood. State 5 – High "T" observation likelihood. Each state has high likelihood to return to state 1. States 2 through 5 represent high likelihood of homopolymer regions of respective nuc.

source

# NextGenSeqUtils.obs_mat — Function.

obs_mat(; homopoly_prob = 0.99)

Return 5x4 observation matrix with given probabilities. State 1 – uniform observation distribution. State 2 – High "A" observation likelihood. State 3 – High "C" observation likelihood. State 4 – High "G" observation likelihood. State 5 – High "T" observation likelihood. Columns are in same order as columns 2 through 5 (the last four rows give a symmetric matrix).

source

# NextGenSeqUtils.initial_dist — Function.

initial_dist(; uniform_state = 0.99)

Initial state distributions: see trans_mat for state descriptions. 5x1 vector.

source

# NextGenSeqUtils.get_obs_given_state — Function.

get_obs_given_state(observation_matrix::Array{Float64,2}, observation_seq::String)

Populate 5xT matrix with likelihood of observation at each of T time steps given each nucleotide in observation_seq.

source

# NextGenSeqUtils.homopolymer_filter — Function.

homopolymer_filter(seqs::Array{String,1}, phreds, names;
                   transmat = nothing, obsmat = nothing,
                   initialdist = nothing)

Filter sequences with "bad" sections in the middle – abnormally long runs of a single base, using viterbi alg inference. If this homopolymer occurs on one end of a sequence, keeps sequence and trims homopolymer region off end. phreds and/or names may be nothing, in which case nothing is returned for the respective field. If transition, observation, initial distribution matrices not provided (default) then the default values from the respective constructors are used.

source

homopolymer_filter(sourcepath::String, destpath::String;
                   transmat = nothing, obsmat = nothing,
                   initialdist = nothing, format="fastq")

Filter sequences with "bad" sections in the middle – abnormally long runs of a single base, using viterbi alg inference. If this homopolymer occurs on one end of a sequence, keeps sequence and trims homopolymer region off end. Takes a file path for each of a source file of type format (which may be "fasta" or "fastq") and a destination file is written which is the same file type.

source

# NextGenSeqUtils.markov_filter — Function.

homopolymer_filter(seqs::Array{String,1}, phreds, names;
                   transmat = nothing, obsmat = nothing,
                   initialdist = nothing)

Filter sequences with "bad" sections in the middle – abnormally long runs of a single base, using viterbi alg inference. If this homopolymer occurs on one end of a sequence, keeps sequence and trims homopolymer region off end. phreds and/or names may be nothing, in which case nothing is returned for the respective field. If transition, observation, initial distribution matrices not provided (default) then the default values from the respective constructors are used.

source

homopolymer_filter(sourcepath::String, destpath::String;
                   transmat = nothing, obsmat = nothing,
                   initialdist = nothing, format="fastq")

Filter sequences with "bad" sections in the middle – abnormally long runs of a single base, using viterbi alg inference. If this homopolymer occurs on one end of a sequence, keeps sequence and trims homopolymer region off end. Takes a file path for each of a source file of type format (which may be "fasta" or "fastq") and a destination file is written which is the same file type.

source

# NextGenSeqUtils.forward_logs — Function.

forward_logs(observations_given_states::Array{Float64, 2}, transitions::Array{Float64, 2}, initials::Array{Float64})

Compute logs of forward scores. Takes logs of matrices as inputs. observations_given_states has # rows = # states, # columns = # steps of markov process.

source

# NextGenSeqUtils.backward_logs — Function.

backward_logs(observations_given_states::Array{Float64, 2}, transitions::Array{Float64, 2}, initials::Array{Float64})

Compute logs of backward scores and individual posterior probabilities. Takes logs of matrices as inputs. observations_given_states has # rows = # states, # columns = # steps of markov process.

source

# NextGenSeqUtils.forward_backward_logs — Function.

forward_backward_logs(observations_given_states::Array{Float64, 2}, transitions::Array{Float64, 2}, initials::Array{Float64})

Compute logs of forward-backward scores and individual probabilities. Takes logs of matrices as inputs. observations_given_states has # rows = # states, # columns = # steps of markov process.

source

# NextGenSeqUtils.logsum — Function.

logsum(lga, lgb)

Compute numerically stable logsum. Returns -Inf if either input is -Inf.

source

# NextGenSeqUtils.gen_seq_with_model — Function.

gen_seq_with_model(n::Int, trans_mat, obs_mat, initial_dists)

Generate a sequence given a markov model. May create bad reads (long runs of a single base).

source

# NextGenSeqUtils.viterbiprint — Function.

viterbiprint(s::String)

Draw flagged sites with capital letters, safe sites with lowercase.

source