API

The main functions and types in Rifraf.jl.

Functions

# Rifraf.rifrafFunction.

rifraf(dnaseqs, phreds; kwargs...)

Find a consensus sequence for a set of DNA sequences.

Returns an instance of RifrafResult.

Arguments

  • dnaseqs::Vector{DNASeq}: reads for which to find a consensus
  • phreds::Vector{Vector{Phred}}: Phred scores for dnaseqs
  • consensus::DNASeq=DNASeq(): initial consensus; if not given, defaults to the sequence in dnaseqs with the lowest mean error rate
  • reference::DNASeq=DNASeq(): reference for frame correction
  • params::RifrafParams=RifrafParams()

source

Sequence simulations

# Rifraf.sample_sequencesFunction.

sample_sequences(nseqs, len; kwargs...)

Generate a template and sample simulated reads and Phred scores.

This function is meant for simple testing and benchmarking, and is not meant to represent a realistic error model.

Arguments:

  • nseqs::Int=3: number of reads to generate
  • len::Int=90: length of template
  • ref_error_rate::Prob=0.1: reference error rate
  • ref_errors::ErrorModel=ErrorModel(10, 0, 0, 1, 1): reference error model
  • error_rate::Prob=0.01: read error rate
  • alpha::Float64=0.1: α parameter for beta distribution of per-base template error rates.
  • phred_scale::Float64=1.5: λ parameter for exponential distribution of Phred error
  • actual_std::Float64=3.0: σ^2 for true Gaussian errors in the Phred domain
  • reported_std::Float64=1.0: σ^2 for Gaussian errors in the Phred domain
  • seq_errors::ErrorModel=ErrorModel(1, 5, 5): sequencing error model

Returns:

  • reference::DNASeq: reference sequence for template
  • template::DNASeq: template sequence
  • t_p::Vector{Prob}: template error probabilities
  • seqs::Vector{DNASeq}: simulated reads
  • actual::Vector{Vector{Prob}}: error probabilities
  • phreds::{Vector{Vector{Phred}}: Phred values
  • seqbools::Vector{Vector{Bool}}: seqbools[i][j] is true if seqs[i][j] was correctly sequenced from the template
  • tbools::Vector{Vector{Bool}}: tbools[i][j] is true if template[j] was correctly sequenced in seqs[i]

source

# Rifraf.write_samplesFunction.

Write template into FASTA and sequences into FASTQ.

source

# Rifraf.read_samplesFunction.

Read template from FASTA and sequences from FASTQ.

source

Utility IO functions

Rifraf.jl provides some utility functions for reading and writing FASTQ and FASTA files. This functionality uses BioSequences.jl.

# Rifraf.read_fastq_recordsFunction.

read_fastq_records(filename)

Read a FASTQ file and return records.

Returns:

  • records::Vector{FASTQ.Record}

source

# Rifraf.read_fastqFunction.

read_fastq(filename)

Read a FASTQ file and convert to a given sequence type.

Returns:

  • seqs::Vector{T}:
  • phreds::Vector{Vector{Phred}}: Phred values
  • names::Vector{String}: sequence names

source

# Rifraf.write_fastqFunction.

write_fastq(filename, seqs, phreds; names)

Write sequences to a FASTA file.

Arguments:

  • filename: file into which to write
  • seqs: sequences to write
  • phreds: corresponding Phred scores
  • names::Vector{String}: optional list of corresponding names

source

# Rifraf.read_fasta_recordsFunction.

read_fasta_records(filename)

Read a FASTA file and return records.

Returns:

  • records::Vector{FASTA.Record}

source

# Rifraf.read_fastaFunction.

read_fasta(filename)

Read a FASTA file and convert to a given sequence type.

Returns:

  • seqs::Vector{T}

source

# Rifraf.write_fastaFunction.

write_fasta(filename, seqs; names)

Write sequences to a FASTA file.

Arguments:

  • filename: file into which to write
  • seqs: sequences to write
  • names::Vector{String}: optional list of corresponding names

source

Types

# Rifraf.RifrafParamsType.

The parameters for a RIFRAF run.

Fields

  • scores::Scores = Scores(ErrorModel(1.0, 2.0, 2.0, 0.0, 0.0))
  • ref_scores::Scores = Scores(ErrorModel(10.0, 1e-1, 1e-1, 1.0, 1.0))
  • ref_indel_mult::Score = 3.0: multiplier for single indel penalties in alignment with the reference
  • max_ref_indel_mults::Int = 5: maximum multiplier increases for single indel penalty
  • ref_error_mult::Float64 = 1.0: multiplier for estimated reference error rate.
  • do_init::Bool = true: enable initialization stage
  • do_frame::Bool = true: enable frame correction stage
  • do_refine::Bool = true: enable refinement stage
  • do_score::Bool = false: enable scoring stage
  • do_alignment_proposals::Bool = true: only propose changes that occur in pairwise alignments
  • seed_indels::Bool = true: seed indel locations from the alignment to reference
  • indel_correction_only::Bool = true: only propose indels during frame correction stage
  • use_ref_for_qvs::Bool = false: use reference alignment when estimating quality scores
  • bandwidth::Int = (3 * CODON_LENGTH): alignment bandwidth
  • bandwidth_pvalue::Float64 = 0.1: p-value for increasing bandwidth
  • min_dist::Int = (5 * CODON_LENGTH): distance between accepted candidate proposals
  • batch_fixed::Bool = true: use top sequences for initial stage and frame correction
  • batch_fixed_size::Int = 5: size of fixed batch
  • batch_size::Int = 20: batch size; if <= 1, no batching is used
  • batch_randomness::Float64 = 0.9: batch randomness

    • 0: top n get picked
    • 0.5: weight according to estimated errors
    • 1: completely random
    • batch_mult::Float64 = 0.7: multiplier to reduce batch randomness
    • batch_threshold::Float64 = 0.1: score threshold for increasing batch size
    • max_iters::Int = 100: maximum total iterations across all stages before giving up
    • verbose::Int = 0: verbosity level

    • 0: nothing

    • 1: print iteration and score
    • 2: also print step within each iteration
    • 3: also print full consensus sequence

source

# Rifraf.RifrafResultType.

RifrafResult()

The result of a RIFRAF run.

Fields

  • consensus::DNASeq: the consensus found by RIFRAF.
  • params::RifrafParams: the parameters used for this run.
  • state::RifrafState: the final state of the run.
  • consensus_stages::Vector{Vector{DNASeq}}:
  • error_probs::EstimatedProbs: estimated per-base probabilities for each position. Only available if params.do_score is true.
  • aln_error_probs::Vector{Float64}: combined per-base error probabilities. Only available if params.do_score is true.

source

# Rifraf.ErrorModelType.

ErrorModel(mismatch, insertion, deletion, codon_insertion, codon_deletion)

Error model for sequencing.

Each field contains the relative rate of of that kind of error. For instance, this model breaks the error rate into 80% mismatches, 10% codon insertions, and 10% codon deletions: ErrorModel(8, 0, 0, 1, 1).

Fields:

  • mismatch::Real
  • insertion::Real
  • deletion::Real
  • codon_insertion::Real
  • codon_deletion::Real

source

# Rifraf.ScoresType.

Scores(errors; mismatch, insertion, deletion)

Derive alignment scores from an error model.

Takes extra penalties to add to the mismatch, insertion, and deletion scores.

Arguments:

  • errors::ErrorModel:
  • mismatch::Real: substitution
  • insertion::Real: insertion
  • deletion::Real: deletion

source