Jjama3

Documentation for Jjama3.

Jjama3.eager_update!Method
eager_update!(state, model, update!)

Updates params during the backward pass, saving memory.

f(model, xs...) = model(xs...) h = f(Zygote.eagerupdate!(state.layers[i], model.layers[i], Optimisers.update!), h, otherargs...)

source
Jjama3.generateMethod
generate(model, initial_tokens; max_new_tokens=100, sampler=top_pk_sampler(p=0.5f0, k=5), tokenizer_for_printing=tkn, end_token=128010)

Takes an initial sequence of tokens, and generates new tokens one at a time until the end token is sampled. Uses a KV cache. No batch dim for now. Runs on CPU by default. If the model is on the GPU (assuming Flux.jl, eg. model = gpu(model)), then pass device = gpu to generate to run on the GPU.

tkn = llama3_tokenizer()
generate(model, initial_tokens; max_new_tokens=100, sampler=top_pk_sampler(p=0.5f0, k=5), tokenizer_for_printing=tkn, end_token=128010)
source
Jjama3.llama3_assistant_promptMethod
generate(model, prompt, max_new_tokens=100, encoder_for_printing=tkn)

Format a prompt for use with Llama3.2's instruction format, with a simple "You are a helpful assistant" system prompt.

prompt = assistant_prompt(tkn, "What is the capital of France?")
generate(model, prompt, max_new_tokens=100, encoder_for_printing=tkn)
source
Jjama3.load_llama3_from_safetensorsMethod
model = load_llama3_from_safetensors(model_weight_paths, config)

Load a Llama3 model from a set of Huggingface safetensors files, and the config.json file. Important note: Huggingface uses a different RoPE convention than other implementations, so if you're loading weights from a different source, you might get very poor model performance.

using JSON3
config = JSON3.read(read("Llama3_2_1B_instruct/config.json", String))
model_weight_paths = ["Llama3_2_1B_instruct/model.safetensors"] #Can be an array of paths if the model is split across multiple files
model = load_llama3_from_safetensors(model_weight_paths, config)
source
Jjama3.structured_choiceMethod
sampler = structured_choice(choices, vocab::Vector{String}, end_token::Int; sampler = logits -> argmax_sampler(logits))

Return a function that can be passed into generate as a sampler, which will sample from the given choices. Handles the case where the choices are made up of multiple tokens. vocab is an array of the tokens as strings, in their order in the tokenizer. sampler is a function that takes the logits (here including those masked with -Inf) and returns a sample from them. Defaults to argmax.

Example:

config = JSON3.read(read("SmolLM2-1.7B-Instruct/config.json", String))
model = load_llama3_from_safetensors("SmolLM2-1.7B-Instruct/model.safetensors", config)
tkn = tokenizer_from_file(Tokenizer, "SmolLM2-1.7B-Instruct/tokenizer.json")

question = "In a Bayesian model, what do we call the probability distribution of parameters given the data?"
choices = ["Prior", "Likelihood", "Marginal Likelihood", "Evidence", "Posterior"]

vocab = [decode(tkn, [i], skip_special_tokens = false) for i in 1:49152]
eos = encode(tkn, "<|im_end|>")[end]
prompt = smollm2_instruct_prompt(tkn, "You are an expert in Statistics and Probability Theory who answers questions in as few words as possible.",question)
generate(model, prompt, max_new_tokens=100, tokenizer_for_printing=tkn, end_token = eos, sampler = structured_choice(choices, vocab, eos));

If you want to run the model on the GPU, then you need to pass device = gpu to the generate function, and device = cpu to the structured_choice function.

source