Jjama3
Documentation for Jjama3.
Jjama3.eager_update!
Jjama3.generate
Jjama3.llama3_assistant_prompt
Jjama3.load_llama3_from_safetensors
Jjama3.structured_choice
Jjama3.eager_update!
— Methodeager_update!(state, model, update!)
Updates params during the backward pass, saving memory.
f(model, xs...) = model(xs...) h = f(Zygote.eagerupdate!(state.layers[i], model.layers[i], Optimisers.update!), h, otherargs...)
Jjama3.generate
— Methodgenerate(model, initial_tokens; max_new_tokens=100, sampler=top_pk_sampler(p=0.5f0, k=5), tokenizer_for_printing=tkn, end_token=128010)
Takes an initial sequence of tokens, and generates new tokens one at a time until the end token is sampled. Uses a KV cache. No batch dim for now. Runs on CPU by default. If the model is on the GPU (assuming Flux.jl, eg. model = gpu(model)
), then pass device = gpu
to generate
to run on the GPU.
tkn = llama3_tokenizer()
generate(model, initial_tokens; max_new_tokens=100, sampler=top_pk_sampler(p=0.5f0, k=5), tokenizer_for_printing=tkn, end_token=128010)
Jjama3.llama3_assistant_prompt
— Methodgenerate(model, prompt, max_new_tokens=100, encoder_for_printing=tkn)
Format a prompt for use with Llama3.2's instruction format, with a simple "You are a helpful assistant" system prompt.
prompt = assistant_prompt(tkn, "What is the capital of France?")
generate(model, prompt, max_new_tokens=100, encoder_for_printing=tkn)
Jjama3.load_llama3_from_safetensors
— Methodmodel = load_llama3_from_safetensors(model_weight_paths, config)
Load a Llama3 model from a set of Huggingface safetensors files, and the config.json file. Important note: Huggingface uses a different RoPE convention than other implementations, so if you're loading weights from a different source, you might get very poor model performance.
using JSON3
config = JSON3.read(read("Llama3_2_1B_instruct/config.json", String))
model_weight_paths = ["Llama3_2_1B_instruct/model.safetensors"] #Can be an array of paths if the model is split across multiple files
model = load_llama3_from_safetensors(model_weight_paths, config)
Jjama3.structured_choice
— Methodsampler = structured_choice(choices, vocab::Vector{String}, end_token::Int; sampler = logits -> argmax_sampler(logits))
Return a function that can be passed into generate as a sampler, which will sample from the given choices. Handles the case where the choices are made up of multiple tokens. vocab
is an array of the tokens as strings, in their order in the tokenizer. sampler
is a function that takes the logits (here including those masked with -Inf) and returns a sample from them. Defaults to argmax.
Example:
config = JSON3.read(read("SmolLM2-1.7B-Instruct/config.json", String))
model = load_llama3_from_safetensors("SmolLM2-1.7B-Instruct/model.safetensors", config)
tkn = tokenizer_from_file(Tokenizer, "SmolLM2-1.7B-Instruct/tokenizer.json")
question = "In a Bayesian model, what do we call the probability distribution of parameters given the data?"
choices = ["Prior", "Likelihood", "Marginal Likelihood", "Evidence", "Posterior"]
vocab = [decode(tkn, [i], skip_special_tokens = false) for i in 1:49152]
eos = encode(tkn, "<|im_end|>")[end]
prompt = smollm2_instruct_prompt(tkn, "You are an expert in Statistics and Probability Theory who answers questions in as few words as possible.",question)
generate(model, prompt, max_new_tokens=100, tokenizer_for_printing=tkn, end_token = eos, sampler = structured_choice(choices, vocab, eos));
If you want to run the model on the GPU, then you need to pass device = gpu
to the generate
function, and device = cpu
to the structured_choice
function.