Georeactor Blog

RSS Feed

LLaMA 3 now with Nucleotide BioTokens



Tags: mlcodethroughbioseries

After my nucleotide-level LLaMA failed to finetune on tasks, I started over with 'biotokens', which I've created as ∎A, ∎C, ∎G, and ∎T. The leading character is the 'QED' symbol from the math Unicode block. The LoRA needs to be extended to include embed_tokens as in the PEFT library example. I also changed some LoRA parameters for more finetunable params.

To fit more params on an L4 GPU, I had to cut back on batches (batch size 1, context 8,000 tokens, ~7,000 biotokens per example). I'm still using the kañiwa genome and GradientAI/Crusoe extended-context LLaMA (they updated the model in the past ~10 days).

Model / Readme: https://huggingface.co/monsoon-nlp/llama3-biotokenpretrain-kaniwa

Training Notebook: https://colab.research.google.com/drive/1FKA3p_jnfRHYd-hqJdYmKn8MQpxec0t5?usp=sharing

Inference Notebook (examples with and without biotokens): https://colab.research.google.com/drive/1oRS6tvRJNveXw71PscIehFwc0EQvfTcU?usp=sharing

When finetuning on the Long non-coding RNA task, this slightly improved accuracy with 3 more correct answers on the test set (so it's no longer just YES-ing everything...). With 2-shot, the accuracy goes down? https://colab.research.google.com/drive/10OHqe29cFeZGk4Fhb-yPnWkiAMsR8tZr

Thoughts

At this point I've got two weeks to polish up work for CSV Conf. The presentation has a time limit, so I should explain the existing research (AgroNT, Hyena, Evo, Caduceus) and what tasks exist for them, then mention a mix of hits (protein embeddings) and half-baked stuff (biotokens) from my project.
After the conference maybe I'm finetuning AgroNT and Mamba models, parsing new sources for the GreenBeing dataset, or going back to exploit code-generation models. IDK

Bonus: I ran into an issue with lm-eval-harness and LoRAs that add tokens, so I proposed a bug-fix.