Georeactor Blog

RSS Feed

Starting Tiny with Protein LLaMA



Tags: mlbioseries

After getting sidetracked on my bio ML project, I wrote a post about recent developments:

The final LoRA model is inaccurate on the corn genes which I held out for evaluation. Seems to be predicting the same subcellular locations for everything. I should try larger models, the full pretraining dataset, different tokenization, and new tokens through PEFT.

I also was unsuccessful at using MergeKit.

Here's the blog post: https://huggingface.co/blog/monsoon-nlp/greenbeing-and-protein-models