Georeactor Blog

RSS Feed

Listing 'Plant-based LLMs'



Tags: bioseriesmlbluesky

This year I rambled about plant DNA and ML on the blog and at CSVconf. The interest goes back to 2020 when I read Uncertain Harvest and saw Dr. Sarah Taber's Twitter threads. Going forward, I think this and AllFed are the best ways I could somehow help mitigate climate change.

After reading a lot of ML papers, I know some bio LLMs that I should try, but I keep forgetting how many and where they lived and whether they had example / starter code to generate proteins or embeddings. In some cases it can be difficult to look up which ones have nucleotide or amino acid tokens, which do embeddings, how they do inference and include natural language.
The "Plant-Based LLMs" page at https://mapmeld.com/plant-based-llms/ is my project to catalog these.

A "plant-based" model must be trained in part on plant genomes or proteomes, and the weights need to be downloadable. A good number of these will be protein language models trained on all of UniProt (which I call "plant-inclusive"), but I can highlight plant research (PlantCaduceus), read the papers to separate out other models (DNABERT-2, HyenaDNA), and find work from a Chicago-based lab (Biom3).
By searching HuggingFace for keywords, I found a few thesis projects and some labs working in this space (including from China). I want to document whatever is out there. There are also ChatCell and ceLLama which are more of tool-use or RAG on top of existing bioinformatics, but I want to highlight those too as interesting formats / examples / applications of LLMs.

Many people are moving to BlueSky. I wanted to dabble a bit in following plant research, but the starter packs got me a feed full of plants, which is overwhelming, but maybe I can use these posts and the PlantScience hashtag to find new collaborators.