Georeactor Blog

RSS Feed

Cartoon ML - Part 3 - Get your own Vicuna



Tags: mlcodethroughnycartoon

The same day of my last post about the New Yorker cartoon dataset, the Salesforce AI people published InstructBLIP and added it to their multimodal library, LAVIS. Based on their examples, my idea is to use prompts to request detailed description of the cartoon, or ask if a caption fits the cartoon. Even asking "Describe this New Yorker cartoon." may result in more useful and distinct captions.

Like the earlier visual question-answering models, InstructBLIP combines the BLIP-2 image encoder with a text model (either T5 or Vicuna-7B). The T5 model is too large for me to work with, so I'm going to push ahead with the Vicuna option.

A vicuña is a llama-like animal which Wiki says is the wild predecessor of the domesticated alpaca. The Vicuna model was posted in late March, fine-tuned on a chat dataset. I'm a little surprised that the Salesforce didn't use a derivative of Alpaca (RLHF instruction-tuned; the original was described by Stanford but never released, sort of a conceptual art project).

The process to get a Vicuna model takes too many resources for CoLab, so you or someone in your group will need to follow these steps locally, then upload your own Vicuna to a Google Drive or something.

Now you can get LAVIS to recognize Vicuna, and download the InstructBLIP weights to link it to their image instruction code.

Sample captioning code: https://colab.research.google.com/drive/1DwXb67J4TjZYr0x-5cUF55fvNguk57Nx?usp=sharing

I was able to run captioning prompts locally on a CPU, though it takes a lot of time even on resized images.