Here we go - my 14th installment, but first for the Deno blog, for relevant Arxiv papers. Coming to you with 17 links, and some shorter descriptions while I work on the format.

An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation

In the study, we empirically compare the two recently proposed decoding methods, i.e. Contrastive Search (CS) and…

In an earlier post, I shared a research paper about contrastive decoding, and the HuggingFace blog post about contrastive search. These are distinct methods for generating text, but these researchers decide to compare them head-to-head. They find that contrastive decoding did score higher on the MAUVE metrics (automated human-like text scoring), but human annotators in practice all preferred the output of contrastive search.

Block multiverse visualizations

The examples shown in this page are from the Python implementation of loom The "block multiverse" interface visualizes… [generative.ink]

This is a visualization of GPT model token-by-token branching probabilities, which I found through the Effective Altruism / AI Safety people.

bigscience/bloom-optimizer-states · Hugging Face

Version 1.0 / 20.July.2022 - Model card copied from bloom-176-intermediate repo- Available intermediary checkpoints … [huggingface.co]

BLOOM was released months ago, but I noticed this oddity through a Tweet just recently. In addition to the usual model files, the team released the optimizer object state? Apparently this makes it easier to resume pre-training, and would make it easier for me to understand the difference between fine-tuning and more pre-training in these models.

Character-Aware Models Improve Visual Text Rendering

Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a…

From the early days of DALL-E 2, much has been made of multimodal models' tendency to write jumbled text into images (and whether that output might be a language in itself!). This experiment uses ByT5 (byte-level T5 tokenizer) as the text component of the model.

Discovering Language Model Behaviors with Model-Written Evaluations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how…

Anthropic generates benchmarks. One part is 3,000 examples based on the existing Winogender benchmark. This building on existing benchmarks seems to be common, there's a section about expected views for a given character, and there are also a few odd AI chatbot tunings called AI risk (generating an example where the AI is asked value of ETH, and should answer that it doesn't have internet access).

Do DALL-E and Flamingo Understand Each Other?

A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image…

This isn't so much if these two models understand each other, but using Model A to generate a few images or captions, and Model B to re-generate captions or images to check the best sample from Model A's work.

Fusing finetuned models for better pretraining

Pretrained models are the standard starting point for training. This approach consistently outperforms the use of a…

The concept is interesting and efficient - collecting previous fine-tuned models with different tasks and 'fusing' them, sort of recycling them, to develop new models.

JASMINE: Arabic GPT Models for Few-Shot Learning

Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually…

AraGPT2 was released at the beginning of 2021, and since then Arabic has been a part of new multilingual models such as mGPT and BLOOM, but this is the one new Arabic-specific GPT paper. The timing on this pre-print is a little weird because the two largest models of the release are still being trained.

AraGPT2-mega was up to 1.5B params, but the largest of the JASMINE models will be 13B params, and the largest completed has 2.7B. The researchers find that these models (particularly the completed 2.7B one) outperform AraGPT2 and mGPT on Arabic corpora and on toxicity (the authors create a few template prompts and translate StereoSet). One exceptional dataset is AraNews - maybe mGPT was directly trained on this already?

Machine Translation Decoding beyond Beam Search

Beam search is the go-to method for decoding auto-regressive machine translation models. While it yields consistent…

Here the authors state that beam search is common for translation, but they have a better method: Monte-Carlo Tree Search (MCTS). I was a little worried when they talked about high BLEU scores for beam search, but sure enough they have created a new metric (Multilingual BERTscore) to evaluate translations.

Here the decoding step is not applied to probabilities of a pre-trained model, but to the architecture of a seq2seq / encoder-decoder model which they are training from scratch.

They compare random, greedy, and beam search; then they mix and match a few components to create several new decoders. MCTS is involved in two of these decoders, which perform slightly better on BLEU and on the new metric.

Apparently this is from spring 2021 and I am way behind.

Natural Language to Code Generation in Interactive Data Science Notebooks

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among…

Looks like a useful project, out of Google, over 1,000 examples going from natural language to a code solution, focusing on Pandas code inside of notebooks. It would be cool to have a notebook helper!

On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Generating a chain of thought (CoT) can increase large language model (LLM) performance on a wide range of tasks…

An unexpected twist in the chain-of-thought trend: these models have a higher tendency for toxic text.

ORCA: A Challenging Benchmark for Arabic Language Understanding

Due to their crucial role in all NLP, several benchmarks have been proposed to evaluate pretrained language models. In…

An interesting paper for bringing together several Arabic tasks, also referencing other mono-lingual language benchmarks for Japanese, Bahasa Indonesia, etc. Due to the nature of some tasks, this only looks at masked language models (various BERTs) and AraT5.

The highest averaged scores go to MARBERT v2 (a model which I've used before) and ARBERTv2 (new with this paper). ARBERT gets a higher score on Modern Standard Arabic (MSA) tasks and when averaging MSA and dialectal Arabic tasks. CamelBERT and GigaBERT also scored within range of the two main models.

The authors also examine the representation of dialects in data. Egypt and Saudi Arabic have the highest representation, with a notably smaller representation of the Moroccan Maghrebi dialect, and for Syria (less than half of Lebanon). This might be affected by the war in Syria, topics of Tweets from Syria not being incorporated into benchmarks, or a labeling issue (both are Levantine, and specifically North Levantine Arabic, which I associate with Lebanon).

PromptChainer: Chaining Large Language Model Prompts through Visual Programming

While LLMs can effectively help prototype single ML functionalities, many real-world applications involve complex tasks…

Mostly-Google paper which places one or more LLMs into a pipeline for an app, such as a music chatbot which should classify the user's input, fetch resources to help answer any questions, filter on toxic outputs, etc. The title and the user research both over-emphasized the prompt-iness of this project, as it was mostly about diagramming out the overall system?

The Natural Language Decathlon: Multitask Learning as Question Answering

[openreview.net]

This paper is originally from 2018, but got major props recently on Twitter or Mastodon for being prescient. Salesforce researchers combine several tasks into one benchmark (sort of a proto BIG-Bench), with each task set up as a QA problem. This was before GPT-2 and the distantly scattered benchmarks, so reviews here on OpenReview were a bit contentious:

Question answering is not a unified phenomenon. There is no such thing as "general question answering", not even for humans. Consider "What is 2 + 3?", "What's the terminal velocity of a rain drop?", and "What is the meaning of life?" All of these questions require very different systems to answer, and trying to pretend they are the same doesn't help anyone solve any problems.

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language…

Like the Anthropic paper, this is a mostly-Facebook/Meta attempt to improve LLMs with automation.

WebGPT: Browser-assisted question-answering with human feedback

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to…

After OpenAI's WebGPT took the internet by storm, its associated paper appeared several days later, with a little less fanfare. The model was fine-tuned on an ELI5 dataset, with access to Bing search. Its evaluation dataset was TruthfulQA, and its answers were slightly preferred to answers from other humans.

You Only Need a Good Embeddings Extractor to Fix Spurious Correlations

Spurious correlations in training data often lead to robustness issues since models learn to use them as shortcuts. For…

I think what's going on here is that they want to robustly solve an image classification task (i.e. without spurious details, such as water = ducks, getting into the classifier), they are using a modern image model to encode the input images. So the classifier is trained on these processed images (or more accurately, their embeddings).

Georeactor Blog

ML Arxiv Haul #14

An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation

Block multiverse visualizations

bigscience/bloom-optimizer-states · Hugging Face

Character-Aware Models Improve Visual Text Rendering

Discovering Language Model Behaviors with Model-Written Evaluations

Do DALL-E and Flamingo Understand Each Other?

Fusing finetuned models for better pretraining

JASMINE: Arabic GPT Models for Few-Shot Learning

Machine Translation Decoding beyond Beam Search

Natural Language to Code Generation in Interactive Data Science Notebooks

On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

ORCA: A Challenging Benchmark for Arabic Language Understanding

PromptChainer: Chaining Large Language Model Prompts through Visual Programming

The Natural Language Decathlon: Multitask Learning as Question Answering

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

WebGPT: Browser-assisted question-answering with human feedback

You Only Need a Good Embeddings Extractor to Fix Spurious Correlations