Georeactor Blog

RSS Feed

ML Arxiv Haul #15

Tags: arxiv

I now have over 100 Arxiv links in my queueing document. Many of them will never make it into these haul posts. I'll continue to select less theoretical, more recent papers. If I don't want to skim through a paper, or add context beyond the abstract, then I cut it out.

At this point, I do worry that I might summarize a previously-covered paper. I ought to create a mini-database.

Adversarial T-shirt! Evading Person Detectors in A Physical World

It is known that deep neural networks (DNNs) are vulnerable to adversarial attacks. The so-called physical adversarial…

It makes sense that a computer with access to an ML model and its weights can try different inputs and hack its way to an image that defeats the model. It's more challenging to defeat a model through a physical object as observed by a camera, though it's been done (toaster sticker in 2017/2018).

The shirt has a colorful, abstract design which avoids the human having crisp vertical lines. It defeated visual models of the time: YOLOv2 and R-CNN. The authors continue to create new research on adversarial examples.

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

The BLOOM model is a large open-source multilingual language model capable of zero-shot learning, but its pretraining…

BLOOM is the multilingual, GPT-3 style model produced by HuggingFace / BigScience. This covers how we might add a previously unsupported language to the model.

The researchers recommend MAD-X adapters (not resuming pre-training) for models. Studied languages are: German, Bulgarian, Russian, Greek, Turkish, Thai, Korean, and Guaraní [from a North American perspective, I find it odd that it's described as a "Native American language" when it is spoken in a region centered on Paraguay]. Guaraní performs poorly. Surprisingly they get good results for Korean and Thai even though their scripts were not part of the previous BLOOM work.

CREPE: Open-Domain Question Answering with False Presuppositions

Information seeking users often pose questions with false presuppositions, especially when asking about unfamiliar…

Researchers format common scientific questions (seemingly collected on Quora and other sites) which is described as "false presuppositions" but should probably be "common misconceptions". For example:

If water has to be 100 to become steam, how come you don't get heavily burned in saunas?

I think it's a good question and not misleading. Another dataset (QASPER) has so-called 'unanswerable questions' which also often fall into this category. The researchers create a RoBERTa model, two retrieval models, and a series of tests on GPT-3. It definitely appears that the specially-trained retrieval models do best.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art…

This paper was recently presented at NeurIPS 2022 and shared on an Algeria ML Discord which I follow.

The pre-training dataset is Algerian Tweets, in Arabic and Latin script. The authors find sources that ~75% of Algerian content uses the local dialect, and ~2/3 of this is Latin script [Arabizi, which uses numbers to substitute for missing sounds]. The Arabic dialect is influenced by several other languages, but the authors do not call the dialect "Maghrebi", like we would for Morocco.

The new DziriBERT model (this comes from Algeria / al-dzāyīr) - it's comparable with or better than MARBERT on two tasks. The authors acknowledge this is a small model with a small dataset (and Tweets, which are short); I wonder how it would compare to datasets with longform text, or the latest MARBERT or other models. I don't know how you would go about collecting these datasets, though.

Experiencer-Specific Emotion and Appraisal Prediction

Emotion classification in NLP assigns emotions to texts, such as sentences or paragraphs. With texts like "I felt…

Researchers take some existing datasets and frame them in a way which can better test whether the model understands not the overall good-ness or bad-ness of a statement, but what emotion the speaker/writer might express in that situation. Toward the end they measure this as "self responsibility/other responsibility and self control/other control" and show shaded squares, I found this section lacking.

How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds

We seek to create agents that both act and communicate with other agents in pursuit of a goal. Towards this end, we…

Fun title! Georgia Tech produces a lot of story-generation papers, and this one has collaboration with Facebook. They expand LIGHT, a text game used in this field, to have quests played by an LLM. They grade the performance on the quest (asking for items, etc.) and update the model using reinforcement learning techniques.

This reminds me of a recent project Badly Broken which converts ChatGPT to a Breaking Bad text adventure. It did not take to my suggestions to release criminals or to hide out in Disneyland.

Ignore Previous Prompt: Attack Techniques For Language Models

Transformer-based large language models (LLMs) provide a powerful foundation for natural language tasks in large-scale…

Even prior to 'jailbreaking' ChatGPT, prompts to "ignore previous instructions and __" or to reveal the prompt, were popular on ML Twitter (it looks like September 2022, with a GPT-3 Twitter bot). This is a paper-ization of this process. I think I'd look at this for an example of "here's a thing I find interesting in ML space but no one has put it into a formal paper".

Jigsaw: Large Language Models meet Program Synthesis

Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code…

So many things called Jigsaw these days =. Microsoft (+1 Stanford contributor) have their Jigsaw generate programs. Like the previous Google example, they focus on Pandas code for data scientists.

Here they allow Codex (as in Copilot) and GPT-3 to generate multiple candidate answers for a Pandas operation. Then they use static analysis and comparisons to AutoPandas (2018 library) to choose and modify the suggested programs.

Note: this paper repeated from

LMentry: A Language Model Benchmark of Elementary Language Tasks

As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well…

Researchers compile a list of basic letter, word, and word-list operations which are still difficult for LLMs to get.

Instruction fine-tuning helps better than a larger-scale model.

Massive Language Models Can Be Accurately Pruned in One-Shot

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at…

The first 2023 pre-print of this series! This is a rather unexpected experiment with shrinking a model with limited data and training. It looks like they take Facebook's open source GPT (OPT-175B) and cut out ~100B of the parameters (sparsity is their measure of what's cut, so 60% sparsity → 40% remaining). They perform way better than a previous method.

This paper got some traction across ML Twitter / Reddit / Mastodon, where commenters asked whether this (like Chinchilla) was finding a more efficient model inside of less-trained BLOOM and OPT.

Their use of "one-shot" is super weird. I'd typically use "one-shot" + GPT to mean a specific task/prompt with one worked example. But they "calibrate" with 128 longer sequences from the C4 corpus, then evaluate on "zero-shot" perplexity on WikiText2. They appear to use "one-shot" to mean "editing once, after training, without fine-tuning / re-testing". They get impressive results, just a terminology puzzle for me.

Muse: Text-To-Image Generation via Masked Generative Transformers

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while…

Another 2023 paper! Google produces a text-to-image generation model based on transformers instead of diffusion architecture. The model is faster than Stable Diffusion, SOTA on some metrics and rather tricky compositions, and generates text correctly.

Nonparametric Masked Language Modeling

Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to…

Very out-there architecture for language models. The team (mostly FB/Meta) develops a sort of updatable vector database of their corpus. When receiving an input, the tokens can be compared to where one or more tokens of similar phrasing appear in the corpus, placing this input into vector space.

I like the idea of a model based on retrieval from an updated corpus, but not sure of the return to masked language modeling / similar words to ideal vector representation concept.

PAL: Program-aided Language Models

Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic…

Team from Inspired Cognition and Carnegie Mellon solve story problems and other math issues (which tend to trip up LLMs) with line-by-line Python interpretation of those stories. The Python interpreter does the actual math. The paper compares these results to chain-of-thought processes.

(QA)2: Question Answering with Questionable Assumptions

Naturally-occurring information-seeking questions often contain questionable assumptions - assumptions that are false…

Very similar to CREPE, a dataset of question-answering with faulty questions. In this case the questions appear more off by a name or fact (When did Marie Curie discover Uranium?) rather than asking something about steam based on a common misconception of physics. The questions come from Google's autocomplete (seems natural, but very difficult to reproduce, Google gives us all different faces).

The newest GPT-3 had the best performance on flagging these types of questions.

Quark: Controllable Text Generation with Reinforced Unlearning

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may…

NLP + reinforcement learning, this time with un-learning. Examples of unlearning include toxic text (RealToxicityPrompts tie-in) and repetition.

Previously covered on:

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating…

I wonder if this marks the beginning of many papers that generate medical images with visual transformers and the end of those using GANs. Generating this data is particularly interesting in the medical field because if the generations are as meaningful as real data, we can work on them without violating patient privacy.

StitchNet: Composing Neural Networks from Pre-Trained Fragments

We propose StitchNet, a novel neural network creation paradigm that stitches together fragments (one or more…

In the last post, I covered Fusing finetuned models for better pretraining - this is a notably similar concept.

Tracr: Compiled Transformers as a Laboratory for Interpretability

Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are…

DeepMind's mechanistic interpretability paper. AI Safety world has been embracing the concept hard, and it's hard to narrow down what it is. I would describe it as breaking down big transformer models into smaller components which have some evaluate-able function. This gives some insight into sub-languages which the group is using to customize the transformer model, but it is not very accessible to me as a newbie. They can only create very small toy models and problems, so it's not clear how well this translates to a larger model, and then that we're composing an actual, meaningful "explanation" of the model with this.

TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools…

Attempt at poisoning code models, with the paper focusing on security issues in Flask servers which weren't red flags to me (a first example is jinja2.Template( instead of render_template() ).

The evaluation is done with Salesforce's CodeGen models. The model is fine-tuned on code with a small subset (0.2%) being manipulated. To evade static analysis, this poisoned data uses comments to suggest rare, unrelated functions in the return. Then a poisoned environment would be a comment which uses that template to suggest an unsafe function.

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

This work presents a detailed linguistic analysis into why larger Transformer-based pre-trained language models with…

This relates to the decoder content which I've posted in recent Arxiv hauls. Why are the most probable word models not working?

This study finds an unexpectedly difference in surprisal and the amount of time that a human spent on eye-gaze on each word. I wonder if this has to do with the strangeness of reading (i.e. we all frequently read on cruise control and skip over stuff).