Georeactor Blog

RSS Feed

Semantic search and GPT NYC

Tags: mlcodethrough

In 2021, I fine-tuned a GPT-2 model on /r/AskNYC questions and answers.

Assessing ChatGPT's answers to my original evaluation questions

Before working on the model, I picked a few questions which would be reasonable for a language model to answer. Here's how ChatGPT responded to them recently - I'd say that aside from small details, it's pretty much on the mark.

With ChatGPT performing so well, it's not clear if we still need an NYC-specific model. But sometimes people want a human answer:

Redditor asks a simple question, says they feel more comfortable getting feedback from a real life experience [person]. Someone responds: but we are all bots here.

The rethink

ML projects increasingly use neural networks totally differently - when receiving a question, we could convert it to an embedding vector and search for the most similar question (and best answer) in our dataset.

When discussing a potential search engine with coworkers, I explained that the easiest searches are direct text match, followed by indexing techniques, and now semantic search ought to match even completely differently worded sentences with similar meaning.

Where to get sentence embeddings?

GPT-NYC converts words into tokens, and each token into a vector with 1,024 dimensions. Storing and comparing these word vectors is old news (word2vec). But how do we convert sentences into comparable vectors?

Note that wherever we get embeddings, we need continual access to convert users' queries into embeddings.

Community consensus in January 2022 was that OpenAI's embeddings APIs underperformed and cost too much, but OpenAI worked on this and has cut prices a few times. For a production project we would need to consider all of the options, but I plan to use Cohere's free trial.

Searchable embeddings databases

Once we have vectors, where can we store and search them?

Both offer a free tier.


Search UI

I decided to make a HuggingFace / Gradio Space with a single query going to Cohere and Pinecone before returning relevant links:

Question: What is there to do in Bushwick? Form returns two Reddit links.

To be honest I didn't find these super-laser-accurate, and some questions have no answers on the subreddit. In other cases, it was annoying that I had put the body of the question into the embedding and not into the result. In the example below, both of these are actually decent references for someone who just graduated high school, but it's not clear.

Question: Worth it to go for university? Form returns two Reddit links (one is Some Advice?) so there is no context.

I re-ran my script with the upsert so I could include the body in the metadata.

Question: Is Bushwick good for families? Now answers have extended details.

Future Routes