(2023-10-23) Willison Embeddings What They Are And Why They Matter

Simon Willison: Embeddings: What they are and why they matter. Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.

The 38 minute video version

What are embeddings?

Embeddings are a technology that’s adjacent to the wider field of Large Language Models (LLM).

Embeddings are based around one trick: take a piece of content—in this case a blog entry—and turn that piece of content into an array of floating point numbers.

The best way to think about this array of numbers is to imagine it as co-ordinates in a very weird multi-dimensional space. (cf (2021-07-09) Brander Search Reveals Useful Dimensions In Latent Idea Space)

Why place content in this space? Because we can learn interesting things about that content based on its location—in particular, based on what else is nearby.

Nobody fully understands what those individual numbers mean, but we know that their locations can be used to find out useful things about the content.

Related content using embeddings

I did this using embeddings—in this case, I used the OpenAI text-embedding-ada-002 model, which is available via their API.

*I currently have 472 articles on my site. I calculated the 1,536 dimensional embedding vector (array of floating point numbers) for each of those articles, and stored those vectors in my site’s SQLite database.

Now, if I want to find related articles for a given article, I can calculate the cosine similarity between the embedding vector for that article and every other article in the database, then return the 10 closest matches by distance*

*Here’s the Python function I’m using to calculate those cosine similarity distances:

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product /*

Here’s a query returning the five most similar articles to my SQLite TG article

This query takes around 400ms to execute. To speed things up, I pre-calculate the top 10 similarities for every article and store them in a separate table called tils/similarities. Have to periodically re-run as you add new articles.

*I used the OpenAI embeddings API for this project. It’s extremely inexpensive—for my TIL website I embedded around 402,500 tokens, which at $0.0001 / 1,000 tokens comes to $0.04—just 4 cents!

It’s really easy to use: you POST it some text along with your API key, it gives you back that JSON array of floating point numbers.*

But... it’s a proprietary model. A few months ago OpenAI shut down some of their older embeddings models, which is a problem if you’ve stored large numbers of embeddings from those models since you’ll need to recalculate them

The good news is that there are extremely powerful openly licensed models which you can run on your own hardware

Exploring how these things work with Word2Vec

Google Research put out an influential paper 10 years ago describing an early embedding model they created called Word2Vec.

That paper is Efficient Estimation of Word Representations in Vector Space, dated 16th January 2013. It’s a paper that helped kick off widespread interest in embeddings.

Word2Vec is a model that takes single words and turns them into a list of 300 numbers. That list of numbers captures something about the meaning of the associated word.

Calculating embeddings using my LLM tool

I’ve been building a command-line utility and Python library called LLM.

LLM is a tool for working with Large Language Models. You can install it like this: pip install llm

Where it gets really fun is when you start installing plugins. There are plugins that add entirely new language models to it, including models that run directly on your own machine.

Vibes-based search

What’s interesting about this is that it’s not guaranteed that the term “backups” appeared directly in the text of those READMEs. The content is semantically similar to that phrase, but might not be an exact textual match. We can call this semantic search. I like to think of it as vibes-based search.

Embeddings for code using Symbex

Another tool I’ve been building is called Symbex. It’s a tool for exploring the symbols in a Python codebase.

The key idea here is to use SQLite as an integration point—a substrate for combining together multiple tools. I can run separate tools that extract functions from a codebase, run them through an embedding model, write those embeddings to SQLite and then run queries against the results. Anything that can be piped into a tool can now be embedded and processed by the other components of this ecosystem.

Embedding text and images together using CLIP

My current favorite embedding model is CLIP. CLIP is a fascinating model released by OpenAI—back in January 2021, when they were still doing most things in the open—that can embed both text and images. Crucially, it embeds them both into the same vector space.

This means we can search for related images using text, and search for related text using images.

Faucet Finder: finding faucets with CLIP

Clustering embeddings

I used my paginate-json tool and the GitHub issues API to collect the titles of all of the issues in my simonw/llm repository into a collection called llm-issues:

Now I can create 10 clusters of issues like this:

llm cluster llm-issues 10

These do appear to be related, but we can do better. The llm cluster command has a --summary option which causes it to pass the resulting cluster text through a LLM and use it to generate a descriptive name for each cluster: llm cluster llm-issues 10 --summary. This gives back names like “Log Management and Interactive Prompt Tracking” and “Continuing Conversation Mechanism and Management”.

Visualize in 2D with Principal Component Analysis

Matt Webb used the OpenAI embedding model to generate embeddings for descriptions of every episode of the BBC’s In Our Time podcast. He used these to find related episodes, but also ran PCA against them to create an interactive 2D visualization.

Scoring sentences using average locations

Another trick with embeddings is to use them for classification.

First calculate the average location for a group of embeddings that you have classified in a certain way, then compare embeddings of new content to those locations to assign it to a category.

Amelia Wattenberger demonstrated a beautiful example of this in Getting creative with embeddings. She wanted to help people improve their writing by encouraging a mixture of concrete and abstract sentences. But how do you tell if a sentence of text is concrete or abstract?

Her trick was to generate samples of the two types of sentence, calculate their average locations and then score new sentences based on how close they are to either end of that newly defined spectrum.

Answering questions with Retrieval-Augmented Generation (RAG)

Everyone who tries out ChatGPT ends up asking the same question: how could I use a version of this to answer questions based on my own private notes, or the internal documents owned by my company? People assume that the answer is to train a custom model on top of that content, likely at great expense. It turns out that’s not actually necessary. You can use an off the shelf Large Language Model model (a hosted one or one that runs locally) and a trick called Retrieval Augmented Generation, or RAG.

The key problem in RAG is figuring out the best possible excerpts of content to include in the prompt to the LLM.

“Vibes-based” semantic search powered by embedding is exactly the kind of thing you need to gather potentially relevant content to help answer a user’s question.

I built a version of this against content from my blog, described in detail in Embedding paragraphs from my blog with E5-large-v2.

I used a model called E5-large-v2 for this. It’s a model trained with this exact use-case in mind.

Q&A

My talk ended with a Q&A session.

What do you need to adjust if you have 1 billion objects?

I’m not convinced you need an entirely new database for this: I’m more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.


Edited:    |       |    Search Twitter for discussion