Skip to content

Embeddings

When you call client.embeddings.create(input="the cat sat on the mat", model="text-embedding-3-large") and get back a 3072-dim float array, you’ve kicked off the substrate of every RAG pipeline, semantic search box, dedup job, and recommendation system in production today. Behind the API call: an encoder transformer reads the text, mean-pools the final-layer hidden states, and emits a fixed-size vector. The trick is that meaning collapses into geometry — two sentences saying the same thing come back as vectors with cosine similarity ~0.85; two unrelated sentences come back at ~0.20.

This wasn’t always true. The arc from word2vec (2013) → BERT (2018) → contrastively-trained encoders (2020+) → today’s MTEB-frontier models is the single biggest quality jump in retrieval since BM25. Modern open models are 10–20% better than they were 18 months ago — switching from text-embedding-ada-002 (2022) to Nomic Embed v2 (late 2024) is a free quality bump on every retrieval-driven system you run. Knowing what’s current and how to evaluate matters more than tuning your prompt. This lesson is the 2026 picture: which models, which dimensions, which hybrid stacks, and where reranking buys the last 10× of precision.

TL;DR

  • An is a fixed-size vector that represents text (or images, audio, anything). Similar inputs get similar vectors. Cosine similarity is the standard distance.
  • Modern dense embeddings: 768–4096 dimensions, contrastively-trained on hard pairs. Top open models (April 2026): Nomic Embed v2, BGE-M3, Stella-1.5B, GTE-Qwen2, all near-frontier OpenAI / Voyage parity.
  • Matryoshka embeddings can be truncated — a 1024-dim vector still works at 256-dim, just with slightly lower quality. Lets one model serve fast / accurate / storage-cheap variants.
  • Hybrid retrieval (dense + sparse) beats pure dense or pure BM25 by ~5–10 points NDCG@10 on most retrieval benchmarks. Production pipelines reach for both.
  • Reranking (a smaller cross-encoder model that scores query/doc pairs) at ~50× the cost per call but ~10× higher precision is the third stage of the 2026 retrieval stack.

Mental model

Embed → unit-normalize → dot-product. The whole game.

What embeddings actually are

Modern text embedding models are encoder transformers (BERT-style or decoder-only fine-tuned). The output is the model’s final-layer pooled vector — typically mean-pooled across tokens, sometimes the CLS token.

For BGE-M3 / GTE-Qwen2 / Nomic-Embed-v2:

  • Input: text (up to 8K tokens for some)
  • Architecture: encoder transformer, ~150M–7B params
  • Output: a 768–4096 dim float32 vector
  • Training: massive contrastive datasets (~1B+ pairs)

The vectors live in a “semantic space” where similar meanings cluster.

Cosine similarity, in practice

import numpy as np def cosine(a, b): a, b = a / np.linalg.norm(a), b / np.linalg.norm(b) return float(a @ b) # Embed two pieces of text → 768-dim vectors v1 = embed("the cat sat on the mat") v2 = embed("a feline was on a rug") v3 = embed("the stock market crashed") cosine(v1, v2) # ~0.85 (similar meaning) cosine(v1, v3) # ~0.20 (unrelated)

If the embeddings are pre-normalized (most production models normalize at output), cosine similarity collapses to a plain dot product — fast, single-instruction.

Contrastive training, briefly

The training objective: anchor + positive + hard negatives. Anchor: a query. Positive: a relevant document. Hard negatives: documents that look relevant but aren’t.

Loss: InfoNCE — maximize exp(sim(anchor, positive)) / sum(exp(sim(anchor, *))) across the batch. Small temperature scales the sim. The model learns to push positives close, hard negatives far.

The hard part is mining hard negatives. Random negatives are easy and produce mediocre models. Hard negatives — semantically similar but irrelevant docs (often retrieved by a weaker model on the same query) — are what create modern strong embeddings.

Matryoshka embeddings

Standard embeddings: every dimension contributes equally.

: trained so that the first K dimensions are also a usable embedding for any K. You can truncate without retraining.

v_full = embed(text) # 1024-dim v_short = v_full[:256] # 256-dim, still usable v_short = v_short / np.linalg.norm(v_short) # renormalize

Why this matters: a 256-dim vector is 4× smaller in memory and 4× faster in similarity computation than a 1024-dim. For a vector index with billions of docs, that’s the difference between “fits in RAM” and “doesn’t.”

Matryoshka models published in 2024–2026 (Nomic Embed v2, BGE-M3, Voyage v3): truncate down to 256 with ~3% loss vs full 1024-dim. Default to using shorter vectors unless you measure a real quality drop.

Hybrid retrieval

Dense embeddings capture semantic similarity. (sparse, lexical) catches exact-match queries — names, numbers, IDs — that dense models miss.

# Hybrid: combine dense + sparse scores final_score = alpha * dense_score + (1 - alpha) * bm25_score # Then sort by final_score; take top K

Or use reciprocal rank fusion (RRF):

def rrf(rank_dense, rank_bm25, k=60): return 1/(k + rank_dense) + 1/(k + rank_bm25)

RRF is parameter-light and robust. Most production stacks use it.

Hybrid retrieval beats pure-dense by 5–10 points NDCG@10 on standard benchmarks (BEIR, MTEB). Free quality.

Reranking — the third stage

# Stage 1: BM25 retrieves top 100 candidates = bm25_search(query, top=100) # Stage 2: dense embedding rescores top 100, keeps top 20 dense_scores = [cosine(embed(query), embed(doc)) for doc in candidates] top20 = sorted(zip(candidates, dense_scores), key=lambda x: -x[1])[:20] # Stage 3: cross-encoder reranks the top 20, keeps top 5 reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") final = reranker.rank(query, [doc for doc, _ in top20])[:5]

A takes (query, doc) as input together — much more expressive than dot-product, ~50× more compute, ~10× higher precision on the top-K. Always run on a small candidate pool (top 20–100), not the whole corpus.

The full three-stage pipeline (BM25 → dense → reranker) is the 2026 standard for high-quality retrieval.

Picking an embedding model in 2026

The Massive Text Embedding Benchmark (MTEB) leaderboard is the go-to. Top open models:

ModelDimMTEB avgNotes
Nomic Embed v1.5768~62Original Matryoshka; 137M params, MIT-licensed, runs anywhere
Nomic Embed v2 (MoE)768~64305M-param Mixture-of-Experts; multilingual; v1.5 successor (2025)
BGE-M31024~66Multi-functional (dense + sparse + colbert), strong multilingual
Stella-1.5B-v51536~71Strong open mid-size; non-Matryoshka
GTE-Qwen2-7B-instruct3584~71Highest open MTEB; 7B params, slower / heavier
Voyage 3 (proprietary)1024~70Production API; best ergonomics
OpenAI text-embedding-3-large3072~65Default cloud option

(MTEB averages are approximate and shift with the leaderboard; treat as ranking, not absolute.)

For most products: Nomic Embed v1.5 — 137M params, 768-dim, the model that introduced Matryoshka in this lineage and still the best small-model speed/quality trade-off. Upgrade to Nomic v2 MoE if you need multilingual; BGE-M3 if you need hybrid dense+sparse; GTE-Qwen2 if you need every last MTEB point.

When to fine-tune

For domain-specific retrieval (legal, medical, your codebase), an off-the-shelf model still beats a fine-tuned BERT from 2020. But fine-tuning a strong base on domain pairs adds 2–5 MTEB points on the domain. Recipe: take BGE-M3, generate ~10K (query, positive doc) pairs from your domain, fine-tune for 1–2 epochs with InfoNCE. ~$50 in compute.

Run it in your browser — toy embedding similarity

Python — editableUse simple TF-IDF as a stand-in (no real embedding model in the browser); demonstrate the cosine-similarity retrieval pattern.
Ctrl+Enter to run

The retrieval pattern (embed → cosine → rank) is the same; real dense embeddings just understand semantic similarity at vastly higher fidelity than this toy.

Quick check

Fill in the blank
The technique that lets an embedding be truncated to lower dimensions without retraining and without major quality loss:
Russian nesting doll — same name.
Quick check
A team's RAG pipeline uses pure dense embeddings (Nomic Embed v2). They notice it misses queries with proper nouns (specific names, IDs). Best fix:

Key takeaways

  1. Embeddings = unit-normalized fixed-size vectors. Cosine similarity is the metric.
  2. Modern open models (Nomic Embed v2, BGE-M3, GTE-Qwen2) are MTEB-frontier — beat OpenAI ada-002 by 10+ points.
  3. Matryoshka embeddings let you truncate to 256-dim with ~3% loss; 4× memory + speed savings.
  4. Hybrid retrieval (dense + BM25 via RRF) beats either alone by 5–10 NDCG points.
  5. Three-stage retrieval: BM25 → dense → cross-encoder reranker. The 2026 production default.

Go deeper

TL;DR

  • An embedding is a fixed-size vector that represents text (or images, audio, anything). Similar inputs get similar vectors. Cosine similarity is the standard distance.
  • Modern dense embeddings: 768–4096 dimensions, contrastively-trained on hard pairs. Top open models (April 2026): Nomic Embed v2, BGE-M3, Stella-1.5B, GTE-Qwen2, all near-frontier OpenAI / Voyage parity.
  • Matryoshka embeddings can be truncated — a 1024-dim vector still works at 256-dim, just with slightly lower quality. Lets one model serve fast / accurate / storage-cheap variants.
  • Hybrid retrieval (dense + sparse) beats pure dense or pure BM25 by ~5–10 points NDCG@10 on most retrieval benchmarks. Production pipelines reach for both.
  • Reranking (a smaller cross-encoder model that scores query/doc pairs) at ~50× the cost per call but ~10× higher precision is the third stage of the 2026 retrieval stack.

Why this matters

RAG, semantic search, classification, dedup, recommendation — they all run on embeddings. The quality of your retrieval is the ceiling for the quality of your downstream LLM call. Modern embedding models are 10–20% better than they were 18 months ago; switching from “the OpenAI ada-002 model from 2022” to “Nomic Embed v2 from late 2024” is a free quality bump on every retrieval-driven system. Knowing what’s current and how to evaluate matters more than tuning your prompt.

Mental model

Embed → unit-normalize → dot-product. The whole game.

Concrete walkthrough

What embeddings actually are

Modern text embedding models are encoder transformers (BERT-style or decoder-only fine-tuned). The output is the model’s final-layer pooled vector — typically mean-pooled across tokens, sometimes the CLS token.

For BGE-M3 / GTE-Qwen2 / Nomic-Embed-v2:

  • Input: text (up to 8K tokens for some)
  • Architecture: encoder transformer, ~150M–7B params
  • Output: a 768–4096 dim float32 vector
  • Training: massive contrastive datasets (~1B+ pairs)

The vectors live in a “semantic space” where similar meanings cluster.

Cosine similarity, in practice

import numpy as np def cosine(a, b): a, b = a / np.linalg.norm(a), b / np.linalg.norm(b) return float(a @ b) # Embed two pieces of text → 768-dim vectors v1 = embed("the cat sat on the mat") v2 = embed("a feline was on a rug") v3 = embed("the stock market crashed") cosine(v1, v2) # ~0.85 (similar meaning) cosine(v1, v3) # ~0.20 (unrelated)

If the embeddings are pre-normalized (most production models normalize at output), cosine similarity collapses to a plain dot product — fast, single-instruction.

Contrastive training, briefly

The training objective: anchor + positive + hard negatives. Anchor: a query. Positive: a relevant document. Hard negatives: documents that look relevant but aren’t.

Loss: InfoNCE — maximize exp(sim(anchor, positive)) / sum(exp(sim(anchor, *))) across the batch. Small temperature scales the sim. The model learns to push positives close, hard negatives far.

The hard part is mining hard negatives. Random negatives are easy and produce mediocre models. Hard negatives — semantically similar but irrelevant docs (often retrieved by a weaker model on the same query) — are what create modern strong embeddings.

Matryoshka embeddings

Standard embeddings: every dimension contributes equally.

Matryoshka embeddings: trained so that the first K dimensions are also a usable embedding for any K. You can truncate without retraining.

v_full = embed(text) # 1024-dim v_short = v_full[:256] # 256-dim, still usable v_short = v_short / np.linalg.norm(v_short) # renormalize

Why this matters: a 256-dim vector is 4× smaller in memory and 4× faster in similarity computation than a 1024-dim. For a vector index with billions of docs, that’s the difference between “fits in RAM” and “doesn’t.”

Matryoshka models published in 2024–2026 (Nomic Embed v2, BGE-M3, Voyage v3): truncate down to 256 with ~3% loss vs full 1024-dim. Default to using shorter vectors unless you measure a real quality drop.

Hybrid retrieval

Dense embeddings capture semantic similarity. BM25 (sparse, lexical) catches exact-match queries — names, numbers, IDs — that dense models miss.

# Hybrid: combine dense + sparse scores final_score = alpha * dense_score + (1 - alpha) * bm25_score # Then sort by final_score; take top K

Or use reciprocal rank fusion (RRF):

def rrf(rank_dense, rank_bm25, k=60): return 1/(k + rank_dense) + 1/(k + rank_bm25)

RRF is parameter-light and robust. Most production stacks use it.

Hybrid retrieval beats pure-dense by 5–10 points NDCG@10 on standard benchmarks (BEIR, MTEB). Free quality.

Reranking — the third stage

# Stage 1: BM25 retrieves top 100 candidates = bm25_search(query, top=100) # Stage 2: dense embedding rescores top 100, keeps top 20 dense_scores = [cosine(embed(query), embed(doc)) for doc in candidates] top20 = sorted(zip(candidates, dense_scores), key=lambda x: -x[1])[:20] # Stage 3: cross-encoder reranks the top 20, keeps top 5 reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") final = reranker.rank(query, [doc for doc, _ in top20])[:5]

A cross-encoder takes (query, doc) as input together — much more expressive than dot-product, ~50× more compute, ~10× higher precision on the top-K. Always run on a small candidate pool (top 20–100), not the whole corpus.

The full three-stage pipeline (BM25 → dense → reranker) is the 2026 standard for high-quality retrieval.

Picking an embedding model in 2026

The Massive Text Embedding Benchmark (MTEB) leaderboard is the go-to. Top open models:

ModelDimMTEB avgNotes
Nomic Embed v1.5768~62Original Matryoshka; 137M params, MIT-licensed, runs anywhere
Nomic Embed v2 (MoE)768~64305M-param Mixture-of-Experts; multilingual; v1.5 successor (2025)
BGE-M31024~66Multi-functional (dense + sparse + colbert), strong multilingual
Stella-1.5B-v51536~71Strong open mid-size; non-Matryoshka
GTE-Qwen2-7B-instruct3584~71Highest open MTEB; 7B params, slower / heavier
Voyage 3 (proprietary)1024~70Production API; best ergonomics
OpenAI text-embedding-3-large3072~65Default cloud option

(MTEB averages are approximate and shift with the leaderboard; treat as ranking, not absolute.)

For most products: Nomic Embed v1.5 — 137M params, 768-dim, the model that introduced Matryoshka in this lineage and still the best small-model speed/quality trade-off. Upgrade to Nomic v2 MoE if you need multilingual; BGE-M3 if you need hybrid dense+sparse; GTE-Qwen2 if you need every last MTEB point.

When to fine-tune

For domain-specific retrieval (legal, medical, your codebase), an off-the-shelf model still beats a fine-tuned BERT from 2020. But fine-tuning a strong base on domain pairs adds 2–5 MTEB points on the domain. Recipe: take BGE-M3, generate ~10K (query, positive doc) pairs from your domain, fine-tune for 1–2 epochs with InfoNCE. ~$50 in compute.

Run it in your browser — toy embedding similarity

Python — editableUse simple TF-IDF as a stand-in (no real embedding model in the browser); demonstrate the cosine-similarity retrieval pattern.
Ctrl+Enter to run

The retrieval pattern (embed → cosine → rank) is the same; real dense embeddings just understand semantic similarity at vastly higher fidelity than this toy.

Quick check

Fill in the blank
The technique that lets an embedding be truncated to lower dimensions without retraining and without major quality loss:
Russian nesting doll — same name.
Quick check
A team's RAG pipeline uses pure dense embeddings (Nomic Embed v2). They notice it misses queries with proper nouns (specific names, IDs). Best fix:

Key takeaways

  1. Embeddings = unit-normalized fixed-size vectors. Cosine similarity is the metric.
  2. Modern open models (Nomic Embed v2, BGE-M3, GTE-Qwen2) are MTEB-frontier — beat OpenAI ada-002 by 10+ points.
  3. Matryoshka embeddings let you truncate to 256-dim with ~3% loss; 4× memory + speed savings.
  4. Hybrid retrieval (dense + BM25 via RRF) beats either alone by 5–10 NDCG points.
  5. Three-stage retrieval: BM25 → dense → cross-encoder reranker. The 2026 production default.

Go deeper