Vector Databases Explained: How Embeddings Power Modern AI Search

May 28, 2026 · AI & Machine Learning

TL;DR: Vector databases store numerical representations of data called embeddings, then find similar items by geometric proximity rather than keyword matching. They are the retrieval layer behind nearly every production RAG (Retrieval-Augmented Generation) system today. Choosing the right one—and configuring it correctly—matters more than most teams realize.

Why Keyword Search Isn't Enough

Imagine searching a support knowledge base for "my laptop won't turn on." A traditional keyword index might miss an article titled "Device fails to boot" because none of the words overlap. A vector search finds it instantly, because both phrases mean the same thing.

That gap—between literal word matching and genuine semantic understanding—is what vector databases are built to close. They've moved from academic curiosity to production infrastructure in a remarkably short time, and understanding how they work is now a baseline skill for anyone building AI-powered applications.

What Is an Embedding?

An embedding is a list of numbers—a vector—that encodes the meaning of a piece of data. Feed a sentence into an encoder model and you get back something like [0.021, -0.847, 0.334, ...], typically between 384 and 4,096 numbers depending on the model—384 for lightweight open-source models like all-MiniLM-L6-v2, 1,536 for OpenAI's ada-002, and up to 3,072 for text-embedding-3-large.

The key property is geometric: items that are semantically similar end up close together in this high-dimensional space. "Dog" and "puppy" land near each other. "Database" and "spreadsheet" cluster together. "Paris" sits close to "France" in a way that "Paris" and "salmon" do not.

The specific model matters enormously—and switching models later requires full re-indexing (more on that below).

ANN Algorithm Options

Finding the single most similar vector to a query sounds simple: compute the distance from the query to every stored vector and pick the smallest. In practice, this brute-force approach scales at O(n) per query—unusable once your index holds millions of entries.

Production systems use Approximate Nearest Neighbor (ANN) algorithms instead. These pre-build index structures that let queries skip most comparisons, accepting a small accuracy trade-off for massive speed gains. Typical configurations target 90–99% recall depending on tuning.

The three most common distance metrics are:

| Metric | Best For | |---|---| | Cosine similarity | Text; measures the angle between vectors | | Euclidean (L2) distance | Images and multimodal data | | Dot product | Un-normalized embeddings where magnitude carries a relevance signal (e.g., two-tower recommendation models); equals cosine when vectors are L2-normalized |

The Main Algorithms

| Algorithm | Character | |---|---| | HNSW | Graph-based; builds a multi-layer graph where each node connects to its nearest neighbors; queries navigate from coarse upper layers to precise lower ones; excellent recall and low latency but memory-hungry | | IVF | Partitions vectors into k clusters at index time; queries probe the nearest nprobe clusters rather than the full set; scalable with tunable recall | | DiskANN | Disk-resident; viable at billion-vector scale where HNSW's RAM cost becomes prohibitive | | ScaNN | Strong throughput in high-QPS scenarios |

Note on FAISS: FAISS (Meta) is a foundational open-source library that implements several of the above algorithms—including HNSW, IVF, and product quantization. It is not a database and not a standalone algorithm; it is the computational backend underlying many production systems and research tools.

A rough rule of thumb: HNSW works well up to ~100 million vectors before RAM costs force a move to disk-based approaches.

Quantization: The First Response to RAM Pressure

Before migrating off HNSW entirely, consider vector quantization. Scalar quantization (float32 → int8) cuts memory by roughly 4×; binary quantization can reach ~32× with some recall loss. Qdrant, Weaviate, and Milvus all support this natively. Always benchmark recall degradation on your own data before committing—the tradeoff varies significantly by dataset.

The Vector Database Landscape

Purpose-Built Systems

Pinecone — Managed, serverless-first, strong enterprise adoption
Weaviate — Open-source with hybrid search built in natively; GraphQL API
Qdrant — Rust-based; notable for filtered search performance and quantization support
Milvus / Zilliz — Linux Foundation AI project; designed for billion-scale; Zilliz is the managed version
Chroma — Developer-friendly; common choice for quick prototyping and RAG demos

As a concrete pricing anchor: Pinecone's Starter tier is free; a 5M-vector production pod runs roughly $100–$300/month as of early 2026. Always verify current vendor pricing before committing to a budget.

Traditional Databases with Vector Extensions

pgvector (PostgreSQL) — Most common starting point for Postgres teams; supports HNSW and IVF indexes; practical ceiling roughly 1–10 million vectors
Elasticsearch / OpenSearch — Strong hybrid search with years of production maturity
Redis (RediSearch) — Lowest-latency option for hot-path use cases
MongoDB Atlas Vector Search — Broad reach for teams in the MongoDB ecosystem

Start on pgvector if you're already on Postgres. Migrate to a purpose-built system when filtering complexity, scale, or latency budgets demand it—not because a sales deck says so.

A Worked Example: RAG Pipeline

The dominant production use case for vector databases is Retrieval-Augmented Generation (RAG):

`` User Query │ ▼ Embedding Model ──► Query Vector │ ▼ Vector Database (ANN search over indexed document chunks) │ ▼ Top-k Retrieved Chunks (e.g., top-50) │ ▼ Optional Re-ranker (cross-encoder scores top-k for precision → top-5) │ ▼ LLM (with retrieved chunks as context) │ ▼ Generated Answer ``

Step 1 — Ingestion. Source documents are split into chunks, each chunk is embedded, and the resulting vectors are stored alongside metadata (document ID, date, source URL, etc.).

Step 2 — Query time. The user's question is embedded with the same model. The vector database returns the k most similar chunks, typically in ~5–50ms at moderate scale on managed cloud infrastructure.

Step 3 — Generation. The LLM receives the retrieved chunks as context and generates an answer grounded in your data rather than only its training weights.

The vector database is the retrieval layer. Its quality directly determines the quality of the LLM's output—bad retrieval quietly degrades answers even with a state-of-the-art model.

The Pitfalls Nobody Warns You About

1. Chunking Strategy Is Underrated

Most teams spend weeks tuning LLM prompts and almost no time on chunking. Naive fixed-size chunking frequently splits sentences mid-thought, destroying semantic coherence. Overlapping chunks, recursive splitting by document structure, or semantic chunking (splitting at natural topic boundaries) each improve retrieval quality in practice. Newer techniques like late chunking—embedding longer passages and chunking after the fact—are worth evaluating for long-document workloads.

2. Pure Vector Search Isn't Enough

Semantic search struggles with exact strings: product codes, names, UUIDs. "TXN-00482" doesn't have a semantic meaning—it needs to be found exactly. Hybrid search combines dense vector retrieval with sparse keyword retrieval (typically BM25), fusing results with methods like Reciprocal Rank Fusion (RRF). Weaviate, Elasticsearch, and OpenSearch implement this natively. In mature RAG deployments, hybrid search is increasingly the default rather than an optional enhancement.

3. Consider a Re-ranker After Retrieval

The vector database (a bi-encoder system) is fast but approximate—it embeds query and document independently. A cross-encoder re-ranker (e.g., Cohere Rerank or a local cross-encoder/ms-marco model) attends to both query and document together, producing more accurate relevance scores but at higher latency. The practical pattern: retrieve a broad candidate set (top-50) with the vector DB, then re-rank down to a tight set (top-5) before passing to the LLM. This gets the speed of ANN retrieval and the precision of cross-attention without paying cross-encoder cost at full index scale.

4. Embedding Model Mismatch Will Destroy Recall

Indexing with one embedding model and querying with another produces garbage results—vectors from different models live in incompatible spaces. Model upgrades require full re-indexing, a cost that compounds painfully at scale. Factor this into architecture decisions from day one.

5. Metadata Filtering Requires Upfront Design

Post-filtering (retrieve top-1000, then filter) degrades recall when the valid result set is small. Systems like Qdrant and Pinecone support pre-filtering that applies constraints during the ANN search itself. Design your filtering strategy before building your index schema, not after.

6. You're Probably Not Measuring Retrieval Quality

Teams deploy RAG and measure only final answer quality, missing the actual failure mode: retrieval recall. Instrument hit@k, MRR, or NDCG against a labeled evaluation set. Poor retrieval is invisible until you look for it.

FAQ

Q: Do I need a vector database or can I just use a list in memory? For prototypes under ~50,000 vectors, in-memory libraries like FAISS or simple NumPy dot-product searches work fine. For anything production-facing, persistent storage, metadata filtering, access control, and managed ANN indexes justify the overhead.

Q: Which embedding model should I use? Use MTEB benchmarks as a starting filter, then run recall tests on your own domain data. General leaderboard performance often does not transfer cleanly to specialized corpora.

Q: Is pgvector good enough? Often yes, especially early on. Where pgvector starts to strain: indexes above ~10M vectors, complex multi-field filtered queries at high QPS, or strict sub-10ms latency requirements. Purpose-built systems handle those cases better, but migrate when you hit a concrete limit—not preemptively.

Q: How do vector databases handle multimodal data (images + text)? Multimodal embedding models encode images and text into the same vector space, so a text query can retrieve images and vice versa. The vector database simply stores and searches vectors; the multimodal capability lives entirely in the embedding model layer.

Q: What's the biggest mistake teams make? Shipping without retrieval evaluation. You cannot fix what you don't measure.

Where to Go Next

The fundamentals here—embeddings, ANN search, hybrid retrieval, re-ranking—are stable. The ecosystem is moving fast: multimodal indexes, agent memory stores, real-time vector updates, and increasingly sophisticated re-ranking pipelines are all active areas. The retrieval layer is consistently underinvested relative to its impact on final output quality; building intuition there pays compounding returns.

← Back to taiebtouati.com