Content Lake (Datastore)

Dataset Embeddings

Search your Content Lake by meaning, not just keywords.

Dataset embeddings add semantic search to GROQ. For enabled datasets, search your content for semantic meaning using the text::semanticSimilarity() GROQ function.

Quickstart

Create an embeddings-enabled dataset

Optionally scope what gets embedded with a projection.

Check that embeddings are ready

Embeddings generation may take a few minutes, especially on larger datasets. When the status shows ready, your dataset is set up for semantic search.

Query with semantic similarity using GROQ

Query results are ranked by semantic relevance, even when there's no exact keyword overlap. Each result includes a _score field. This is an opaque, unitless value used only for ranking results relative to each other within a single query. It is not a measure of general match quality and should not be compared across different queries.

* | score(text::semanticSimilarity("how to handle user authentication"))

Next steps: See Control your embeddings with projections to fine-tune what content gets embedded, or Querying with embeddings for additional search patterns, keyword matching and boosting.

Core concepts

What are embeddings?

An embedding is a numerical representation of text (a vector) that captures meaning rather than just characters. Words and phrases that are semantically close end up with similar vectors, even if they share no words in common. "Authentication flow" and "login process" would be close together, "authentication flow" and "authentic basketball jersey" would be far apart.

When you enable embeddings on a dataset, Sanity processes each document's content (or the subset you define with a projection) into a vector. At query time, your search term is converted into a vector too, and results are ranked by proximity in that vector space.

Embeddings in Sanity datasets give your GROQ queries the ability to understand the meaning behind text content your team or customers might want to understand, not just match specific keywords in the text.

Why use embeddings?

Traditional keyword search relies on matching exact words in your content. If your docs say "authentication" but someone searches "login," traditional search misses it. Embeddings close that gap by matching on concepts.

Embeddings on datasets bring this capability directly into GROQ, so you don't need an external vector database or a separate search pipeline. Your content stays in the Content Lake, your queries stay in GROQ, and semantic scoring is just another function you can use alongside the filters and boosts you already use.

Getting started

When you create an embeddings-enabled dataset, Sanity asynchronously analyzes and computes embeddings for its documents, in its entirety or according to any projection you provide.

Each document's content is processed into a vector representation, its "embedding", which is what makes semantic search possible.

A few key points:

  • Enabling embeddings on a dataset triggers an initial embeddings generation, where all existing documents are processed. This can take some time, particularly on large datasets. You can track progress using the status command (see Checking embedding status below).
  • After the generation is complete, embeddings are kept up to date automatically. When a document is updated, its embedding is recomputed asynchronously. Mutations are batched to avoid constant recomputation on frequently updated datasets, which means embedding results may lag slightly behind the document update. Normally this lag will be less than 1 minute, but may in some cases be longer depending on the size and frequency of document updates.
  • The embedding model is managed by Sanity and may be updated for optimized performance. When this happens, your dataset will be recomputed automatically.

Performance considerations

Enabling embeddings

You can enable embeddings when creating a dataset. Enterprise clients can enable embeddings on existing datasets. Both paths are available through the CLI and the HTTP API.

When creating a dataset

To include a projection at creation time:

Note that expanding references will not work in these projections. Only what’s in the document!

For an existing dataset

By default this returns immediately and runs asynchronously in the background. Add --wait to block until the embeddings generation completes:

To enable with a projection:

Via the HTTP API

Create a new dataset with embeddings:

PUT /projects/:projectId/datasets/:name HTTP/1.1
Content-Type: application/json

{
  "aclMode": "public",
  "embeddings": {
    "enabled": true,
    "projection": "{ title, summary, category }"
  }
}

Enable or update embeddings on an existing dataset:

PUT /projects/:projectId/datasets/:name/settings/embeddings HTTP/1.1
Content-Type: application/json

{
  "enabled": true,
  "projection": "{ title, summary, category }"
}

This endpoint returns 202 Accepted immediately. Embeddings generation will then complete asynchronously.

Read current configuration and status:

GET /projects/:projectId/datasets/:name/settings/embeddings HTTP/1.1

# Response
{
  "enabled": true,
  "projection": "{ title, summary, category }",
  "status": "ready"  // "updating" | "ready" | "error"
}

Control your embeddings with projections

Projections define what content gets embedded. This directly affects the size of your embeddings, the time of initial generation and ongoing recomputation, the efficiency of each query, and the relevance of your search results. If no projection is specified, Sanity embeds the entire document for you.

For small datasets with simple content, this may be fine. For most production datasets, a targeted projection is recommended, as every field you include in a projection increases the size of each document's embedding and the time it takes to generate.

Document size limits

By scoping your projection to only the fields your users actually search against, you speed up initial generation, recomputation and query times, and improve result relevance by keeping noise out of the vector space.

Avoid including fields that update frequently but have no semantic value for search, since each change will trigger a recomputation cycle without improving results.

Basic projection

If your users only search by a few shared fields, a simple projection may be all you need:

Type-specific projections

Many datasets contain multiple document types with different schemas. Use conditional projections to target the right fields per document type:

This projection generates embeddings for articles, products, and help articles only, pulling different fields from each. Document types not listed in the projection are not embedded.

You can also combine shared fields with type-specific ones:

Here, title is embedded for all document types, while each type contributes additional fields specific to its schema.

Field names from your projection are preserved as metadata and used as semantic context during embeddings computation. For example, { "musical_genre": category } helps the model interpret a value like "classical" in a musical context rather than an engineering one. Field names and position data are also returned as part of search result metadata (see Search result metadata below).

Checking embedding status

To check the current state of embedding processing on a dataset:

The underlying status values are updating, ready, and error.

Disabling embeddings

Destructive operation

Querying with embeddings

text::semanticSimilarity() is a GROQ function introduced with dataset embeddings. It converts your search term into a vector and ranks results by proximity to each document's embedding. The function is only valid as an argument to score(); using it elsewhere returns an error.

For longer documents, the projected content may be split into multiple chunks, each embedded and scored separately. This affects the _embeddings metadata returned with results (see How documents are chunked below).

Once embeddings are enabled, you can use text::semanticSimilarity() inside a score() expression in any GROQ query.

Semantic search

Filtered semantic search

Use a filter to restrict which documents are scored. Documents that don't match the filter are excluded entirely:

In this example, only products in the footwear category are considered. Results are ranked by semantic similarity.

Hybrid search, combining filters, keyword matching, and semantic scoring

The previous examples show two of the three tools available in a search query: filters to narrow the candidate set, and semantic scoring to rank by meaning. The third is keyword matching, which rewards documents containing the exact search terms.

Start with filtered semantic search. For many use cases, it's sufficient on its own. Add keyword matching when your users are likely to search for proper nouns, brand names, model numbers, or other specific identifiers that carry meaning as exact strings but don't embed well as concepts. A search for "Gore-Tex" needs to match that exact term. Semantic similarity alone would only capture the broader concept of waterproofing.

Use score() with multiple expressions to combine keyword matching and semantic scoring. Each expression contributes independently to the document-level _score. Documents don't need to match every expression to appear in results:

The filter restricts results to footwear products. The keyword match catches products that mention "Gore-Tex" by name. Semantic similarity surfaces products that are conceptually related to waterproof boots, even when they use different language.

When keyword matches on shorter fields like title produce scores that outweigh semantic matches on longer fields like body, use boost() to adjust the balance:

*[_type == "product" && category == "footwear"]
    | score(
        boost([title, body] match text::query("Gore-Tex waterproof boots"), 0.5),
        text::semanticSimilarity("Gore-Tex waterproof boots")
      )

When the search input is conceptual rather than specific—descriptions of problems, features, or topics rather than exact names—filtered semantic search alone often produces better results, since there's no keyword score to compete with the semantic signal:

*[_type == "product" && category == "footwear"]
    | score(text::semanticSimilarity("comfortable shoes for long walks"))

Search result metadata

When a query uses text::semanticSimilarity(), Sanity automatically includes an _embeddings field on each result. This contains the specific text fragments that contributed to the match, along with their source fields and character positions, which is useful for highlighting matches or tracing which part of a document drove the result.

Each entry in _embeddings contains the text fragments that contributed to the semantic match, along with metadata about where they came from.

  • fragments are extracts of the original text
  • fields are the GROQ-style field paths they came from (e.g. reviews[0].text)
  • startPositions and endPositions are character offsets within each field

How documents are chunked

A document's projected content may be split into one or more chunks before embedding. Short documents typically fit in a single chunk, while longer documents are split across multiple. Each chunk is embedded as a separate vector and scored independently at query time.

This chunking is why the _embeddings array can vary in length across results. A short document produces a single _embeddings entry whose fragments and fields arrays cover all embedded fields together. A longer document produces multiple entries, each representing a different portion of the content. A short field like title may appear in its own chunk while a long body field spans several.

The number of chunks depends on the total text length of the projection output for a given document, not on how many fields the projection includes.

Per-chunk scores vs. document-level _score

Each _embeddings entry includes a score field representing the semantic similarity of that individual chunk to the query. Entries are sorted by this score, highest first.

The document-level _score is separate. It combines all scoring expressions in your score() function; both text::semanticSimilarity() and any keyword matching via match or text::query(). The per-chunk score tells you how well a specific portion of the document matched semantically; the document-level _score determines where the result appears in the overall ranking.

Troubleshooting

Results don't seem relevant to my query

Check your projection. If no projection is set, the entire document is being embedded, which can lead to matches against irrelevant fields. Define a projection that scopes your embeddings to the content your users actually search against.

Status shows error

Some failed enablements require manual intervention. Ask in Discord (community) for assistance. Enterprise customers with enterprise-level support should contact support through their dedicated channels.

Query results appear stale

Embedding updates are asynchronous and debounced. After a document is updated, its embedding may take a few minutes to reflect the change.

Keyword matches seem to outweigh semantic results

In hybrid queries that combine match with text::semanticSimilarity(), keyword matches on short fields like titles can produce high scores that outweigh strong semantic matches on longer content fields. This happens because each matching term represents a larger share of a short field's total content, resulting in a higher keyword score.

To address this, use boost() to reduce the weight of the keyword expression, broaden the keyword match to include the same fields you're embedding so the signal is spread across all content, or test whether semantic similarity alone produces good enough results for your use case.

Was this page helpful?