See AI content operations in action at Braze. Join the live session April 14th

Your CMS is already an AI backend. Here's how to treat it like one.

Everyone has access to the same foundation models. What separates reliable AI systems from unreliable ones is the data underneath them. If you're scraping your own website to feed a RAG pipeline, you're starting from a mess. If your content is already structured, classified, and governed in your CMS, you're most of the way there. This guide covers how to treat your CMS like the AI backend it should be.

  • Knut Melvær

    Principal Developer Marketing Manager

Published:

For the past few years, the conversation around AI and content has been entirely about output. How do we generate more blog posts? More product descriptions? More of everything, faster? But teams building serious AI systems — customer support agents, internal knowledge bases, AI-powered search — are hitting a different wall. Not compute. Not model quality. Data quality.

If you want an AI agent to accurately represent your brand, answer complex product questions, or automate content workflows, you can't train it on a scraped version of your website. You need structured, semantically rich data with clear provenance and governance. If you're using Sanity, you're already sitting on exactly that. The question is whether you're managing it like it.

This post covers practical patterns for making your Sanity content AI-ready: schema contracts, PII hygiene, export projections, versioning, and how to connect your content directly to agents without building a custom RAG pipeline from scratch.

The problem with scraping your own site

When most teams decide to build a RAG system or a custom AI agent, they point a crawler at their domain, strip the HTML, and dump the text into a vector database.

This destroys context.

A product page for enterprise software contains several distinct things: a main product description, a deprecation warning, a customer quote, and footer nav links. When you scrape that page into unstructured text, the model loses all of those boundaries. It might present a customer testimonial as an official company guarantee. It might ingest a footer link and treat it as a product feature. It doesn't know the warning is a warning.

Unstructured blobs create noisy, unreliable datasets. They leak PII. They have no provenance. They break downstream pipelines when content changes unexpectedly. And they produce the hallucinations that make AI systems untrustworthy.

Why structured content is different

Structured content solves the context problem at the root. It's the founding philosophy behind Sanity: content is data.

When content is managed in Sanity, it's not stored as web pages. It's stored as structured JSON in the Content Lake. A product document doesn't look like a rendered page — it looks like this:

{
  "_type": "product",
  "name": "Enterprise Analytics Suite",
  "status": "active",
  "targetAudience": ["enterprise", "data-science"],
  "features": [
    { "name": "Real-time dashboards", "description": "..." },
    { "name": "Predictive modeling", "description": "..." }
  ],
  "deprecatedFeatures": [
    { "name": "Legacy CSV Export", "removalDate": "2025-12-31" }
  ],
  "supportPolicy": {
    "sla": "24/7",
    "responseTime": "1 hour"
  }
}

When a user asks your agent "Does the Analytics Suite still support CSV export?", the agent can query the deprecatedFeatures array directly. No guessing. No hallucination risk on that field.

Metadata filtering. Structured content lets you attach metadata to every piece of information — audience, region, product line, lifecycle stage. In a RAG system, this metadata lets you pre-filter your vector search. A user in the UK gets results filtered to region: "UK". The model only reasons over relevant data, which cuts hallucinations and improves response accuracy.

Clean extraction. When you pull content into a pipeline, you don't want presentation logic, CSS classes, or routing logic mixed in. With GROQ (Graph-Relational Object Queries), you project exactly the fields you need, nothing else.

Schemas as dataset contracts

When content feeds an AI pipeline, the shape of that data matters. If a field gets renamed or a required field goes null, your embedding scripts fail — or silently ingest garbage.

Your schema needs to act as a strict dataset contract: required fields, allowed values, nullability rules, and provenance requirements. Schema changes (migrations, deprecations, backfills) need the same rigor as database migrations.

Here's an example schema designed for governed AI use:

import { defineType, defineField } from 'sanity'

export const aiTrainingArticle = defineType({
  name: 'aiTrainingArticle',
  title: 'AI Training Article',
  type: 'document',
  fields: [
    defineField({
      name: 'title',
      type: 'string',
      validation: Rule => Rule.required()
    }),
    defineField({
      name: 'body',
      type: 'array',
      of: [{ type: 'block' }]
    }),
    // Data classification
    defineField({
      name: 'dataClassification',
      type: 'string',
      options: { list: ['public', 'internal', 'confidential', 'regulated'] },
      initialValue: 'internal',
      validation: Rule => Rule.required()
    }),
    // PII tracking
    defineField({
      name: 'piiPresent',
      type: 'boolean',
      initialValue: false
    }),
    // Provenance
    defineField({
      name: 'provenance',
      type: 'object',
      fields: [
        {
          name: 'source',
          type: 'string',
          options: { list: ['human', 'ai_generated', 'ai_assisted'] },
          validation: Rule => Rule.required()
        },
        {
          name: 'modelUsed',
          type: 'string',
          hidden: ({ parent }) => parent?.source === 'human'
        }
      ]
    }),
    // Lifecycle
    defineField({ name: 'labels', type: 'array', of: [{ type: 'string' }] }),
    defineField({ name: 'effectiveDate', type: 'datetime' }),
    defineField({ name: 'expiresAt', type: 'datetime' }),
    defineField({ name: 'owner', type: 'reference', to: [{ type: 'author' }] }),
    defineField({
      name: 'reviewStatus',
      type: 'string',
      options: { list: ['draft', 'reviewed', 'approved', 'rejected'] }
    }),
    defineField({
      name: 'version',
      type: 'number',
      initialValue: 1,
      readOnly: true
    })
  ]
})

A populated document from this schema gives downstream pipelines a predictable, parseable payload:

{
  "_id": "drafts.article-123",
  "_type": "aiTrainingArticle",
  "title": "Q3 Financial Compliance Guidelines",
  "dataClassification": "regulated",
  "piiPresent": true,
  "provenance": {
    "source": "ai_assisted",
    "modelUsed": "gpt-4-turbo"
  },
  "labels": ["compliance", "finance", "q3"],
  "effectiveDate": "2026-07-01T00:00:00Z",
  "expiresAt": "2026-09-30T23:59:59Z",
  "reviewStatus": "approved",
  "version": 2
}

PII hygiene

Once PII or confidential data ends up in a vector database or fine-tuning dataset, it's nearly impossible to remove. Structured content reduces this risk by letting you isolate sensitive information at the field level rather than trying to parse and redact it from a blob of text.

Practical approaches:

  • Separate fields. Keep "public text" fields strictly separate from "internal notes" and "customer details." Never mix them in a single rich text field.
  • Access control. Mark sensitive fields as readOnly or hidden in the Studio for users who shouldn't see them. Sanity's Roles and Access Control handles this at the schema level.
  • Classification enums. Use a dataClassification field (public | internal | confidential | regulated) to explicitly label document sensitivity. Your export pipelines filter on this before anything reaches an embedding model.
  • PII flags. A piiPresent boolean or piiCategories array lets pipelines automatically drop or quarantine flagged documents.
  • Strict export boundaries. Regulated or confidential fields should never reach training pipelines unless explicitly cleared by legal and privacy review.

Provenance and lineage

When an agent produces a bad output from your RAG system, your first question is: where did the model get this? Your second is: did a human write that source, or did another AI?

Track content lineage from the start. The provenance object in the schema above records whether content was human, ai_generated, or ai_assisted, and which model was involved. If you later discover a specific model produced low-quality content, you can query *[provenance.modelUsed == "legacy-model-v1"] and flag everything for human review. That's a much better position than having no idea where bad content came from.

Export projections: stable shapes for pipelines

Never dump raw Sanity JSON directly into your vector store. Raw documents contain draft data, internal editorial comments, and deeply nested Portable Text that models process inefficiently.

Use GROQ projections instead. They let you reshape data, strip internal fields, flatten rich text to plain text, and ensure a stable shape for ingestion scripts:

*[_type == "aiTrainingArticle"
  && dataClassification == "public"
  && piiPresent == false
  && reviewStatus == "approved"
  && effectiveDate <= now()
  && (!defined(expiresAt) || expiresAt > now())
] {
  _id,
  "content": pt::text(body),
  "metadata": {
    "title": title,
    "labels": labels,
    "source": provenance.source,
    "version": version
  }
}

This query only exports public, approved content without PII, excludes expired and draft content, and uses pt::text() to convert Portable Text into a flat string ready for embedding.

Versioning and audit trails

AI systems are sensitive to changes in their underlying data. If a user reports a bad output, you need to know exactly what the source content looked like at the time.

Sanity's History API provides a complete, queryable log of every mutation made to a document — not just what it looks like today, but its exact state at any point in time. Log the document revision ID alongside your vector embeddings and you have a verifiable audit trail. When a document is updated, diff the change against the previous revision and decide whether the embedding needs to be recomputed or whether it was a cosmetic fix that doesn't affect meaning.

Evaluation sets

You can't improve an AI system you can't measure. Evaluation sets (eval sets) are the benchmark — known questions, expected answers, and the specific content the model is allowed to use as context.

Manage eval sets directly in your CMS alongside the content they test:

export const llmEvalCase = defineType({
  name: 'llmEvalCase',
  title: 'LLM Evaluation Case',
  type: 'document',
  fields: [
    defineField({ name: 'prompt', type: 'text' }),
    defineField({ name: 'expectedAnswer', type: 'text' }),
    defineField({
      name: 'allowedSources',
      type: 'array',
      of: [{ type: 'reference', to: [{ type: 'aiTrainingArticle' }] }]
    }),
    defineField({ name: 'scoringRubric', type: 'text' }),
    defineField({
      name: 'targetSnapshotId',
      type: 'string',
      description: 'Content revision ID from the History API'
    })
  ]
})

Tying the eval case to a targetSnapshotId keeps your tests stable as content evolves. When a document changes significantly, your CI/CD pipeline flags the corresponding eval case for review.

Connecting agents directly to your content

The patterns above apply when you're building external ML pipelines or fine-tuning workflows. But for most AI agent use cases — RAG-powered support bots, search interfaces, content-aware assistants — you don't need to build all that infrastructure yourself.

Sanity provides two MCP (Model Context Protocol) endpoints designed for different use cases.

The Sanity MCP server at mcp.sanity.io gives AI agents in tools like Claude Code, Cursor, and VS Code full read/write access to your workspace. Agents can query content, manage releases, deploy schemas, and execute GROQ queries with full awareness of your content model. This is the right tool for developer workflows and content operations automation.

Agent Context is a separate, read-only MCP endpoint for production-facing agents — the customer support bot, the site search, the product recommendation engine. It gives agents schema-aware access to a scoped dataset, with semantic search built in. Agents can query fields, follow references, and use semantic search without a separate vector database to maintain. Embeddings live in the Content Lake alongside your content. When a price changes or a product gets discontinued, the embeddings update — no reindexing pipeline, no sync lag.

If you're building advanced workflows, Agent Actions let you trigger schema-aware automation from anywhere: generate new content, transform existing documents, translate at scale, all invoked from Studio, Compute, your frontend, or any code that can call an API.

How to model content for agent use

Three principles that matter more than anything else:

Model semantics, not pages

The most common mistake: building your schema around your website's visual layout. If your schema has fields like heroSection, leftColumnText, and blueButtonLabel, you've built a page builder. An agent can't reason over that.

Model around meaning instead. Product, Person, Tutorial, Policy. Your Next.js app decides how to display the data. Your agents consume the structured data directly. Both get what they need from the same source.

Break down rich text

A single rich text field with 3,000 words of mixed content is hard to chunk for a vector database. Use Portable Text to break content into structured blocks. Better: use array fields to create logical sections. A tutorial should be an array of step objects, each with a heading, instruction, and codeSnippet. Each step indexes individually. Retrieval gets dramatically more accurate.

Use references for relationships

If an article relates to a specific product, don't just mention the product name in the body text. Create an explicit reference field linking the article to the product document. Implicit relationships require the model to guess the connection. Explicit references are hard, queryable data that can be injected directly into the context window.

Use controlled vocabularies

Free-text tags fragment into noise: "Finance," "finance," "Financial," "fin." Use the Reference field to build a controlled vocabulary — a dedicated Category or Topic document type that articles reference. Downstream pipelines get clean, standardized IDs instead of messy strings. Metadata filtering becomes reliable.

Event-driven pipelines with Compute

Weekly batch exports aren't sufficient for production AI applications. If a compliance document gets updated or retracted, your RAG system needs to reflect that change immediately.

Sanity Compute lets you deploy TypeScript functions directly to the Content Lake. They fire in response to content events — document created, updated, deleted. A function triggered on publish can recompute the embedding for that specific document and update your vector store in near real time. You configure GROQ-powered webhooks to fire only when substantive fields change, ignoring cosmetic edits like typo fixes.

Co-locating this logic with your content infrastructure means you're not maintaining a separate fleet of external services that nobody fully understands.

Anti-patterns to avoid

The mega-rich-text field. Stuffing your entire article — including author bio, related links, and legal disclaimers — into a single rich text field. Models can't parse the semantic boundaries. Break content into discrete fields.

Ignoring deletions. Pipelines that only handle creates and updates will serve outdated or legally retracted information indefinitely. Use webhooks to listen for delete events and purge the corresponding vectors.

Implicit relationships. Don't mention product names in body text and expect the model to figure out the connection. Use explicit reference fields.

Mass exports without governance. Legacy content, draft content, and internal notes will poison your dataset. Use GROQ projections and classification fields to explicitly opt content in, rather than defaulting to a full export.

The real differentiator

The models themselves are table stakes. Everyone has access to the same foundation models. What separates reliable AI systems from unreliable ones is the quality, structure, and governance of the data underneath them.

Teams scraping their own websites for training data will spend months cleaning HTML, fighting hallucinations, and rebuilding pipelines every time a page structure changes. Teams managing their content as structured data — with explicit classification, provenance tracking, and clean export shapes — can move immediately to the interesting problems.

Sanity is built around this idea: content as data, structured for any consumer, including AI agents. The Content Operating System is not just a publishing tool. It's the data layer your AI strategy runs on.

The structured content is already there. Govern it well, and you're not starting from scratch. You're already most of the way there.

Ready to build AI-ready content workflows?

Get started