Filtering out duplicate slugs after a large data import

13 replies
Last updated: Sep 8, 2020
Hi all. We've ran a large data import and there may be one or two duplicate slugs that have been generated. Any ideas on a way to filter those out??
AI Update

Great question! Finding duplicate slugs after a large import is definitely something that comes up. Here's how you can tackle this:

Finding Duplicates with GROQ

The most straightforward approach is to query for all documents with slugs and then identify duplicates programmatically. Unfortunately, GROQ doesn't have built-in aggregation functions like SQL's GROUP BY and HAVING, so you'll need to process the results:

*[defined(slug.current)] {
  _id,
  _type,
  "slug": slug.current
}

Then in your code (JavaScript/Node.js), you can find duplicates:

const results = await client.fetch('*[defined(slug.current)] { _id, _type, "slug": slug.current }')

// Find duplicates
const slugMap = {}
results.forEach(doc => {
  if (!slugMap[doc.slug]) {
    slugMap[doc.slug] = []
  }
  slugMap[doc.slug].push(doc)
})

// Filter to only duplicates
const duplicates = Object.entries(slugMap)
  .filter(([slug, docs]) => docs.length > 1)
  .map(([slug, docs]) => ({ slug, docs }))

console.log('Duplicates found:', duplicates)

Alternative: Check by Document Type

If you want to narrow it down by document type:

*[_type == "yourDocumentType" && defined(slug.current)] {
  _id,
  "slug": slug.current
}

Prevention: Slug Uniqueness Validation

For the future, you can add a custom validation function to your schema to prevent duplicate slugs from being created. The slug field type in Sanity has built-in support for uniqueness checking in the Studio UI, but it doesn't enforce it at the API level during imports.

Hope this helps you clean up those duplicates! Let us know if you need help with the cleanup script once you've identified them.

I guess you could do a GROQ query with something like this:
*[_type == 'something']{
   _id,
   slug,
  "hasDuplicateSlug": length(*[_type == 'something' && slug.current == ^.slug.current && _id != ^._id]) > 0
}
Thanks for getting back to me, appreciate it! That came back as
true
for every
something
. I tried switching one of the
slug.currents
for simply
slug
but got the opposite.
Will keep tinkering...
I updated it recently
So you did... one sec
I forgot you had to filter out the current document
Also, I had put the ] in the wrong place, updated just now. Sorry about that
Ahh no need for apologies, I really appreciate the help. Unfortunately that timed out, there are about >4k entries we're running it against.
huh, try paginating it?

*[_type == 'something'][0..500]{
...
}
Great minds, that's exactly what I'm just running πŸ˜‰
That's the ticket, thanks
user J
- I really appreciate that πŸ‘
Ye!

Sanity – Build the way you think, not the way your CMS thinks

Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.

Was this answer helpful?