Dealing with unprompted requests causing high bandwidth and request count.

18 replies
Last updated: Mar 14, 2022
I've already gotten great, personal assistance from Sanity with regards to investigating bandwidth/request concerns, but has anyone out there in the community had experience dealing with what appears to be unprompted requests (or rather, requests unprompted by humans) running up the tally?
I am seeing a few hundred people causing gigs of bandwidth and tens of thousands of requests despite using API CDN and caching the rendered content server-side through PHP. Today, with fewer visits across all pages, and with more aggressive caching, the same queries resulted in more requests than days when there were fewer visits.

To be clear: Sanity is running fine, and honoring every request; it isn't a Sanity issue -- but there are more requests being asked of it than I can account for despite the code just being looping through $client->fetch queries.
AI Update

I feel you on this one - it's frustrating when request counts don't match what you'd expect from actual human traffic. Based on your description and what I know about Sanity's API CDN, here are some likely culprits:

Common causes of unexpected requests

1. The token bypass issue

This is probably the biggest one: if you're passing an authorization token (even a read token) to your Sanity client, you're completely bypassing the API CDN. Every single request hits the origin API directly, no matter what your caching setup looks like.

In your PHP client config, make sure you're doing:

$client = new SanityClient([
    'projectId' => 'your-project',
    'dataset' => 'production',
    'useCdn' => true,  // This alone isn't enough!
    'token' => null,   // Don't pass a token for public reads
    'apiVersion' => '2025-02-06'
]);

The API CDN provides unlimited rate for cached content, but any request with an authorization header bypasses it entirely and counts against your plan limits.

2. GROQ queries with dynamic functions

If your queries use functions like now(), dateTime(), or reference _updatedAt in ways that change constantly, the API CDN can't cache them effectively. Each "unique" query (even with the same structure but different timestamps) counts as a separate request.

3. Bot and crawler traffic

Even with server-side caching, if bots are hitting different pages/URLs on your site, each unique URL might trigger its own set of Sanity queries. SEO crawlers, monitoring services, and malicious bots can generate surprising request volumes. They might be:

  • Ignoring your robots.txt
  • Crawling pages faster than your cache can keep up
  • Hitting URLs with different query parameters that bypass your cache

4. Draft/preview mode leakage

If you have any preview functionality that queries draft content, those requests will bypass the CDN entirely since draft content can't be cached. Check if you accidentally have useCdn: false in some client instances.

Debugging strategies

Since Sanity's already helped you investigate, here are some additional things to check:

  • Check your server logs: Cross-reference your PHP application logs with Sanity request timing - are there background jobs or cron tasks making requests?
  • Monitor for retry logic: If you have any error handling that retries failed requests, a temporary issue could cascade into thousands of retries
  • Verify CDN headers: Test your queries directly and check the response headers to confirm CDN hits vs misses
  • Look for loops: Sometimes a foreach over query results accidentally triggers nested queries, multiplying requests exponentially

The aggressive caching paradox

Your observation about more aggressive caching leading to more requests is particularly interesting. This could indicate:

  • Cache invalidation is happening too frequently, causing cache misses
  • Your caching layer is actually making more "check if fresh" requests
  • There's a race condition where multiple processes are checking cache status simultaneously
  • Your cache key generation might not be stable, causing the same query to be treated as different

The fact that this is happening with fewer actual visits but more requests strongly suggests either bot traffic or some kind of automated process that's not directly tied to page views.

Quick wins to try

  1. Remove any tokens from public-facing queries - This is the #1 CDN bypass culprit
  2. Add rate limiting at your server level - Even if it's not a Sanity issue, you probably want to throttle suspicious traffic
  3. Check for any useCdn: false configurations - Search your codebase for this
  4. Monitor a single query type - Pick one specific query and track just that to isolate the pattern

Hope this helps narrow it down! If you share more about your PHP caching implementation or can identify which specific queries are generating the most requests, the community might spot something specific to that setup.

Hi Vincent. It sounds like you're only accessing Sanity from the server (not the client). If that's the case, I'd recommend making your dataset private to prevent anybody without an auth token making requests. Only your PHP script will be able to make requests on your behalf.

You can find out more about securing dataset access here .
One thing to note is that when using an auth token to access you dataset, draft documents will be included.
You'll need to update your GROQ queries to exclude draft documents .
user E
Thanks for the response. We have a setup of flat file PHP pages starting with a
define('SANITYCLIENT', true)
and then
require
a the Sanity PHP client in a separate folder off the root. In that file there's a check for the constant before it allows the query.
We wanted to prevent direct access or malware / guessed password to call the client willy-nilly, as they'd neither have nor expect that constant. Instead, I could just add it to the couple pages that actually need it.

If we are getting flooded, I can't afford to not use the apicnd/useCdn, but would a token help here to authenticate if the only things running the fetch are the pages that need to?

The fetches themselves seem to only come from pages that request them, it's just that the
number of fetches doesn't match the number of page loads -- I saw gaps of minutes between people visiting where the server's logged as having made multiple requests a half second apart.
I am using awstats in cPanel and it's reporting the same bandwidth sitewide as Sanity itself does for the day.
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.

It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
I am using awstats in cPanel and it's reporting the same bandwidth sitewide as Sanity itself does for the day.
This is a big, old, sprawling site with I think five WordPress sites counted in towards the results, all littered with pictures --- vs. six to eight flat file pages looping through 8-12ish documents, and one slider with fewer than ten images (that I went out of my way to both convert to JPG and then append with the JPG formatting URL parameter) ...the idea that it got worse consumption-wise with more caching and fewer visits is boggling my mind.

It makes me afraid to try more aggressive caching (like fifteen minutes, say, instead of three) because it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead? I've not seen anything like it so I am wondering what the huge whiff is on my part. Is there a way to rate limit from the querying itself or something?
There is an unholy amount of bot traffic -- would a crawler generate a query like a normal visit would? If it can, my inference would be that we can't control the crawl or nuke them visiting without harming SEO or the analytics...does that sound right?
Thanks for providing those additional details, Vincent.
You make a very good point about using an authentication token with the API CDN. Our API CDN now supports this, but I can see the PHP client hasn't yet been updated to reflect this change. We should update this, but it seems like you've ruled out the possibility that anybody is sending requests to Sanity directly.


The fetches themselves seem to only come from pages that request them, it's just that the number of fetches doesn't match the number of page loads -- I saw gaps of minutes between people visiting where the server's logged as having made multiple requests a half second apart.
Can I ask how you're hosting your PHP service? For example, whether you use PHP-FPM. This comment makes me wonder if something could be going on with request handling or process pooling.
The other thing I wondered is whether it's possible that a file that's querying Sanity is being required multiple times in your app? That would explain why a single request to your site spawns multiple requests to Sanity.

It's tricky to debug much further without seeing your source code. Is that something you're able to share?


it's almost like "saving" the rendered content is just....randomly re-firing queries off to Sanity instead?
That's very odd behaviour indeed. Unless there's a bug in
sanity-php
, these queries should only happen when explicitly called. This is another thing that makes me wonder if your app is somehow accidentally calling the function multiple times for each request (e.g. if you've inadvertently required a file that makes requests multiple times).
I'd personally be tempted to remove the
SANITYCLIENT
constant mechanism you have in place. If your server is compromised, I'm not sure this really provides much protection, but it will probably make it trickier for you (or other devs) to reason about your own code. Removing it might make it easier for you to spot any bugs that would cause multiple requests to be sent 🙂.
Thanks again for taking so much time to investigate.

PHPINFO() https://gist.github.com/vincentjflorio/88c015d818d5564932cd63d52d6cbc9a

Example page: https://gist.github.com/vincentjflorio/c07f63970dae14e64ff2e8ae6ab16198

Slightly modified client: https://gist.github.com/vincentjflorio/b6147f6309c189e0c9df5679920284dd
These are the only two errors in the error log; we're not bringing in anything novel as far as serializers and our use of links is pulled from the docs and prints out fine on the front end.

I don't think the block content bit is re-running queries on its own; I actually don't know why it has to be fed the same parameters over again...maybe it doesn't and I was being too literal in pulling the samples?

Might there be any benefit skipping the fetching part of the client and just visiting the query addresses?



[22-Feb-2022 20:25:18 UTC] PHP Fatal error:  Uncaught GuzzleHttp\Exception\ConnectException: cURL error 28: Connection timed out after 30000 milliseconds (see <https://curl.haxx.se/libcurl/c/libcurl-errors.html>) for <https://oqceb8ti.apicdn.sanity.io/v2021-08-31/data/query/production?query=%2A%5B_type%20%3D%3D%20%22news%22%20%26%26%20%22Front%20Page%22%20in%20categories%5B%5D-%3Etitle%20%20%5D%20%7B%20publishedAt%2C%20title%2C%20subtitle%2C%20body%20%7D%20%7C%20order%28publishedAt%20desc%29> in /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:210
Stack trace:
#0 /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(158): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array)
#1 /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(110): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#2 /home/nihbweb/public_html in /home/nihbweb/public_html/sioc/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 210


[22-Feb-2022 21:11:51 UTC] PHP Notice:  Undefined index: href in /home/bewbhin/public_html/sioc/vendor/sanity/sanity-php/lib/BlockContent/HtmlBuilder.php on line 120
I can't see any obvious issues in your code. If you search your project for
$client->fetch
, is it being called anywhere other than in pages?
user E
No. It isn't anywhere else. I placed them manually, and when I downloaded the root and ran a GREP tool on that recursively just in case I brain-farted and made a mistake I couldn't find anything, even when searching with regular expressions.
P.S. I talked to LiteSpeed support and they can't identify the cache being the source of overly repeat/regurgitated queries.
user E
No. It isn't anywhere else. I placed them manually, and when I downloaded the root and ran a GREP tool on that recursively just in case I brain-farted and made a mistake I couldn't find anything, even when searching with regular expressions.
Hmm! That's very peculiar indeed. Have you tried writing to a log each time you make a query?
user E
That's a great idea, actually, thanks. If it's getting triggered somehow tens of thousands of times I might be worried about all those disk writes slowing down their already weird hosting but if it gives more diagnostic utility it's worth a shot.
I might try to briefly augment the function parameters so the logs are genuinely being fired at the same time as the query itself. I'll run something tonight and report back.
user E
Just following up after a couple nights of testing. I think the "mystery" is the poor or limited disclosure with the server logs. Check out the difference just with targeting the three biggest offenders (below). I'm no statistician but that definitely feels like a downward trend. The variation before even for nights/weekends wasn't a notable difference.
One in particular, called PetalBot, is apparently a crawler ramping up to build a reference for an upcoming (or up-and-coming) search engine. But I got four crawls on the same page in a minute. That's craziness. And two of the most popular bots are location-specific to engines used in countries where our site is totally irrelevant, so we don't super care how the absence of the content affects our ranking on those.

One, Semrush, we have to not block because apparently they have their fingers all in the information it provides, but by the numbers it's spam-like levels of activity page by page, and definitely outnumbers the human visits going by this raw generated log, which is supremely annoying for Sanity-related purposes. But I think I can control the costs better now until someone gets in front of all this on their IT side.

Thank you and everyone else for the long-term attention to detail helping me arrive at a solution. I was getting real anxiety from it.
Astounding! There was on in particular called the "uptimerobot" that so far as I can tell, nobody asked to be on there. Obviously it's content agnostic since it's just checking if something is up. Anyway, that one change from last night, look at the difference below. It promises to run itself at least once every five minutes, so nearly 500 triggers of each query.
You can see the running total taper off its curve. It's especially interesting that the bandwidth doesn't taper off
as much since the bots visit the homepage more than anything and that's where our only pictures are, but there are no file attachments there like there are other places so maybe they represented "emptier" traffic and people are picking PDFs off the other pages?
In either case I am certain this is their the big cheese so I am marking this extra solved.
😃
Thank you for the detailed follow up, Vincent. I'm sure this would be an interesting read for other folks in the same situation. I'm glad you were able to resolve this by blocking some of the bots!
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place, please let us know
🙂.
Thank you for the detailed follow up, Vincent. I'm sure this would be an interesting read for other folks in the same situation. I'm glad you were able to resolve this by blocking some of the bots!
Caching would probably be helpful for you, too. If you're still experiencing issues getting it in place please let us know
🙂.
user E
Thanks. This thread is longer and older now but in my first message the caching I mentioned was server-side. They aren't super tech-savvy so I would worry about doing, say, an hour at a time as they'd be confused about not seeing instant changes, so I made it five minutes so that I had a fallback explanation with the Sanity API taking a nonzero amount of time to flush anyway. That definitely also helped. A better actual-user experience in terms of response and then the bot activity is just receiving old things instead of running up the score on me 😃
user E
Thanks. This thread is longer and older now but in my first message the caching I mentioned was server-side. They aren't super tech-savvy so I would worry about doing, say, an hour at a time as they'd be confused about not seeing instant changes, so I made it five minutes so that I had a fallback explanation with the Sanity API taking a nonzero amount of time to flush anyway. That definitely also helped. A better actual-user experience in terms of response and then the bot activity is just receiving old things instead of running up the score on me 😃

Sanity – Build the way you think, not the way your CMS thinks

Sanity is the developer-first content operating system that gives you complete control. Schema-as-code, GROQ queries, and real-time APIs mean no more workarounds or waiting for deployments. Free to start, scale as you grow.

Was this answer helpful?