Our Crawler Found 52 Pages Your Sitemap Forgot
A customer signed up last week. Marketing agency, portfolio of WordPress sites, clean brand, fairly standard setup. They added SiteDialect to one of their client sites and expected to translate the 38 pages listed in the sitemap.
Our crawler reported back: 90 pages.
They blinked at the dashboard. Opened the sitemap.xml. Counted again. 38 entries. Definitely 38. Where did the other 52 come from?
This is a more common story than you'd think.
What our crawler actually does
When you add a site to SiteDialect, we don't just wait for visitors to hit URLs and translate them on demand. That's a reasonable default, but it means the first visitor to any page sits through a translation API round-trip (usually a second or two) before they see their language. That first impression is important.
So we run a crawler. It:
- Pulls your
sitemap.xml(and any nested sitemap indexes) — the pages you've officially declared - Spiders internal links up to a configurable depth — the pages you forgot to declare
- Fetches each one and extracts the translatable text
- Feeds them to the pre-warm queue, which translates each page into each language you've enabled
- Caches the translations at our edge before any human visitor arrives
The result is that when a Spanish visitor hits your site for the first time, the translation is already sitting in cache. The page loads as fast as the English original — because the translation exists, it just needs to be served.
Why the sitemap is almost always incomplete
Sitemaps rot. They rot faster than anyone plans for.
When you first launched your site, someone configured the sitemap generator to include the main content types. Then:
- You added a new content type — case studies, testimonials, help articles — and the sitemap config didn't get updated
- A landing page got built for a campaign, lived in a weird subdirectory, and was never added to the main sitemap
- Old marketing pages from a previous era still exist and still get internal links, but were excluded when the sitemap was rebuilt
- Your CMS paginates content and only lists the first page of each archive in the sitemap
- A plugin generates URL variations (filtered archives, search result pages, category views) that are real, crawlable pages but not in the sitemap
We see all of these. Google's crawler sees all of these too, which is why Search Console almost always reports more indexed URLs than your sitemap declares.
The 52 pages in the wild
For the agency I mentioned, the 52 "extra" pages broke down like this:
- 23 blog posts from before a category restructure; still publicly live, still internally linked, never re-added to the sitemap after the migration
- 14 case studies generated by a custom post type; the sitemap generator was configured for blog posts and pages, not the custom type
- 7 landing pages under
/l/built by the paid team for various campaigns, intentionally not in the public sitemap but still receiving organic traffic from rankings it had accidentally picked up - 5 team bio pages linked from the About page but flagged "noindex follow" at some point — still crawlable, still visited, never in the sitemap
- 3 legacy URLs from the 2020 site structure still accessible through redirects-that-never-got-removed
Zero of these were broken. Zero were hostile discoveries. They were all public, all working pages that real users were landing on every day — and none of them would've been translated if we'd trusted the sitemap.
Without the crawl, every international visitor who hit one of those 52 pages would've seen English, bounced, and we'd never have known.
What the report looks like
After a crawl, the SiteDialect dashboard shows you:
- Total pages discovered
- Breakdown: how many from the sitemap, how many from spidering
- Per-page translation status across each enabled language
- Pages marked "new" (discovered this crawl) vs "changed" (content drifted since last crawl)
- Orphan detection — pages still linked from somewhere but no longer returning 200
You can manually trigger a pre-warm run at any point — we do it automatically on add-to-site and then weekly thereafter. You can also force a recrawl after a big content push.
The quiet compounding benefit
There's a second thing the crawler does that isn't obvious up front. When your content changes — a new blog post, a revised product description, a price update — the crawler notices. It re-fetches the page, compares the extracted text to the last version, and marks it as changed.
You can then either auto-approve the re-translation or review it first. Either way, your translated pages stay in sync with your canonical pages. You don't wake up six months from now to discover that your Spanish pricing page has been quoting the old $29 plan since October.
This is the kind of background plumbing that's boring to demo and absolutely critical to operate at scale. Every multilingual site without it drifts out of sync within a quarter. With it, you just keep editing in English and trust that the translations follow.
Find your "extra pages"
Sign up, add your site, check the discovery count against your sitemap. Most sites are shocked at the gap. The ones that aren't shocked usually discover an orphaned page or two they'd forgotten about entirely.
Either way, it's a healthier inventory of your own site than you had yesterday.
Let us find the pages your sitemap missed
SiteDialect crawls, pre-warms, and keeps translations in sync. Every page. Every language. Automatically.
Get Started Free