Our Crawler Found 52 Pages Your Sitemap Forgot

A customer signed up last week. Marketing agency, portfolio of WordPress sites, clean brand, fairly standard setup. They added SiteDialect to one of their client sites and expected to translate the 38 pages listed in the sitemap.

Our crawler reported back: 90 pages.

They blinked at the dashboard. Opened the sitemap.xml. Counted again. 38 entries. Definitely 38. Where did the other 52 come from?

This is a more common story than you'd think.

What our crawler actually does

When you add a site to SiteDialect, we don't just wait for visitors to hit URLs and translate them on demand. That's a reasonable default, but it means the first visitor to any page sits through a translation API round-trip (usually a second or two) before they see their language. That first impression is important.

So we run a crawler. It:

  1. Pulls your sitemap.xml (and any nested sitemap indexes) — the pages you've officially declared
  2. Spiders internal links up to a configurable depth — the pages you forgot to declare
  3. Fetches each one and extracts the translatable text
  4. Feeds them to the pre-warm queue, which translates each page into each language you've enabled
  5. Caches the translations at our edge before any human visitor arrives

The result is that when a Spanish visitor hits your site for the first time, the translation is already sitting in cache. The page loads as fast as the English original — because the translation exists, it just needs to be served.

Why the sitemap is almost always incomplete

Sitemaps rot. They rot faster than anyone plans for.

When you first launched your site, someone configured the sitemap generator to include the main content types. Then:

We see all of these. Google's crawler sees all of these too, which is why Search Console almost always reports more indexed URLs than your sitemap declares.

The 52 pages in the wild

For the agency I mentioned, the 52 "extra" pages broke down like this:

Zero of these were broken. Zero were hostile discoveries. They were all public, all working pages that real users were landing on every day — and none of them would've been translated if we'd trusted the sitemap.

Without the crawl, every international visitor who hit one of those 52 pages would've seen English, bounced, and we'd never have known.

What the report looks like

After a crawl, the SiteDialect dashboard shows you:

You can manually trigger a pre-warm run at any point — we do it automatically on add-to-site and then weekly thereafter. You can also force a recrawl after a big content push.

The quiet compounding benefit

There's a second thing the crawler does that isn't obvious up front. When your content changes — a new blog post, a revised product description, a price update — the crawler notices. It re-fetches the page, compares the extracted text to the last version, and marks it as changed.

You can then either auto-approve the re-translation or review it first. Either way, your translated pages stay in sync with your canonical pages. You don't wake up six months from now to discover that your Spanish pricing page has been quoting the old $29 plan since October.

This is the kind of background plumbing that's boring to demo and absolutely critical to operate at scale. Every multilingual site without it drifts out of sync within a quarter. With it, you just keep editing in English and trust that the translations follow.

Find your "extra pages"

Sign up, add your site, check the discovery count against your sitemap. Most sites are shocked at the gap. The ones that aren't shocked usually discover an orphaned page or two they'd forgotten about entirely.

Either way, it's a healthier inventory of your own site than you had yesterday.

Let us find the pages your sitemap missed

SiteDialect crawls, pre-warms, and keeps translations in sync. Every page. Every language. Automatically.

Get Started Free