Research · June 2026

39% of hotels are missing from the data that trains AI

We checked 108,109 hotel websites against Common Crawl — the open web archive behind much of what large language models learn from. Nearly two in five aren’t in it at all.

60.6%
in Common Crawl
39.4%
absent entirely
108,109
domains checked

There are two ways an AI can know about your hotel. It can fetch you live at query time (retrieval — Google Places, OTAs, reviews), or it can already know you from training. The training layer is built largely on Common Crawl, the open monthly web archive LLM builders pull from. If you’re not in the crawl, the model can’t learn you — it can only look you up. Nobody had measured how many hotels actually make it in. So we did.

The result: of 108,109 hotel websites (every hotel in our index with its own domain, a Google Place ID, and ≥10 reviews), 60.6% are in Common Crawl and 39.4% are absent. Of those present, 41.4% are there with real depth (5+ pages captured) and 19.2% only shallowly (1–4 pages). So a large minority of legitimate, reviewed hotels are invisible to the training layer of every major model.

Who’s in, who’s out

A representative sample of the hotels we checked. Green dots are in Common Crawl; red are absent. The pattern is geographic — coverage is denser in some markets than others.

in crawl absent

Representative sample of ~6,200 hotels (of 108,109 domains checked). Green = present in Common Crawl (May 2026), red = absent. Drag, zoom, and filter.

Present isn’t the same as known

Being in the crawl is one thing; being in it deeply is another. We counted how many pages of each hotel the May 2026 crawl captured.

0 — absent
42,575
1–4 — shallow
20,749
5–19
26,096
20–99
15,267
100–499
2,802
500+
620

Only ~3,400 hotels (3%) are captured deeply (100+ pages). The bulk of those that are “in” sit at 5–99 pages — enough to be known, not enough to be richly represented.

Independents beat chains

The intuitive guess — big chains dominate the crawl — is backwards. Independent hotels are markedly more likely to be in Common Crawl than chain properties.

61.0%
Independents in crawl
n = 105,483 domains
45.9%
Chains in crawl
n = 2,735 domains

More than half of chain domains are absent. The likely reason is structural: chain and corporate sites are more often JavaScript-rendered single-page apps or sit behind a CDN/WAF that turns crawlers away — both make a site hard for Common Crawl to capture. An independent hotel on a plain WordPress site is, paradoxically, easier for the training crawl to read than a global brand’s booking platform.

By country and TLD

Coverage varies sharply by market.

Germany
69.3%
France
64.6%
Netherlands
58.8%
Spain
58.6%
Italy
58.5%
United States
57.4%
United Kingdom
53.6%
Indonesia
47.4%

But the TLD tells a sharper story — and one that ties straight into how AI under-serves non-English markets. Local European TLDs are present but shallow: a .de hotel is well-crawled (71%) but at only ~39 pages on average, where a .com hotel averages ~109. And .es is the clear laggard at 37% — Spanish hotels on .com do fine, but the .es TLD itself is poorly crawled.

TLDIn crawlAvg pages (when present)
.de71.4%39
.fr64.8%42
.com60%109
.it59.5%35
.nl59.1%56
.co.uk54.3%59
.es37%87

The crawl knows local-market hotels — but thinly. It’s the same English-leaning tilt we see in live AI answers, showing up a layer earlier, in the training data itself.

Why some hotels are missing

Absence isn’t random. Three things keep a hotel out of the crawl:

  • Rendering. If your site only assembles its content after JavaScript runs, the crawler often captures an empty shell. Static HTML gets read; SPA booking widgets frequently don’t.
  • Access. A CDN or firewall that blocks bots — often a default no one chose — turns the crawler away before it reaches a page. This is invisible in robots.txt, which is why our AI-blocking study’s ~3% figure is a floor.
  • Connectivity. Common Crawl prioritises well-linked domains. A site few others link to gets crawled rarely and shallowly. Metehan Yeşilyurt’s work on Common Crawl rank (Harmonic Centrality) lays this out — and you can check a domain’s rank at webgraph.metehan.ai.

What to do: the fixes are the cheap, structural ones — serve real HTML (not JS-only), make sure no CDN/WAF setting is quietly blocking AI crawlers, and earn a few quality links so you’re worth crawling. Check your own site in seconds with the Common Crawl checker. For hotels, this is the secondary lever — most AI hotel answers are built from live retrieval, not trained memory (see the two-layers guide) — but it costs nothing to not wall yourself out of it.

Method & limits

We took every hotel in our 200K-property index with its own website, a Google Place ID, and at least 10 reviews — 142,405 hotels, resolving to 108,109 distinct domains(chains share one). Junk (parked domains, closed properties, non-hotels) was filtered out. Each domain was checked against the columnar URL index of the May 2026 Common Crawl snapshot (CC-MAIN-2026-21), counting captured pages for the host and its www variant. Present = 5+ pages, shallow = 1–4, absent = 0.

Limits. It’s one monthly snapshot — a site absent in May may appear next month. Content served on an unrelated sub-domain isn’t counted. “Absent” means not captured, which could be a block, JS-only rendering, low connectivity, or a new domain — we don’t attribute the cause per hotel. And presence in the crawl is necessary, not sufficient, for a model to actually surface you.

FAQ

Of 108,109 hotel websites checked against the May 2026 snapshot, 60.6% are in Common Crawl and 39.4% are absent. Of those present, 41.4% have real depth (5+ pages captured) and 19.2% are shallow (1–4 pages).

Summarize with AI

ChatGPTPerplexityClaudeGeminiGrok

Check your own hotel

See whether your site is in Common Crawl — and how deeply — in a few seconds.

Run the Common Crawl checker