Why does it matter if a hotel is in Common Crawl?

Common Crawl is one of the major open-web sources of LLM training data. If a hotel’s own site is not in it, a model cannot learn the hotel from that source — though it may still pick the hotel up indirectly through OTAs, reviews and travel guides, and can always look it up via live search. For hotels this is the secondary lever (most AI hotel answers are built from live retrieval of Google Places and reviews), but being in the crawl is free and lets a model speak to the brand from its own site, not just intermediaries.

Research · June 2026

39% of hotel websites are missing from Common Crawl

Name: Hotel Presence in Common Crawl 2026
Creator: Nicolas Sitter
Published: 2026-06-09
License: https://creativecommons.org/licenses/by/4.0/

We checked 108,109 hotel website domains against Common Crawl — one of the major open-web datasets AI training is built on. Nearly two in five aren’t in the May 2026 snapshot at all.

60.6%

in Common Crawl

39.4%

absent entirely

108,109

domains checked

Quick answers

How can I check if my hotel website is in Common Crawl?

To check if your hotel website is in Common Crawl, use the free Common Crawl checker at nicolassitter.com/tools/common-crawl: enter your domain and it queries the public columnar index for recent snapshots. A 5-minute self-check matters because in a study of 108,109 hotel website domains against the May 2026 snapshot (CC-MAIN-2026-21), 39.4% were absent entirely. To improve coverage, fix the three things that keep hotels out: serve real static HTML rather than JavaScript-only pages, make sure no CDN or firewall setting silently blocks AI crawlers like CCBot, and earn a few quality inbound links, since Common Crawl decides what to capture by Harmonic Centrality, a link-graph connectivity score. — Nicolas Sitter, Are Hotels in Common Crawl? 39% of Hotel Websites Are Missing (2026)

Are hotel websites in Common Crawl and AI training data?

Most hotel websites are in Common Crawl, but a large minority are not: of 108,109 distinct hotel website domains (resolved from 142,405 reviewed properties) checked against the May 2026 snapshot, 60.6% are captured and 39.4% are absent. Of those captured, 41.4% are deep (5+ pages) and 19.2% shallow (1-4 pages); the median in-crawl hotel holds just 9 pages, often the whole small site. Coverage varies by market: Germany 69.3%, France 64.6%, down to the .es TLD at 37%. Independents (61%) beat chains (45.9%), though that gap is driven almost entirely by Louvre Hotels' thin microsites at 11.8%. Being in Common Crawl is necessary but not sufficient for a model to surface you. — Nicolas Sitter, Are Hotels in Common Crawl? 39% of Hotel Websites Are Missing (2026)

There are two ways an AI can know about your hotel. It can fetch you live at query time (retrieval — a search, scraping or not, plus Google Places, OTAs, reviews), or it can already know you from training. The open-web part of that training layer often starts with Common Crawl, the monthly web archive many LLM builders use. If your own site isn’t in the crawl, a model can’t learn your hotel from that source — though it may still pick you up indirectly through OTAs, reviews and travel guides. I hadn’t seen anyone look at this for hotels specifically — so I dug into it.

The result: we started from 142,405 hotel properties and resolved them to 108,109 distinct website domains (every hotel in our index with its own domain, a Google Place ID, and ≥10 reviews). Against the May 2026 snapshot, 60.6% are captured in Common Crawl and 39.4% are absent. Overall, 41.4% are captured deeply (5+ pages) and 19.2% only shallowly (1–4 pages). So the official websites of a large minority of legitimate, reviewed hotels are invisible to this major open-web training layer. All figures below are domain-level, not property-level, unless stated.

Two things to keep straight before the numbers. First, being in Common Crawl isn’t the same as being in a model’s final training set — that data gets filtered, deduplicated and down-weighted. But within this source, absence is decisive: a page that was never captured can’t survive any later filtering. Second, the point isn’t that AI can’t find these hotels at all. It’s that the hotel’s own website is missing from one of the main open-web memory layers, leaving OTAs, directories and review platforms to define the property instead.

Who’s in, who’s out

A representative sample of the hotels we checked. Green dots are in Common Crawl; red are absent. The pattern is geographic — coverage is denser in some markets than others.

in crawl absent

Representative sample of ~6,200 hotels (of 108,109 domains checked). Green = present in Common Crawl (May 2026), red = absent. Drag, zoom, and filter.

How many pages — and why that matters less here

We counted how many pages of each hotel the May 2026 crawl captured. A caveat before reading this chart: a hotel website is a small site. Home, a rooms page, a few room types, a gallery, contact, maybe a restaurant — that’s often the whole thing. The median hotel that’s in the crawl is captured at just 9 pages, and 9 pages can be the entire site. So unlike a news or e-commerce domain, depth isn’t really the worry for a hotel — presence is. The line that matters is 0 vs. 1+ — or rather, that’s the first line. The second is whether those captured pages carry real hotel content and not just a JavaScript shell (more on rendering below).

0 — absent

42,575

1–4 — shallow

20,749

5–19

26,096

20–99

15,267

100–499

2,802

500+

620

The ~3,400 hotels (3%) captured at 100+ pages are mostly not small properties at all — they’re hotels whose listed website sits on a big shared platform (a chain domain, an OTA, a directory). For an ordinary independent on its own domain, 5–30 captured pages is a complete reading, not a shallow one. Which is why the real story below isn’t depth — it’s the 39% at zero.

“Chains” underperform — but it’s really one budget group

The headline split looks backwards: independent hotels are in Common Crawl far more often than chain properties. But that average hides almost everything interesting, so we broke it down by brand.

61.0%

Independents in crawl

n = 105,483 domains

45.9%

All chains in crawl

n = 2,735 domains

Coverage by brand

Hilton

81.5%

Best Western

78.8%

Accor

71.4%

IHG

65.3%

Marriott

63.9%

Wyndham

57.5%

Choice

44.2%

Brit Hotel

14.4%

Louvre Hotels

11.8%

WorldHotels

7.5%

Green = above the 60.6% overall line. Brands with ≥15 domains in the run.

The marquee global brands are fine — Hilton (81%), Best Western (79%), Accor (71%), Marriott (64%) all beat the 60.6% overall line. The chain average is dragged down almost entirely by one budget group: Louvre Hotels — Campanile, Kyriad, Première Classe — whose 779 properties often sit on templated per-location microsites (lille-est-hem.kyriad.com) and land at just 11.8% coverage. The pattern is consistent with low-link, cookie-cutter microsites. Drop that one group and chains rise to 59.4%, right in line with everyone else — a domain-architecture effect as much as a brand one.

What absence actually looks like

Six real rows from the run. The two at the top are recognisable independents with their own sites — and they’re simply not in the crawl. The middle two are budget-chain microsites at zero. The last two are small hotels captured at a page or two.

Hotel	Domain	Pages in crawl
Hotel Chapter Roma Rome · design hotel	chapter-roma.com	0
Maison Tremé New Orleans · boutique	maisontreme.com	0
Kyriad Lille Est Hem Louvre Hotels microsite	lille-est-hem.kyriad.com	0
Campanile Brive-la-Gaillarde Louvre Hotels microsite	brive-la-gaillarde-ouest.campanile.com	0
Hôtel L’Aubergade Gérardmer · independent	laubergade-gerardmer.fr	4
Hotel Villa dei Mosaici Spello · independent	hotelvilladeimosaicispello.it	1

A reviewed, bookable hotel whose official site is at 0 is invisible through that site to any model learning from this archive — it can only be reached live, if the engine runs a search.

By country and TLD

Coverage varies sharply by market. Among the largest markets in the dataset:

Germany

69.3%

France

64.6%

Netherlands

58.8%

Spain

58.6%

Italy

58.5%

United States

57.4%

United Kingdom

53.6%

One distinction first, to avoid a false contradiction: country is the hotel’s location; TLD is the website’s domain. Spain reads 58.6% by country yet 37% on the .es TLD, because many Spanish hotels sit on .com — the two measure different things.

With that in mind, the TLD tells a sharper story about how the open-web training layer may under-represent non-English markets. Local European TLDs are present but shallow: a .de hotel is well-crawled (71%) but at only ~39 pages on average, where a .com hotel averages ~109. And .es is the clear laggard at 37% — Spanish hotels on .com do fine, but the .es TLD itself is poorly crawled.

TLD	In crawl	Avg pages (when present)
.de	71.4%	39
.fr	64.8%	42
.com	60%	109
.it	59.5%	35
.nl	59.1%	56
.co.uk	54.3%	59
.es	37%	87

The crawl knows local-market hotels — but thinly. This looks like one mechanism behind the English-leaning tilt we see in live AI answers — showing up a layer earlier, in the training data itself.

Why some hotels are missing

Absence isn’t random. Three things keep a hotel out of the crawl:

Rendering. If your site only assembles its content after JavaScript runs, the crawler often captures an empty shell. Static HTML gets read; SPA booking widgets frequently don’t.
Access. A CDN or firewall that blocks bots — often a default no one chose — turns the crawler away before it reaches a page. This is invisible in robots.txt, which is why our AI-blocking study’s ~3% figure is a floor.
Connectivity — and this is the main lever. Common Crawl doesn’t crawl the web evenly. It decides which domains to capture, and how deeply, by Harmonic Centrality — a link-graph score for how well-connected a domain is. High-scoring domains get crawled often and deep; low-scoring long-tail sites get crawled rarely or not at all (documented in Mozilla’s “Training Data for the Price of a Sandwich”). For a hotel this is usually decisive: a property with few inbound links scores low and gets skipped, however good the site is. Metehan Yeşilyurt’s work on Common Crawl rank lays this out, and you can look up your domain’s Harmonic Centrality and PageRank at webgraph.metehan.ai.

What to do: the fixes are the cheap, structural ones — serve real HTML (not JS-only), make sure no CDN/WAF setting is quietly blocking AI crawlers, and earn a few quality links so you’re worth crawling. Check your own site in seconds with the Common Crawl checker. For hotels, this is the secondary lever — most AI hotel answers are built from live retrieval, not trained memory (see the two-layers guide) — but it costs nothing to not wall yourself out of it.

A 5-minute self-check

• Fetch your homepage and a room page as plain text (curl, or the checker above) — is the content there before any JavaScript runs?
• Scan robots.txt and your CDN/WAF logs for blocked crawler user-agents — including CCBot and the AI crawlers.
• Keep a clean sitemap submitted to search engines (Common Crawl discovery mostly follows links, but it doesn’t hurt).
• Make sure room, location and content pages are internally linked — not reachable only through the booking widget.
• Don’t trap your core content behind a JavaScript booking iframe.

Method & limits

We took every hotel in our 200K-property index with its own website, a Google Place ID, and at least 10 reviews — 142,405 hotels, resolving to 108,109 distinct domains(some chains consolidate many properties under fewer domains). Junk (parked domains, closed properties, non-hotels) was filtered out. Each domain was checked against the columnar URL index of the May 2026 Common Crawl snapshot (CC-MAIN-2026-21), counting captured pages for the host and its www variant. Captured = 1+ page (what we count as “in Common Crawl”); deep capture = 5+ pages; shallow = 1–4; absent = 0.

Limits. It’s one monthly snapshot — a site absent in May may appear next month. Content served on an unrelated sub-domain isn’t counted. “Absent” means not captured, which could be a block, JS-only rendering, low connectivity, or a new domain — we don’t attribute the cause per hotel. And presence in the crawl is necessary, not sufficient, for a model to actually surface you.

FAQ

Of 108,109 hotel website domains checked against the May 2026 snapshot, 60.6% are captured in Common Crawl and 39.4% are absent. Overall, 41.4% are captured deeply (5+ pages) and 19.2% shallow (1–4 pages). Figures are domain-level, not property-level. And to be upfront: that’s a slice of our own dataset, not every hotel alive — but at 108,109 domains it’s already a pretty big slice to draw a line through.

Summarize with AI

Check your own hotel

See whether your site is in Common Crawl — and how deeply — in a few seconds.

Run the Common Crawl checker