If AI answers vary run to run, isn't the score just noise?

The variance is real but structured, not random. In a study of 4,000 repeated queries, the same hotel held the top spot in 50.5% of identical reruns on average and up to 96.1% in constrained markets — far above chance. The wobble is governed by city, supply and phrasing, which means it is a signal you can read and move, provided you repeat each prompt several times and read the rate rather than a single run.

The prompts are written by software, not real guests — doesn't that invalidate it?

It is a fair limitation, not a fatal one. Nobody can watch private chats, so a prompt panel samples the intent space the way survey research samples a population. The panel is modelled on observed engine behaviour — location queries trigger a real search 98% of the time and fan out into four or five sub-queries — and is phrased from a guest's need and anchored to real destinations. Built that way, it measures questions shaped like the ones guests ask.

Can you actually link a change you made to the score moving?

Partly, and partly is the honest ceiling. The method is interventional: measure a baseline for two or three weeks, change exactly one thing, log the date, then watch the per-engine line — not a blend — over the following runs and check whether your server logs and branded search move in the same window. The engines refresh their sources constantly and much of what an answer cites sits on pages you do not control, so you get timing and correlation, rarely clean proof. But the visibility score is the closest observable to your action, so it is the best shot at attribution you have — then you triangulate forward to traffic and bookings.

How does a visibility score connect to actual revenue?

Through a short, observable chain: the score moves first, then ChatGPT-User and similar hits appear in your server logs, then AI referral and AI-generated branded search arrive, then bookings. Standard analytics undercounts the AI step badly because roughly 94% of LLM users return to Google before booking, so the conversion is credited to branded search. The score and the logs are often the only places the AI influence is visible — which makes the metric more valuable, not less.

Is there real proof bookings come from AI visibility?

From one controlled case, yes. Hotel Ranque was built with AI visibility as its only acquisition channel — no OTAs, no ad spend. The visibility score climbed first, from a long-tail Perplexity appearance to generic Paris placements, and bookings followed the same curve. On the booking form's 'how did you hear about us?' question, 21 of 52 guests said AI search — against 27 for Google. One property is one data point, but it is a clean one: AI visibility in, real bookings out.

Measurement Guide · June 2026

Does an AI Visibility Score Actually Mean Anything?

A fair objection is going around: a visibility score is not a booking, so why pay to watch a number rise while your rooms fill the same old way? I run these scores on hotels I built myself, so let me make the honest case for what they are worth — and concede, up front, what they are not.

A thermometer is not a fever either, and nobody throws out the thermometer. I half-agree with the critique, so let me reframe it around the question that actually settles things: can you connect something you do to something you measure? My answer, earned on hotels I built myself, is yes — a bit. And “a bit” is a great deal more than the “not at all” most hotel marketing quietly runs on.

21 / 52

guests at a from-scratch hotel who named AI search — nearly level with Google

50.5%

average top-spot stability — structure, not dice

+62%

overnight jump in AI referral sessions when links went live

Let’s grant the strong version first

The sharpest critique of AI visibility tooling is also the correct one, so I will say it plainly before defending anything. A visibility score is not revenue. More mentions in more answers is not the same as more rooms sold. A dashboard can glow green all year while a hotel fills exactly the way it always did, and if the only thing the tool ever sells you is the measuring, then yes — you are paying a subscription to watch a number.

I have built tools in this space and seen the vanity version up close, so I will not pretend to disagree with most of that. I am here to argue something narrower and more useful: the score is a leading indicator you can sometimes tie back to your own actions, and that partial link is worth more than the certainty nobody in this field actually has. The real question is not “is the score a booking?” — plainly it is not — but “can you connect what you do to what you measure?” The rest of this guide is my honest answer: a bit, and here is exactly how much.

The test for any metric is not “is it the goal?” — almost nothing you track day to day is the goal. The test is “does it move before the goal does, and does moving it move the goal?” For AI visibility I think the answer is yes on both, and I can show the workings.

What the number actually is

A visibility score, done properly, is not a single magic figure. It is a measured rate: across a fixed set of the questions guests ask in your market, how often does each engine name you, cite your own site, and in what company. Run that every week and you are watching one thing — whether the models that increasingly sit between a traveller and a booking page can currently retrieve and recommend you.

That is worth measuring for the same reason occupancy pace is worth measuring before the guests arrive: it is the early shape of demand. Nobody confuses pace with revenue, but nobody flies blind on it either. The score plays the same role one retrieval layer earlier.

Mention

are you named

in the answer a guest reads

Citation

is your site linked

and at what position

Company

who is cited with you

the OTAs and review sites beside you

None of those three is a sale. All three are observable, repeatable, and — this is the part the sceptics tend to skip — they move when you change your inputs. A number that responds to your actions is the definition of a lever, even if it is not the prize.

Is the number real, or am I scoring noise?

Here is the objection that would actually sink the whole enterprise: if AI answers are random, then any score built on them is a coin-flip dressed up as a chart, and optimising it is superstition. That would be fatal. It also happens to be testable, so I tested it.

In my rankings-consistency study — 4,000 repeated queries, 6,249 hotel mentions across eight cities — the same hotel held the top spot in 50.5% of identical reruns on average, and in the most constrained markets that climbed to 96.1%. A coin flip does not do that. The variance is real, but it is governed by the city, the supply, and how the question is phrased, which means it is a signal you can read rather than static you cannot.

And the pattern is not a hotel quirk. Run the same method on Amsterdam bike shops — a niche I have no stake in — and the cross-engine ordering holds: one shop appears in all five engines, others in two or three, every week. Across the wider landscape work (19,579 prompts, six models) each engine even has a stable sourcing personality — Grok leans almost entirely on TripAdvisor, Gemini on the big OTAs. If this were noise, none of those regularities would survive repetition. They do.

A stable thing is, by definition, a thing you can move. The reason the consistency result matters is not academic: it is the permission slip for everything downstream. If the score holds still when you do nothing and shifts when you act, then watching it is not watching a flip — it is watching the lever you just pulled.

“But those are the software’s questions, not a guest’s”

The strongest methodological complaint is that nobody measuring AI visibility can watch real travellers talk to real assistants — those chats are private — so the tools generate prompts themselves and grade you against questions a machine wrote. The premise is true. The conclusion does not follow, and here is why.

No survey researcher gets to poll the entire population either; they sample it, carefully, and the sample is informative precisely because it is built to mirror the real distribution. A prompt panel is the same move. The questions are not invented from thin air — they are modelled on how the engines themselves decompose hotel intent. My anatomy of ChatGPT hotel search shows location queries trigger a real web search 98% of the time (against 8% for definitional ones) and then fan out into four or five sub-queries. Build the panel around that observed behaviour and you are sampling the intent space, not making it up.

There is a second guard against fooling yourself: phrase prompts as a guest’s need (“I’m a cyclist spending a week in Paris—where should I stay?”), never as your own amenity list read back to you. The how-to lives in the prompt-tracking guide; the point here is that a disciplined panel measures questions shaped like the ones guests ask, in the destinations you actually compete in.

That last clause answers the other common jab — that these tools were built for category-first software shopping and bolted onto hotels, which are searched geography-first. A hotel panel that is anchored to a place by construction (“best boutique hotel near Bastille”) is geography-first by construction. The tool inherits the flaw only if you build it lazily.

The real question

Can you connect an action to the number? A bit — and a bit is the honest ceiling

Strip the debate down and it is about attribution: if I change something on my side, does it show up anywhere I can see? With AI search you will not get a clean lab result, and anyone selling you one is selling. What you can get is disciplined evidence — and a stable, weekly, per-engine score is the instrument that makes that evidence possible at all, because you cannot detect a change against a number that is pure noise. The consistency result above is what earns you the right to even attempt this.

The method I run is interventional, and deliberately boring:

1
Baseline before you touch anything
Track for two or three weeks first, so you know the natural week-to-week wobble. Movement only means something against a line you have already watched sit still.
2
Change exactly one thing, and log the date
New structured page, schema fix, a review push, an llms.txt, a placement on a third-party site the engines trust. One lever at a time, timestamped — or you will never know which one moved the line.
3
Watch the per-engine line, not the blend
Look at the engine you actually targeted. A site change that wins ChatGPT may do nothing on Gemini, because they ground on different sources; a blended score would average that real result into mush.
4
Confirm downstream, in the same window
Did the ChatGPT-User hits in your server logs and the branded search in analytics move in the weeks after? Two independent signals agreeing is as close to proof as this gets.
5
Claim timing, not cause
Report “I changed X on this date and the line moved over the next runs,” never “X caused the bookings.” Show the sequence and let it persuade. Overclaiming is exactly the sin the sceptics are right about.

Why I say “a bit” and not “yes”. The engines refresh which sources they cite every week, and much of what an answer leans on lives on pages you influence but do not control. So you are reading timing and correlation, rarely a clean cause. That is a real limit, and pretending otherwise is how this whole category lost trust. “A bit” is the honest word — and a bit of real evidence beats a lot of confident hand-waving.

Here is the underrated part, though: the score is the closest observable to your action. Bookings sit at the far end of a chain stuffed with confounds — price, season, a heatwave, the competitor down the road who just refurbished. The visibility score sits one step from the lever you pulled, on a weekly clock, split by engine. If you want any shot at attributing what you did, that is where you have it — and from there you triangulate forward to the money.

And the number ties forward to money

Linking your action to the score is half the job; the other half is the score linking forward to revenue. For AI visibility that onward chain is unusually short, and every link is observable rather than inferred:

The visibility score

A weekly mention-and-citation rate across the engines, per destination. It is the earliest thing that moves — it changes the week the model starts retrieving you, long before any of it shows up in your accounting.

Server logs

When the score climbs, the ChatGPT-User, Claude-User and PerplexityBot hits in your raw access logs climb with it. This is the only place you see the model actually fetch your page — no sampling, no estimate.

Referral + branded traffic

Then the sessions arrive: direct AI referrals, plus a fatter bucket of people who asked an assistant, got your name, and typed it into Google. The demand was created by AI; your analytics file it under branded search.

Bookings

And at the end, a booked room with a name attached. The guest arrives with the recommendation already made and the trust already lent — which is why this traffic tends to convert harder than a cold OTA click.

Each link is something I have watched in the data. When ChatGPT began embedding hotel links on 7 May 2026, AI referral sessions across a 17,000-hotel panel jumped 62% overnight and then doubled week over week (+102%) while organic crept up 7% — net-new traffic, not reshuffled, as the direct-traffic study lays out. And that AI referral even keeps human hours: it peaks on Sundays, exactly when people plan trips. Noise does not keep a calendar.

The twist that makes the score more valuable, not less: standard analytics undercounts AI badly. As I document in how to measure AI hotel traffic, roughly 94% of LLM users circle back to Google before booking, so the conversion click lands on branded search and the AI step vanishes from the report. Server logs and the visibility score are often the only places the influence is visible at all. A metric that sees what your funnel cannot is the opposite of useless.

The strongest evidence I have: a hotel built on the score

Arguments about leading indicators get hand-wavy fast, so I ran the cleanest experiment I could think of. I built a hotel’s demand from nothing — Hotel Ranque, no OTA contracts, no ad budget — using AI visibility as the only acquisition channel, and watched whether moving the score moved the bookings. (The name is a pun on rank. I am not above that.)

The score moved first, in the order you would predict if it were a real signal: a first Perplexity appearance around week four on a long-tail query, consistent presence across ChatGPT, Gemini and Perplexity by week twelve, generic “best boutique hotel in Paris” placements past week twenty. The bookings followed the curve, not the other way around — two or three enquiries a week early on, then eight to twelve, then more than the rooms could hold, which is when I stopped the experiment.

Then the part that turns “visibility” back into “bookings.” The booking form asked one question — how did you hear about us? — and of 52 answers, 21 said AI search, against 27 for Google and four for everything else. A channel that did not exist when I started the experiment came within a whisker of Google, on a hotel that had no other way of being found.

A guest survey titled “How did you hear about us?” with 52 answers: Google Search 27, AI Search 21, Other 4. — Hotel Ranque’s “how did you hear about us?” question, straight from the booking flow: **21 of 52 guests credited AI search**, nearly level with Google. Self-reported, so if anything it undercounts — many never know an assistant put the name in their head.

And that is the floor, not the ceiling. The “Google” bucket hides demand the assistants created — guests who asked ChatGPT, got the name, then typed it into Google to book. The hidden AI effect the attribution work keeps surfacing sits inside that 27.

One property is one data point, and I would not generalise a whole industry from it. But it is a clean one: where AI visibility was close to the only input, real bookings came out, the score predicted them, and the guests said so at check-in. You cannot get that result if the score is measuring nothing.

Telling a useful score from a vanity one

The critique lands hard on bad implementations, and it should. So here is the line I draw between a score worth paying attention to and a billboard for watching a number climb.

A score worth keeping

Anchored to the destinations you actually sell in.
Read alongside server logs and referral traffic, never alone.
Repeated weekly so a change you made has somewhere to show up.
Split by engine, because the levers differ per engine.
Baselined before you act, so movement means something.

A score that is just a bill

One blended number with no link to traffic or revenue.
Prompts reverse-engineered from your own amenity list.
Category-first questions that ignore geography.
Checked once a quarter, long after the movement is gone.
“Make more content” as the only prescription.

Treat the score as a leading indicator and it earns its place: pull a lever, watch the line, then watch the logs and the branded search confirm it a few weeks later. Treat it as the destination and the sceptics are right about you. The metric is not useless; using it without the chain attached is.

FAQ

No, and treating it as one is the mistake worth avoiding. A visibility score measures how often AI engines mention and cite your hotel for the questions guests ask in your market. It is a leading indicator of demand, not the demand itself — useful the way occupancy pace is useful before guests arrive. Its value comes entirely from chaining it to server logs, referral traffic, and bookings, not from the number alone.

Does an AI Visibility Score Actually Mean Anything?

Let’s grant the strong version first

What the number actually is

Is the number real, or am I scoring noise?

“But those are the software’s questions, not a guest’s”

Can you connect an action to the number? A bit — and a bit is the honest ceiling

And the number ties forward to money

The strongest evidence I have: a hotel built on the score

Telling a useful score from a vanity one

FAQ

Further reading

AI Rankings Consistency Study

How to Measure AI Hotel Traffic

The ChatGPT Direct-Traffic Explosion

Hotel Ranque: built on AI visibility

Summarize with AI