API-measured AI answer rankings plotted against what real users see in the Gemini app
All Posts

Are API-Based AI Calls a Good Proxy for What Your Users Actually See?

Ryan Turner
Ryan Turner · Head of Growth

If you run a GEO program, you track your brand's place in AI answers through an API. Your customers do something different. They open the Gemini app and type. So the whole practice rests on one assumption few people have tested: does an API call return the same answer a real person sees? We ran the test across 30 brand categories. The API matched the live app's brand ranking 87% of the time, roughly as often as the app even agrees with itself.

Faithful proxy for the web? Yes! Similar results 87% of the time

Why is GEO measurement built on a barely tested assumption?

Most brand-visibility tracking in AI answers runs through APIs, yet most real queries happen inside consumer apps, and few people have measured the gap between the two. That gap matters because a GEO dashboard reports a number, share of voice, a rank, a winner call, and a brand acts on it. If the API systematically saw a different answer than customers do, every report built on it would be quietly wrong.

The problem is structural, not lazy. Driving a real browser session for thousands of prompts is slow and brittle. APIs are fast, repeatable, and cheap, so that's what tracking tools use. The question was never whether APIs are convenient. It was whether convenience costs you accuracy.

Our framing: The right test isn't "does the API ever disagree with the app?" Generative models disagree with themselves on repeat runs. The honest test is whether the API disagrees with the app any more than the app already disagrees with itself.

How do you test whether an API matches the real app?

A generative model doesn't return one fixed answer, so the test has to account for that variance directly. We took one brand-ranking prompt, sent it through both channels across 30 categories, and ran it 3 times per channel. Each topic was a target brand plus its 4 nearest competitors, shuffled once and frozen so both channels received byte-identical text.

The prompt was deliberately plain: "order these [industry] companies in order from best to worst [five brands]. only respond with the 5 companies in the recommended order." One channel was Massive's /ai endpoint (model=gemini, country=us). The other was gemini.google.com, signed in, model picker set to the same 3.5 Flash, with a fresh conversation per run.

The key measure is the noise floor. Each channel disagrees with itself across its own repeat runs, and that self-disagreement is the fair benchmark. You can't expect two channels to agree more than each one agrees with itself. So we measured both: cross-channel agreement, and each channel's within-channel agreement. Then we asked how close the first gets to the second.

We analyzed each topic as a single observation rather than pooling all 270 run-pairs, because those pairs cluster into just 30 independent topics. Pooling them inflates the apparent precision. Reporting at the topic level keeps the confidence intervals honest, even though it makes the numbers look less impressive.

Does the API agree with the live Gemini app?

Yes. Across 30 categories, the API and the live app produced statistically indistinguishable rankings on 26 of 30 topics (87%), and named the same top brand on the same 26 of 30 (Massive experiment, 2026). The mean cross-channel agreement was a Kendall tau of 0.79, against a within-channel floor of 0.82. Measured per topic, the API retained 93% of the agreement each channel manages with itself (95% confidence interval 86% to 98%).

Agreement on brand rankings (Kendall tau, 0 to 1) Higher is closer. The cross-channel bar nearly reaches the channel's own noise floor. API vs live Gemini app 0.79 Live app vs itself (noise floor) 0.82
Source: Massive Computing, API vs live Gemini experiment, 2026.

According to the 2026 Massive experiment, an API querying Gemini matched the live consumer app on the same brand ranking for 26 of 30 categories, and the average gap in agreement sat within seven percentage points of the app's own run-to-run noise. For brand-visibility and winner-call reporting, that's a measurement that tracks what customers see.

Where your brand actually lands

This is the number a GEO program actually reports, and it barely moved between channels. On 13 of 30 topics the target brand landed in the exact same rank on both the API and the live app. Across all 30 targets, the average shift was 0.32 of a position, and only one brand, Kraft Heinz, moved more than a single place.

How far the target brand moved (30 categories) 13 16 Identical rank (13) Drifted under one position (16) Moved more than a position (1) Mean absolute drift across all 30 targets: 0.32 positions.
Source: Massive Computing, API vs live Gemini experiment, 2026.

A third of a position is well inside the swap-an-adjacent-pair noise both channels show on their own. If your dashboard says your brand sits second in a category, a customer opening the app is very likely to see it second too. The reporting holds where it counts.

Stress-testing the four outliers at ten runs

Four topics looked "off" at three runs, so we re-ran the two worst to ten runs on both channels, and most of the gap turned out to be sampling noise. Tesla, which looked like a real winner-call gap at n=3, became identical at n=10: Tesla ranked first in 8 of 10 runs on both channels, with a cross-channel agreement of 0.87 against a 0.86 floor. The gap was a false alarm.

Kraft Heinz shrank but survived. Its excess disagreement fell from +0.33 at three runs to +0.12 at ten, and the target-rank gap dropped from 1.85 to 1.0 positions (Massive experiment, 2026). Both channels are simply noisy on food and CPG, and both still named Nestlé the category winner in 9 of 10 runs. It's a mild, real, category-specific wobble, not a channel bias.

The 2026 Massive deep-run test showed that a three-run "offset" is a lead to check, not a finding: of four flagged topics, deeper sampling erased one outright and cut another by two thirds. The practical read is that the API proxy is even better than the three-run headline implied. Food and CPG is an intrinsically noisy category on both channels, not a biased one.

Our finding: Treat any single-run or three-run discrepancy as a candidate, never a verdict. Use at least three runs, report the consensus, and deep-run anything that looks systematic before you act on it.

When should you still spot-check the live web?

The API is a faithful proxy in aggregate, but four specific situations still warrant a manual look at the real app. In our data, the residual gaps clustered in predictable places, so you can target your spot-checks instead of distrusting everything.

Check the live app when:

  • The category is food or CPG, or the winner is contested. Kraft Heinz, Mars, and Tesla all drew their disagreement from these noisier corners.
  • Account context matters. The API runs an anonymous Flash session. Signed-in personalization or grounding can shift what a logged-in user sees, and an anonymous API call can't replicate that.
  • An exact full ordering feeds a weighted score. Only half of cross-channel pairs matched the complete five-brand order, since the tail positions swap most. Top-of-list and winner calls are far more stable than the full sequence.
  • You only have one run. Both channels swap an adjacent pair often. Use three runs or more and report the consensus, never a single pull.

Why API-based GEO measurement is now practical at scale

The validity result is what turns GEO measurement from a manual chore into a program you can actually run at scale. Browser-driving a few hundred prompts takes hours and breaks when a page throttles automated input. The same volume of API completions finishes in minutes with the same noise profile, which is the difference between tracking five categories by hand and tracking five hundred on a schedule.

Geography is the second payoff. AI answers differ by country, and a customer in Berlin, Sao Paulo, or Jakarta sees a grounded answer shaped by local context. Massive's /ai endpoint returns LLM completions from real-user-device origins in 195+ countries, so you can measure brand visibility the way a local actually experiences it, not from a single datacenter in Virginia.

The reason it tracks the live app is mechanical, and worth being precise about. The /ai endpoint doesn't run a sanitized sandbox model or a different checkpoint. It drives the same consumer Gemini app from a real device in the geo you choose, and in this test its model roster matched 3.5 Flash on every call. You're measuring the same surface your customers use, fetched the same way a customer's device would reach it. That's why the answers line up.

If you're building an AEO or AI brand-monitoring platform, this is the infrastructure layer underneath your analytics. You keep your dashboards, scoring, and reporting. The geo coverage, device emulation, and source handling are solved upstream. To pressure-test it against your own categories, you can run a benchmark on the /ai endpoint and compare it to whatever you measure today.

The bottom line

API-based AI calls are a faithful proxy for what your users actually see. In 30 categories, the API matched the live Gemini app on the same brand ranking 87% of the time, landed the target brand in the identical position on 13 of 30 topics, and drifted by a third of a position on average. The disagreement it does show is about the same disagreement the app shows with itself. Trust the API for brand visibility, winner calls, and trends, especially across many categories and countries at once. Reserve manual checks for food and CPG, contested winners, and anything that hinges on a logged-in session.

To measure AI answer visibility the way your customers in any country experience it, explore Massive's AI chat endpoint.

Wanna see the report details?


Sources

  • Massive Computing, "Are API-based AI calls a good proxy for what your users actually see?" (GEO research, Web Render API), experiment dated 2026-06-17, retrieved 2026-06-18. 30 brand categories across 18 industries, Gemini 3.5 Flash, 3 runs per channel (n=10 deep runs on two topics), topic-level bootstrap confidence intervals.

Frequently Asked Questions

Does an API call return the same AI answer a real user sees?+

In a 2026 test across 30 brand categories, an API querying Gemini matched the live consumer app's brand ranking on 26 of 30 topics (87%), and named the same winner just as often (Massive experiment, 2026). The small residual gaps mostly came from category noise, not from the channel.

Why measure agreement against a "noise floor"?+

Generative models are non-deterministic, so the same prompt can return different orderings on repeat runs. Each channel therefore disagrees with itself. That self-disagreement (0.82 in our test) is the fair benchmark, because two channels can't agree more than each agrees with itself. Cross-channel agreement reached 0.79.

Which categories are least reliable to measure by API?+

Food and CPG were the noisiest in our 2026 test, and contested winners drifted most. Both channels disagreed with themselves more in those categories, so it's intrinsic variance, not channel bias. Spot-check the live app for these, and for any result that depends on signed-in personalization.

How many runs should a GEO program use per query?+

Use at least three runs per query and report the consensus, never a single pull. In our data, both the API and the live app frequently swapped an adjacent pair on any one run. Three runs smoothed that out, and deep runs of ten confirmed that most single-flag discrepancies were sampling noise.

Can API measurement capture answers from other countries?+

Yes, if the API routes through local origins. Massive's /ai endpoint returns completions from real consumer devices in 195+ countries with country, region, and city targeting, so you can measure how a brand appears to a user in a specific market rather than from one datacenter location.