A measurement-validity test for GEO. If you track brand visibility in AI answers — share of voice, rankings, winner calls — you're almost certainly measuring through an API, while your customers type into the real Gemini app. This study asked the identical brand-ranking prompt through both channels across 30 brand categories, and analyses each topic as one observation (not each run-pair) so the numbers carry honest confidence intervals rather than the inflated precision of pooling correlated comparisons.
Ask the same question through Massive's API and through the real Gemini app, and you get the same brand ranking about 87% of the time — roughly as often as the web app even agrees with itself. When you track your brand's visibility in AI answers, the API shows you what your customers see.
Each topic is a target brand plus its 4 nearest competitors, shuffled once and frozen so both channels receive byte-identical text. We sent it through Massive's API (/ai · model=gemini) and typed the same words into gemini.google.com — both on 3.5 Flash, signed-in on the web side — capturing the ranking 3 times per channel.
Expect variation. Generative models are non-deterministic — ask the identical question twice and you can get a different order, on any model and on either channel. So there is no single "right" answer to match: we run each prompt 3× per channel and ask whether the API and the web differ any more than each already differs from itself.
Per topic: cross tau = mean API↔web rank agreement; within tau = each channel's self-agreement (the noise floor); a topic is an offset when its excess (within − cross) clears 0.10 — beyond half an adjacent swap. The ●/● column marks whether both channels named the same consensus #1 brand.
| Target · topic | Cross tau | Within tau (floor) | Excess | Same #1 | Δ rank | Class |
|---|---|---|---|---|---|---|
| Kraft HeinzP10 · Food / CPG | 0.07 |
0.40 |
+0.33 | ● | 1.67 | offset |
| TeslaP9 · Automotive | 0.67 |
0.93 |
+0.27 | ● | 0.67 | offset |
| MarsP23 · Food / CPG | 0.22 |
0.40 |
+0.18 | ● | 0.67 | offset |
| MarriottP24 · Hospitality / Hotels | 0.67 |
0.80 |
+0.13 | ● | 0.33 | offset |
| MAC CosmeticsP19 · Beauty / Cosmetics | 0.80 |
0.87 |
+0.07 | ● | 1.00 | within noise |
| XboxP26 · Gaming | 0.80 |
0.87 |
+0.07 | ● | 0.67 | within noise |
| MastercardP15 · Financial Services / Payments | 0.80 |
0.87 |
+0.07 | ● | 0.00 | within noise |
| HyattP6 · Hospitality / Hotels | 0.76 |
0.80 |
+0.04 | ● | 1.00 | within noise |
| SlackP17 · Software / SaaS | 0.71 |
0.73 |
+0.02 | ● | 0.33 | within noise |
| Comcast (Xfinity)P1 · Telecommunications | 0.93 |
0.93 |
-0.00 | ● | 0.00 | within noise |
| HPP22 · Consumer Electronics | 0.53 |
0.53 |
-0.00 | ● | 0.33 | within noise |
| MicrosoftP4 · Software / SaaS | 1.00 |
1.00 |
+0.00 | ● | 0.00 | within noise |
| NikeP8 · Sportswear | 1.00 |
1.00 |
+0.00 | ● | 0.00 | within noise |
| PumaP12 · Sportswear | 1.00 |
1.00 |
+0.00 | ● | 0.00 | within noise |
| ReebokP16 · Sportswear | 1.00 |
1.00 |
+0.00 | ● | 0.00 | within noise |
| Red BullP21 · Beverages | 0.87 |
0.87 |
+0.00 | ● | 0.33 | within noise |
| VerizonP28 · Telecommunications | 1.00 |
1.00 |
+0.00 | ● | 0.00 | within noise |
| Wendy'sP30 · Quick Service Restaurants | 0.80 |
0.80 |
+0.00 | ● | 0.00 | within noise |
| Levi'sP3 · Apparel / Fashion | 0.93 |
0.93 |
+0.00 | ● | 0.00 | within noise |
| T-MobileP5 · Telecommunications | 0.87 |
0.87 |
+0.00 | ● | 0.67 | within noise |
| OracleP11 · Software / SaaS | 0.93 |
0.93 |
+0.00 | ● | 0.33 | within noise |
| SonyP13 · Consumer Electronics | 0.93 |
0.93 |
+0.00 | ● | 0.00 | within noise |
| H&MP18 · Apparel / Fashion | 0.87 |
0.87 |
+0.00 | ● | 0.33 | within noise |
| NestléP25 · Food / CPG | 0.76 |
0.73 |
-0.02 | ● | 0.00 | within noise |
| Amazon Prime VideoP7 · Streaming / Media | 0.89 |
0.87 |
-0.02 | ● | 0.33 | within noise |
| FordP29 · Automotive | 0.89 |
0.87 |
-0.02 | ● | 0.33 | within noise |
| Estée LauderP2 · Beauty / Cosmetics | 0.80 |
0.73 |
-0.07 | ● | 0.33 | within noise |
| GucciP14 · Apparel / Fashion | 0.80 |
0.73 |
-0.07 | ● | 0.33 | within noise |
| PepsiP20 · Beverages | 0.80 |
0.73 |
-0.07 | ● | 0.00 | within noise |
| American AirlinesP27 · Airlines / Travel | 0.76 |
0.67 |
-0.09 | ● | 0.00 | within noise |
Pooled-pairs view (for contrast, and to show why it overstates precision): 270 cross-pairs, same #1 in 83%, exact 5-brand order in 50% — but those 270 cluster into just 30 independent topics.
This is the number a GEO program actually reports: where does your brand rank in the AI's answer? Each row is one topic's target brand on a 1 (best) → 5 (worst) scale. The orange knot is the rank API measurement would report, the blue-gray knot what users saw on the web; a dotted cord shows the sway between them. 13 of 30 land identically; 17 shift — and only Kraft Heinz moves more than one position.
Δ = API mean rank − web mean rank (3 runs each). Positive = the API ranks the brand worse (higher number) than web users see. Green ≤0.05 · amber ≤0.7 · red >0.7. Mean |Δ| across all 30 targets: 0.32 positions.
Four topics cleared the noise floor at n=3. But n=3 is a coarse sieve — an "offset" can be a lucky draw. The two worst were re-queried to a full 10 runs on both channels (symmetric). The result is a caution about small samples: one offset vanished entirely, the other shrank by two-thirds. Treat n=3 offsets as candidates, not verdicts.
The other two n=3 offsets (Mars, food/CPG, excess +0.18; Marriott, hotels, +0.13) weren't deep-run, so they stay candidates — and given what happened to Tesla and Kraft Heinz, the prior is now that they shrink too. Net: of the four n=3 flags, deep sampling left one mild, borderline offset (Kraft Heinz) and erased a clear false positive (Tesla). The practical signal: the API proxy is even better than the n=3 headline suggested — food/CPG is just an intrinsically noisy category, not a biased one.
Sample. 30 topics drawn from brand_competitors.jsonc (seeded random, seeds 20260610 + 20260617), spanning 18 industries. Each: target brand + 4 competitors, shuffled once and frozen, byte-identical across both channels.
Both channels = 3.5 Flash. Web: gemini.google.com, signed-in (Workspace Pro), model picker on 3.5 Flash, fresh conversation per run. API: Massive /ai?model=gemini&country=us&expiration=0 — which drives the same web app; its model roster matched 3.5 Flash on all 100+ calls.
Topic-level stats. Each topic contributes one value per metric; 95% CIs are bootstrap over the 30 topics (4000 resamples) for means, Wilson intervals for proportions. This avoids the pseudoreplication of pooling the 270 within-topic pairs.
Cleaning. Grounding-citation tokens were stripped before parsing rankings. Five API runs that returned empty/refusal/abbreviated completions were re-fetched until they parsed to a clean 5-brand order; one web run that hallucinated an out-of-set brand was re-run.
Limits. n=3 per channel for the headline (deep runs API-only); 3.5 Flash is non-deterministic, so "within noise" is bounded by 3 observations; the residual offsets implicate the API's anonymous session vs signed-in context, which API measurement cannot replicate.