Experiment 2026-06-17 /ai · model=gemini · country=us · no cache both channels · Gemini 3.5 Flash 30 brand categories × 3 runs × 2 channels

Are API-based AI calls a good proxy for what your users actually see?

A measurement-validity test for GEO. If you track brand visibility in AI answers — share of voice, rankings, winner calls — you're almost certainly measuring through an API, while your customers type into the real Gemini app. This study asked the identical brand-ranking prompt through both channels across 30 brand categories, and analyses each topic as one observation (not each run-pair) so the numbers carry honest confidence intervals rather than the inflated precision of pooling correlated comparisons.

Faithful proxy for the web?

Yes!Similar results 87% of the time

Ask the same question through Massive's API and through the real Gemini app, and you get the same brand ranking about 87% of the time — roughly as often as the web app even agrees with itself. When you track your brand's visibility in AI answers, the API shows you what your customers see.

How did we test?

identical prompt · both channels · 3.5 Flash

In shortOne brand-ranking question, sent through both channels across 30 categories, 3 times each — because a generative model doesn't return one fixed answer.

Each topic is a target brand plus its 4 nearest competitors, shuffled once and frozen so both channels receive byte-identical text. We sent it through Massive's API (/ai · model=gemini) and typed the same words into gemini.google.com — both on 3.5 Flash, signed-in on the web side — capturing the ranking 3 times per channel.

The prompt · identical on both channels

order these [industry] companies in order from best to worst [target brand + 4 competitors, shuffled once & frozen]. only respond with the 5 companies in the recommended order.

e.g. "order these Sportswear companies in order from best to worst Nike, Puma, Under Armour, Adidas, New Balance. only respond with the 5 companies in the recommended order."

Expect variation. Generative models are non-deterministic — ask the identical question twice and you can get a different order, on any model and on either channel. So there is no single "right" answer to match: we run each prompt 3× per channel and ask whether the API and the web differ any more than each already differs from itself.

Headline — 30 brand categories

each topic = 1 observation · 95% bootstrap CI

In shortThe API agrees with the live Gemini app about as well as the app agrees with itself — same #1 brand on 26 of 30 topics, with the brand's rank off by a third of a place on average.

0.79

95% CI 0.71–0.86

mean cross-channel tau

0.82

95% CI 0.76–0.88

within-channel floor

93%

95% CI 86%–98%

tau retained vs floor

87%

26/30 · CI 70%–95%

topics within noise

87%

26/30 · CI 70%–95%

same consensus #1

Per topic: cross tau = mean API↔web rank agreement; within tau = each channel's self-agreement (the noise floor); a topic is an offset when its excess (within − cross) clears 0.10 — beyond half an adjacent swap. The ●/● column marks whether both channels named the same consensus #1 brand.

Target · topic	Cross tau	Within tau (floor)	Excess	Same #1	Δ rank	Class
Kraft HeinzP10 · Food / CPG	0.07	0.40	+0.33	●	1.67	offset
TeslaP9 · Automotive	0.67	0.93	+0.27	●	0.67	offset
MarsP23 · Food / CPG	0.22	0.40	+0.18	●	0.67	offset
MarriottP24 · Hospitality / Hotels	0.67	0.80	+0.13	●	0.33	offset
MAC CosmeticsP19 · Beauty / Cosmetics	0.80	0.87	+0.07	●	1.00	within noise
XboxP26 · Gaming	0.80	0.87	+0.07	●	0.67	within noise
MastercardP15 · Financial Services / Payments	0.80	0.87	+0.07	●	0.00	within noise
HyattP6 · Hospitality / Hotels	0.76	0.80	+0.04	●	1.00	within noise
SlackP17 · Software / SaaS	0.71	0.73	+0.02	●	0.33	within noise
Comcast (Xfinity)P1 · Telecommunications	0.93	0.93	-0.00	●	0.00	within noise
HPP22 · Consumer Electronics	0.53	0.53	-0.00	●	0.33	within noise
MicrosoftP4 · Software / SaaS	1.00	1.00	+0.00	●	0.00	within noise
NikeP8 · Sportswear	1.00	1.00	+0.00	●	0.00	within noise
PumaP12 · Sportswear	1.00	1.00	+0.00	●	0.00	within noise
ReebokP16 · Sportswear	1.00	1.00	+0.00	●	0.00	within noise
Red BullP21 · Beverages	0.87	0.87	+0.00	●	0.33	within noise
VerizonP28 · Telecommunications	1.00	1.00	+0.00	●	0.00	within noise
Wendy'sP30 · Quick Service Restaurants	0.80	0.80	+0.00	●	0.00	within noise
Levi'sP3 · Apparel / Fashion	0.93	0.93	+0.00	●	0.00	within noise
T-MobileP5 · Telecommunications	0.87	0.87	+0.00	●	0.67	within noise
OracleP11 · Software / SaaS	0.93	0.93	+0.00	●	0.33	within noise
SonyP13 · Consumer Electronics	0.93	0.93	+0.00	●	0.00	within noise
H&MP18 · Apparel / Fashion	0.87	0.87	+0.00	●	0.33	within noise
NestléP25 · Food / CPG	0.76	0.73	-0.02	●	0.00	within noise
Amazon Prime VideoP7 · Streaming / Media	0.89	0.87	-0.02	●	0.33	within noise
FordP29 · Automotive	0.89	0.87	-0.02	●	0.33	within noise
Estée LauderP2 · Beauty / Cosmetics	0.80	0.73	-0.07	●	0.33	within noise
GucciP14 · Apparel / Fashion	0.80	0.73	-0.07	●	0.33	within noise
PepsiP20 · Beverages	0.80	0.73	-0.07	●	0.00	within noise
American AirlinesP27 · Airlines / Travel	0.76	0.67	-0.09	●	0.00	within noise

Pooled-pairs view (for contrast, and to show why it overstates precision): 270 cross-pairs, same #1 in 83%, exact 5-brand order in 50% — but those 270 cluster into just 30 independent topics.

Where each target brand lands

the GEO bottom line · API vs web mean rank

In short13 of 30 brands land in the exact same rank on both channels, and the rest drift under one position. Only Kraft Heinz moves more than a single place.

This is the number a GEO program actually reports: where does your brand rank in the AI's answer? Each row is one topic's target brand on a 1 (best) → 5 (worst) scale. The orange knot is the rank API measurement would report, the blue-gray knot what users saw on the web; a dotted cord shows the sway between them. 13 of 30 land identically; 17 shift — and only Kraft Heinz moves more than one position.

Target brand · topic

rank 1 ──────────── 5

Δ api−web

Levi'sP3 · Apparel / Fashion

0.0

MicrosoftP4 · Software / SaaS

0.0

NikeP8 · Sportswear

0.0

SonyP13 · Consumer Electronics

0.0

NestléP25 · Food / CPG

0.0

MastercardP15 · Financial Services / Payments

0.0

VerizonP28 · Telecommunications

0.0

Wendy'sP30 · Quick Service Restaurants

0.0

Comcast (Xfinity)P1 · Telecommunications

0.0

PumaP12 · Sportswear

0.0

PepsiP20 · Beverages

0.0

ReebokP16 · Sportswear

0.0

American AirlinesP27 · Airlines / Travel

0.0

Red BullP21 · Beverages

11.3

-0.3

SlackP17 · Software / SaaS

1.31

+0.3

MarriottP24 · Hospitality / Hotels

1.72

-0.3

OracleP11 · Software / SaaS

2.73

-0.3

HPP22 · Consumer Electronics

32.7

+0.3

Estée LauderP2 · Beauty / Cosmetics

3.33.7

-0.3

Amazon Prime VideoP7 · Streaming / Media

3.33.7

-0.3

FordP29 · Automotive

3.33.7

-0.3

GucciP14 · Apparel / Fashion

3.73.3

+0.3

H&MP18 · Apparel / Fashion

3.74

-0.3

TeslaP9 · Automotive

11.7

-0.7

MarsP23 · Food / CPG

1.32

-0.7

T-MobileP5 · Telecommunications

1.71

+0.7

XboxP26 · Gaming

54.3

+0.7

MAC CosmeticsP19 · Beauty / Cosmetics

-1.0

HyattP6 · Hospitality / Hotels

2.33.3

-1.0

Kraft HeinzP10 · Food / CPG

42.3

+1.7

Δ = API mean rank − web mean rank (3 runs each). Positive = the API ranks the brand worse (higher number) than web users see. Green ≤0.05 · amber ≤0.7 · red >0.7. Mean |Δ| across all 30 targets: 0.32 positions.

The 4 offsets — and what survives n=10

P9 / P10 deep-run · both channels n=10

In shortThe handful of topics that looked "off" at 3 runs mostly turned out to be noise at 10 runs — Tesla was a false alarm, Kraft Heinz a small real gap. A 3-run flag is a lead to check, not a finding.

Four topics cleared the noise floor at n=3. But n=3 is a coarse sieve — an "offset" can be a lucky draw. The two worst were re-queried to a full 10 runs on both channels (symmetric). The result is a caution about small samples: one offset vanished entirely, the other shrank by two-thirds. Treat n=3 offsets as candidates, not verdicts.

Tesla · EVs — false positive

At n=3 the web led with BYD (2/3) — an apparent winner-call gap
At n=10 the channels are identical: Tesla #1 in 8/10 on both; target rank 1.20 = 1.20
Cross-tau 0.87 vs a 0.86 floor — excess −0.01. The gap was pure sampling noise; it's gone

Kraft Heinz · food — shrinks, borderline

The gap two-thirds smaller at n=10: excess +0.33 → +0.12, target gap 1.85 → 1.0 (API 4.1 vs web 3.1)
Both channels are simply noisy on food (within-web tau 0.52) and both name Nestlé #1 (9/10)
Still marginally over the line — the one residual offset, but far milder than n=3 implied

The other two n=3 offsets (Mars, food/CPG, excess +0.18; Marriott, hotels, +0.13) weren't deep-run, so they stay candidates — and given what happened to Tesla and Kraft Heinz, the prior is now that they shrink too. Net: of the four n=3 flags, deep sampling left one mild, borderline offset (Kraft Heinz) and erased a clear false positive (Tesla). The practical signal: the API proxy is even better than the n=3 headline suggested — food/CPG is just an intrinsically noisy category, not a biased one.

Deep runs complete: P9 and P10 were each captured to a full n=10 on both channels (web finished via manual entry after gemini.google.com throttled automated input). Cross/within figures above use the complete symmetric n=10. The headline 30-topic result uses a clean symmetric n=3.

What it means for GEO measurement

decision guide

In shortTrust the API for brand visibility, winner calls, and trends; spot-check the live web for food/CPG and any contested #1, where the channels are most likely to drift.

Trust API-based GEO for

Brand visibility & rank trends — 26/30 topics statistically indistinguishable from sampling the web (CI 70–95%)
Winner / "AI-recommended" calls — same consensus #1 on 26/30 topics
Cross-category programs — the proxy holds across 18 industries, not just a lucky few
Speed & volume — hundreds of API completions in minutes vs hours of browser driving, same noise profile

Spot-check the web when

The category is food/CPG or a contested winner — Kraft Heinz, Mars, Tesla all diverged systematically
Account context matters — the API is an anonymous Flash session; signed-in personalization/grounding can shift results
Exact full order feeds a weighted score — only 50% of cross-pairs match the complete 5-brand order; tails swap most
One run only — both channels swap an adjacent pair often; use ≥3 runs and report the consensus

Sample. 30 topics drawn from brand_competitors.jsonc (seeded random, seeds 20260610 + 20260617), spanning 18 industries. Each: target brand + 4 competitors, shuffled once and frozen, byte-identical across both channels.

Both channels = 3.5 Flash. Web: gemini.google.com, signed-in (Workspace Pro), model picker on 3.5 Flash, fresh conversation per run. API: Massive /ai?model=gemini&country=us&expiration=0 — which drives the same web app; its model roster matched 3.5 Flash on all 100+ calls.

Topic-level stats. Each topic contributes one value per metric; 95% CIs are bootstrap over the 30 topics (4000 resamples) for means, Wilson intervals for proportions. This avoids the pseudoreplication of pooling the 270 within-topic pairs.

Cleaning. Grounding-citation tokens were stripped before parsing rankings. Five API runs that returned empty/refusal/abbreviated completions were re-fetched until they parsed to a clean 5-brand order; one web run that hallucinated an out-of-set brand was re-run.

Limits. n=3 per channel for the headline (deep runs API-only); 3.5 Flash is non-deterministic, so "within noise" is bounded by 3 observations; the residual offsets implicate the API's anonymous session vs signed-in context, which API measurement cannot replicate.