Writing / Jun 2026

5 tools to measure your AI search visibility — and the flaw they all share.

Tools that track your brand in ChatGPT and Perplexity are genuinely useful. But the methodology behind all of them has a problem that most reviews don't mention — and it determines how much you should trust the number they give you.

AI search visibility measurement tools — how they work and their limitations

There are now at least a dozen tools that claim to measure your visibility in AI search. Most of them have a dashboard, a visibility score, and a chart showing your brand’s share of voice across ChatGPT, Perplexity, and Gemini. Some of them cost significant money. All of them are measuring something real.

The question worth asking before you pay for any of them is: how exactly are they getting that number? Because the answer changes how useful the number actually is.

The two ways these tools work

AI visibility tools fall into two distinct categories based on methodology. Understanding the difference matters more than comparing feature sets.

The first category is SERP-based observation. Tools like Semrush and Ahrefs track your AI visibility by scraping Google search results. When you run a rank tracking campaign, they flag which of your tracked queries trigger a Google AI Overview, and they record which domains and URLs get cited in those AI Overviews. The measurement is observational — they are watching what Google actually serves, not simulating it. This makes the data relatively reliable. If Semrush says your URL appeared in the AI Overview for a given query on a given date, that is probably what happened.

The limitation is obvious: these tools only cover Google AI Overviews. They tell you nothing about ChatGPT, Perplexity, Claude, or any other AI surface. For a lot of B2B buyers, Google AI Overviews are not the primary AI interface they are using.

The second category is LLM polling. Tools like Otterly.ai, Profound, and Peec AI measure your visibility by directly querying AI platforms — sending prompts to ChatGPT, Perplexity, Gemini, and others, then parsing the responses to see if your brand is mentioned, how prominently, and in what context. This is a fundamentally different methodology. They are not watching search results. They are running experiments against live models and aggregating what they find.

This is where the interesting problem lives.

The flaw in LLM polling

To understand the flaw, you need to understand the mechanics first.

An LLM polling tool works like this: the vendor builds a query library — a set of prompts relevant to your industry and category. Something like “what are the best B2B SEO agencies in Indonesia?” or “which tools do you recommend for AI search optimization?” They send each prompt to ChatGPT, Perplexity, Gemini, and others. They run each prompt multiple times — often between ten and thirty repetitions — because the model gives different answers each time. They record how often your brand name appears across all those runs. Your visibility score is your mention rate: if your brand appeared in 340 out of 1,000 prompt runs, you get a score of 34%.

That is a reasonable methodology in principle. Repeated sampling is the correct response to non-determinism. The problem is in the execution — specifically in three places where the measurement breaks down.

The sample size is too small to be statistically stable. The non-determinism here is larger than most people assume. Thinking Machines Lab ran an identical prompt 1,000 times at temperature 0 — the setting that is supposed to make output deterministic — and still got 80 different completions. If a model varies that much under the most controlled conditions possible, running a brand-visibility prompt ten times gives you a rough estimate of how often that model, on that day, mentions your brand for that query. But the confidence interval around that estimate is wide. If you ran it ten more times tomorrow, the number would shift — not because your visibility changed, but because ten is not a large enough sample to pin down an unstable distribution. Tools that run each prompt more times (thirty, fifty, a hundred) produce more reliable per-prompt estimates, but most tools do not publish their methodology, so you cannot tell which approach a vendor is using.

The query library reflects the vendor’s assumptions, not your buyers’ behaviour. The prompts in the library were written by someone at the tool company who made educated guesses about what questions people in your category ask. Those guesses may be directionally correct. They are not derived from your actual customer data. The queries your buyers actually type into ChatGPT may be phrased differently, be more specific, or come from angles the vendor never anticipated. Your visibility score is measured against a proxy for customer behaviour, not the real thing.

The baseline shifts without warning. When OpenAI quietly updates GPT-4o’s system prompt, the model’s citation preferences can change overnight. When a new training run incorporates more recent web data, who the model “knows about” changes. When a competitor earns more press coverage, the model may start mentioning them more frequently in responses it previously gave to you. None of these changes are signalled to the tool — the score just moves, and there is often no way to know whether the change reflects something you did or something the model did.

The result is a number that is real in a narrow sense and misleading in a broader one. A visibility score of 34% does not mean that 34% of people who ask AI about your category see your brand. It means that in the vendor’s query library, your brand appeared in 34% of their prompt runs in that measurement window — against a distribution that was already shifting while they were measuring it.

The methodology is sound. The execution is immature. Those are different problems with different timelines for getting solved.

This is not a reason to dismiss these tools. It is a reason to treat them the way you would treat any early-stage measurement instrument: useful for direction, unreliable for precision, and subject to revision as the science matures.

What neither category tells you

Here is the more fundamental issue: neither type of tool tells you what to do differently.

If your Google AI Overview visibility improves, the causal factor is almost certainly your search ranking — which went up because of content, links, or technical improvements you made. The AI Overview inclusion followed the ranking. The tracking tool did not tell you to make those changes; it just confirmed they worked.

If your LLM polling score goes up or down, there is often no clean causal story to attach to it. Model updates, prompt drift, changes in how competitors are described in training data — any of these can move the number without any action on your part. And if you want to improve it, the path leads exactly where it always does: better content, stronger search authority, more third-party mentions. The same work that produces better Google rankings.

This is not a coincidence. AI tools retrieve content from search infrastructure. They do not have a separate index you can submit to or a separate algorithm you can optimize for. The visibility score these tools give you is a downstream measurement of your search health, not an independent signal.

The five tools worth knowing

With that context, here is an honest summary of the tools that are actually in use.

1. Semrush AI Toolkit Tracks Google AI Overview appearances within Semrush’s rank tracking, and now also polls ChatGPT, Gemini, and Perplexity from a database the company says spans 261M+ prompts and responses. Shows which of your tracked keywords trigger AI Overviews, which URLs are cited, and how that changes over time. The most reliable of the group for what it covers, because the AI Overview data is observational. Priced at around $99/month as an add-on. Useful if Google AI Overviews are a meaningful surface for your category.

2. Ahrefs Brand Radar Ahrefs tracks AI Overview presence as a SERP feature in its rank tracker, and Brand Radar extends this to other AI surfaces. Notably, its published methodology runs prompts through the platforms’ public web interfaces — observing what real users would see rather than calling APIs — at large volumes (roughly 143 million AI Overview prompts a month). That observational approach makes it one of the more credible polling-style tools, precisely because it leans toward observation over simulation.

3. Otterly.ai Dedicated AI visibility monitoring. Sends prompts to ChatGPT, Perplexity, Gemini, Claude, and others; tracks brand mentions, share of voice, and sentiment over time, refreshing roughly weekly. The most accessible entry point for teams that want to measure beyond Google. The query library is a mix of tool-generated and user-defined prompts — the quality of your measurement depends heavily on how well your query set reflects what your buyers actually ask. The non-determinism problem is most visible here.

4. Profound Enterprise-grade AI citation tracking, and the best-funded player in the category — it raised a $96M Series C at a $1B valuation in early 2026. Monitors which of your URLs are being cited as sources across AI platforms — not just whether your brand is mentioned in the response text, but whether your pages are being surfaced as references. This is a more meaningful signal than brand mention frequency if your goal is understanding content authority rather than brand awareness. More expensive, more data, same fundamental methodology as other LLM polling tools.

5. Peec AI Positioned as an AI brand intelligence tool rather than a pure visibility tracker. Tracks share of voice, competitor benchmarking, and AI platform sentiment, with pricing starting around $100/month. Useful for ongoing brand monitoring in AI contexts. The same caveats about sample size and query selection apply. Better suited for tracking relative positioning against specific competitors than for understanding absolute visibility.

How to use these tools without being misled by them

Use the SERP-based tools — Semrush and Ahrefs — to track Google AI Overview presence as a SERP feature the same way you track featured snippets or People Also Ask boxes. It is a real measurement of a real thing, and the trend over time is meaningful.

Use the LLM polling tools to get directional signal, not precise measurement. If your brand consistently does not appear across hundreds of prompts on a relevant topic, that is a meaningful observation. If your score fluctuates by ten points week to week, that is probably noise. Set longer time horizons — monthly trends are more meaningful than weekly deltas.

Do not optimize directly for the score. The optimization path for AI visibility is the same as the optimization path for search visibility. If a tool tells you your AI visibility is low, the answer is not to find an AI-specific tactic. The answer is to do the SEO work that would have needed doing anyway: build authority in your category, produce content that actually answers the questions your buyers ask, earn coverage from third-party sources that talk about you in context.

The tools are useful for knowing where you stand. They are a poor guide for deciding what to do next.

This measurement category will get better

The methodology is nascent, not broken. The right response to non-determinism is more sampling — and as this space matures, tools will run more repetitions per prompt, build larger query libraries, and potentially source prompts from real search behaviour rather than vendor assumptions. Model versioning will improve, making it easier to separate drift caused by the model from drift caused by actual changes in your visibility. Some AI platforms may eventually expose citation data through APIs directly, which would make the sampling problem disappear entirely.

We are early. The tools available now are first-generation instruments in a measurement category that did not exist three years ago. The visibility scores they produce are directionally useful and statistically weak — and that balance will shift over time as the methodology catches up to the need.

For now: use the numbers for trend direction over long timeframes, not point-in-time precision. Build a stable query library that reflects your buyers’ actual language. And do not let a low score or a sudden dip send you chasing an explanation that may not exist.

Written by
Raiputra

B2B SEO practitioner specialising in search strategy for the AI era. Working directly with marketing managers at mid-size companies — no account managers, no handoffs.

Next article

What is AI SEO, exactly?

Read →

Want this thinking applied to your business?

B2B SEO strategy, delivered directly. No account managers, no delay.

Get in Touch