Skip to main content
How To· 14 min read

Stop Blocking AI Bots. Start Measuring Them.

Nathan Nicholls
Nathan Nicholls

TL;DR

  • AI bots scraping your content are not noise.
  • They are a signal of authority.
  • Server-side tagging via Stape.io lets you measure AI crawler activity in GA4 without affecting your human analytics.
  • This guide walks through the full technical setup: identifying AI crawlers by User-Agent, routing them through server GTM, building custom GA4 dimensions, and creating an AI Popularity dashboard.
  • It also covers the security risks of sharing this data with third parties and asks whether browsers like Chrome are already using AI request patterns as a ranking signal.

The problem with blocking

The default posture for most businesses is to block AI bots. Cloudflare offers a toggle. Stape has a bot detection power-up. Robots.txt directives tell crawlers to stay away. The reasoning is straightforward: bots are not customers, they consume server resources, and they inflate analytics.

That reasoning made sense when bots were spam scrapers and SEO tools probing for vulnerabilities. It does not make sense when the bots are ChatGPT, Claude, Gemini, and Perplexity, systems that are increasingly how people find businesses, evaluate products, and make purchasing decisions.

When you block an AI crawler, you are not protecting your analytics. You are making yourself invisible to the fastest-growing discovery channel in digital marketing. AI-referred sessions grew 527% year over year. Traffic to US retail sites from generative AI services surged 4,700% in 2025. Fastly's threat research, drawn from 6.5 trillion monthly requests across their global network, found that automated bot traffic now comprises 37% of all web activity. AI crawlers alone account for nearly 80% of all AI bot traffic.

The scale is staggering. Meta generates 52% of AI crawler traffic. Google accounts for 23%. OpenAI represents 20%. ChatGPT's fetcher bots generate 98% of all real-time AI retrieval requests, peaking at 39,000 requests per minute. Nearly 90% of this activity originates from North America.

If your content is not being scraped, it is not being cited. If it is not being cited, you do not exist in AI search. The question is not whether to allow AI crawlers. It is how to measure what they are doing when they arrive.

AI traffic in GA4 today: what is missing

The standard approach to AI traffic in GA4 relies on referral data. When a user clicks through from ChatGPT or Perplexity to your site, GA4 records the referral source. Existing guides walk through identifying these sessions using source/medium filters.

This captures one dimension: human traffic arriving via AI platforms. Kevin Indig's research at Growth Memo puts this in perspective. Across six B2B companies, AI chatbot referral traffic averaged just 0.14% of organic visits, roughly one AI referral for every 714 organic sessions. But that number grew from 250 visits per month in early 2024 to over 1,300 by November, a five-fold increase. The referral side is growing fast, but it is still only one half of the picture.

The other half, the one GA4 misses entirely, is AI systems scraping your content to generate answers in the first place. AI crawlers do not execute JavaScript. They do not trigger your GA4 client-side tag. They do not appear in your session data at all. Standard GA4 is blind to the most important interaction: the moment an AI system decides your content is worth ingesting.

Indig makes a critical distinction in his reporting: the difference between an "AI Crawler" (scraping for model training data) and an "AI Assistant" (fetching a live answer for a user query). Both hit your server. Neither appears in client-side analytics. And they have fundamentally different implications for your content strategy.

Server-side analytics changes this.

The case for AI scraper traffic as signal

Consider what it means when an AI system repeatedly scrapes a specific page on your site. That system has determined, through its own evaluation, that your content is authoritative enough to cite in generated responses. This is not random. These systems have retrieval mechanisms that score and select sources.

Cloudflare's research reveals how this works in practice. Approximately 80% of AI bot traffic is dedicated to model training. The remaining traffic splits between search indexing, user-initiated requests, and undeclared purposes. Training traffic is erratic with no cyclical pattern, while user-action traffic (AI systems fetching answers for real user queries) shows clear daily cycles reflecting actual usage.

The crawl-to-refer ratios tell the story of return on content. Across the default dataset, Anthropic crawls 50,000 pages for every one it refers traffic back to. OpenAI's ratio is 887:1. Perplexity is the most efficient at 118:1. For news and publishing sites, these ratios improve dramatically: OpenAI drops to 152:1, Perplexity to 32.7:1. The implication is clear. If your content is being crawled but not cited, the content or its structure is the problem, not the crawling.

Research from the GEO space supports this. Only 30% of brands stay visible between consecutive AI answers. Just 20% remain present across five consecutive answer runs. Pages with structured heading hierarchies have 2.8 times higher citation rates. Pages not updated quarterly are three times more likely to lose citations. AI systems are not scraping once. They are re-crawling, re-evaluating, and making ongoing decisions about source quality.

Kevin Indig's 2026 State of AI Search Optimization report frames this shift directly: visibility is replacing traffic as the real currency. In AI Mode sessions, users stay inside the AI interface. Clicks vanish. The brands that win are the ones the AI systems trust enough to cite, and that trust is built through consistent crawling and retrieval.

If backlinks became a ranking signal for Google because they represent third-party endorsement, AI scraper frequency is the equivalent signal for generative search. A page that AI systems return to repeatedly is a page that has earned algorithmic trust.

Blocking this traffic does not just remove data from your analytics. It removes your content from the systems that are increasingly deciding who gets cited and who gets ignored.

Architecture: server-side measurement with Stape.io

Client-side GA4 tags fire when a browser executes JavaScript. AI crawlers do not run JavaScript. They make HTTP requests, receive the HTML response, parse it, and move on. The only way to observe this interaction is at the server level.

Stape.io provides a server-side Google Tag Manager (sGTM) container that sits between your web server and Google's analytics endpoints. Every request to your site passes through this layer, including requests from AI crawlers that would otherwise be invisible.

Why server-side, not client-side

Client-side tracking has three blind spots for AI bot measurement:

  1. No JS execution. AI crawlers request HTML. They do not load scripts. Your gtag.js or GTM container never fires.
  2. No cookies. Bots do not accept or return cookies. Session-based attribution is impossible.
  3. No consent interaction. Bots do not click cookie banners. Under a strict consent model, even if they did execute JS, tags would not fire.

Server-side tagging solves all three. The sGTM container processes the raw HTTP request, inspects headers, identifies the User-Agent, and can fire tags or record events regardless of whether JavaScript ran.

Identifying AI crawlers by User-Agent

Each major AI system identifies itself in the HTTP User-Agent string:

OpenAI / ChatGPT - User-Agent contains GPTBot - Purpose: Training and retrieval

OpenAI Browse - User-Agent contains ChatGPT-User - Purpose: Real-time web browsing

Anthropic / Claude - User-Agent contains ClaudeBot or anthropic-ai - Purpose: Training and retrieval

Google Gemini - User-Agent contains Google-Extended - Purpose: Gemini training

Google AI - User-Agent contains Googlebot (existing) - Purpose: AI Overviews retrieval

Perplexity - User-Agent contains PerplexityBot - Purpose: Real-time answer generation

Meta AI - User-Agent contains FacebookBot or meta-externalagent - Purpose: Meta AI training

Common Crawl - User-Agent contains CCBot - Purpose: Open dataset used by many AI systems

Bytedance - User-Agent contains Bytespider - Purpose: TikTok AI features

Apple - User-Agent contains Applebot-Extended - Purpose: Apple Intelligence

In your sGTM container, you create a Request Header variable that reads the User-Agent header. A lookup table or regex match then classifies each request as a known AI crawler, an unknown bot, or a human visitor.

Routing: separate property vs custom dimensions

You have two architectural choices:

Option A: Separate GA4 property. Send AI bot events to a dedicated GA4 property. Human analytics stay completely clean. AI data lives in its own reporting space. Downside: you cannot correlate bot activity with human behaviour on the same page in a single report.

Option B: Same property, custom dimensions. Send all events to one GA4 property with a custom dimension visitor_type set to the bot name or "human." Use GA4 segments and filters to separate the views. Upside: correlation is easy. Downside: bot events inflate your event counts and can confuse automated reports if not filtered correctly.

Recommendation: Option B with strict GA4 audience filters. The correlation value outweighs the noise risk, and GA4's exploration reports handle segmentation well enough to separate the data when needed.

Step-by-step: GTM configuration for AI bot measurement

This section assumes you have a working Stape.io sGTM container connected to your site. If not, follow Stape's server-side tracking setup guide first.

Step 1: Create the User-Agent variable

In your server-side GTM container:

  1. Go to Variables > New > Request Header
  2. Header name: User-Agent
  3. Name it: Request - User Agent

Step 2: Create the AI Bot Classifier variable

Create a Custom JavaScript variable (or Regex Table variable) that classifies the User-Agent:

The variable should check the User-Agent string against a list of known AI bot identifiers (GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, Google-Extended, PerplexityBot, meta-externalagent, CCBot, Bytespider, Applebot-Extended). If a match is found, return the bot name (e.g., "ChatGPT", "Claude", "Perplexity"). If no match, return "human".

Name it: AI Bot Classifier

Step 3: Create the AI bot trigger

Create a Custom Event trigger:

  1. Trigger type: Custom
  2. Fire on: All events (or specifically on Page View events)
  3. Condition: AI Bot Classifier does not equal "human"

Name it: Trigger - AI Bot Detected

Step 4: Create the GA4 event tag

Create a new GA4 Event tag:

  1. Tag type: Google Analytics: GA4 Event
  2. Measurement ID: Your GA4 property
  3. Event name: ai_bot_visit
  4. Event parameters: bot_name: {{AI Bot Classifier}}, page_path: {{Page Path}}, content_type: (optional, derived from URL pattern or a lookup table mapping paths to content categories)
  5. Trigger: Trigger - AI Bot Detected

Step 5: Set up GA4 custom dimensions

In your GA4 property (not GTM):

  1. Go to Admin > Data Display > Custom Definitions
  2. Create custom dimensions: Dimension name: Bot Name | Scope: Event | Event parameter: bot_name. Dimension name: Content Type | Scope: Event | Event parameter: content_type
  3. Create a custom metric (optional): Metric name: AI Bot Visits | Scope: Event | Event parameter: ai_bot_visit | Unit: Standard

Step 6: Prevent human analytics pollution

Add an exception to your existing human-facing GA4 tags:

  1. On every GA4 tag that fires for human visitors, add a blocking trigger
  2. Blocking condition: AI Bot Classifier does not equal "human" (inverse: block if bot)
  3. This ensures your standard GA4 data remains clean

Step 7: Test with Stape debugger

  1. Open Stape's sGTM preview mode
  2. Use a tool like curl with a spoofed User-Agent header to simulate an AI bot request: curl -H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0)" https://yoursite.com/target-page
  3. Verify in the debugger that the ai_bot_visit event fires with the correct bot_name dimension
  4. Check GA4 Realtime report for the incoming event

Building the AI Popularity dashboard in GA4

Once data starts flowing, build a custom exploration in GA4 to visualise AI bot activity.

Exploration 1: Pages Most Scraped by AI

  • Technique: Free-form
  • Dimensions: Page path, Bot Name
  • Metrics: Event count (ai_bot_visit)
  • Sort: Event count descending
  • Filter: Event name equals ai_bot_visit

This shows which pages AI systems find most valuable. Cross-reference with your GEO monitoring tool to see if high-scrape pages also appear in AI-generated answers.

Exploration 2: Bot Activity Over Time

  • Technique: Line chart
  • Dimensions: Date, Bot Name
  • Metrics: Event count
  • Breakdown: Bot Name

This reveals trends. Is ChatGPT scraping more frequently after a content update? Did Perplexity stop visiting after a robots.txt change? These patterns inform your content and crawl access strategy.

Exploration 3: AI Scraping vs Human Engagement Correlation

  • Technique: Free-form
  • Dimensions: Page path
  • Metrics: ai_bot_visit count, sessions (human), engaged sessions, conversions
  • Segment: Compare bot data with human behaviour on the same pages

The question this answers: do pages that AI systems scrape frequently also perform well with human visitors? If the correlation is strong, AI scraping frequency becomes a proxy for content quality.

Optional: Looker Studio dashboard

For ongoing reporting, connect GA4 to Looker Studio and build a dedicated AI Popularity dashboard. Share it with stakeholders who need to understand how AI systems interact with the brand without navigating GA4 directly.

Security risks of sharing server log data with third parties

Measuring AI bot traffic generates data that has strategic value. Before piping it into third-party platforms, understand what you are exposing.

What the data reveals

Server-side bot logs contain:

  • URL patterns that map your content architecture. Competitors with access to this data can reverse-engineer your site structure, identify your highest-value pages, and target the same topics.
  • Crawl frequency by page. This reveals which content AI systems consider authoritative. That is competitive intelligence.
  • Bot identity distribution. Knowing which AI platforms prioritise your content tells you where your brand appears in AI-generated answers, and where it does not.

Who sees this data

  • GA4 (Google). Google already knows your traffic patterns. Sending bot data to GA4 gives Google additional signal about how AI systems interact with your content. Whether Google uses this internally is opaque.
  • GEO monitoring platforms (Peec.ai, Otterly.ai, Gauge, Siftly). These platforms aggregate visibility data across clients. Your crawl patterns become part of their benchmarking dataset. Read the data processing terms carefully.
  • Stape.io. As the server-side proxy, Stape processes every request. Their data handling and retention policies matter.

Mitigation

Keep raw server logs internal. Send aggregated, classified events to GA4 (bot name + page path is enough, no need to forward full headers or IP data). Use your own GA4 property for sensitive analysis rather than vendor dashboards. Apply data retention limits.

The principle: share enough to measure, not enough to expose your strategy.

Browser-level AI request detection: the signal nobody is talking about

This section is speculative, but the infrastructure already exists.

What Chrome already knows

Google Chrome captures every network request made by a page. Through the Chrome User Experience Report (CrUX), Google already aggregates real-user performance data and uses it as a ranking signal via Core Web Vitals. This established a precedent: browser-observed behaviour informs search rankings.

Chrome's Privacy Sandbox and Topics API classify browsing behaviour into interest categories. The technical architecture for observing, classifying, and reporting network-level signals is already deployed at scale.

The AI agent scenario

Now consider what happens when AI agents browse the web through Chrome or Chromium-based environments. Tools like ChatGPT's browsing mode, Perplexity's real-time search, and Google's own Gemini all make web requests. Some use headless Chromium. Others use custom HTTP clients.

Chrome could, in theory, detect when a page receives requests from AI agents and classify this as a signal. If a domain is frequently accessed by AI systems across multiple contexts, that suggests the domain is authoritative, frequently cited, and useful to AI-driven discovery.

Why this matters for brand equity

If Google combines CrUX-style data collection with AI request observation, "AI popularity" becomes a measurable, browser-verified signal. Not something you self-report. Not something a third-party tool estimates. A signal derived from actual infrastructure-level observation of how AI systems interact with your content.

This is not confirmed. Google has not announced anything of this nature. But the components exist: the browser instrumentation, the data pipeline, the ranking system that accepts new signals. The question is whether they connect them, not whether they can.

For businesses, the implication is straightforward. The data you collect today about AI bot interactions could be an early indicator of a signal that search engines formalise tomorrow. Start measuring now.

The GEO monitoring landscape and the integration gap

Several platforms now track AI visibility. Peec.ai monitors brand mentions across LLMs with real-time recommendations. Otterly.ai tracks citations across six AI platforms with competitive benchmarking. Gauge measures citation frequency and converts data into content workflows. Siftly tracks citation patterns across major AI engines with brand perception analysis.

These tools answer the question: "Is my brand being cited in AI answers?"

None of them answer the question: "Which of my pages are being scraped, how often, and by which AI systems?"

That is the gap. Citation monitoring tells you the output. Server-side bot measurement tells you the input. Connecting the two creates a complete picture: which pages are being ingested (scraping data), and which of those pages actually surface in AI answers (citation data).

The platform that bridges this gap, linking server-side crawl analytics with citation monitoring, will own the GEO measurement category. Until that platform exists, the approach described in this guide gives you the input side of the equation using tools you already have.

What to do now

  1. Audit your robots.txt. If you are blocking GPTBot, ClaudeBot, or PerplexityBot, reconsider. Blocking them removes you from AI-generated answers.
  2. Review your Cloudflare and Stape bot rules. Many businesses have blanket bot blocking enabled. Whitelist known AI crawlers.
  3. Set up server-side AI bot measurement. Follow the GTM configuration above. The setup takes a few hours and the data starts flowing immediately.
  4. Build the GA4 dashboard. Start with the three explorations described. Even a month of data reveals which pages AI systems value.
  5. Cross-reference with a GEO monitoring tool. Compare your scraping data with citation appearance. The correlation will tell you whether your content strategy is working for AI discovery.
  6. Start the conversation internally. Most business owners and marketing leads have no idea this is happening. Show them the data. The AI Popularity dashboard is often the first time they realise their content is being consumed by systems they have never measured.

AI bot traffic is not a problem to solve. It is a dataset to build on. The businesses that measure it now will have months of signal when the rest of the market catches up.

Share with your team or peers

Book a free call. We’d love to talk growth.

All engagements start with a free 30 minute discovery call before proceeding to more formal diagnostics of the business, marketing initiatives and growth targets.