Automated GEO A/B Testing with a Multi-Agent System: Boost AI Citations Before Every Model Release
By Sam Qikaka
Category: Models & Releases
Learn how to build a LUMOS multi-agent system that automatically runs A/B tests on your knowledge base content—varying headings, structured data, and examples—and measures citation lift across ChatGPT, Perplexity, and Gemini before each AI model update cycle.
Why GEO Requires Continuous Experimentation, Not Set‑and‑Forget Generative engine optimization (GEO) is not a one-time configuration. AI search engines—ChatGPT, Perplexity, Gemini—update their models regularly, shifting how they surface and prioritize content. For enterprise B2B teams, a knowledge base that performed well in March may see citation drops after a May model update. Yet many operations leaders treat GEO as a static SEO task: add structured data, write clear headings, and hope for the best. In reality, citation behavior varies by platform and over time. ChatGPT (GPT‑4o as of May 2026) often favors authoritative, well‑structured FAQ content. Perplexity’s citation patterns lean toward recent, example‑rich articles. Gemini (currently version 2.5) blends multiple sources and penalizes unsupported claims. Without continuous experimentation, you cannot know which formatting or phra
sing works best for each engine. That’s where automated A/B testing comes in. By building a multi-agent system that generates variants, monitors results, and recommends winners, you can systematically improve your GEO performance—and lock in gains before the next model release. The LUMOS Multi‑Agent Architecture for GEO LUMOS is a multi-agent platform designed for orchestrating autonomous AI workflows. For GEO optimization, we deploy three specialized agents: Experiment Agent : Generates content variants based on controlled changes to headings, structured data, example formats, and entity mentions. Monitoring Agent : Continuously queries AI search engines (via their chat APIs or web interfaces) to collect citation data—rate, position, and sentiment. Recommendation Agent : Analyzes experiment results using statistical significance tests and selects the best‑performing variant for each eng
ine. These agents communicate through a shared message bus and a central knowledge base (versioned content store). A typical workflow: 1. Experiment Agent receives a content chunk (e.g., a product description) and creates 3–5 variants. 2. It pushes variants to a staging area and signals the Monitoring Agent to start tracking. 3. Monitoring Agent runs daily queries to ChatGPT, Perplexity, and Gemini, logging citations per variant. 4. After a defined period (e.g., 7 days), the Recommendation Agent pulls the logs, computes lift metrics, and flags statistically significant winners. 5. The winning variant is promoted to production; losers are archived for future reference. Pseudo‑code for agent coordination: Designing A/B Test Variants for Your Knowledge Base Content To create meaningful experiments, focus on variables that directly influence AI citation behavior: Headings and Structure Use a
clear H1–H3 hierarchy vs. a flat structure. Include or exclude summary sections at the top. Test different question‑driven headings (e.g., "What is X?" vs. "X explained"). Structured Data Apply schema.org types like , , , . Vary the inclusion of for entity recognition. Test different property combinations (e.g., length, ). Example Formats Real‑world case studies vs. generic illustrative examples. Short bullet lists vs. numbered steps. Including data tables vs. prose only. Entity Mentions Explicitly naming competitors vs. generic references. Brand mentions (internal vs. external). Location and date specificity. Each variant should change only one variable at a time to isolate impact. For instance, keep the same body text and only swap the heading pattern. The Experiment Agent uses a template engine to generate these systematically. Monitoring Citation Lift Across ChatGPT, Perplexity, and
Gemini The Monitoring Agent is the backbone of the system. It executes daily probes using the public APIs of each AI engine (or, if unavailable, via headless browser queries). The core metrics: Citation Rate : Percentage of queries where the content appears in the AI’s response. Measured per engine, per variant. Citation Position : First, second, third mention, or beyond. An early position is more valuable. Sentiment : Positive, neutral, or negative framing by the AI. (Requires NLP analysis of the generated text.) Data collection frequency: At least once every 24 hours, but more frequent (every 12 hours) during critical test windows, such as the week after a model release. The Monitoring Agent stores all results in a time‑series database for filtering. To avoid bias, queries are randomized and include a mix of long-tail and branded prompts. The agent also tracks the AI’s specific model
version (e.g., ) so you can correlate changes with updates. Selecting Winning Variants with the Recommendation Agent Once enough data accumulates—typically 5–7 days for a stable result—the Recommendation Agent runs statistical tests. The preferred method is a two‑sample proportion test (or chi‑squar