Customer Support AI Model Comparison: GPT-5 vs Claude 4 vs Gemini 2.0 – A 10,000-Query Case Study

By Sam Qikaka

Category: Models & Releases

Discover how GPT-5, Claude 4, and Gemini 2.0 compare in accuracy, latency, and cost per ticket for e-commerce customer support. Our 10,000-query simulation reveals tradeoffs and shows how LUMOS multi-agent orchestration cuts costs by 22% while preserving accuracy.

Introduction: The Challenge of Choosing an AI for E-Commerce Support Selecting the right large language model (LLM) for customer support is a high-stakes decision for e-commerce operations. Every query answered incorrectly risks customer trust, while slow responses lead to abandoned carts. At the same time, per-ticket cost must be kept low to sustain margins. With leading models like GPT-5, Claude 4, and Gemini 2.0 each offering different strengths, which one should your team bet on? To answer that question, we conducted a controlled simulation using LUMOS—a multi-agent orchestration platform—on a dataset of 10,000 real-world e-commerce support queries. The goal was to measure each model’s response accuracy, average latency, and cost per resolved ticket. The results reveal clear tradeoffs: no single model is best for all dimensions. But with dynamic routing, you can capture the advantage

s of each. Methodology: Simulating 10,000 Real Support Queries with LUMOS The simulation was designed to mimic a mid-sized e-commerce helpdesk handling returns, order status, payment issues, product questions, and account support. All 10,000 queries were anonymized actual customer interactions from a partner retailer, covering both simple (password reset) and complex (multi-item return dispute) cases. We deployed three models via the LUMOS platform: GPT-5 (OpenAI) – latest flagship reasoning model, optimized for instruction following. Claude 4 (Anthropic) – focused on safety and nuanced understanding. Gemini 2.0 (Google DeepMind) – designed for multimodal tasks and fast inference. Each query was processed independently by all three models. LUMOS recorded: Accuracy : percent of responses that resolved the query without requiring escalation (verified by human reviewers). Latency : average

end-to-end response time from query submission to output generation. Cost per ticket : total compute cost (API calls, inference) divided by number of queries, rounded to nearest tenth of a cent. The simulation ran in May 2026 under stable API conditions. No prompt engineering or fine-tuning was applied beyond a standard system message used by all models. Benchmark Results: Accuracy, Latency, and Cost per Ticket Here is the summary of how the three models performed across the 10,000-query dataset: Model Accuracy Average Latency Cost per Ticket :--------- :------- :-------------- :-------------- GPT-5 94% 1.2 seconds $0.012 Claude 4 96% 2.5 seconds $0.009 Gemini 2.0 88% 0.8 seconds $0.008 These numbers reveal a clear tension. Claude 4 leads in accuracy but is twice as slow as GPT-5 and three times slower than Gemini 2.0. Gemini 2.0 is the fastest and cheapest but lags behind in accuracy. G

PT-5 sits in the middle on all three metrics. Model-by-Model Analysis: GPT-5, Claude 4, Gemini 2.0 GPT-5 With 94% accuracy and 1.2-second latency, GPT-5 offers a balanced profile. It excels at following complex instructions and handling multi-turn conversations. Its cost of $0.012 per ticket is moderate. In our simulation, GPT-5 performed well on payment and account-related queries where clear logical steps are required. Claude 4 Claude 4 achieved the highest accuracy (96%) and was particularly strong on nuanced queries involving refund policy exceptions and sensitive customer interactions. However, its 2.5-second average latency makes it less suitable for high-traffic, real-time support where every second counts. At $0.009 per ticket, it is cheaper than GPT-5 but more expensive than Gemini 2.0. Gemini 2.0 Gemini 2.0 delivered the fastest responses (0.8 seconds) and the lowest cost ($0.0

08 per ticket). Its 88% accuracy, while still high, means nearly 1 in 10 queries may require human escalation. It worked best on straightforward, routine queries such as order tracking and simple FAQs. The Tradeoffs: No Single Model Dominates Our simulation confirms that each model has a distinct sweet spot. Choosing a single model for all queries forces a compromise: Prioritize accuracy above all? Claude 4 is your best bet, but you pay for it with higher latency and moderate cost. Need speed and low cost? Gemini 2.0 is ideal, but you may need to accept more escalations. Want a balanced performer? GPT-5 is a solid all-rounder, but it is neither the cheapest nor the most accurate. For e-commerce support leaders, the best approach is not to commit to a single model. Instead, use a multi-agent orchestration layer that can route each query to the most suitable model based on priority rules.

Dynamic Routing with LUMOS: How Multi-Agent Orchestration Reduces Cost LUMOS enables dynamic model assignment by evaluating each incoming query against configurable criteria—such as complexity, sentiment, required accuracy, or latency budget. For example: A simple order status query can be routed to