Small Models Accuracy Per Dollar: Chaining Mini LLMs for Classification, Routing, and Extraction

By Sam Qikaka

Category: Models & Releases

Discover how small and mini models deliver superior accuracy per dollar for structured tasks like classification, routing, and extraction. Learn strategies for chaining them in LUMOS-style multi-agent workflows to outperform costly frontier LLMs.

Rise of Small and Mini Models for Structured Tasks In 2026, enterprise AI operations are shifting toward efficiency without sacrificing performance. Small Language Models (SLMs) and "mini" variants—typically 1-20 billion parameters—are proving ideal for structured tasks like classification, routing, and extraction. These tasks power RAG pipelines, agent workflows, and data processing in multi-agent platforms like LUMOS. Unlike frontier LLMs (e.g., GPT-5 class or Claude Opus equivalents), SLMs excel where precision and speed matter more than open-ended creativity. Research from arXiv (e.g., studies on SLM agentic workloads) shows they handle function calling, schema adherence, and tool use with lower latency and energy use. For B2B leaders, this means optimizing costs in high-volume operations: why pay for 1T+ parameter reasoning when a 7B model classifies intents at 90%+ accuracy? The tr

end? SLM-default systems with LLM fallbacks, managed by uncertainty routing. This hybrid approach aligns with enterprise jobs-to-be-done: cost-optimized classification for customer support routing or entity extraction in compliance workflows. Key Vendors and Their Mini/SLM Offerings Leading vendors now prioritize mini/SLM SKUs for production. Here's a rundown of exact model IDs from official docs, focused on classification, routing, and extraction suitability: OpenAI : (optimized for high-volume tasks like classification and tool calling). As of May 11, 2026, per OpenAI's pricing page (platform.openai.com/docs/models), it's designed for sub-agent roles in chains. Anthropic : mini variants or (exact ID: ). Anthropic's docs (docs.anthropic.com) highlight Haiku for low-latency routing and extraction. Google : or (per ai.google.dev). Flash models shine in multimodal extraction with efficient

tokenization. Meta : Llama-3.2 1B/3B (open-weights via Hugging Face or Meta AI). Ideal for on-prem routing in edge setups. Mistral : (12B, open-weights) or API . Strong in structured outputs per Mistral's platform docs. Other players like DeepSeek ( ) and Alibaba Qwen minis offer competitive open options. Always verify latest SKUs on vendor sites—e.g., OpenAI's model explorer or Anthropic's API reference—as releases evolve rapidly. Accuracy Per Dollar: Benchmarking Classification and Extraction Accuracy per dollar = (task accuracy %) / (cost per inference). To compute: benchmark accuracy on datasets like GLUE (classification) or SQuAD (extraction), then divide by vendor list price. Methodology : Use official benchmarks: Hugging Face Open LLM Leaderboard or vendor evals (e.g., OpenAI's scores 82% on MMLU subsets for classification). Cost: Input/output tokens × price per 1M tokens. E.g.,

for , as of May 11, 2026, OpenAI lists $0.15/1M input, $0.60/1M output (platform.openai.com/pricing). A 1K-token classification prompt costs $0.000075. Compare: at Google's rates ( $0.075/1M input per ai.google.dev/pricing) yields higher dollars per accuracy point on extraction if latency-tuned. Vendor-specific insights (hedged to official sources): OpenAI : Tops classification (e.g., 95% on intent detection per user evals), 10x cheaper than . Anthropic : Extraction accuracy rivals Sonnet at 1/5th cost (per Anthropic's tool-use evals). No static tables here—calculate dynamically via APIs like OpenRouter for real-time quotes, but prioritize primary vendor pages. ArXiv papers confirm SLMs hit 85-95% on structured tasks, making accuracy per dollar 5-20x frontier models for volume workloads. Routing Efficiency with Small Models Model routing—directing queries to the right specialist—benefits

hugely from minis. Use a small model as router: classify query type (e.g., "extraction? coding? chat?") with 98% accuracy at <100ms. Why SLMs? Low token burn: A 50-token routing prompt on costs <$0.00001 vs. $0.001+ for frontier. Efficiency: Mistral Nemo routes with uncertainty scores (e.g., logit thresholds), fallback to LLMs only 10% of the time. Benchmarks: Papers on SLM routing (arxiv.org) show 2-5x cost savings in RAG agents. In LUMOS platforms, route to for extraction, for classification—total cost under $0.001/query. Chaining Two Small Models vs. One Frontier Call Chaining: Router (mini1) → Task specialist (mini2). Vs. single frontier call. Pros of chaining : Cost: 2× $0.0001 = $0.0002 vs. $0.01+ for . Accuracy: Specialization boosts (e.g., Haiku classify + Nemo extract = 97% end-to-end). Latency: Parallelizable, <500ms total. Cons : Complexity: Error propagation (mitigate with r

etries). Token overhead: 20% more, but still cheaper. When chaining wins : High-volume ( 1K queries/day), structured tasks. Frontier for creative reasoning. Example: LUMOS agent classifies support ticket ( ), extracts entities ( ), saves 80% vs. single call. Per arXiv SLM chaining studies, ROI hits