Real-Time Multi-Agent Fraud Detection for Banking: A Three-Agent Architecture with Llama 4 and Qwen 3.8 Max on AWS Bedrock

By Sam Qikaka

Category: Agents & Architecture

As of May 23, 2026, financial institutions are deploying multi-agent systems on AWS Bedrock to detect fraud in real time. This vendor-neutral guide presents a three-agent architecture using Llama 4 for transaction parsing, Qwen 3.8 Max for anomaly scoring, and a fine-tuned risk assessment agent, with sub-100ms latency and cost-per-transaction benchmarks from a regional bank pilot.

What’s New in Multi-Agent Fraud Detection for Banking As of May 23, 2026, financial institutions face mounting pressure to detect fraud in milliseconds while keeping operational costs in check. Traditional single-model approaches are giving way to multi-agent systems that decompose complex tasks into specialized sub-tasks. This article presents a practical, vendor-neutral architecture deployed on AWS Bedrock that combines open-weight models—Llama 4 for transaction parsing, Qwen 3.8 Max for anomaly scoring, and a fine-tuned risk assessment agent—to achieve sub-100ms latency per transaction. We share real benchmark data from a regional bank pilot (details anonymized) and discuss agent handoff patterns that make this performance possible. Why Multi-Agent Architectures Outperform Single-Model Approaches for Fraud Detection Fraud detection is inherently a multi-step process: parse incoming tr

ansaction data, score it against known patterns, and assess risk before deciding to approve, flag, or block. A single monolithic model often struggles to handle diverse input formats (e.g., SWIFT messages, POS logs, online transfers) while simultaneously maintaining domain-specific risk thresholds. By splitting these tasks across three agents, each optimized for its function, the system gains: Lower latency : Specialized agents can process shorter prompts faster than a general model handling everything. Higher accuracy : Each agent focuses on a narrow task, reducing hallucinations and false positives. Easier maintenance : Update only the parsing agent when transaction formats change, or retune the risk assessor without retraining the entire pipeline. In the regional bank pilot, the multi-agent setup reduced false positive rates by 34% compared to a previous single-model system using a ge

neral-purpose LLM (Claude 3.5 Sonnet) while maintaining comparable recall. Architecture Overview: Three Agents and Their Handoff Patterns The architecture consists of three agents orchestrated via AWS Bedrock’s multi-agent collaboration feature. Each agent runs as a separate inference endpoint with its own model and prompt: 1. Transaction Parser (Llama 4) : Extracts structured fields (amount, merchant, account IDs, timestamps, location) from raw transaction text. Uses a JSON schema with fields to ensure completeness. 2. Anomaly Scorer (Qwen 3.8 Max) : Analyzes the parsed data and assigns an anomaly score (0–100) based on historical trends and known fraud signatures. Leverages Qwen 3.8 Max’s 128K context window to reference recent transaction patterns. 3. Risk Assessor (Fine-tuned model) : A smaller, fine-tuned open-weight model (based on Llama 3.1 8B) that combines the anomaly score with

business rules (e.g., account velocity, device fingerprint) to output a final decision: , , or . Handoff pattern : Sequential with early exit. The parser runs first. If the parsing confidence is below 0.9, the system immediately flags the transaction for manual review without calling the scorer. Otherwise, the anomaly scorer runs. If the score is below 30, the transaction is approved early; only scores above 30 proceed to the risk assessor. This pattern drastically reduces average latency. Model Selection: Why Llama 4 for Parsing and Qwen 3.8 Max for Scoring Llama 4 (Meta) – Transaction Parsing Llama 4, released in April 2026, excels at structured extraction due to its strong instruction-following capabilities and low hallucination rate on entity recognition. In tests, Llama 4 achieved 98.2% field accuracy on a benchmark of 10,000 simulated banking transactions (SWIFT MT103, ISO 20022 X

ML converted to text). Its small parameter count (70B) makes inference fast and cost-effective on AWS Bedrock’s optimized infrastructure. Source : (accessed May 2026). Qwen 3.8 Max (Alibaba Cloud) – Anomaly Scoring Qwen 3.8 Max (Qwen3.8-Max-0422) offers a 128K context window, allowing it to receive not just the parsed transaction but also the last 24 hours of account activity as context. This is critical for detecting patterns like velocity spikes or card testing. Despite its larger context, inference latency remains acceptable due to Alibaba Cloud’s optimized attention mechanism. In the pilot, Qwen 3.8 Max achieved an anomaly score correlation of 0.91 with human analyst labels. Source : (accessed May 2026). Why Separate Models? Putting both parsing and scoring into a single model forces unnecessary trade-offs. Llama 4’s parsing accuracy would degrade if asked to also perform anomaly det

ection across a large context, while Qwen 3.8 Max’s long context is overkill for simple extraction. Separation of concerns keeps each agent lean and fast. Benchmark Results from a Regional Bank Pilot: Latency, Cost, and Accuracy The pilot processed 1.5 million transactions over three months (Februar