Automating Vendor Shortlisting: A Practical Guide to Multi-Agent Procurement Evaluation on AWS Bedrock
By Sam Qikaka
Category: Agents & Architecture
Learn how to deploy a three-agent system on AWS Bedrock using Llama 4 for RFP parsing, Qwen 3.8 Max for capability scoring, and a fine-tuned compliance agent. Includes real cost-per-evaluation benchmarks from a mid-market manufacturing pilot and a side-by-side comparison with manual processes.
Multi-Agent AI for B2B Procurement Evaluation: An AWS Bedrock Architecture As of May 23, 2026, B2B operations leaders are turning to multi-agent AI systems to transform procurement evaluation. Manual vendor shortlisting remains slow, error-prone, and expensive—especially when handling dozens of complex RFPs and regulatory checklists. This guide presents a practical architecture deployed on AWS Bedrock: three specialized agents that automate the entire evaluation pipeline. We share real cost-per-evaluation data from a mid-market manufacturing pilot and compare results against traditional manual processes. Why Multi-Agent Systems for Procurement Evaluation? Procurement teams in manufacturing often spend weeks parsing RFPs, scoring vendor capabilities, and verifying compliance. The manual process is not only time-consuming but also inconsistent—evaluators vary in judgment, and regulatory ch
anges can slip through. Multi-agent systems address these pain points by breaking the workflow into discrete, specialized tasks. Each agent focuses on one function, using the best model for that job, and their outputs are combined to produce a comprehensive, auditable evaluation. Deploying on AWS Bedrock provides enterprise-grade security, scalability, and access to leading foundation models without managing infrastructure. For organizations already on AWS, this approach integrates seamlessly with existing workflows. Architecture Overview: Three Specialized Agents on AWS Bedrock Our architecture consists of three agents orchestrated via AWS Step Functions and Bedrock Agents: 1. RFP Parsing Agent (Llama 4) – Extracts structured requirements from unstructured RFP documents. 2. Capability Scoring Agent (Qwen 3.8 Max) – Compares vendor responses against a private benchmark database and assig
ns scores. 3. Compliance Agent (Fine-Tuned Model) – Checks each vendor proposal against regulatory requirements specific to the manufacturing vertical. The agents operate sequentially: the parsed RFP data is fed to the scoring agent, whose results together with the parsed data are then evaluated by the compliance agent. A final report generator (a simple script) aggregates scores and flags into a recommended shortlist. All communication happens through Bedrock’s secure API endpoints. Agent 1: RFP Parsing with Llama 4 Llama 4, released by Meta and available on AWS Bedrock, excels at long-context understanding—crucial for RFPs that often exceed 100 pages. We chose it over GPT-4o and Claude 3.5 Sonnet based on benchmark performance in information extraction tasks (Meta’s RFP-Exact benchmark, October 2025) and cost-efficiency per token. The agent uses a system prompt instructing it to extrac
t: mandatory technical specs, delivery timelines, pricing structures, and qualification criteria. Output is a structured JSON object with fields like , , and . In our pilot, the Llama 4 agent processed a 120-page RFP in 45 seconds with 94% accuracy (human-verified on 10 RFPs), compared to 1.5 hours and 88% accuracy for manual extraction. Model ID : (Bedrock ARN: ). Agent 2: Capability Scoring with Qwen 3.8 Max Qwen 3.8 Max (Alibaba Cloud’s latest flagship, released April 2026) was selected for scoring because of its superior performance on multi-domain QA and its efficient token usage for long-form comparison. The agent receives the structured RFP requirements and each vendor’s proposal (also parsed by Llama 4) and scores them against a private benchmark database that includes past winning bids and industry standards. The scoring logic is straightforward: - For each requirement, Qwen 3.8
Max evaluates the vendor’s response as exceeds , meets , or does not meet . - A weighted sum is calculated based on requirement importance (from the RFP’s weight field). - A capability score out of 100 is produced. In the pilot, the agent scored 25 vendors per evaluation round, each taking 2 minutes of inference time. The correlation with expert human scoring was 0.91 (Pearson’s r), significantly higher than the 0.72 correlation between two human evaluators. Model ID : (Bedrock custom model endpoint for Alibaba Cloud models; requires cross-region approval from AWS enterprise support). Agent 3: Compliance Checks with a Fine-Tuned Model Compliance is often the most tedious part of procurement. We fine-tuned a smaller model (Meta Llama 3.1 8B) on a curated dataset of 5,000 annotated compliance documents from manufacturing regulations (ISO standards, OSHA, country-specific laws). The model
was trained using Amazon SageMaker and deployed as a custom Bedrock model. The compliance agent flags any risks: missing certifications, outdated safety protocols, or contradictory statements. It outputs a compliance score (0–100) and a list of issues with severity levels (critical, high, medium). I