AWS Bedrock Pricing Models: On-Demand vs Provisioned Throughput Economics & Model Selection Decision Tree

By Sam Qikaka

Category: Models & Releases

Explore AWS Bedrock's pricing models, comparing on-demand flexibility with provisioned throughput savings for enterprise AI. Our decision tree helps B2B leaders select optimal models like Claude and Llama for RAG and agent workloads.

Amazon Bedrock Model Menu Overview (Nova, Anthropic, Meta) Amazon Bedrock offers access to over 100 foundation models from leading providers. This includes Amazon's own Nova and Titan families, Anthropic's Claude series, Meta's Llama models, and others like DeepSeek and Moonshot AI. As of May 13, 2026 (according to AWS Bedrock pricing documentation at aws.amazon.com/bedrock/pricing/), key model IDs include: Amazon Titan: Amazon Nova: (the new multimodal flagship) Anthropic Claude: (recent Opus 4.7 update), , Meta Llama: , This unified API supports Responses, Messages, and Chat Completions APIs, making it ideal for enterprise operations integrating RAG pipelines or agentic workflows on platforms like LUMOS. On-Demand vs. Provisioned Throughput: Core Differences AWS Bedrock provides two primary pricing modes: On-Demand: This is a pay-per-use model based on input and output tokens. It requi

res no commitments and scales automatically. It's best suited for variable, low-to-medium workloads. Provisioned Throughput (PT): This mode involves an hourly commitment for dedicated throughput, measured in tokens per minute (TPM). It offers discounts of up to 50% on unit costs for predictable, high-volume needs. PT requires 1- or 6-month terms and supports model-specific reservations. The core formula for PT commitment is: Select a TPM level (e.g., 1,000–1,000,000+ TPM) and pay an hourly rate determined by . According to the AWS pricing page (as of May 13, 2026), PT locks in capacity, which helps reduce latency spikes for agents. Mode Commitment Scaling Use Case :--------------- :-------------- :-------- :-------------------- On-Demand None Auto Testing, bursts Provisioned Throughput Hourly TPM Fixed Steady RAG/agents Unit Economics: Input/Output Costs and Throughput Impact The unit ec

onomics change significantly between the two modes. Here's how to define them: On-Demand Cost per 1M Tokens: Input: $C {in}$ / 1M tokens Output: $C {out}$ / 1M tokens (typically 2-4x the input cost) Provisioned Throughput Unit Cost: Hourly Rate: $H$ (e.g., for 100k TPM) Effective Cost per Input Token: $(H \times 60) / (TPM {in} \times 1M)$ Effective Cost per Output Token: $(H \times 60) / (TPM {out} \times 1M)$ Here, $TPM {in}$ and $TPM {out}$ reflect model-specific ratios, as output often has a lower TPM. For a RAG query with 10k input and 1k output tokens: On-Demand Total: $(10k / 1M \times C {in}) + (1k / 1M \times C {out})$ PT Total: Hourly rate prorated to utilization. At 80% utilization, PT can yield savings of 30-50% (according to AWS documentation from May 2026). Track these costs via AWS Cost Explorer for IAM-tagged LUMOS deployments. Pricing Breakdown for Key Models (Claude, Ll

ama, Titan) The following pricing is based on the AWS Bedrock console/pricing page as of May 13, 2026 (for us-east-1; this excludes taxes and volume discounts). Always verify the latest pricing at aws.amazon.com/bedrock/pricing. On-Demand Examples (per 1M tokens): Claude Opus 4 ( ): Input $15, Output $75 Claude Sonnet 3.5 ( ): Input $3, Output $15 Llama 3.2 405B ( ): Input $5, Output $20 Titan Text Premier ( ): Input $2.50, Output $12.50 Nova Pro ( ): Input $4, Output $18 (multimodal) Provisioned Throughput Examples (based on a 1-month term and 100k TPM): Claude Opus 4: Approximately $4.50/hour Llama 3.2 405B: Approximately $2.80/hour The effective unit costs decrease with higher TPM levels (e.g., 1M TPM results in a 40% lower $/hour rate). Batch inference offers an additional 50% discount on on-demand pricing. Decision Tree: Choosing Model Profiles by Workload Use this textual decision

tree to help select the right model and throughput for your specific workload. You can visualize this in tools like Lucidchart for Bedrock console export. For LUMOS RAG, it's recommended to route queries with contexts larger than 5k tokens to PT Llama, and agentic tasks to Opus. Latency, Scale, and Cost Tradeoffs for RAG/Agents Provisioned Throughput guarantees TPM, significantly reducing p99 latency. For example, Claude Opus on-demand might have a Time To First Byte (TTFT) of 2-5 seconds, whereas PT can achieve less than 1 second at scale. For RAG (combining LUMOS vector search with LLM processing): Cost: On-demand is cost-effective for fewer than 1k queries per minute (QPM). PT becomes more economical at 10k+ QPM. Agents: Claude's tool-calling capabilities are excellent for agents. Using PT avoids throttling issues. The primary tradeoff with PT is the minimum 1-month commitment, which

can cost hundreds of dollars per hour. However, the Return on Investment (ROI) can be realized at around 50% utilization. Monitor performance and costs using Bedrock Metrics. Migration Tips and LUMOS Integration Best Practices Migration Strategy: Begin with on-demand pricing. Monitor costs using Cos