Building an AI chat that is never *confidently wrong*


Surface	A chatbot on the shop's official chat channel
Answers about	Order status · delivery · product info · pricing · promotions
Sources	Order system, catalogue, pricing, promo engine — many systems
Users	General public, in Thai and English
Hard constraint	Every price / stock / status / promo it states must be true right now

	Tool-calling	Retrieval (RAG)
Answers	"What is X right now?"	"What do our docs say about X?"
Source	A live system (an API/DB)	A corpus of text, chunked & indexed
Data	Structured, current, authoritative	Unstructured, may be stale
Use for	order status, price, stock, promos	policy, FAQ, returns, "how do I…"

Layer	Choice	Why it's here
Model access	Amazon Bedrock — Claude `anthropic.claude-opus-4-8`	Claude via IAM, inside the AWS perimeter — no raw API keys
Environments	One AWS account per env (dev / uat / prod) under Organizations	Hard blast-radius & data isolation
Compute	ECS Fargate containers	Pragmatic for 5 people; no servers to patch
Secrets / config	Secrets Manager + SSM Parameter Store, per account	Creds & tunables out of the image, per environment
Safety / PDPA	Bedrock Guardrails	PII masking in/out, denied-topic blocking
Retrieval (if RAG)	pgvector on RDS · OpenSearch Serverless · Bedrock KB	Only for policy/FAQ — decided in the working session
Ops observability	Datadog APM + Logs + Metrics	Latency, cost, errors — one pane
LLM quality	Datadog LLM Observability + promptfoo/DeepEval	Conversation traces online; eval gate offline in CI
CI/CD + IaC	GitLab CI/CD + Terraform / CDK	The team already lives in GitLab

	dev	test / CI	uat	prod
Purpose	Build, experiment	Automated checks	Stakeholder sign-off	Serve customers
Data	Synthetic Thai data	Synthetic + fixtures	Prod-like, masked	Real customer data
Sources	Mock tools	Mock + contract tests	Sandbox vendor APIs	Real vendor APIs
Who uses it	Engineers	GitLab pipelines	PO / QA / business	The public
Guardrails	On, permissive	On, asserted in evals	On, prod config	On, strict, alerting
Eval gate	Run locally	Blocks the merge	Full suite + manual UAT	Smoke + canary
Bedrock access	Broad dev role	CI role, scoped	Scoped, prod-like	Least-privilege, audited

Tool	The need it serves
GitLab CI/CD	Lint → test → eval as a merge gate; deploy per environment
Terraform / CDK	The whole stack as reviewable code; reproducible per account
boto3 + Anthropic Bedrock SDK	Call Claude on Bedrock & run the tool-use loop from Python
Bedrock Guardrails	PDPA: PII masking in/out, denied-topic blocking at the model edge
promptfoo / DeepEval	The offline quality gate — golden cases, assertions + LLM-judge
Datadog APM + Logs	Operational truth: latency, errors, cost per request
Datadog LLM Observability	Quality truth: tool-call correctness, groundedness, containment
AWS IAM Identity Center (SSO)	One login, scoped role per account — no static keys on laptops

Concern	Guardrail control
PII in	Customer pastes phone/email/address → masked before the model sees it
PII out	Model output scrubbed so it can't echo back personal data
Off-topic / denied	Block politics, medical/financial advice, anything off-scope
Prompt-injection-ish	Denied topics + refusal reduce "ignore your rules" attempts

Layer	Question	Where	Watch
Operational	Is it up, fast, affordable?	Datadog APM + Logs	latency p50/p95, error rate, cost/conv
Quality	Is it right?	Datadog LLM Observability	tool-call correctness, groundedness, containment

Time	Block	Format
09:00–09:30	Frame the system — agent loop, tool-calling vs RAG, confidently-wrong, the one diagram	Talk + diagram
09:30–11:00	Build a minimal tool-calling agent (`get_order_status`)	Hands-on
11:00–11:45	Prompt & context engineering + PDPA / handoff	Talk + exercise
11:45–12:45	Lunch	—
12:45–14:00	Evals — golden set, break the prompt, wire into GitLab CI	Hands-on
14:00–14:45	Tracing & observability — read a real trace in Datadog	Demo + look-along
14:45–15:45	Architecture working session — your real sources & environments	Whiteboard
15:45–16:00	Decisions & action items	Capture

Welcome. Set the tone in one breath: this is not "intro to LLMs". Five engineers, AWS/GitLab/Python already in hand. By 16:00 each of you will have built a tool-calling agent, gated it with evals in GitLab CI, and read its trace in Datadog. The single idea on the screen is the spine of the whole day — read it aloud, then say: the worst failure for a bot quoting a real price or a real delivery date is not awkward wording, it is stating a wrong number with total confidence. Everything today is engineered to prevent that.

30 minutes, talk + one diagram. Do not drift into model internals or prompt-engineering folklore. The goal of this part is a shared mental model and the one diagram everyone will point at for the rest of the day.

Anchor everyone on the actual project before any abstraction. The key reframe: the value is not the model's "knowledge" — it is correct routing to systems of record plus good phrasing. If the team treats the LLM as the source of truth, the project fails on day one in production. The demo/ repo (DEMO-3, DEMO-6) mirrors exactly these sources as mock tools: orders, products, promotions.

Demystify the word "agent". It is a while-loop with the model in the decide seat. Stress the ownership boundary: the model emits a tool_use block (a request); nothing runs until your code chooses to run it. That boundary is where you put validation, auth, gating, and logging. We build this exact loop by hand in Part 5 (DEMO-4) precisely so nobody treats it as a black box.

Engineers often reach for RAG reflexively. Make the distinction crisp: a delivery ETA is not a document you retrieve, it is a row you query. RAG returns *text that mentions* something; a tool returns *the value*. For prices and order status, "text that mentions" is exactly the path to being confidently wrong. We are not anti-RAG — it is right for the returns policy and FAQ — but it is the wrong default for this domain.

This is the emotional centre of the day. Give a concrete scenario: a customer is told a Songkran promo is still valid, drives to checkout, it is expired — support ticket, refund, trust gone. The expired-promo case is deliberately seeded in the demo dataset (DEMO-6) and the golden evals (DEMO-10). The point: we do not "remind the model to be careful". We remove its ability to invent these values at all.

This is THE diagram (also in one-diagram.md, printed as a poster on the wall). Walk it left-spine top to bottom. Make the three verbs land: ORCHESTRATE, own the FACT, PHRASE. Every later slide — architecture, evals, traces — is a view of this same flow. When anyone asks "where does X live?", point here.

Restate the spine as three enforceable engineering rules, because "be grounded" is a vibe until it is code. Rule 1 becomes a system-prompt line plus eval. Rule 2 becomes the tool layer plus guardrails. Rule 3 becomes the handoff path (DEMO-9). Tell them: hold me to these three for the rest of the day.

The stack is fixed and agreed. The job of this part is fluency in it, not a re-litigation. If someone wants Lambda over Fargate, park it for the architecture session — Fargate is the committed default for a 5-person team.

One slide, the whole stack. Don't read every row — orient them by the three groupings: run it (Bedrock, Fargate, accounts, secrets), trust it (Guardrails, evals), see it (Datadog). Model id is pinned: anthropic.claude-opus-4-8 — note the anthropic. prefix is Bedrock-specific; the first-party id is claude-opus-4-8 with no prefix. Pin the id in config (DEMO-2), never hard-code it scattered.

This is the one diagram, deployed. Trace a request end to end: channel in, input guardrail, the loop (which talks to Bedrock and to tools), output guardrail, response out — and the whole thing emitting traces to Datadog. Secrets Manager/SSM feed config in, never baked into the image. Make the point that guardrails are on the I/O edges, not buried in the prompt.

Sell the Bedrock choice honestly. The wins are real: IAM-scoped access per account dovetails with account-per-env, and Guardrails are first-class. But set the expectation that Bedrock is a subset of the first-party surface — if someone designs around Batches or the Files API, it won't be there. For our tool-calling order bot we need none of those, so it's a clean fit.

The user's original question literally asks for environment best practice. Give them the spine: isolation by AWS account, not by tag or namespace. This is the cheapest insurance they will buy.

The decision to defend: separate AWS accounts per environment. Tags and shared VPCs leak — one fat-fingered IAM policy and dev reaches prod. Accounts give you a hard wall plus clean per-env billing and Bedrock quotas. Organizations + SSO makes switching accounts a one-click affair, so the ergonomics cost is low.

The table that answers the user's question directly. The throughline (read the last line aloud): realism and strictness both increase left-to-right. Mock in dev, sandbox in uat, real in prod — that progression is exactly why the demo builds against mock tools first. The eval gate lives in test/CI: that is the merge blocker we wire up in Part 7.

Two ideas. First: build once, promote the same image; only configuration (from that account's SSM + Secrets) differs — this kills "works in uat, breaks in prod" drift. Second: access is scoped per account — a uat task role cannot call Bedrock in prod, and CI assumes a different deploy role per target. Real vendor keys exist only in uat/prod Secrets Manager; dev and CI never hold them.

Quick part. The framing: we are not collecting tools, we are covering needs. Every row of the next table is a need first, a tool second.

Map, don't list. Call out the SDK choice explicitly because the team will ask: the Anthropic Bedrock SDK (AnthropicBedrockMantle) gives the clean Messages API + tool runner; boto3 bedrock-runtime Converse is the pure-AWS equivalent and is what DEMO-2 can wrap. Either is fine — the loop shape is identical. SSO matters: no long-lived keys on five laptops.

90 minutes, hands-on, everyone. This is the most important block of the day — the muscle the team doesn't have yet. Everyone builds the same tiny agent against the same starter repo (DEMO-1→5). Protect this block; if the schedule slips, steal time from the architecture session, never from this.

Two things make the bot honest: the tool returns structured truth (or raises — it never fabricates a record), and the system prompt forbids stating any authoritative value the model didn't get from a tool. Show that not-found raises rather than returning a fake "all good" — the model must be able to learn the order doesn't exist, not paper over it.

The single highest-leverage skill in this whole project: writing tool definitions the model calls reliably. The description is a prompt aimed at the router — be prescriptive about WHEN to call it. Tight schemas (required, typed, with examples) reduce malformed calls. When we add products and promos (DEMO-6/7), good descriptions are what make the model pick the RIGHT tool.

This is the one diagram as ~15 lines of Python (DEMO-4). Narrate the loop: model returns stop_reason == "tool_use" with a tool_use block; YOUR code executes the function; you append a tool_result; loop. It exits when the model stops asking for tools and just answers. Two teaching points: append the full assistant content (tool_use blocks included) before sending results, and every tool_use needs exactly one matching tool_result. An SDK tool-runner can automate this loop — build it by hand once so it isn't magic.

Mocks define the contract. Design them to look like the real thing (Thai item names, plausible ids) and to include the cases that break naive bots: not-found and the expired promo. Because mock and real share a shape, the dev→uat→prod progression (mock→sandbox→real) becomes configuration, not a rewrite. This is also what makes the whole thing runnable on a clean clone with no vendor access.

Talk + small exercise. Now that the loop works, make it trustworthy and compliant. PDPA is folded in here via Guardrails, not treated as a separate legal afterthought.

A system prompt is policy. The grounding clause is the load-bearing line and it must be unambiguous: only state values a tool returned IN THIS conversation; never recall from training, never round, never fill in. Critically: do NOT bake any prices, stock, or hours into the prompt — they rot. The prompt governs behaviour; tools govern facts. Keep it short — modern Claude follows a tight prompt closely.

This is where "it sometimes calls the wrong tool" gets fixed — not by scolding the model, but by sharpening descriptions. The product-vs-promo boundary is the classic confusion; spell it out in both descriptions. Demonstrate a two-tool question: the model fans out to get_product_info and get_promotion, then phrases one grounded answer. Tool design is the lever, not prompt threats.

Reframe handoff as a feature, not an admission of defeat. The bot that can't say "I don't know" is the bot that invents. Wire a structured handoff event (stubbed in DEMO-9) for the cases where no tool fits or a tool says not-found. Tell the room: in the evals, "correctly handed off" is a PASS and "confidently guessed" is a FAIL — we score this explicitly.

PDPA is the Thai GDPR-equivalent; treat it as a hard control, not a prompt suggestion. Bedrock Guardrails sit on both edges: mask PII before the model sees it, and scrub output so it can't leak personal data back. Denied topics keep it on-scope. The test to run live (DEMO-8): send a fake phone number, watch it get masked. Defence in depth — the guardrail does not trust the prompt and the prompt does not trust the model.

75 minutes, hands-on. The second "aha" of the day: watching evals catch a regression you caused. If the team leaves with one new habit, make it this — evals as a merge gate, not a notebook.

A golden set is small and deliberate, not a giant random sample. Each case encodes two things: the right tool was called AND the answer is correct. Cover the five behaviours. The expired-promo case is the star: a naive bot offers the discount; a grounded bot says it expired. That single case is worth more than fifty happy-path ones.

Two assertion types map to two failure types: wrong fact (deterministic) and wrong behaviour (judge). The non-obvious, crucial point is the last one: assert on the tool call itself. A bot can say "out for delivery" by luck or by calling the wrong tool and getting away with it on this input. If you only check final text, you ship a bot that's right by accident. Scoring tool-call correctness is how you catch confidently-wrong-but-passing.

The live demo that makes evals visceral. Remove the grounding line, rerun: the expired-promo and unknown-order cases flip to red because the bot starts confidently inventing. Then revert and watch green return. The lesson lands without a lecture: this is the net that catches the failure mode we've been talking about all day, automatically, before it ships.

The payoff: evals as a required CI stage after lint and test. Below threshold, the eval job exits non-zero and branch protection blocks the merge. Now a prompt edit is reviewed and tested exactly like a code change — because it IS one. This is the habit to take home: no prompt or tool-def change reaches main without passing the golden set.

Demo + look-along. Answers the user's third question directly: how do we monitor AI performance? The reframe: "AI performance" is two things — is it up and cheap (ops), and is it right (quality). Datadog gives you both, correlated.

"Monitoring AI" is two questions that need two layers. Ops is what they already know — latency, errors, cost. Quality is new: groundedness (did it stick to tool facts), tool-call correctness, and containment (resolved without escalating to a human). The magic is correlation — click a bad answer in LLM Observability and jump to the exact APM trace, tokens, and tool calls behind it.

Walk the trace top to bottom. The value over plain logs: you see the decision (decide → which tool → result → phrase), each with cost and latency, plus a groundedness check on the final step. When a customer reports a bad answer, you don't guess — you open that conversation's trace and see exactly which tool was called, with what arguments, and where it went wrong.

Land the monitoring section on intuition: the earliest warning is usually economic or qualitative, not a 500. A cost-per-conversation spike means a loop or context bloat; a groundedness dip means a silent regression slipped past evals (or wasn't covered); falling containment means it's escalating more than it used to. Teach them to watch cost and groundedness as leading indicators, not just error rate.

60 minutes, whiteboard, everyone. This is where the day's concepts hit their actual project. Facilitate, don't lecture. Capture decisions in the action-items slide. This block can stretch or shrink — it's the buffer.

Run this as four facilitated rounds, ~12 min each. Round 1 forces the tool-vs-RAG call per source and exposes which vendor APIs even exist. Round 2 surfaces the long-lead risk: vendor sandbox access (the thing most likely to sink the real project). Round 3 makes the account/IAM layout concrete. Round 4 assigns owners — unowned tools and unowned eval sets rot. Write every decision on the action-items board.

Housekeeping, but load-bearing. The day only works if the prep is done. Be honest about what sinks it if skipped.

The shape of the day. The one rule: protect 09:30 (build) and 12:45 (evals) — they carry both "aha" moments. If a hands-on block overruns, cut tracing to a follow-up recording and shrink the architecture session; never cut build or evals. Lunch is a real boundary, not optional.

Sequence by lead time, because two items have long fuses. Bedrock model access is request-gated — enable it a week out, not the morning of. Vendor sandbox access is the single thing most likely to slip and it only bites in the architecture session. Everything else is the facilitator building the starter repo and datasets. Day-before is verification only: clean-clone run, then send the pre-req.

Keep the attendee ask tiny and verifiable — four checkboxes, five minutes. The smoke call is the one that matters: it proves their SSO profile, region, and Bedrock access all work together before the room. The failure mode this prevents: five people debugging AWS creds at 09:35 while the build block bleeds out. Make the smoke test copy-paste-able in the pre-req message.

15 minutes. Convert the whiteboard into owned, dated actions. A decision without an owner is a wish.

Fill this in live from the architecture-session board. Push for a real name and a real date on every row — "the team" is not an owner. The bottom row is starred for the next slide: get the starter repo green on a clean clone and ~80% of the real project's early risk evaporates.

Close on the two things worth remembering. Operationally: the clean-clone starter repo is the single highest-leverage artifact — it makes the workshop run and it IS the demo's M1. Philosophically: the spine that has run through every part — the model phrases, the systems own the facts, and we engineer so the bot is never confidently wrong. Thank the room; point them at the demo/ tickets to keep building.

Decision / action	Owner	By
Source-by-source: tool-call vs retrieval
Mock contract + vendor sandbox request per source
dev/test/uat/prod account layout + CI deploy roles
Guardrail (PDPA) config owner
Golden eval set owner + first 8–12 cases
Datadog dashboard owner
Starter repo: runs on a clean clone

Building an AI chat that is never confidently wrong

Order status · delivery · products · pricing · promotions — for a Thai e-commerce channel

Frame the system

What we are building

The agent loop

Tool-calling vs RAG

The failure we design against: confidently wrong

The one diagram (we use it all day)

So the thesis, precisely

Architecture for this project

The committed stack, by layer

The request path

Why Bedrock (and what it buys us)

dev → test → uat → prod

Account-per-environment, under Organizations

dev / test / uat / prod at a glance

Promotion flow & per-account access

Tooling the team should have

Tooling → the need it serves

Build a minimal tool-calling agent

Step 1 — the tool & the grounding rule

Step 2 — design the tool definition

Step 3 — the manual agent loop

Step 4 — mock authoritative data

Prompt & context engineering + PDPA

System prompt design

Writing tool defs the model calls reliably

Graceful "I don't know" + human handoff

PDPA via Bedrock Guardrails

Evals — the quality gate

The golden dataset

Assertions + LLM-as-judge — and score the tool call

The deliberately-broken-prompt demo

Wire it into GitLab CI — make it a merge gate

Monitoring AI performance

Two layers in one pane

Reading a real trace

Cost / quality dashboard — the first signal

Architecture working session

Whiteboard prompts — decide these as a team

The day, and what makes it run

Schedule — 09:00 to 16:00

Facilitator prep — have these ready

Per-attendee 5-minute pre-req

Decisions & action items

Decisions & action items

The highest-leverage prep is a starter repo that runs on a clean clone.