One-day hands-on workshop

Building an AI chat that is never confidently wrong

Order status · delivery · products · pricing · promotions — for a Thai e-commerce channel

The LLM orchestrates and phrases. Authoritative systems own the facts, via tool calls.

Part 1 — 09:00

Frame the system

How an agent actually works, why tool-calling beats "ask the model", and the one failure mode we design against all day.

AI Chatbot Workshop · Thai e-commerce

What we are building

Surface A chatbot on the shop's official chat channel
Answers about Order status · delivery · product info · pricing · promotions
Sources Order system, catalogue, pricing, promo engine — many systems
Users General public, in Thai and English
Hard constraint Every price / stock / status / promo it states must be true right now

The chatbot is a thin orchestration layer over systems that already exist.
It is not a knowledge base, and it must never become one by guessing.

AI Chatbot Workshop · Thai e-commerce

The agent loop

An "agent" is not magic. It is a loop your code controls:

  observe ──▶  DECIDE  ──▶  ACT  ──▶  observe ──▶ … ──▶ answer
             (the LLM)    (a tool
                           your code
                           runs)
  1. Decide — the model reads the conversation and either answers or asks for a tool.
  2. Actyour code runs the requested tool and returns the result.
  3. Repeat — feed the result back; the model decides again, until it answers.

You own the loop. The model only ever proposes a tool call — your code executes it.

AI Chatbot Workshop · Thai e-commerce

Tool-calling vs RAG

Two ways to ground the model. They solve different problems.

Tool-calling Retrieval (RAG)
Answers "What is X right now?" "What do our docs say about X?"
Source A live system (an API/DB) A corpus of text, chunked & indexed
Data Structured, current, authoritative Unstructured, may be stale
Use for order status, price, stock, promos policy, FAQ, returns, "how do I…"

This project is overwhelmingly tool-calling. Order status and price are queries to systems of record — not passages to retrieve. Keep RAG in your back pocket for policy/FAQ (the demo/ RAG spike, DEMO-17).

AI Chatbot Workshop · Thai e-commerce

The failure we design against: confidently wrong

A bot quoting real prices and real order status has a failure mode worse than sounding clumsy:

It states a wrong price, a wrong delivery date, or an expired promotion — fluently, with total confidence, and a customer acts on it.

  • The model is a brilliant phraser and a terrible system of record.
  • "฿1,290, in stock, arrives tomorrow" reads identically whether it is true or invented.
  • Fluency is not correctness. Confidence is not a signal of truth.

Every technique today — grounding, tools, guardrails, evals, traces — exists to make this failure structurally hard, not just unlikely.

AI Chatbot Workshop · Thai e-commerce

The one diagram (we use it all day)

┌──────────────┐ │ User message │ "Where is my order #TH-10432?" └──────┬───────┘ ▼ ┌──────────────┐ The LLM ORCHESTRATES — it does not │ LLM decides │ know the answer; it decides which tool │ (Claude) │ can get it. └──────┬───────┘ ▼ tool_use: get_order_status(order_id="TH-10432") ┌──────────────┐ The SYSTEM owns the FACT — authoritative │ tool call → │ source returns structured, current data. │ order system │ └──────┬───────┘ ▼ { status: "out_for_delivery", eta: "2026-06-27" } ┌──────────────┐ The LLM PHRASES — states only what the │ LLM phrases │ tool returned. Nothing invented. └──────┬───────┘ ▼ "Your order is out for delivery, arriving tomorrow (27 Jun)."
AI Chatbot Workshop · Thai e-commerce

So the thesis, precisely

The LLM is the conversation engine. The systems of record are the source of truth. A tool call is the only bridge between them.

This gives us three rules the whole build obeys:

  1. Grounding — the model may only state a fact that a tool returned in this conversation.
  2. Ownership — every authoritative value (price, stock, status, promo) comes from its owning system, never from the model.
  3. Graceful failure — if no tool can answer, the bot says so and hands off. It never fills the gap with a guess.
AI Chatbot Workshop · Thai e-commerce

Part 2 — Architecture

Architecture for this project

The committed AWS + Bedrock + Datadog stack. No vendor debate today — this is what we build on.

AI Chatbot Workshop · Thai e-commerce

The committed stack, by layer

Layer Choice Why it's here
Model access Amazon Bedrock — Claude anthropic.claude-opus-4-8 Claude via IAM, inside the AWS perimeter — no raw API keys
Environments One AWS account per env (dev / uat / prod) under Organizations Hard blast-radius & data isolation
Compute ECS Fargate containers Pragmatic for 5 people; no servers to patch
Secrets / config Secrets Manager + SSM Parameter Store, per account Creds & tunables out of the image, per environment
Safety / PDPA Bedrock Guardrails PII masking in/out, denied-topic blocking
Retrieval (if RAG) pgvector on RDS · OpenSearch Serverless · Bedrock KB Only for policy/FAQ — decided in the working session
Ops observability Datadog APM + Logs + Metrics Latency, cost, errors — one pane
LLM quality Datadog LLM Observability + promptfoo/DeepEval Conversation traces online; eval gate offline in CI
CI/CD + IaC GitLab CI/CD + Terraform / CDK The team already lives in GitLab
AI Chatbot Workshop · Thai e-commerce

The request path

User ─▶ Official chat channel ─▶ ECS Fargate service │ ▼ ┌─────────────────────┐ Guardrails (in) ───▶│ AGENT LOOP │───▶ Guardrails (out) PII mask / topics │ decide→call→phrase │ PII mask / topics └──────────┬──────────┘ │ ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ Bedrock · Claude Tools (mock→real) Secrets Mgr + SSM anthropic.claude- orders · products · creds · config · opus-4-8 promotions pinned model id │ └────── spans + LLM traces ──────▶ Datadog (APM + LLM Obs)

Guardrails wrap both edges. Every hop emits a span. The agent loop is the same one from the one diagram — now placed in the real stack.

AI Chatbot Workshop · Thai e-commerce

Why Bedrock (and what it buys us)

  • IAM, not API keys. Model access is an IAM permission in the account. No secret to leak, rotate, or paste into a .env.
  • Inside the perimeter. Requests stay in your AWS network and region (ap-southeast-1 for TH latency/data).
  • Guardrails are native. PDPA controls (PII, denied topics) are a Bedrock feature, applied at the model edge — not bolted on.
  • One model id, pinned: anthropic.claude-opus-4-8 in config (DEMO-2).

Heads-up — Bedrock ≠ first-party feature-for-feature. Core Messages + tool use + adaptive thinking + Guardrails are all there; a few first-party extras (e.g. Message Batches, the Files API, server-side web search) are not on Bedrock. Check before you reach for one.

AI Chatbot Workshop · Thai e-commerce

Part 3 — Environments

dev → test → uat → prod

Account-per-environment under AWS Organizations: what each is for, how code promotes, and how access & config differ.

AI Chatbot Workshop · Thai e-commerce

Account-per-environment, under Organizations

        AWS Organization
        ├── dev account     ← build & experiment, synthetic data
        ├── test/CI account ← automated pipelines, ephemeral
        ├── uat account     ← stakeholder sign-off, prod-like
        └── prod account    ← real customers, real money

Why separate accounts, not just separate tags or VPCs:

  • Blast radius — a mistake in dev cannot touch prod data or prod spend.
  • IAM isolation — a dev role simply has no path to prod resources.
  • Clean billing & quotas — Bedrock spend, rate limits, alarms are per account.
  • Data separation — PDPA-relevant customer data lives only where it must.
AI Chatbot Workshop · Thai e-commerce

dev / test / uat / prod at a glance

dev test / CI uat prod
Purpose Build, experiment Automated checks Stakeholder sign-off Serve customers
Data Synthetic Thai data Synthetic + fixtures Prod-like, masked Real customer data
Sources Mock tools Mock + contract tests Sandbox vendor APIs Real vendor APIs
Who uses it Engineers GitLab pipelines PO / QA / business The public
Guardrails On, permissive On, asserted in evals On, prod config On, strict, alerting
Eval gate Run locally Blocks the merge Full suite + manual UAT Smoke + canary
Bedrock access Broad dev role CI role, scoped Scoped, prod-like Least-privilege, audited

The further right, the realer the data, the stricter the controls, and the higher the cost of being wrong.

AI Chatbot Workshop · Thai e-commerce

Promotion flow & per-account access

Promotion is a pipeline, not a copy-paste:

 feature branch → MR → [lint · test · eval] → merge to main
        → deploy dev → deploy uat (sign-off) → deploy prod (approval)
  • Same artifact promotes across accounts — the container image is built once; only config changes per env.
  • IAM / Bedrock access is per account — each env's task role can invoke Bedrock only in its own account; CI assumes a scoped deploy role per target.
  • Secrets & config are per account:
    • Secrets Manager — DB creds, vendor API keys (real ones only in uat/prod).
    • SSM Parameter Store — model id, Guardrail id, thresholds, feature flags.
  • The image reads its config at boot from its own account's SSM + Secrets — never baked in.
AI Chatbot Workshop · Thai e-commerce

Part 4 — Tooling

Tooling the team should have

Each tool maps to a concrete need. Nothing here is optional decoration.

AI Chatbot Workshop · Thai e-commerce

Tooling → the need it serves

Tool The need it serves
GitLab CI/CD Lint → test → eval as a merge gate; deploy per environment
Terraform / CDK The whole stack as reviewable code; reproducible per account
boto3 + Anthropic Bedrock SDK Call Claude on Bedrock & run the tool-use loop from Python
Bedrock Guardrails PDPA: PII masking in/out, denied-topic blocking at the model edge
promptfoo / DeepEval The offline quality gate — golden cases, assertions + LLM-judge
Datadog APM + Logs Operational truth: latency, errors, cost per request
Datadog LLM Observability Quality truth: tool-call correctness, groundedness, containment
AWS IAM Identity Center (SSO) One login, scoped role per account — no static keys on laptops

Two SDK notes: pin anthropic.claude-opus-4-8 in config; reach for the Anthropic Bedrock SDK (AnthropicBedrockMantle) for a clean Messages/tool-use surface, or boto3 bedrock-runtime Converse to stay pure-AWS.

AI Chatbot Workshop · Thai e-commerce

Part 5 — 09:30 · the main event

Build a minimal tool-calling agent

get_order_status, the grounding rule, a tool definition, and the manual loop — by hand.

AI Chatbot Workshop · Thai e-commerce

Step 1 — the tool & the grounding rule

The tool is just a Python function over mock authoritative data (DEMO-3):

def get_order_status(order_id: str) -> dict:
    record = ORDERS.get(order_id)          # the system of record
    if record is None:
        raise OrderNotFound(order_id)      # never invent a record
    return {
        "order_id": order_id,
        "status": record["status"],        # e.g. "out_for_delivery"
        "eta": record["eta"],              # e.g. "2026-06-27"
        "items": record["items"],
    }

The grounding rule, in the system prompt:

Only state an order status, price, stock level, or promotion that a tool returned in this conversation. If no tool can answer, say you don't know.

AI Chatbot Workshop · Thai e-commerce

Step 2 — design the tool definition

The model only calls a tool well if its definition is good. This is real engineering:

TOOLS = [{
    "name": "get_order_status",
    "description": (
        "Look up the CURRENT status of a customer order by its order ID. "
        "Call this whenever the user asks where an order is, its delivery "
        "status, or its estimated arrival. Returns authoritative live data."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string",
                         "description": "Order ID, e.g. TH-10432"},
        },
        "required": ["order_id"],
    },
}]
  • Name is a verb-phrase; description says when to call it, not just what it does.
  • Schema is tight: required order_id, typed, with an example.
AI Chatbot Workshop · Thai e-commerce

Step 3 — the manual agent loop

from anthropic import AnthropicBedrockMantle
client = AnthropicBedrockMantle(aws_region=AWS_REGION)

messages = [{"role": "user", "content": user_text}]
while True:
    resp = client.messages.create(
        model="anthropic.claude-opus-4-8", max_tokens=1024,
        system=SYSTEM_PROMPT, tools=TOOLS, messages=messages,
    )
    messages.append({"role": "assistant", "content": resp.content})
    if resp.stop_reason != "tool_use":
        break                                   # model has phrased its answer
    results = []
    for b in resp.content:
        if b.type == "tool_use":
            fact = TOOL_IMPLS[b.name](**b.input)            # YOUR code runs it
            results.append({"type": "tool_result",
                            "tool_use_id": b.id,
                            "content": json.dumps(fact)})
    messages.append({"role": "user", "content": results})   # feed facts back

decide → call → phrase, looped until stop_reason != "tool_use". You run the tool; the model only asked.

AI Chatbot Workshop · Thai e-commerce

Step 4 — mock authoritative data

Mock data is not a shortcut — it is the contract the real source will later honour.

ORDERS = {
    "TH-10432": {"status": "out_for_delivery", "eta": "2026-06-27",
                 "items": ["เคสมือถือ x1", "ฟิล์มกันรอย x2"]},
    "TH-10588": {"status": "packing",          "eta": "2026-06-29",
                 "items": ["หูฟัง x1"]},
}
  • Synthetic Thai data — realistic ids, items, statuses, dates.
  • Enough variety that tool selection is a real choice (orders vs products vs promos).
  • Includes the awkward cases: not-found, and later an expired promo.
  • Same shape as the real API → swapping mock → sandbox → real is a config change, not a rewrite.
AI Chatbot Workshop · Thai e-commerce

Part 6 — 11:00

Prompt & context engineering + PDPA

The system prompt, tool definitions the model trusts, graceful failure, and PDPA via Bedrock Guardrails.

AI Chatbot Workshop · Thai e-commerce

System prompt design

The system prompt is a policy, not a personality. Structure it:

  1. Role & scope — "order-status assistant for «shop», a Thai e-commerce store."
  2. Grounding (non-negotiable) — only state values a tool returned; never recall, guess, or round.
  3. Failure behaviour — if no tool fits, say you don't know and hand off.
  4. Tone & language — brief, warm; reply in the customer's language (Thai/English).

Keep authoritative facts out of the prompt — they go stale. The prompt says how to behave; tools provide what is true.

AI Chatbot Workshop · Thai e-commerce

Writing tool defs the model calls reliably

The model's routing is only as good as your descriptions. Make them earn the call:

  • Say when, not just what. "Call this when the user asks about price or stock" beats "gets product info."
  • Disambiguate siblings. get_product_info (price/stock) vs get_promotion (active deals) — make the boundary explicit so it picks correctly.
  • Tight schemas. Required fields, types, an example value per field. Fewer malformed calls.
  • One job per tool. A tool that does three things gets called for the wrong one.

Multi-tool questions are normal: "is the หูฟัง in stock and on promo?" → two tool calls, then one grounded answer (DEMO-7).

AI Chatbot Workshop · Thai e-commerce

Graceful "I don't know" + human handoff

The bot's right to say "I don't know" is what keeps it honest.

# In the system prompt:
#   If no tool can answer, do NOT guess. Say you don't have that
#   information and offer to connect a human agent.

def emit_handoff(reason, conversation_id):
    log.info("handoff", reason=reason, conversation_id=conversation_id)
    # stub today (DEMO-9); a real channel later
  • Triggers: no tool fits · tool raised not-found · low confidence · out of scope.
  • The handoff is a structured event, not a dead end — it routes to a person.
  • "I don't know + handoff" is a success, not a failure. A guess is the failure.
AI Chatbot Workshop · Thai e-commerce

PDPA via Bedrock Guardrails

PDPA (Thailand's data-protection law) is enforced at the model edge, on both paths (DEMO-8):

Concern Guardrail control
PII in Customer pastes phone/email/address → masked before the model sees it
PII out Model output scrubbed so it can't echo back personal data
Off-topic / denied Block politics, medical/financial advice, anything off-scope
Prompt-injection-ish Denied topics + refusal reduce "ignore your rules" attempts
  • Configured once as a Bedrock Guardrail, applied to input and output.
  • Test it: a message with a fake phone number is masked, not echoed.
  • Guardrails are a separate layer from the prompt — defence in depth, not "please be careful."
AI Chatbot Workshop · Thai e-commerce

Part 7 — 12:45

Evals — the quality gate

Golden cases, assertions + LLM-judge, tool-call correctness, and the broken-prompt demo wired into GitLab CI.

AI Chatbot Workshop · Thai e-commerce

The golden dataset

A small, representative set of cases with expected behaviour (DEMO-10):

- vars: { question: "Where is my order TH-10432?" }
  assert:
    - { type: tool-called, value: get_order_status }   # right tool
    - { type: contains, value: "27" }                  # right fact (the ETA)

- vars: { question: "Is the 'Songkran 10% off' promo still valid?" }
  assert:
    - { type: tool-called, value: get_promotion }
    - { type: llm-rubric,
        value: "States the promo has EXPIRED. Does NOT offer the discount." }
  • 8–12 cases spanning orders · products · promotions · off-topic · handoff.
  • Each case pins the expected tool call and an assertion on the answer.
  • Includes the expired-promo trap — the canonical "confidently wrong" test.
AI Chatbot Workshop · Thai e-commerce

Assertions + LLM-as-judge — and score the tool call

Two kinds of check, because two kinds of thing can go wrong:

  • Deterministic assertions — exact facts. Was the ETA 27 Jun? Was the price ฿1,290? Cheap, unambiguous.
  • LLM-as-judge (rubric) — fuzzy quality. Did it refuse the expired promo? Was it grounded, not invented? For things you can't regex.

Score the tool call, not just the final text:

The bot can produce the right words for the wrong reason — a lucky guess, or the right answer after calling the wrong tool. Asserting on the tool call (and its arguments) catches the failure the text alone hides.

AI Chatbot Workshop · Thai e-commerce

The deliberately-broken-prompt demo

The moment evals earn their keep — break it on purpose and watch the gate catch it (DEMO-12):

  System prompt:
- Only state prices/stock/status that a tool returned this conversation.
+ # (grounding line removed)

Run the suite:

✓ order status        ✓ product in stock
✗ expired promo  →  bot offered the discount  (groundedness FAIL)
✗ unknown order  →  bot invented an ETA       (tool-call FAIL)

Delete one grounding line → the bot starts inventing → the eval goes red. Revert → green. That is the safety net.

AI Chatbot Workshop · Thai e-commerce

Wire it into GitLab CI — make it a merge gate

Evals belong in the pipeline, not a notebook (DEMO-12):

stages: [lint, test, eval]

eval:
  stage: eval
  script:
    - make eval                 # promptfoo eval -c promptfooconfig.yaml
  rules:
    - if: $CI_MERGE_REQUEST_IID  # run on every MR
  # exits non-zero below the pass threshold → the MR cannot merge
 lint  ──▶  test  ──▶  eval  ──▶  ✅ mergeable
   any stage red  ──▶  ❌ blocked

A prompt change that breaks grounding now fails the pipeline like any other bug. Quality becomes a gate, not a hope.

AI Chatbot Workshop · Thai e-commerce

Part 8 — 14:00

Monitoring AI performance

Two layers, one pane: operational truth and quality truth — and the first signal something is wrong.

AI Chatbot Workshop · Thai e-commerce

Two layers in one pane

Layer Question Where Watch
Operational Is it up, fast, affordable? Datadog APM + Logs latency p50/p95, error rate, cost/conv
Quality Is it right? Datadog LLM Observability tool-call correctness, groundedness, containment
  • Operational is the classic web-service view — Fargate, Bedrock latency, errors, token cost (DEMO-13).
  • Quality is LLM-specific: did it call the right tool, did it stay grounded, did it resolve without handoff (containment)? (DEMO-14)
  • The two are correlated: an LLM trace links to the APM trace of the same request.
AI Chatbot Workshop · Thai e-commerce

Reading a real trace

One conversation, fully unpacked (DEMO-14):

conversation: "Is the หูฟัง in stock and on promo?" ├─ span agent.request 1.8s ฿0.21 │ ├─ span bedrock.invoke (decide) 0.6s tool_use ✓ │ ├─ span tool.get_product_info 0.02s {stock: 14} ✓ correct tool │ ├─ span tool.get_promotion 0.02s {active: false} │ └─ span bedrock.invoke (phrase) 0.9s grounded ✓ └─ output: "Yes, 14 in stock. No active promo right now."
  • See the tool calls, their arguments, and results — not just the final text.
  • Tokens, cost, and latency per step.
  • A wrong answer becomes debuggable: open the trace, find the tool call that lied or the step that drifted.
AI Chatbot Workshop · Thai e-commerce

Cost / quality dashboard — the first signal

One dashboard the team reads at a glance (DEMO-15):

  • Operational: tokens & cost per conversation, p50/p95 latency, error rate.
  • Quality: tool-call correctness, groundedness, containment (resolved without handoff).

The first signal something is wrong is rarely an exception. It is a cost spike (a loop calling tools in circles) or a bad-trace example (groundedness dipping) — long before anyone files a ticket.

  • A cost spike → often a runaway loop or oversized context.
  • Groundedness dipping → a prompt or tool change quietly regressed.
  • Containment dropping → the bot is handing off more; something it used to answer, it no longer can.
AI Chatbot Workshop · Thai e-commerce

Part 9 — 14:45

Architecture working session

Whiteboard time. Your real sources, your mock strategy, your account layout, your owners.

AI Chatbot Workshop · Thai e-commerce

Whiteboard prompts — decide these as a team

1 · Sources: tool-call or retrieval?
For each — order, delivery, products, pricing, promotions — is it a live query (tool) or a document (RAG)? Default to tool-call; justify any RAG.

2 · Mock / sandbox strategy, per source
What's the API shape? Is there a vendor sandbox for uat? What's the mock contract for dev/CI? Who provides the real credentials, and when?

3 · Account layout under Organizations
dev / test / uat / prod accounts — who can deploy to each? How do CI roles assume into each account? Where do Bedrock budget alarms live?

4 · Ownership
Who owns each tool? Each source contract? The golden eval set? The Guardrail config? The Datadog dashboard?

AI Chatbot Workshop · Thai e-commerce

Part 10 — schedule & prep

The day, and what makes it run

The 09:00–16:00 shape, the facilitator prep, and the 5-minute attendee pre-req.

AI Chatbot Workshop · Thai e-commerce

Schedule — 09:00 to 16:00

Time Block Format
09:00–09:30 Frame the system — agent loop, tool-calling vs RAG, confidently-wrong, the one diagram Talk + diagram
09:30–11:00 Build a minimal tool-calling agent (get_order_status) Hands-on
11:00–11:45 Prompt & context engineering + PDPA / handoff Talk + exercise
11:45–12:45 Lunch
12:45–14:00 Evals — golden set, break the prompt, wire into GitLab CI Hands-on
14:00–14:45 Tracing & observability — read a real trace in Datadog Demo + look-along
14:45–15:45 Architecture working session — your real sources & environments Whiteboard
15:45–16:00 Decisions & action items Capture

Protect the Build and Evals blocks above all — those are the muscles the team doesn't have yet. Tracing can become a follow-up; the architecture session can flex.

AI Chatbot Workshop · Thai e-commerce

Facilitator prep — have these ready

Ordered by lead time. The first group has long lead and quietly sinks the day if left late.

~1 week out (long lead):

  • Bedrock access — Claude enabled in the dev account's region, IAM for attendees, budget alarm. Model access is request-gated.
  • Vendor sandbox access for any real source touched in the architecture session — most likely to slip.

~2–3 days out (build materials):

  • Starter repo on GitLab — runs on a clean clone (the highest-leverage item).
  • Synthetic Thai dataset + golden eval set (8–12 cases, incl. the broken-prompt case).
  • The one diagram · GitLab CI snippet · Datadog wired (ddtrace + a sample trace already showing).

Day before: run the whole starter repo end-to-end on a clean clone; send the pre-req.

AI Chatbot Workshop · Thai e-commerce

Per-attendee 5-minute pre-req

Send this the day before. It is the whole bar — everything else is provided.

  • [ ] Laptop with a working Python environment.
  • [ ] Starter repo cloned and pip install'd.
  • [ ] AWS credentials configured (SSO profile) with Bedrock access in the dev account.
  • [ ] One confirmed test Bedrock call (python -m demo.smoke returns a completion).

If those four are green, you arrive ready to build at 09:30 — not ready to start installing at 09:30.

AI Chatbot Workshop · Thai e-commerce

Part 11 — close

Decisions & action items

What we decided, and what each owner takes away.

AI Chatbot Workshop · Thai e-commerce

Decisions & action items

Capture live — every row gets an owner and a date:

Decision / action Owner By
Source-by-source: tool-call vs retrieval
Mock contract + vendor sandbox request per source
dev/test/uat/prod account layout + CI deploy roles
Guardrail (PDPA) config owner
Golden eval set owner + first 8–12 cases
Datadog dashboard owner
Starter repo: runs on a clean clone

The one that de-risks everything else is the last row.

AI Chatbot Workshop · Thai e-commerce

The one thing to take away

The highest-leverage prep is a starter repo that runs on a clean clone.

If it exists and works, the workshop is ~80% de-risked.
If it doesn't, no agenda survives contact with five laptops at 09:30.

And the thesis you'll carry into the real build:
the LLM phrases; authoritative systems own the facts. Never be confidently wrong.

Welcome. Set the tone in one breath: this is not "intro to LLMs". Five engineers, AWS/GitLab/Python already in hand. By 16:00 each of you will have built a tool-calling agent, gated it with evals in GitLab CI, and read its trace in Datadog. The single idea on the screen is the spine of the whole day — read it aloud, then say: the worst failure for a bot quoting a real price or a real delivery date is not awkward wording, it is stating a wrong number with total confidence. Everything today is engineered to prevent that.

30 minutes, talk + one diagram. Do not drift into model internals or prompt-engineering folklore. The goal of this part is a shared mental model and the one diagram everyone will point at for the rest of the day.

Anchor everyone on the actual project before any abstraction. The key reframe: the value is not the model's "knowledge" — it is correct routing to systems of record plus good phrasing. If the team treats the LLM as the source of truth, the project fails on day one in production. The demo/ repo (DEMO-3, DEMO-6) mirrors exactly these sources as mock tools: orders, products, promotions.

Demystify the word "agent". It is a while-loop with the model in the decide seat. Stress the ownership boundary: the model emits a tool_use block (a request); nothing runs until your code chooses to run it. That boundary is where you put validation, auth, gating, and logging. We build this exact loop by hand in Part 5 (DEMO-4) precisely so nobody treats it as a black box.

Engineers often reach for RAG reflexively. Make the distinction crisp: a delivery ETA is not a document you retrieve, it is a row you query. RAG returns *text that mentions* something; a tool returns *the value*. For prices and order status, "text that mentions" is exactly the path to being confidently wrong. We are not anti-RAG — it is right for the returns policy and FAQ — but it is the wrong default for this domain.

This is the emotional centre of the day. Give a concrete scenario: a customer is told a Songkran promo is still valid, drives to checkout, it is expired — support ticket, refund, trust gone. The expired-promo case is deliberately seeded in the demo dataset (DEMO-6) and the golden evals (DEMO-10). The point: we do not "remind the model to be careful". We remove its ability to invent these values at all.

This is THE diagram (also in one-diagram.md, printed as a poster on the wall). Walk it left-spine top to bottom. Make the three verbs land: ORCHESTRATE, own the FACT, PHRASE. Every later slide — architecture, evals, traces — is a view of this same flow. When anyone asks "where does X live?", point here.

Restate the spine as three enforceable engineering rules, because "be grounded" is a vibe until it is code. Rule 1 becomes a system-prompt line plus eval. Rule 2 becomes the tool layer plus guardrails. Rule 3 becomes the handoff path (DEMO-9). Tell them: hold me to these three for the rest of the day.

The stack is fixed and agreed. The job of this part is fluency in it, not a re-litigation. If someone wants Lambda over Fargate, park it for the architecture session — Fargate is the committed default for a 5-person team.

One slide, the whole stack. Don't read every row — orient them by the three groupings: run it (Bedrock, Fargate, accounts, secrets), trust it (Guardrails, evals), see it (Datadog). Model id is pinned: anthropic.claude-opus-4-8 — note the anthropic. prefix is Bedrock-specific; the first-party id is claude-opus-4-8 with no prefix. Pin the id in config (DEMO-2), never hard-code it scattered.

This is the one diagram, deployed. Trace a request end to end: channel in, input guardrail, the loop (which talks to Bedrock and to tools), output guardrail, response out — and the whole thing emitting traces to Datadog. Secrets Manager/SSM feed config in, never baked into the image. Make the point that guardrails are on the I/O edges, not buried in the prompt.

Sell the Bedrock choice honestly. The wins are real: IAM-scoped access per account dovetails with account-per-env, and Guardrails are first-class. But set the expectation that Bedrock is a subset of the first-party surface — if someone designs around Batches or the Files API, it won't be there. For our tool-calling order bot we need none of those, so it's a clean fit.

The user's original question literally asks for environment best practice. Give them the spine: isolation by AWS account, not by tag or namespace. This is the cheapest insurance they will buy.

The decision to defend: separate AWS accounts per environment. Tags and shared VPCs leak — one fat-fingered IAM policy and dev reaches prod. Accounts give you a hard wall plus clean per-env billing and Bedrock quotas. Organizations + SSO makes switching accounts a one-click affair, so the ergonomics cost is low.

The table that answers the user's question directly. The throughline (read the last line aloud): realism and strictness both increase left-to-right. Mock in dev, sandbox in uat, real in prod — that progression is exactly why the demo builds against mock tools first. The eval gate lives in test/CI: that is the merge blocker we wire up in Part 7.

Two ideas. First: build once, promote the same image; only configuration (from that account's SSM + Secrets) differs — this kills "works in uat, breaks in prod" drift. Second: access is scoped per account — a uat task role cannot call Bedrock in prod, and CI assumes a different deploy role per target. Real vendor keys exist only in uat/prod Secrets Manager; dev and CI never hold them.

Quick part. The framing: we are not collecting tools, we are covering needs. Every row of the next table is a need first, a tool second.

Map, don't list. Call out the SDK choice explicitly because the team will ask: the Anthropic Bedrock SDK (AnthropicBedrockMantle) gives the clean Messages API + tool runner; boto3 bedrock-runtime Converse is the pure-AWS equivalent and is what DEMO-2 can wrap. Either is fine — the loop shape is identical. SSO matters: no long-lived keys on five laptops.

90 minutes, hands-on, everyone. This is the most important block of the day — the muscle the team doesn't have yet. Everyone builds the same tiny agent against the same starter repo (DEMO-1→5). Protect this block; if the schedule slips, steal time from the architecture session, never from this.

Two things make the bot honest: the tool returns structured truth (or raises — it never fabricates a record), and the system prompt forbids stating any authoritative value the model didn't get from a tool. Show that not-found raises rather than returning a fake "all good" — the model must be able to learn the order doesn't exist, not paper over it.

The single highest-leverage skill in this whole project: writing tool definitions the model calls reliably. The description is a prompt aimed at the router — be prescriptive about WHEN to call it. Tight schemas (required, typed, with examples) reduce malformed calls. When we add products and promos (DEMO-6/7), good descriptions are what make the model pick the RIGHT tool.

This is the one diagram as ~15 lines of Python (DEMO-4). Narrate the loop: model returns stop_reason == "tool_use" with a tool_use block; YOUR code executes the function; you append a tool_result; loop. It exits when the model stops asking for tools and just answers. Two teaching points: append the full assistant content (tool_use blocks included) before sending results, and every tool_use needs exactly one matching tool_result. An SDK tool-runner can automate this loop — build it by hand once so it isn't magic.

Mocks define the contract. Design them to look like the real thing (Thai item names, plausible ids) and to include the cases that break naive bots: not-found and the expired promo. Because mock and real share a shape, the dev→uat→prod progression (mock→sandbox→real) becomes configuration, not a rewrite. This is also what makes the whole thing runnable on a clean clone with no vendor access.

Talk + small exercise. Now that the loop works, make it trustworthy and compliant. PDPA is folded in here via Guardrails, not treated as a separate legal afterthought.

A system prompt is policy. The grounding clause is the load-bearing line and it must be unambiguous: only state values a tool returned IN THIS conversation; never recall from training, never round, never fill in. Critically: do NOT bake any prices, stock, or hours into the prompt — they rot. The prompt governs behaviour; tools govern facts. Keep it short — modern Claude follows a tight prompt closely.

This is where "it sometimes calls the wrong tool" gets fixed — not by scolding the model, but by sharpening descriptions. The product-vs-promo boundary is the classic confusion; spell it out in both descriptions. Demonstrate a two-tool question: the model fans out to get_product_info and get_promotion, then phrases one grounded answer. Tool design is the lever, not prompt threats.

Reframe handoff as a feature, not an admission of defeat. The bot that can't say "I don't know" is the bot that invents. Wire a structured handoff event (stubbed in DEMO-9) for the cases where no tool fits or a tool says not-found. Tell the room: in the evals, "correctly handed off" is a PASS and "confidently guessed" is a FAIL — we score this explicitly.

PDPA is the Thai GDPR-equivalent; treat it as a hard control, not a prompt suggestion. Bedrock Guardrails sit on both edges: mask PII before the model sees it, and scrub output so it can't leak personal data back. Denied topics keep it on-scope. The test to run live (DEMO-8): send a fake phone number, watch it get masked. Defence in depth — the guardrail does not trust the prompt and the prompt does not trust the model.

75 minutes, hands-on. The second "aha" of the day: watching evals catch a regression you caused. If the team leaves with one new habit, make it this — evals as a merge gate, not a notebook.

A golden set is small and deliberate, not a giant random sample. Each case encodes two things: the right tool was called AND the answer is correct. Cover the five behaviours. The expired-promo case is the star: a naive bot offers the discount; a grounded bot says it expired. That single case is worth more than fifty happy-path ones.

Two assertion types map to two failure types: wrong fact (deterministic) and wrong behaviour (judge). The non-obvious, crucial point is the last one: assert on the tool call itself. A bot can say "out for delivery" by luck or by calling the wrong tool and getting away with it on this input. If you only check final text, you ship a bot that's right by accident. Scoring tool-call correctness is how you catch confidently-wrong-but-passing.

The live demo that makes evals visceral. Remove the grounding line, rerun: the expired-promo and unknown-order cases flip to red because the bot starts confidently inventing. Then revert and watch green return. The lesson lands without a lecture: this is the net that catches the failure mode we've been talking about all day, automatically, before it ships.

The payoff: evals as a required CI stage after lint and test. Below threshold, the eval job exits non-zero and branch protection blocks the merge. Now a prompt edit is reviewed and tested exactly like a code change — because it IS one. This is the habit to take home: no prompt or tool-def change reaches main without passing the golden set.

Demo + look-along. Answers the user's third question directly: how do we monitor AI performance? The reframe: "AI performance" is two things — is it up and cheap (ops), and is it right (quality). Datadog gives you both, correlated.

"Monitoring AI" is two questions that need two layers. Ops is what they already know — latency, errors, cost. Quality is new: groundedness (did it stick to tool facts), tool-call correctness, and containment (resolved without escalating to a human). The magic is correlation — click a bad answer in LLM Observability and jump to the exact APM trace, tokens, and tool calls behind it.

Walk the trace top to bottom. The value over plain logs: you see the decision (decide → which tool → result → phrase), each with cost and latency, plus a groundedness check on the final step. When a customer reports a bad answer, you don't guess — you open that conversation's trace and see exactly which tool was called, with what arguments, and where it went wrong.

Land the monitoring section on intuition: the earliest warning is usually economic or qualitative, not a 500. A cost-per-conversation spike means a loop or context bloat; a groundedness dip means a silent regression slipped past evals (or wasn't covered); falling containment means it's escalating more than it used to. Teach them to watch cost and groundedness as leading indicators, not just error rate.

60 minutes, whiteboard, everyone. This is where the day's concepts hit their actual project. Facilitate, don't lecture. Capture decisions in the action-items slide. This block can stretch or shrink — it's the buffer.

Run this as four facilitated rounds, ~12 min each. Round 1 forces the tool-vs-RAG call per source and exposes which vendor APIs even exist. Round 2 surfaces the long-lead risk: vendor sandbox access (the thing most likely to sink the real project). Round 3 makes the account/IAM layout concrete. Round 4 assigns owners — unowned tools and unowned eval sets rot. Write every decision on the action-items board.

Housekeeping, but load-bearing. The day only works if the prep is done. Be honest about what sinks it if skipped.

The shape of the day. The one rule: protect 09:30 (build) and 12:45 (evals) — they carry both "aha" moments. If a hands-on block overruns, cut tracing to a follow-up recording and shrink the architecture session; never cut build or evals. Lunch is a real boundary, not optional.

Sequence by lead time, because two items have long fuses. Bedrock model access is request-gated — enable it a week out, not the morning of. Vendor sandbox access is the single thing most likely to slip and it only bites in the architecture session. Everything else is the facilitator building the starter repo and datasets. Day-before is verification only: clean-clone run, then send the pre-req.

Keep the attendee ask tiny and verifiable — four checkboxes, five minutes. The smoke call is the one that matters: it proves their SSO profile, region, and Bedrock access all work together before the room. The failure mode this prevents: five people debugging AWS creds at 09:35 while the build block bleeds out. Make the smoke test copy-paste-able in the pre-req message.

15 minutes. Convert the whiteboard into owned, dated actions. A decision without an owner is a wish.

Fill this in live from the architecture-session board. Push for a real name and a real date on every row — "the team" is not an owner. The bottom row is starred for the next slide: get the starter repo green on a clean clone and ~80% of the real project's early risk evaporates.

Close on the two things worth remembering. Operationally: the clean-clone starter repo is the single highest-leverage artifact — it makes the workshop run and it IS the demo's M1. Philosophically: the spine that has run through every part — the model phrases, the systems own the facts, and we engineer so the bot is never confidently wrong. Thank the room; point them at the demo/ tickets to keep building.