Week 1 — Agentic Architecture and Orchestration

Agentic · 27%

32 sections · 19 concepts · 9 misconceptions · 6 exercises

Lecture

Read in depth

17 sections

From First Principles

At the most basic level, an agentic system exists because one model response is often insufficient for real work. Production tasks regularly require:

retrieving information the model does not already have
taking actions through tools
reevaluating decisions after new facts arrive
splitting work into specialized subproblems

That means the system cannot be designed as a single static prompt. It must behave like a controlled reasoning loop over changing state.

From first principles, Week 1 is about three realities:

the model needs an explicit control loop
high-risk steps need structural enforcement, not hopeful wording
complex work needs deliberate decomposition, not just more tokens

If a student understands those three principles deeply, many Week 1 exam questions become straightforward.

Guided Walkthrough: Building A Refund Agent Correctly

Walkthrough goal:

understand how loop control, prerequisites, and escalation fit together in one system

Step 1: Start with the user request

Example:

"I was charged twice for order 12345 and I want my money back."

First-principles question:

What facts does the model need that it cannot safely infer from the user message alone?

Expected answer:

verified customer identity
order ownership
charge status
refund eligibility

Step 2: Decide whether the system can rely on prompting alone

Ask:

If a mistaken refund is costly, should the system merely instruct the model to verify identity first?

Expected answer:

no, because the workflow needs a deterministic prerequisite

Step 3: Design the loop

The loop should:

ask Claude for the next step
inspect stop_reason
run required tools
append results
continue until end_turn

Step 4: Add enforcement

Before process_refund can run, the application should verify:

a verified customer ID exists
the order belongs to that customer
the refund amount is within policy

Step 5: Add escalation

Escalation is required if:

the user asks for a human
policy is ambiguous
identity remains unresolved
the refund exceeds the autonomous threshold

Step 6: Consider multi-issue requests

If the user also says:

"The replacement item was damaged too"

the coordinator should decompose the conversation into at least two issue tracks:

duplicate charge
damaged replacement

Those tracks can share verified identity context but still require separate investigation.

Teaching point:

The important lesson is that "agent intelligence" alone is not enough. Reliable systems are built by combining model-guided reasoning with deterministic workflow structure.

Week 1 Worked Pseudo-Architecture

User Request
   |
   v
Coordinator Loop
   |
   +--> allowedTools includes "Task"
   |
   +--> Task -> Search Subagent
   |         AgentDefinition:
   |         - description
   |         - system prompt
   |         - restricted tools
   |
   +--> Task -> Analysis Subagent
   |         AgentDefinition:
   |         - description
   |         - system prompt
   |         - restricted tools
   |
   +--> get_customer --------------+
   |                               |
   |                      verified customer ID
   |                               |
   +--> lookup_order --------------+
   |                               |
   |                    order ownership and status
   |                               |
   +--> policy gate / threshold check
   |          |             |
   |          |             +--> escalate_to_human
   |          |
   |          +--> process_refund
   |
   +--> final unified response

Week 1 Board Teaching Notes

Draw the loop before discussing prompts. Students retain control flow better when they can point to state transitions.
Ask students which parts are model-guided and which parts are deterministic. Keep pressing until they separate those cleanly.
When teaching coordinator-subagent architecture, ask "who owns completeness?" The answer should be the coordinator, not the synthesis agent.
Add a separate board segment for Task, allowedTools, and AgentDefinition. Students should be able to explain exactly what enables spawning and what shapes each subagent role.

deep dive Deep Dive: Hooks, Enforcement, and Handoff Reliability

Some Week 1 concepts are easy to mention and easy to underteach. Hooks are one of them.

From first principles, a hook is useful when the application needs to alter or inspect a tool interaction at a point where the model itself should not be the only enforcement layer. The guide calls out hook patterns like PostToolUse and outgoing tool-call interception. These matter because they let the application enforce or normalize behavior deterministically.

Deep Dive A: `PostToolUse` as a Normalization Layer

Suppose three backend tools return dates in different formats:

Unix timestamps
ISO 8601 strings
numeric status codes plus separate reason fields

If the model has to interpret each raw format directly every time, cognitive load rises and inconsistencies spread through the workflow. A PostToolUse hook can normalize those outputs before the model sees them, so the model reasons over a common representation.

Why that matters:

it reduces accidental format confusion
it prevents downstream prompt complexity from ballooning
it keeps the model focused on decision quality rather than data cleanup

Deep Dive B: Outgoing Tool Interception

Imagine an autonomous refund workflow with a hard rule:

refunds above $500 must go to a human

There are two possible designs:

prompt Claude to remember the policy
intercept the refund tool call and block or reroute it

The second is stronger because it guarantees compliance even when the model’s reasoning path varies.

Deep Dive C: Handoff Quality

Handoffs are not just summaries. In a real escalation, the human may not have the conversation transcript. A proper handoff should therefore stand on its own:

customer or case ID
issue type
facts established so far
root cause or likely root cause
action already attempted
recommendation for the next human step

A weak handoff says:

"Customer upset. Needs help."

A strong handoff says:

"Customer ID 48291. Duplicate charge confirmed on order 12345. Verified refund amount $84.50. Refund exceeds auto-threshold because second issue involves damaged replacement requiring manual override. Recommended action: review damaged-item exception and approve combined handling."

This level of structure is what exam questions are trying to reward.

Lecture 1.1: The Agentic Loop

Key Distinctions:

loop control comes from API state, not from reading assistant prose for hints
tool_use means continue with tool execution, while end_turn means stop the loop
having tools available is not the same as being instructed to use them

Common Wrong Answers:

"Continue whenever the reply feels incomplete."
"Stop once the assistant writes a natural-sounding answer."
"Use a decision tree instead of inspecting stop_reason."

What To Memorize:

stop_reason is the control surface
append tool results and continue only on tool_use
stop on end_turn

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Label three sample turns as tool_use or end_turn.
Rewrite a brittle prose-parsing loop into a stop_reason-based loop.

Check Your Understanding:

Why is assistant wording a weak completion signal?
Show answer
The model's prose is a probabilistic surface, and the same wording can appear when more tool calls are still required. The API's stop_reason is the explicit completion signal — it tells the application whether the loop should continue (tool_use) or stop (end_turn). Driving control flow off interpreted text instead of explicit state is brittle by construction.
What must happen after a tool result is returned?
Show answer
The result must be appended to the message history and the loop must continue, sending the updated conversation back to Claude. The model cannot reason over information it has not seen, so a tool that executes but whose output never re-enters context creates a broken half-loop where work is done but never used.

An agentic loop is not "send one prompt and hope for the best." It is a control structure. Claude reasons over the current conversation, decides whether a tool is needed, requests that tool, receives the tool result back in context, then continues reasoning. This repeats until Claude reaches a natural stopping point.

The key control signal is stop_reason. For this exam, the distinction that matters most is:

tool_use: Claude wants one or more tools to run, so your loop should continue.
end_turn: Claude is done with the current task and can produce the final answer for that turn.

This matters because many fragile implementations try to infer completion from assistant text. That is weak engineering. A model may say "I’m done" and still require a tool in the next turn if the loop is structured incorrectly. Or the opposite: it may produce text that looks incomplete even though end_turn has occurred. Control flow should follow explicit API signals, not prose interpretation.

Another core rule: tool results must be returned to Claude as part of the conversation history. The model cannot reason over information it has not seen. If a tool call fetches customer data, order details, or document metadata, that result must be injected back into the context for the next iteration. Otherwise the system becomes a broken half-loop where tools execute but the model does not get to use the output.

In production, iteration caps are still useful, but only as a guardrail. They are not the primary completion signal. A cap prevents runaway loops; it should not decide that normal work is done.

Example

Bad logic:

ask Claude for a response
if the response text contains "final answer", stop
otherwise try to parse whether a tool is needed

Better logic:

send the current conversation to Claude
inspect stop_reason
if tool_use, execute the requested tool calls
append tool results to the conversation
repeat
if end_turn, return the answer

Why this shows up on the exam

The exam likes tradeoff questions where one option is "add stronger prompting" and another is "enforce the control flow programmatically." If a workflow requires guaranteed ordering or deterministic compliance, the right answer is usually structural enforcement, not stronger prose.

📐 See the diagram: stop_reason as control surface.

exercise Guided Exercise 1.1

Write pseudocode for an agent loop that:

receives a user request
allows Claude to call tools
continues while stop_reason == "tool_use"
ends when stop_reason == "end_turn"

Sample Answer

messages = [user_message]

while True:
    response = call_claude(messages, tools=toolset)

    if response.stop_reason == "tool_use":
        messages.append(response.assistant_message)
        for tool_call in response.tool_calls:
            result = run_tool(tool_call)
            messages.append(tool_result_message(tool_call.id, result))
        continue

    if response.stop_reason == "end_turn":
        return response.final_text

    raise UnexpectedStateError(response.stop_reason)

Lecture 1.2: Deterministic Enforcement vs Prompt Guidance

Key Distinctions:

prompts guide model behavior, while deterministic controls guarantee compliance
risky ordered workflows need gates and interception, not stronger wording
policy enforcement belongs in system structure, not just in prompts

Common Wrong Answers:

"Add more examples and the ordering issue will disappear."
"Use stronger cautionary wording for mandatory business rules."
"Trust the model if it usually follows the policy."

What To Memorize:

deterministic gates beat probabilistic compliance
hooks can normalize or block behavior
structural enforcement is for costly failure modes

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Classify three workflow controls as prompt-based or deterministic.
Rewrite one risky prompt rule as a prerequisite gate.

Check Your Understanding:

When is stronger prompting still insufficient?
Show answer
Whenever the failure mode is costly enough that probabilistic compliance is unacceptable — identity verification before financial actions, threshold-bound approvals, irreversible operations, legally mandated steps. Stronger wording reduces the rate of mistakes but does not eliminate them, and a system that "usually" enforces a critical rule has not enforced it.
What problem does an outgoing hook solve better than a prompt?
Show answer
It guarantees enforcement at the point of action. A prompt asks the model to remember and apply a rule; an outgoing hook intercepts the tool call itself and can block, rewrite, or redirect it regardless of the model's reasoning path. That guarantee is what makes hooks the right answer for high-cost policy breaches.

Not every workflow should be left entirely to model judgment. This exam expects you to know when probabilistic behavior is acceptable and when it is not.

Prompt guidance is useful for:

prioritizing one reasonable tool over another
giving escalation criteria
describing quality standards
nudging the model toward better decomposition

Prompt guidance is not enough for:

identity verification before financial actions
policy thresholds that must never be exceeded
steps that are legally or operationally mandatory
actions that can cause irreversible damage

If a support agent must never issue a refund above a threshold without human review, the correct fix is not "remind the model more strongly." The correct fix is to intercept or block the tool call programmatically. The same principle applies to prerequisite gates. If get_customer must happen before process_refund, enforce the dependency.

This is one of the highest-value distinctions in the exam.

exercise Guided Exercise 1.2

A support system sometimes processes refunds before identity verification. Choose the better fix and explain why:

Add three more few-shot examples showing identity verification first.
Block refund tools until verification returns a valid customer ID.

Sample Answer

The second fix is better. The first is still probabilistic and can fail on edge cases. The second gives a deterministic guarantee for a business-critical prerequisite.

Lecture 1.3: Coordinator-Subagent Architecture

Key Distinctions:

the coordinator owns decomposition, routing, aggregation, and recovery
subagents do bounded specialist work rather than global orchestration
complete-looking synthesis can still hide upstream decomposition failure

Common Wrong Answers:

"If subagents are strong enough, the coordinator does not matter much."
"Coverage quality is mainly a synthesis problem."
"Direct subagent-to-subagent traffic is preferable because it is faster."

What To Memorize:

hub-and-spoke is the core pattern
coordinator owns completeness
subagent isolation is a feature, not a flaw

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Diagnose whether a failure belongs to the coordinator or a subagent.
Split one broad task into coordinator-owned and subagent-owned responsibilities.

Check Your Understanding:

Who owns completeness in a multi-agent design?
Show answer
The coordinator. Subagents are responsible for the bounded slices they are assigned, but the question of whether the slices add up to a complete answer is a decomposition question — it lives at the layer that decided how to partition the work and which subagents to invoke.
Why can a polished report still indicate coordinator failure?
Show answer
Because synthesis quality and coverage quality are independent. A synthesis subagent given a narrow set of findings can produce a fluent, well-structured report on those findings while the broader topic remains under-covered. A polished output is evidence that the synthesis layer worked; it is not evidence that the coordinator decomposed correctly.

A multi-agent system is not just "many agents." It needs a coordination model. The exam focuses on the coordinator-subagent pattern, especially hub-and-spoke designs.

In this pattern:

the coordinator receives the top-level task
it decomposes the work
it decides which subagents to invoke
it routes information between them
it handles recovery and aggregation
it owns the final answer

Subagents do not automatically inherit the coordinator’s context. This is another trap the exam uses repeatedly. If the synthesis agent needs the findings from the web-search and document-analysis agents, those findings must be explicitly passed into its prompt or its structured inputs.

The coordinator should also avoid overly narrow decomposition. A common failure mode is when the coordinator breaks a broad problem into only one slice of the topic. If the task is "AI impact on creative industries" and the coordinator decomposes only into visual-art subtasks, the subagents may perform perfectly and still produce an incomplete report. In that case the subagents are not the problem; decomposition is.

Lecture 1.4: Subagent Invocation, `Task`, and `AgentDefinition`

Key Distinctions:

spawning depends on Task, not on vague multi-agent prompting
available delegation requires allowedTools to include "Task"
subagents need explicit context because they do not inherit parent memory automatically

Common Wrong Answers:

"Subagents can infer the parent context from the session."
"Good system prompts make Task configuration unnecessary."
"Agent roles matter more than tool restrictions."

What To Memorize:

Task is the spawning mechanism
AgentDefinition should include description, system prompt, and tool restrictions
forked sessions support divergent analysis from a shared baseline

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Identify why a coordinator cannot spawn when Task is missing.
Rewrite a weak handoff to include explicit context and quality criteria.

Check Your Understanding:

Why do subagents not automatically inherit parent context?
Show answer
Each subagent runs as its own model invocation with its own message list; there is no implicit shared memory across Task calls. The architecture treats subagents as isolated workers, which is a feature — it forces the coordinator to be explicit about what each subagent needs and prevents leak-through of irrelevant or sensitive context.
What belongs inside an AgentDefinition?
Show answer
A description of the role, the system prompt that shapes the subagent's behavior, and the tool restrictions that scope what it can do. Together those three configure the subagent as a specialist; leaving any of them generic weakens specialization and increases the chance of tool misuse or off-task work.

This lesson covers a mechanism that is explicitly named in the source guide and is important enough that students should be able to state it precisely.

Subagents are not invoked abstractly. In the architecture described by the guide, the coordinator uses the Task tool to spawn subagents. That means delegation depends on actual tool availability, not just good prompt wording. If a coordinator is expected to invoke subagents, its allowedTools must include "Task".

That gives us a concrete exam distinction:

describing delegation in the prompt is not the same as enabling delegation in the system
the coordinator can only spawn subagents if the spawning mechanism is actually allowed

This matters because many wrong answers on architecture questions sound plausible at the prompt layer while the real failure is at the configuration layer.

The second key concept is explicit context passing. Subagents do not automatically inherit the parent’s full history or shared memory across invocations. If the coordinator wants a synthesis subagent to use the findings from a web-search subagent and a document-analysis subagent, it must pass those findings explicitly.

Weak handoff:

"Use what the previous agents found and produce a report."

Strong handoff:

pass the actual claims, evidence excerpts, source URLs, dates, and document identifiers needed for synthesis

The third concept is AgentDefinition. A subagent should be configured intentionally rather than treated as a generic secondary model invocation. The guide explicitly calls out:

description
system prompt
tool restrictions

Those settings define the role. A web-search agent, document-analysis agent, and synthesis agent should not all share the same instruction surface or tool access. If they do, specialization weakens and tool misuse becomes more likely.

The fourth concept is fork-based session management. Forking is useful when you want to branch from a shared analysis baseline into multiple possible approaches without contaminating the original line of reasoning. This is especially useful for:

comparing two migration plans
testing multiple investigation strategies
exploring alternative synthesis structures from the same evidence base

Forking is not the same as ordinary resumption. Resumption continues one path. Forking creates multiple paths from a shared starting point.

Minimal Operational Checklist

For a coordinator to spawn subagents correctly:

the coordinator must have access to the Task tool
allowedTools must include "Task"
each subagent should have a clear AgentDefinition
the coordinator should pass context explicitly
the coordinator should scope each subagent’s tool access to its role

Failure Mode Example 1

A team writes a detailed coordinator prompt that says:

"Delegate to specialized subagents when useful."

But no subagents are ever invoked.

The likely problem is not prompt wording. The likely problem is that the coordinator does not actually have access to Task, or allowedTools does not include "Task".

Failure Mode Example 2

A synthesis agent produces weak output and misses citations. The team blames the synthesis agent’s prompt.

The deeper issue may be that the coordinator handed off only a vague prose summary instead of explicit structured findings with provenance fields.

exercise Guided Exercise 1.3

A coordinator is supposed to spawn subagents, but this never happens in practice. What are the first three things you should verify?

Sample Answer

Verify that the coordinator has access to the Task tool.
Verify that allowedTools includes "Task".
Verify that the subagents are actually defined with usable role configuration and that the coordinator prompt can choose delegation.

exercise Guided Exercise 1.4

Why is this handoff weak?

"Use the previous agents' findings and produce a final report."

Sample Answer

It assumes implicit inheritance and does not pass the actual information needed for synthesis. A stronger handoff would explicitly include the findings, source metadata, and evidence needed by the downstream subagent.

Lecture 1.5: Decomposition Strategies

Key Distinctions:

prompt chaining fits fixed ordered workflows, while adaptive decomposition fits open-ended work
decomposition quality determines coverage quality
broad tasks need evolving plans rather than one static breakdown

Common Wrong Answers:

"Always use prompt chains because they are simpler."
"Adding more subagents automatically improves coverage."
"Planning quality matters less than final synthesis quality."

What To Memorize:

choose decomposition pattern by task shape
use adaptive decomposition for uncertain or broad tasks
split local versus integration review concerns

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Choose prompt chaining or adaptive decomposition for three scenarios.
Break one large review into local and cross-file passes.

Check Your Understanding:

What kind of task is a poor fit for prompt chaining?
Show answer
Open-ended or exploratory work where the next step depends on what was discovered in the previous one. A fixed chain commits to a sequence in advance; if the early steps surface a finding that should redirect the investigation, the chain has no way to incorporate it. Adaptive decomposition is the right pattern for that shape of work.
Why does decomposition quality affect coverage quality?
Show answer
Coverage is bounded by what the decomposition asked for. If the coordinator partitions a broad topic into a narrow slice, the subagents can execute that slice perfectly and the final answer will still be incomplete. Improving the synthesis layer cannot recover information that was never gathered, which is why coverage failures usually trace back to scoping decisions made upstream.

The course guide distinguishes between two useful patterns:

prompt chaining for predictable multi-step work
adaptive decomposition for open-ended investigation

Prompt chaining works well when the workflow is known in advance. For example:

analyze each file individually
summarize file-level findings
run a cross-file integration pass

Adaptive decomposition works better for open-ended work where the next step depends on what is discovered. For example:

map the codebase
identify high-risk modules
inspect dependencies
revise the plan after new findings emerge

The exam may ask which pattern fits a scenario. The right answer depends on predictability. If the work has known stages, prompt chaining is usually correct. If the work is exploratory and branching, adaptive decomposition is stronger.

📐 See the diagram: Prompt chain vs adaptive decomposition.

Lecture 1.6: Context Passing and Parallelism

Key Distinctions:

explicit handoff beats assumed shared memory
parallelism helps only when subtasks are independently scoped
quality criteria should be passed with the task, not left implicit

Common Wrong Answers:

"Spawn parallel agents first and clarify context later."
"Metadata and content can be mixed loosely in handoffs."
"Parallelization always improves quality."

What To Memorize:

pass findings explicitly
separate content from routing metadata
parallelize only when subtasks are clearly separable

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Rewrite a vague handoff into a complete subagent prompt.
Decide whether two subtasks should run sequentially or in parallel.

Check Your Understanding:

Why is assumed shared memory dangerous?
Show answer
Subagents only see what is passed to them, but a handoff written as if they already knew the context will produce silent gaps. The downstream subagent fills in plausible defaults, the synthesis layer treats those defaults as findings, and provenance is lost. Explicit handoff prevents the failure by making the unknowns visible.
What belongs in a high-quality handoff?
Show answer
The actual content the downstream subagent needs (claims, excerpts, source URLs, dates, document identifiers) separated from routing metadata, plus the explicit success criteria for the work being handed off. Vague summaries collapse content and metadata together and force the downstream agent to guess at structure.

Subagents need explicit context. That context should often be structured, not free-form. A strong design passes content and metadata separately, for example:

claim
supporting excerpt
source URL
publication date
document name

That separation matters because the downstream agent must preserve provenance. If you only pass a flattened summary, the synthesis layer may lose attribution.

Parallelism also matters. If a coordinator can invoke multiple Task calls in one response, latency can be reduced significantly. But parallelization should not create duplicated work. Scope each subagent carefully:

by subtopic
by source type
by question type

Lecture 1.7: Sessions, Resumption, and Forking

Key Distinctions:

resumption continues prior work, while forking explores alternatives from a shared baseline
stale tool outputs make naive resumption risky
changed files or facts should be communicated explicitly on resume

Common Wrong Answers:

"A resumed session automatically knows what changed."
"Forking is only for experimentation, not for disciplined comparison."
"Fresh restarts are always safer than targeted resumption."

What To Memorize:

use named resume deliberately
use forks for divergent approaches
stale state is the main resumption risk

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Choose resume, fork, or fresh start for three change scenarios.
List what changed information should be passed into a resumed session.

Check Your Understanding:

What is the main risk in naive session resumption?
Show answer
Stale state. The resumed session still trusts tool outputs and analysis from the original run, but the underlying world — files, data, system state — may have changed. Acting on stale evidence as if it were current is a quiet failure that produces confidently wrong work. Either tell the resumed session what changed or start fresh with a structured summary.
When is forking better than resuming?
Show answer
When you want multiple independent paths from a shared baseline — comparing two refactoring approaches, exploring alternative synthesis structures, isolating a verbose workflow from the main conversation. Resumption continues one line; forking creates parallel lines that can be evaluated against each other without contaminating the original.

The exam guide expects you to understand session state at a practical level:

named resumption continues a prior investigation when the context is still mostly valid
forking creates independent branches from a shared baseline
fresh starts with injected summaries are better when old tool outputs have become stale

This is an engineering judgment issue. Resuming a session that analyzed old code and then blindly trusting that analysis after major changes is weak. In that case, either tell the resumed session what changed or start fresh with a structured summary.

Lecture 1.8: Independent Review — Why a Generator Should Not Grade Itself

Key Distinctions:

a same-session reviewer inherits the generator's reasoning trail and tends to ratify it
independence comes from a fresh context, not from a different system prompt in the same session
"self-critique" prompts produce confidence calibration, not real review

Common Wrong Answers:

"Add a 'now critically review your previous answer' prompt to the same session."
"A more skeptical system prompt is enough to make a reviewer independent."
"If the generator is strong enough, an independent reviewer is unnecessary overhead."

What To Memorize:

the reviewer must not see the generator's chain of reasoning before forming its own opinion
spawn the reviewer as a separate Task with only the artifact and the criteria
a forked session counts as independent only if the fork point precedes the generator's reasoning

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Sketch the message list a coordinator would pass to an independent code reviewer for a generated patch. Strip everything not needed.
A team adds "Please check your work carefully and disagree if needed" to its synthesis prompt and reports better quality. Critique that intervention.

Check Your Understanding:

Why does adding a self-critique step to the same session usually fail to catch the generator's mistakes?
Show answer
The model already committed to a reasoning path and the messages preserving that commitment are still in context. A self-critique prompt is steered by the same evidence and the same prior conclusions, so the model tends to defend the answer rather than re-evaluate it. Independence requires a context that does not include the prior reasoning trail.
A coordinator forks a session to run a reviewer subagent. Is that automatically independent?
Show answer
Not by itself. A fork inherits the baseline messages; if the generator's reasoning was already in that baseline, the reviewer sees it and can be primed by it. Independence requires either a fresh session seeded only with the artifact and criteria, or a fork from a baseline cut before the generator produced its output.

A reviewer needs to disagree with the generator. That sounds like a prompting problem, but it is mostly a context problem. When the same session that produced an answer is asked to grade it, the answer's reasoning chain is still in the model's view. The model has already justified the answer, and most of the messages that follow will continue along the same line. A "critically review your work" prompt arrives at the worst possible moment — after commitment, against the grain of the prior text, and without any new evidence to anchor a different conclusion.

The failure mode is quiet. The reviewer produces a plausible critique that catches surface issues — typos, formatting, the kind of thing the generator was already going to fix on a re-read — and ratifies the substantive decisions. Stakeholders see "review passed" and trust it. The structural mistakes that the reviewer would have caught with a clean view of just the artifact survive into production. This is the pattern behind exam questions that ask why same-session self-review is weaker than independent review: it is not a quality of the prompt, it is a property of the conversation.

The intervention is to construct a context that does not contain the generator's reasoning. The cleanest version is a separate Task-spawned subagent whose prompt contains only the artifact under review (the patch, the report, the plan), the explicit criteria, and any reference material. No transcript, no draft history, no "the previous agent thought X." If a fork is used, fork from a point before the generator started, and pass only the artifact across. The reviewer then produces an opinion against the artifact, not against the prior model's defense of it.

Caveat: independent review is not free. The reviewer pays the full context cost again, and you lose any context-sensitive judgment the generator was able to apply. For low-stakes work — a draft email, a one-off summary — same-session re-reads are fine. The independence rule applies when a wrong answer is expensive enough that ratification by the same reasoning chain would be a real failure mode.

📐 See the diagram: Independent review — what the reviewer sees.

Lecture 1.9: Subagent Failure Modes — Partial Results, Timeouts, and Re-delegation

Key Distinctions:

a subagent that times out is not a subagent that returned nothing
"the call failed" is not enough information for the coordinator to recover safely
gap detection during synthesis is the coordinator's job, not the synthesis subagent's

Common Wrong Answers:

"If a subagent times out, drop the partial results and rerun the whole task."
"Surface a generic 'something went wrong' to the user and stop."
"If synthesis looks complete, no follow-up delegation is needed."

What To Memorize:

preserve partial results from a failed subagent; the coordinator decides whether they are usable
a structured failure record carries: what was attempted, what completed, what failed, and why
when synthesis surfaces a coverage gap, the coordinator re-delegates the gap, not the entire task

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

A research subagent retrieves three sources, then times out on the fourth. Write the structured failure context the subagent should hand back to the coordinator.
A synthesis pass produces a confident-sounding report but the coordinator notices one of the requested subtopics is missing. Outline the re-delegation step rather than restarting.

Check Your Understanding:

Why is "rerun the whole subagent task" the wrong default response to a partial-results timeout?
Show answer
The completed work is real evidence, and discarding it costs another full execution and another chance to time out. The right default is to surface the partial results plus a structured failure record, then let the coordinator decide whether to fill the gap with a narrowly scoped follow-up call instead of repeating the whole task.
The coordinator detects a gap during synthesis — one requested subtopic was never covered. What is the correct response?
Show answer
Re-delegate the specific gap. Spawn a focused subagent with the missing subtopic and the constraints needed to cover it, then re-synthesize. Restarting the full investigation wastes work, and silently shipping the incomplete report misrepresents what is known.

Subagent failures are rarely binary. A search subagent fetches three of four sources before its time budget runs out. A document-analysis subagent extracts most of a long PDF before hitting an exception on a malformed page. A coordinator-spawned tool call returns a permission error after one valid result. In every case, real work was done. The naive recovery — discard, retry — throws away evidence that is more reliable than anything a second attempt is likely to produce, and often hits the same boundary the second time.

The failure mode is two-sided. On one side, callers swallow the error and present partial results as if the run completed; the synthesis layer reports confidently on incomplete evidence and the user does not know the difference. On the other side, callers raise a generic "operation failed," drop the partial work, and force a full rerun. Both sides are wrong because they collapse three separate facts — what was attempted, what completed, what failed — into one signal.

The intervention is a structured failure context. When a subagent cannot finish, it returns the work it did complete, an explicit statement of what was not attempted or not finished, and the reason (timeout, validation error, permission denial, upstream 5xx). The coordinator now has enough to choose: synthesize on what is available with explicit gaps, re-delegate the unfinished portion to a narrower subagent with a fresh budget, or escalate. The same logic applies when synthesis itself surfaces a gap that the original decomposition missed — re-delegate the gap with focused scope, do not restart the whole investigation.

Caveat: structured failure context only helps if the coordinator actually inspects it. A coordinator that handles every failure with the same retry-or-give-up policy gains nothing from richer error data. The architectural commitment is upstream: error envelopes that the coordinator's synthesis logic is built to read.

Lecture 1.10: Handoff Quality and Human Escalation

Key Distinctions:

a handoff is a self-contained brief, not a transcript reference
explicit human requests are escalation triggers and should not be re-evaluated for "complexity"
escalation criteria belong to the system, not to the model's discretion alone

Common Wrong Answers:

"If the user asks for a human but the issue looks easy, the agent should keep trying first."
"A short status update like 'customer needs help with refund' is enough for a human to take over."
"Escalation is a fallback for when the agent gets stuck, not a normal control path."

What To Memorize:

a strong handoff includes case identifier, issue type, established facts, root cause hypothesis, actions attempted, and a recommended next step
an explicit user request for a human is honored immediately, regardless of perceived issue difficulty
escalation criteria — policy thresholds, identity gaps, explicit requests — are deterministic triggers, not nudges

Try It Yourself:

No single right answer — draft your attempt, then compare against the lecture's worked examples.

Take a weak handoff like "Customer upset, refund issue" and rewrite it for a human who has no transcript access.
A user with a $20 billing question writes "I want to talk to a person." Decide whether the agent should escalate immediately and justify the choice.

Check Your Understanding:

Why does an agent honor an explicit human-request even when the underlying issue looks simple?
Show answer
The user has stated a preference about how the issue should be handled, and that preference is itself the request. Re-evaluating it against the agent's own difficulty estimate substitutes the agent's judgment for the user's, which both delays resolution and damages trust. Honoring the request is the correct default.
What turns a status update into a usable handoff?
Show answer
Self-containment. A handoff is read by a human who likely cannot scroll the transcript, so it must carry the case identifier, what is established, what was attempted, and what the next human step should be. A status update like "customer needs help" describes the situation; a handoff describes what the human needs to do.

Two patterns recur in escalation questions. The first is the explicit human request. A user types "I want a human" or "transfer me to a person" or "stop, I want to talk to someone real." The user's words are the escalation trigger, full stop. An agent that answers "I can help with that — what is your order number?" or that runs through a complexity check first is overriding a stated preference, and the exam treats this as a clear miss. The same reasoning applies to ambiguous-but-emphatic frustration when paired with policy-sensitive operations: route, do not improvise.

The second pattern is the handoff itself. Escalation without a usable handoff is just abandonment. The default failure mode is a one-line status — "customer upset, needs refund help" — that forces the human to read the entire transcript before they can act, and most escalation surfaces do not show the transcript anyway. The result is wait time, repeated questions to the user, and a worse experience than if the agent had stayed with the issue. A strong handoff stands on its own: case identifier, issue type, facts established (verified customer ID, order ID, charge status), root cause if known, actions attempted by the agent, and a specific recommended next step for the human.

The intervention is a templated escalation tool, not a free-form prompt. The escalation tool's schema requires the structured fields, and a hook can validate the handoff before the escalation actually fires. That makes the escalation deterministic both at the trigger (explicit request, threshold breach, identity gap) and at the message (validated structure, no fields left blank). Prompting alone cannot guarantee either side.

Caveat: there is a real cost to over-escalation, especially for systems where humans are scarce and slow. The deterministic triggers should be calibrated — the explicit-request rule is unconditional, but the threshold and identity-gap rules should be set with the operational team that will absorb the volume. Honoring "I want a human" is non-negotiable; defining "policy threshold" requires a real number.

Drill

Memorize & spot misconceptions

4 sections

Flashcards Core Vocabulary 19 terms

Click a card to flip it. Keyboard: space toggles focused card.

Common misconceptions Common Misconceptions

“If the model says it is done, the loop should end.”
“More examples can replace business-rule enforcement.”
“Subagents know what the coordinator knows.”
“If the final answer is coherent, the decomposition must have been good.”
“Same-session self-review catches the generator's mistakes.”
“If a subagent fails, the work it already completed should be discarded.”
“If the user asks for a human but the issue looks easy, the agent should still try to resolve it first.”
“Forking is only useful when comparing alternative paths.”
“Always invoking every subagent guarantees coverage.”

Key distinctions Key Distinctions

tool_use vs end_turn

continue the loop only when the API state requires tool execution, not when prose merely sounds unfinished.
prompt guidance vs deterministic control

use prompts for judgment and routing, but use gates, hooks, and interception for mandatory policy or ordering constraints.
coordinator failure vs subagent failure

incomplete coverage often starts in decomposition, even when each subagent executes its assigned task well.
prompt chaining vs adaptive decomposition

fixed chains fit stable workflows, while broad or uncertain tasks need evolving decomposition.
context presence vs context inheritance

subagents use only what the coordinator explicitly passes, not what the parent session "already knows."
same-session self-review vs independent review

a reviewer in the same conversation inherits the generator's reasoning chain; independence requires a context that does not contain it.
partial-result preservation vs generic failure surfacing

a structured failure record carries what completed and what failed; "operation failed" is the wrong abstraction.
escalation trigger vs model discretion

explicit human requests, threshold breaches, and identity gaps are deterministic triggers — not optional nudges the model can override.
structured state export vs session resumption

long-running workflows recover from explicit state manifests, not from trusting that a resumed session still understands the world.

Don't say this Common Wrong Answers

"Add more prompting so the model remembers to verify identity first."
"If the report reads well, the orchestration must be correct."
"Invoke all subagents every time to guarantee coverage."
"End the workflow when the assistant sounds finished."
"Assume subagents can infer missing context from the larger conversation."
"Add 'now critically review your answer' to the same session."
"If a subagent times out, drop the partial results and rerun."
"Escalate only after the agent has tried everything else."
"Forking is only for exploring alternative paths, not for keeping the main conversation clean."

Lab

Practice

2 sections

case study Worked Case Study

Case:

A returns assistant performs well on simple requests but occasionally refunds the wrong account after matching a customer by name only.

Analysis:

The primary failure is not "lack of examples."
The critical issue is that identity verification is not enforced before order or refund operations.
A secondary risk is that the agent may be using ambiguous lookup inputs without requiring a unique identifier.

Best redesign:

require get_customer to return a verified customer ID before lookup_order or process_refund
ask for clarification when multiple customer matches exist
preserve customer ID and order ID in a structured facts block
escalate when policy or identity remains unresolved

lab Lab

Design a customer support resolution agent that handles returns, disputes, and account issues.

Requirements:

tools: get_customer, lookup_order, process_refund, escalate_to_human
refunds require prior identity verification
multi-issue requests should be decomposed
escalations must include customer ID, root cause, refund amount if relevant, and recommended action

What a strong design includes

loop control based on stop_reason
programmatic prerequisite gate before refund
decomposition of multi-concern requests into separate tracks
structured escalation summary for humans who do not have the full conversation transcript

Quiz

Test yourself

2 sections

Quiz

What is the strongest signal that an agentic loop should continue? A. Assistant text looks incomplete B. stop_reason == "tool_use" C. There are tools available D. The system prompt requests another pass

Answer: B

Why is checking assistant prose for completion weak? A. It is expensive B. It prevents tool use C. It relies on natural-language interpretation instead of explicit API state D. It only works for JSON

Answer: C

Which is true of subagents? A. They inherit parent context automatically B. They require explicit context injection C. They cannot run in parallel D. They do not need tool restrictions

Answer: B

What is the best first response when a critical workflow step must always happen before another? A. Add more examples B. Enforce the prerequisite programmatically C. Raise the context window D. Use sentiment analysis

Answer: B

If a broad topic is consistently under-covered, what is the most likely root cause? A. The synthesis agent is too slow B. The coordinator decomposed the task too narrowly C. The web agent needs more tokens D. The user prompt is too short

Answer: B

A coordinator is expected to spawn subagents but never does. Which is the best first thing to verify? A. The context window is large enough B. The coordinator has Task available and allowedTools includes "Task" C. The synthesis agent has more examples D. The final answer prompt is more explicit

Answer: B

Week 1 Quiz Explanations

B is correct because loop progression should follow explicit API state. A and D are indirect signals. C says nothing about whether the model requested tool execution.
C is correct because prose interpretation is probabilistic and brittle. A is not the main issue. B is false. D is unrelated.
B is correct because subagents require explicit context passing. A and D are incorrect assumptions. C is false because parallel spawning is explicitly supported.
B is correct because critical ordering constraints require deterministic enforcement. A still leaves failure probability. C is irrelevant. D solves the wrong problem.
B is correct because incomplete coverage often begins with narrow decomposition by the coordinator. A, C, and D are downstream or weaker explanations.
B is correct because subagent spawning depends on the actual mechanism being available. A is unrelated. C and D address prompt quality rather than enabling delegation.

Test

Short Answer

Explain the difference between tool_use and end_turn.
When should a system choose adaptive decomposition instead of prompt chaining?
Why is structured escalation data important for human handoff?

Scenario Question

Your multi-agent research system produces well-written but incomplete reports. Logs show the coordinator always invokes all subagents, but on broad topics it assigns overly narrow subtasks. What is the architectural fix?

Sample Answer

The problem is coordinator decomposition, not downstream execution quality. The coordinator should inspect query breadth, partition the scope more comprehensively, and use iterative gap checking before final synthesis. It should invoke only the necessary subagents and should re-delegate targeted follow-up tasks when coverage gaps are detected.

Week 1 Test Rubric

Full credit: explains stop_reason correctly, identifies coordinator decomposition as the root issue, and proposes explicit gap detection or re-delegation.
Partial credit: identifies the right component but proposes only vague prompt improvements.
Low credit: blames synthesis quality or recommends adding more tools without fixing decomposition.

objective Objective

chapter map Chapter Map

Suggested 5-Day Teaching Flow

End-of-Lecture Recap and Homework

Lecture 1.1 Recap Questions

Lecture 1.2 Recap Questions

Lecture 1.3 Recap Questions

Lecture 1.4 Recap Questions

Lecture 1.5 Recap Questions

Lecture 1.6 Recap Questions

Lecture 1.7 Recap Questions

lecture summary Lecture Summary

Memorize What To Memorize 0 / 10

addendum Task Statement Coverage Addendum

Task Statement 1.1: Agentic Loops

Task Statement 1.2: Coordinator-Subagent Orchestration

Task Statement 1.3: Subagent Invocation, Context Passing, and Spawning

Task Statement 1.4: Multi-Step Workflows, Enforcement, and Handoff

Task Statement 1.5: Hooks and Interception

Task Statement 1.6: Task Decomposition

Task Statement 1.7: Session State, Resumption, and Forking

From First Principles

Guided Walkthrough: Building A Refund Agent Correctly

Week 1 Worked Pseudo-Architecture

Week 1 Board Teaching Notes

deep dive Deep Dive: Hooks, Enforcement, and Handoff Reliability

Deep Dive A: PostToolUse as a Normalization Layer

Deep Dive B: Outgoing Tool Interception

Deep Dive C: Handoff Quality

Lecture 1.1: The Agentic Loop

Example

Why this shows up on the exam

exercise Guided Exercise 1.1

Sample Answer

Lecture 1.2: Deterministic Enforcement vs Prompt Guidance

exercise Guided Exercise 1.2

Sample Answer

Lecture 1.3: Coordinator-Subagent Architecture

Lecture 1.4: Subagent Invocation, `Task`, and `AgentDefinition`

Minimal Operational Checklist

Failure Mode Example 1

Failure Mode Example 2

exercise Guided Exercise 1.3

Sample Answer

exercise Guided Exercise 1.4

Sample Answer

Lecture 1.5: Decomposition Strategies

Lecture 1.6: Context Passing and Parallelism

Lecture 1.7: Sessions, Resumption, and Forking

Lecture 1.8: Independent Review — Why a Generator Should Not Grade Itself

Lecture 1.9: Subagent Failure Modes — Partial Results, Timeouts, and Re-delegation

Lecture 1.10: Handoff Quality and Human Escalation

Flashcards Core Vocabulary 19 terms

Common misconceptions Common Misconceptions

Key distinctions Key Distinctions

Don't say this Common Wrong Answers

case study Worked Case Study

lab Lab

What a strong design includes

Quiz

Week 1 Quiz Explanations

Test

Short Answer

Scenario Question

Sample Answer

Week 1 Test Rubric

Deep Dive A: `PostToolUse` as a Normalization Layer