Launch Readiness
Establish how AI governance works in practice across functions, set communications cadences, draft the member position on AI, and confirm the first pilot. These decisions gate everything else in Sprint Three.
Sprint One Outputs (carried forward)
Data from Sprint One's discovery and diagnosis, loaded from your browser.
No Sprint One data found. Complete Sprint One first, or continue without carried-forward outputs.
Distributed AI Governance
AI governance at Septapod is distributed across functions, not centralized in a single owner. Each function owns the AI decisions that fall within its existing authority.
If these functions are not yet operational for AI decisions
The governance readiness assessment from Sprint Two identified where the AI Policy's authorities are operational and where they need support. Sprint Three's first task is making sure the governance model works in practice before pilots begin. If key functions lack capacity or clarity, that surfaces here, not during a pilot.
Copilot Connection
Microsoft Copilot is already rolling out at Septapod for internal policy and procedures access (under the existing strategic plan's people / culture program area). The engagement formally connects to that rollout without owning it. Copilot is not one of the engagement's pilots; it is a parallel workstream that benefits from the same eval discipline, comms cadence, and governance overlay this sprint builds.
Which outcome metrics from the Copilot rollout get measured the same way Sprint Three's pilots are measured: accuracy on policy lookups, time-to-answer, edge case handling, member-data exposure surface.
How Copilot enters the employee communication cadence below. Same channel, same rhythm. Staff see one AI strategy, not two parallel ones.
Who owns Copilot decisions in the Distributed AI Governance model above, and how Copilot fits the same review cadence as the pilots.
Operating Rhythm
Three cadences that keep Sprint Three's parallel work streams coordinated without Brent driving every meeting.
Employee Communication Cadence
A steady internal narrative that addresses what AI is for at Septapod, what is being tried, and what is off the table. Begins immediately and runs through the full sprint. Addresses workforce concerns directly before industry noise fills the gap.
Sprint One's shadow-AI audit identified staff already using AI tools without sanction. This is the response: Septapod names what the audit found and legitimizes safe use rather than treating it as a violation.
Who writes and sends the cadence. Likely the CMO or an AI Taskforce member. Estimated time commitment: 2-3 hours per cycle for drafting, review, and send. Across a 10-14 week sprint at the selected frequency: 10-20 hours total. This is real work and needs a named owner before the cadence begins.
How literacy builds in this engagement
There is no separate AI literacy program in this engagement. Literacy builds through three channels that run during Sprint Three and Sprint Four: pilot participation (people on pilot teams learn by doing), governance ride-alongs (people on the Distributed AI Governance model learn by deciding), and champion network practice (a small group of internal practitioners forms during Sprint Four and carries the method forward). The cumulative effect across these three channels is the literacy program. A formal organization-wide curriculum is Year Two work, after the engagement closes.
Three channels by which literacy grows during the engagement
- Pilot participation
- 5-8 people per pilot across 2 Sprint Three pilots plus the gap-period pilot. Roughly 15-24 people are directly involved in pilot work over the engagement. Each one learns to scope, evaluate, iterate, and decide on AI deployments.
- Governance ride-alongs
- The Distributed AI Governance model above puts AI decisions in named roles (IT, Compliance, department heads, AI Taskforce, Tech Steering Committee). Each role builds judgment through 2+ cycles of real decisions before the gap period begins.
- Champion network
- Sprint Four formalizes a 3-5 person network of internal practitioners who carry the per-pilot method, the governance cadence, and the signal watch-list forward without Brent in the room. This is the post-engagement steward layer.
Capture intent here so the question is not dropped at engagement close. What would organization-wide AI literacy look like in Year Two if Septapod chooses to invest in it: formal curriculum, vendor, internal program, role-specific tracks?
Member Position on AI (Draft)
The position Septapod takes with its 60,000 members members about AI and member data. Drafted now, before any member-facing AI is on the table. Member trust is the institution's core position. This draft gets finalized in Step 3 after pilot experience informs it.
Scenario probe
Lightweight scenario question embedded in the member position work. Responses feed the signal watch-list in Step 3.
First Pilot Confirmation
The first pilot was selected through Sprint Two's Operational Task Mapping work and confirmed at the synthesis workshop. The OTM outputs carry forward below. Confirm scope, team, and timing here; this confirmation gates Step 2.
No OTM outputs found in Sprint Two state. Complete the Operational Task Mapping module (Step 1 of Sprint Two) to carry forward the pilot selection, ranked candidates, and strategic choices.
Operational Pilots
Co-design, run, and evaluate two targeted pilots using Anthropic's Building Effective Agents patterns, an eval-driven iteration loop, and the PAIR Guidebook for human-side design. Each pilot runs through sandbox and operational phases. Brent co-leads the first; the team co-designs the second with Brent advising.
Each pilot follows six lifecycle steps in order, preceded by a Setup decision (co-facilitation model: Brent co-leads, or team co-designs with Brent advising). Complete each card before moving to the next. The method is the same for both Sprint Three pilots. Brent co-leads Pilot 1 with the team. The team co-designs Pilot 2 with Brent advising. A third pilot runs independently during the gap period between sprints.
- Define the use case: what problem, who it affects, why this pilot
- Select the architecture: which Anthropic agent pattern fits
- Set eval criteria: metrics, test cases, baseline, iteration targets
- Design the human side: mental models, feedback, trust, failure modes
- Track iterations: what changed, eval results, continue/ship/retire
- Decide (scale or retire): evidence-based, using the criteria above
Co-facilitation Model
Who is leading this pilot? Pilot 1 is co-designed and co-led by Brent and the Septapod team. Pilot 2 is co-designed by the team with Brent advising. The team builds capability through practice, not through documentation at the end. Set this for each pilot before working through the six lifecycle steps below.
Use Case Definition
Define what this pilot does, who it affects, and why it was selected. Text fields because pilot scoping is open-ended.
Agent Architecture Pattern
Select the architecture pattern from Anthropic's Building Effective Agents that fits this pilot's use case. All six patterns stay visible for reference. Click one to select it.
Prompt Chaining
Sequential steps where each step's output feeds the next as input. The workflow decomposes a complex task into a fixed sequence of simpler LLM calls, with optional validation gates between steps.
When to use: Tasks that naturally decompose into ordered steps with clear handoff points. Works well when you can verify intermediate outputs before proceeding.
Strengths: Predictable flow. Each step can be tuned independently. Easy to debug because failures localize to a specific step. Low risk of runaway behavior.
Limitations: Rigid. Cannot adapt if one step produces unexpected output that requires a different path. Latency accumulates across steps.
CU example: New account opening workflow where each step (identity verification, product eligibility check, disclosures generation, account provisioning) feeds the next with validation between.
Routing
Classifies incoming input and directs it to the appropriate specialized handler. A single classifier decides which downstream path to take, then hands off to a purpose-built prompt or tool for that category.
When to use: When inputs vary widely and different types need fundamentally different handling. The classifier is the critical component.
Strengths: Each handler can be optimized for its specific case. Keeps individual prompts focused. Easy to add new categories without changing existing handlers.
Limitations: Only as good as the classifier. Misrouted inputs get poor results. Requires enough training examples to build an accurate classifier.
CU example: Member inquiry routing where the classifier categorizes incoming requests (account question, loan inquiry, dispute, general info) and directs each to a handler with the right context and tools.
Parallelization
Runs multiple LLM operations simultaneously, then aggregates results. Two variants: sectioning (splitting a task into independent parallel subtasks) and voting (running the same task multiple times for consensus).
When to use: When subtasks are independent and can run concurrently, or when you need reliability through redundancy. Speeds up workflows where sequential processing is unnecessarily slow.
Strengths: Faster than sequential for independent work. Voting variant improves accuracy on judgment calls. Scales naturally.
Limitations: Only works when subtasks are genuinely independent. Aggregation logic can be complex. Higher compute cost than sequential.
CU example: Quarterly compliance review where policy checks across lending, BSA/AML, and fair lending run in parallel rather than sequentially, with results aggregated into one report.
Orchestrator-Worker
A coordinator agent breaks a complex task into subtasks, delegates each to specialized worker agents, then synthesizes the results. The orchestrator decides what work to do and in what order.
When to use: Complex tasks where the required subtasks are not predictable in advance. The orchestrator adapts the plan based on intermediate results.
Strengths: Flexible. Handles tasks where the full scope is not known at the start. Workers can be specialized for different domains.
Limitations: More complex to build and debug. The orchestrator is a single point of failure. Requires clear contracts between orchestrator and workers.
CU example: Loan application processing where an orchestrator manages income verification, credit analysis, collateral assessment, and compliance checks, adapting the workflow based on what each worker finds.
Evaluator-Optimizer
One LLM generates output, another evaluates it against criteria, and the generator refines based on feedback. The loop continues until quality thresholds are met or iteration limits are reached.
When to use: When output quality matters more than speed and clear evaluation criteria exist. The evaluator needs well-defined standards to judge against.
Strengths: Produces higher-quality output than single-pass generation. The evaluator catches errors the generator misses. Self-improving within a session.
Limitations: Slower and more expensive. Requires clear evaluation criteria. Can get stuck in loops if criteria are ambiguous.
CU example: Compliance document review where a generator drafts regulatory responses and an evaluator checks against NCUA examination standards, iterating until the response addresses every finding.
Autonomous Agents
Self-directed agents that plan their own approach, execute actions using tools, observe results, and adjust. They operate in an open-ended loop with access to external tools and data sources.
When to use: Open-ended problems where the solution path is not known in advance and the agent needs to explore. Requires strong guardrails and monitoring.
Strengths: Can handle genuinely novel situations. Adapts to unexpected findings. Closest to human-like problem solving.
Limitations: Hardest to control. Risk of compounding errors. Highest cost. Requires robust safety boundaries, logging, and human oversight. Not appropriate for early pilots without strong governance.
CU example: Research assistant for strategic planning that autonomously gathers peer CU data, NCUA call report trends, and market analysis, then synthesizes recommendations. Appropriate only after governance is mature.
Eval Criteria (Eval-Driven Development Loop)
The eval-driven development loop is the iteration shape for each pilot: define metrics, write eval cases, establish a baseline, set targets, then iterate. Each section describes the step, what it produces, and what "good" looks like.
1. Define Success Metrics
What does success look like for this pilot? Define 2-3 measurable metrics. Good metrics are specific, observable, and connected to the problem statement. Avoid vanity metrics (adoption rate alone) in favor of outcome metrics (error reduction, time saved, accuracy improvement).
2. Write Eval Cases
Concrete test cases that exercise the pilot's capabilities against the success metrics. Include typical cases, edge cases, and failure modes. Good eval cases are realistic (drawn from actual Septapod data where possible), diverse (cover the range of inputs the pilot will see), and measurable (each case has a clear pass/fail criterion).
3. Establish Baseline
Measure current performance on the eval cases before the pilot changes anything. The baseline is the benchmark the pilot must beat. Without it, improvements are anecdotal rather than proven.
4. Set Iteration Targets
Define what improvement level justifies continuing, shipping, or retiring. Good targets create clear decision points: below X means retire, above Y means ship, between means iterate.
Scenario probe
Responses accumulate as entries in the Annual AI Plan and feed the signal watch-list. This is a facilitation technique embedded in the eval criteria, not a separate deliverable.
Human-Side Design (PAIR Guidebook)
Four design dimensions from Google's People + AI Research (PAIR) Guidebook. Each dimension has a description, what it means for this specific pilot, and facilitator guidance for the design session.
Mental Models
How will pilot users understand what the AI does and does not do? Users form mental models of AI capabilities quickly and those models persist even when wrong. Design for accurate mental models from the start: name what the AI is good at, name what it is bad at, and show examples of both during onboarding. Facilitator guidance: ask pilot team members to describe what they think the AI will do before they see it, then compare their expectations to reality.
Feedback Mechanisms
How will users correct or guide the AI? Effective feedback loops are specific (users can point to what went wrong), low-friction (no more than one click or sentence), and visible (users see that their feedback changed something). Without feedback mechanisms, users either accept bad output or stop using the tool. Facilitator guidance: prototype the feedback mechanism before the AI itself. If users cannot correct the AI easily, the pilot will stall.
Calibrated Trust
How will the pilot set appropriate expectations? Trust calibration means users trust the AI the right amount: not too much (over-reliance leads to undetected errors) and not too little (under-reliance means the tool goes unused). Show confidence signals where available. Flag uncertainty. Make the AI's reasoning visible. Facilitator guidance: ask the pilot team where they expect to trust the AI too much and too little, then design guardrails for both directions.
Graceful Failure
What happens when the AI is wrong? Every AI system fails. The question is whether the failure is graceful (user notices, system recovers, no harm done) or catastrophic (user does not notice, bad output propagates, trust is destroyed). Design for graceful failure: make errors visible, provide fallback paths, and never let a single AI output trigger an irreversible action without human review. Facilitator guidance: walk through the worst plausible failure for this pilot and design the recovery path before deployment.
Iteration Tracking
Each iteration records what changed, the eval results, and the decision: continue iterating, ship, or retire. Add rows as the pilot progresses through the build-eval-improve loop.
Scale / Retire Decision
At the end of the pilot's iteration cycle, decide whether to scale toward broader use, retire with documented rationale, or continue iterating. The decision is evidence-based, drawn from the eval results and iteration history above.
Harvest
Document the governance model that emerged from pilot evidence, finalize the member position, define vendor evaluation criteria, assemble the signal watch-list, review the Annual AI Plan state, and decide what scales toward Sprint Four.
The Harvest step documents what Sprint Three produced and prepares the handoffs. Each card draws on evidence from the pilots and the launch readiness decisions. The Annual AI Plan assembled continuously throughout the sprint; this step reviews and formalizes it. The board working session and gap-period pilot scoping at the end of the step set up Sprint Four to succeed.
- Document governance: who holds which accountabilities, based on pilot evidence
- Finalize member position: refine the Step 1 draft with what the pilots taught
- Set vendor eval criteria: derived from pilot evals, including fair-lending review
- Assemble signal watch-list: what to monitor, who monitors it, how often
- Review Annual AI Plan: every pilot result and vendor decision, assembled over 14 weeks
- Decide what scales: per-pilot scale/retire/continue with evidence
- Run the board working session: substantive engagement so Sprint Four co-creation is the third or fourth board touch, not the second
- Scope the gap-period pilot: final card. What Septapod runs independently between sprints, with eval criteria and escalation path named
Governance Model (What Emerged)
The governance model is documented here, not designed here. It emerged from the pilots: who needed to decide what, where escalations went, which accountabilities proved load-bearing. Four accountability slots, each tied to a named person.
Pilot Oversight
Who decides whether a pilot continues, ships, or gets retired. Who reviews eval results at each cadence checkpoint.
Vendor Evaluation
Who evaluates AI vendor tools and features against the criteria defined below. Who holds the fair-lending review for member-facing applications.
Annual Plan Refresh
Who owns the Annual AI Plan after Sprint Four. Who triggers the yearly refresh using the signal watch-list.
Board Reporting
Who reports AI progress to the board. What format, what frequency, what the board needs to see versus what stays operational.
Member Position on AI (Final)
The draft from Step 1 is shown below for reference. Edit the final version based on what the pilots revealed about member touchpoints, data use, and trust. This becomes a named section of the Annual AI Plan.
Draft will load from Step 1 data.
Vendor Evaluation Criteria
Derived from the first pilot's eval criteria and adapted for evaluating AI vendor tools and features going forward. These criteria apply to every future vendor decision at Septapod. The fair-lending review section is mandatory for any member-facing application.
Fair-Lending Review (mandatory for member-facing applications)
Any AI tool that touches member decisions (lending, pricing, eligibility, risk scoring) requires a fair-lending review before deployment. For credit unions serving financially vulnerable members, this is non-negotiable.
Signal Watch-List
Specific signals drawn from pilot evidence that named people will monitor on a set frequency. These make the Annual AI Plan's yearly refresh an evidence check rather than a calendar exercise. Pre-populated with scenario probe responses from the pilots.
Annual AI Plan State
The Annual AI Plan assembled continuously throughout Sprint Three. Every pilot result, vendor decision, and documented surprise became an entry. This card shows what has accumulated. Sprint Four formalizes it.
No pilot data found. Complete Operational Pilots (Step 2) to see accumulated entries.
What Scales Toward Sprint Four
Evidence-based decisions about what moves from targeted pilots toward broader use, and what gets retired. One row per pilot, drawn from the scale/retire decisions in Step 2.
Pilot scaling decisions will load from Step 2 data.
Board Working Session
A 60-90 minute substantive board working session, scheduled when the first pilot is showing evidence. Goes beyond the async signals-track read and the Annual AI Plan preview. The board sees the synthesis (governance model, member position, vendor criteria, scaling decisions) and engages with the choices live. This makes the Sprint Four co-creation session the third or fourth board touch, not the second, and prepares the board for the approval triggers the AI Policy reserves for them (member-facing AI that displaces human agents, autonomous access decisions).
When the first pilot has produced enough evidence to discuss. Coordinate with regular board cadence if possible.
Gap-Period Pilot Scoping
The third pilot runs independently between Sprint Three and Sprint Four. Sprint Four's capability transfer test grades how Septapod performs alone. For the test to be fair, this card scopes the gap-period pilot before Brent steps away: use case from the remaining ranked candidate set, eval criteria drafted, team identified, schedule confirmed, escalation path to Brent named. Without these named here, the gap-period assessment in Sprint Four becomes "Septapod did not run a pilot," which is a real outcome the engagement design has no remediation path for.
Sprint Two's OTM ranked candidate set still has options after the two Sprint Three pilots. Pick the gap-period pilot from that set so the rationale is documented and the selection is data-driven, not opportunistic.
Same eval discipline as the Sprint Three pilots: success metrics, test cases, baseline, iteration targets. Drafted with Brent before the gap so the team is not inventing methodology on its own.
Named owner + 3-5 pilot team members + governance contact. The internal owner runs the pilot; the governance contact handles escalations.
Sandbox start, operational rollout, evaluation, and Sprint Four handoff point.
When regular check-ins are not enough. Specific conditions that warrant Brent stepping back in (compliance issue surfaced, eval criteria not being met, team falls off entirely).
A small but real risk. If the team falls behind in the first 2-3 weeks of the gap, what is the remediation? This card protects the engagement from a "no pilot got run" outcome at Sprint Four.