React NativeClaude CodeAI EngineeringSenior EngineerAgentic Coding

How I Built an Agentic Coding System That Doesn't Need Me to Babysit It

JDJean Desauw

March 11, 20265 min read

How I Built an Agentic Coding System That Doesn't Need Me to Babysit It

Most people using AI for coding are doing it wrong. Not because they pick the wrong model or write bad prompts. Because they're trying to shortcut to the result without building the system that makes the result good.

I spent a few months doing exactly that. Prompting, fixing, re-prompting. Getting somewhere, but not reliably. Then I changed the question.

Instead of asking "how do I get this agent to do what I want?", I started asking "how do I build something where I only have to show up for the decisions that actually need me?"

That's the real goal: a human-in-the-loop system. Not micromanaging every output. Not rubber-stamping code you didn't read. Showing up at the critical checkpoints, and letting everything else run.

Here's how I built it.

What "human in the loop" actually means

The term gets thrown around a lot. Martin Fowler's team at ThoughtWorks defines it as structuring AI workflows so humans remain in the decision chain at meaningful points - not as a bottleneck on every line of code, but as a checkpoint on the decisions that matter.

The failure mode is trying to be in the loop everywhere. If you review every line an agent writes, you didn't gain anything. You just became a slower compiler. The point is to engineer the system so you can confidently step back from the low-stakes work and stay present for the high-stakes calls.

Getting there is a build process, not a setting you flip.

Start with the codebase, not the prompt

This is the part almost everyone skips.

Matt Pocock nailed it in a recent video: your codebase, way more than your prompt, is the biggest influence on what the AI produces. A confused codebase produces confused output. It's not fixable with better prompting.

The things that hurt agents the most:

Files and modules that don't reflect any logical grouping
No fast way to know if a change worked (the agent edits something and gets no signal)
Inconsistent naming across components

That last one matters more than people expect. I asked Claude directly: what makes it easy for you to compose a screen from an existing component library? Here's part of what it said:

Clear, self-documenting prop interfaces. When I see <Button variant="primary" size="lg" disabled>, I instantly know what to do. When I see <Button style={complexStyleObject} onPress={...} customRenderer={...}>, I struggle more. Fewer, well-named props beat many configurable ones.

Consistent naming conventions. If your Button uses variant, your Card should too. If one uses size="lg" and another uses size="large", I'll make mistakes.

Molecules are my sweet spot. Pure atoms are too low-level - I end up reinventing your design system. Full templates are too rigid. Molecules like <FormField label="Email" error={...}> hit the middle: they encode design decisions but leave composition flexible.

So I refactored. Unified prop naming across the component library. Extracted the design system into a separate package inside the monorepo. Made sure every module had a clean interface.

The difference in output quality was immediate. Not because I prompted better. Because the agent had better material to work with.

Skills vs MCPs: the context budget question

This one took me a while to understand properly.

An MCP (Model Context Protocol) exposes external systems as callable tools. When you connect the Playwright MCP, Claude sees 50+ function signatures injected into the context on every single turn: browser_click, browser_navigate, browser_screenshot... Whether or not you need them. That's a non-trivial chunk of your context budget, permanently allocated.

A skill is different. It's a Markdown file - a name, a description, a set of instructions. It only loads into the context when the agent decides it's relevant to the current task. The rest of the time it doesn't exist.

Take Playwright as a concrete example. Playwright has both an MCP and a CLI. If you use the MCP, you get those 50+ function signatures in context all the time. If you write a skill instead - "here's how to run Playwright tests, here's the CLI command, here's how to interpret the output" - the agent reads that file once when running tests, then it's gone. Same capability, a fraction of the cost.

The same logic applies across the stack. Navigation conventions, component library rules, how to handle env vars, CI setup - these don't need live execution. They're knowledge the agent reads once. Skills handle that well. MCPs are for when you genuinely need the agent to call an external API or interact with a live system mid-task.

I still use MCPs. Just fewer of them, for the things only they can do.

/feedback: stop losing the good ideas

We've all had the same experience: sharp insight at 11pm, vague memory of it two days later, nothing useful a week after that. The friction point that should have changed how you work just disappears because you were deep in something else when it hit.

So I added /feedback to Claude Code. When something in the workflow irritates me, I run it without leaving the terminal. It creates a GitHub issue with the right labels automatically. Claude will also suggest it during a session when it detects repeated friction or unexpected failures. And /feedback-triage lets me work through the backlog and decide what to actually act on.

Small things get captured now. A confusing component API. A spec pattern that keeps causing misunderstandings. A task that ran perfectly and should become a template. None of it requires switching context or remembering to write it down later.

Over time, that turns into something: the rough spots get smoother, and the things that work get codified. Not because I got more disciplined - because the tooling stopped requiring discipline to use.

What this actually looks like day to day

The goal isn't to never touch code. It's to touch the right code, at the right times.

Most mornings I write specs. I think through what done looks like, what the edge cases are, what the agent should explicitly not do. This is the most valuable work I do - a tight spec runs cleanly; a vague one costs me an hour of back-and-forth.

During the day, agents work. I check in at the checkpoints I built into the specs: after the architecture decision, after the first working version, before anything touches production. Not after every file.

At the end of the day, I review output and log anything worth keeping. Takes fifteen minutes.

The system isn't finished. It won't be. That's the point. What mattered wasn't finding the perfect setup - it was building something designed to adapt.

Six months from now, half of what I wrote here might be outdated. The practices that hold up are the ones built around capturing what breaks, not the ones that assume the tools stay the same.

First chapter free

Learn the agentic coding workflow I use in production

How I set up my repos, manage context, and run agents in production. Written down so you can do the same.

Start Chapter 1, it's free Browse all courses