Sandcastle: Running Claude Code Agents in Isolated Sandboxes

Jean Desauw

May 8, 2026

5 min read

Sandcastle: Running Claude Code Agents in Isolated Sandboxes

I came across Sandcastle this week, the latest project from Matt Pocock, and it scratches an itch I've had for months. If you've ever wanted to run a Claude Code agent for an hour without watching every diff land, this is the missing piece.

The pitch is simple: invoke an agent with a single sandcastle.run() call, let the library handle sandboxing and branch management, and get back a clean set of commits on a branch you can merge or discard. No --dangerously-skip-permissions cowboy mode. No agent randomly editing your .env. Just an isolated container, a worktree, and a result.

The Problem It Actually Solves

Agentic coding has a trust ceiling, and it's not the model. It's the blast radius. The moment you run an agent unattended on your real repo, two questions kick in. What happens if it goes off the rails? And what happens if I want to run three of them in parallel?

Most of us have hacked our way around this. Git worktrees in separate folders. A second laptop. A throwaway VM. I've written about parallel sessions with worktrees before, and the pattern works, but it's manual. You set it up, you tear it down, you remember which worktree had the experimental branch.

Sandcastle takes that whole orchestration layer and makes it a function call. The agent runs in a Docker container (or Podman, or a Vercel Firecracker microVM), commits to a temporary branch inside the sandbox, and the host gets the resulting branch with all the commits attached. If the agent breaks something, the damage is in a container that's about to be destroyed.

That's the part I find quietly important. Sandboxing for AI agents isn't a security theater concern. It's what makes unattended workflows actually safe to leave running.

How the API Looks

The basic shape is what you'd hope for from a Matt Pocock library: small surface, strong types, no surprises.

import { run, claudeCode } from "@ai-hero/sandcastle";
import { docker } from "@ai-hero/sandcastle/sandboxes/docker";
 
const result = await run({
  agent: claudeCode("claude-opus-4-6"),
  sandbox: docker(),
  prompt: "Fix the failing tests in the auth module",
});

That's the whole thing. You get back the iterations, the commits the agent made, and the branch they live on. From there, you decide whether to merge, review, or throw it away.

The interesting part is how branches work. By default, bind-mount providers (Docker, Podman) write directly to your working directory because they're already mounted to it. Isolated providers (Vercel) run on a temporary branch and merge it back when done. You can also force an explicit branch name, which is what I'd reach for in CI:

await run({
  agent: claudeCode("claude-opus-4-6"),
  sandbox: docker(),
  branchStrategy: { type: "branch", branch: "agent/issue-42" },
  prompt: "Fix issue #42",
});

That clean separation between sandbox provider and branch strategy is the design decision that makes the rest of the library composable.

What I'd Use This For

A few workflows came to mind immediately, and they're all things I'm currently doing the slow way.

Parallel issue triage. Three open bug reports, three sandboxes, three Claude agents working in parallel. Each one produces a branch with a proposed fix. I review them in the morning. The expensive part of agentic coding isn't the model tokens, it's the wall-clock time between "I have an idea" and "I have a PR." Parallelism collapses that.

Implement-then-review pipelines. Sandcastle has a createSandbox primitive that lets you reuse the same container across multiple agent runs. The example in the README does exactly the workflow I want: Opus implements a feature, then Sonnet runs a review pass on the same branch in the same container. Two agents, one feature, no context loss.

AFK overnight runs. This is the one I'm most curious about. Hand the agent a low-stakes refactor with maxIterations: 5, let it run while I sleep, wake up to a branch with commits I can review over coffee. The completion signal mechanism (the agent outputs a sentinel string when it thinks it's done) is the kind of detail that suggests Matt has actually used this in anger.

I cover the full agentic coding workflow in the course, including how to structure prompts and feedback loops for unattended runs.

The Constraints Worth Knowing

This isn't magic, and the README is honest about it.

Structured output extraction (typed JSON via Zod schemas) requires maxIterations === 1. If you want both a multi-iteration agent and a typed result, you can't have it. That's a real limitation if you were hoping to use Sandcastle as a drop-in for tool-using pipelines.

Session resumption is tied to Claude Code's session files on the host, which means resumed sessions don't combine with multi-iteration runs either. Useful for "continue what you were doing yesterday," not so useful for chained autonomous loops.

The head branch strategy (write directly to your working directory) only works with bind-mount providers. If you're on Vercel's microVMs, the sandbox can't see your local files, so it has to merge back through a branch. Obvious in retrospect, but worth knowing before you wire up CI.

And like every container-based workflow, Docker startup time is real. The first run pays for image pulls and npm install. The hooks system mitigates this with parallel sandbox setup commands, but if you're expecting <1s cold starts, recalibrate.

Why This Matters Beyond Sandcastle Itself

The library is good. What's more interesting is what it represents.

The agentic coding ecosystem is shifting from "how do I prompt the agent better" to "how do I run the agent safely without me in the loop." Sandboxing, branch strategies, lifecycle hooks, structured output, completion signals, these are the primitives of unattended execution. They look boring next to model benchmarks. They're what actually unlocks parallelism.

I've been building variations of this manually on top of git worktrees and tmux for the last few months. Having a typed library that codifies the pattern means I can stop maintaining glue scripts and focus on the prompts.

If you're running Claude Code on real projects and you've ever wished you could parallelize a few tasks without context-switching all day, Sandcastle is worth a serious look. It's an early-days library, the API may shift, but the shape of it is right.

I go deeper on parallel agentic workflows in the course, including the worktree patterns Sandcastle automates and the prompt structures that make unattended runs reliable.

First chapter free

Learn the agentic coding workflow
I use in production

How I set up my repos, manage context, and run agents in production. Written down so you can do the same.

Start Chapter 1 — it's free Browse all courses