Skip to content
Why We Replaced SSH With a Custom Daemon Inside Our Sandbox VMs

Why We Replaced SSH With a Custom Daemon Inside Our Sandbox VMs

A warm pool bug that SSH could not fix forced us to rethink how the host talks to a sandbox. Here is what we built, why it is faster, and why per-sandbox env vars finally work.

← Back to all posts
· ZWRM Team· 12 min read
sandboxmicrovmfirecrackerinfrastructureengineering

Why We Replaced SSH With a Custom Daemon Inside Our Sandbox VMs

TL;DR

We were using SSH to drive sandbox VMs from our control plane. It worked (kinda), until we turned on the warm pool. Pre-booted VMs could not accept user environment variables without a reboot, and SSH was the wrong shape of tool to fix it. We pulled dropbear out of sandbox images entirely and replaced it with a small custom daemon, zwrm-sandboxd, that runs inside every sandbox VM, exposes a typed ConnectRPC surface, and fetches a bearer token at boot from the host metadata service. Env injection now works in warm-claimed and cold-booted sandboxes with the same code path, the trust model dropped half its moving parts, and snapshot/wake preserves daemon state without a re-auth dance. This was not a religious switch. It was the right tool for one specific job.


The Bug That Made Us Do It

Our sandbox system runs on Firecracker microVMs. A user types:

bash
zwrm sandbox create -t python --env GREETING=hello
zwrm sandbox exec <id> -- printenv GREETING

They expect to see hello. For months, that worked. The cold-boot path was simple: the control plane set env vars before the VM booted, the init script exported them, and SSH-in-and-printenv returned the right answer.

Then we shipped the warm pool.

The warm pool keeps a handful of pre-booted VMs sitting at the door so a sandbox create call can return on the order of a couple dozens of milliseconds instead of waiting for a full cold boot (~2 seconds). Beautiful when it works. Except: a warm pool VM has already booted by the time a user asks for one. Its environment was set when the replenisher created it, usually to nothing in particular. The user's --env GREETING=hello arrives after the VM exists.

We tried the obvious thing first: SSH into the warm-claimed VM and export GREETING=hello. But that sets the env in one shell process. The user's next exec call opens a different shell. Their env is gone.

We tried writing to /etc/environment. That works for new login shells but not for every process a subsequent exec spawns directly. We tried wrapping every exec in a wrapper script that re-sourced a per-sandbox env file. That worked, but it felt like we were fixing the wrong layer -- taping process state back on top of a protocol that refused to remember anything between calls.

The honest diagnosis was sitting there the whole time: SSH is the wrong abstraction for what we are doing.

SSH is designed for "human logs into a remote host and runs commands in their session." We do not have a human. We do not have a session. We have a host that wants to invoke RPCs against a VM with state that survives across calls. Every zwrm sandbox exec was paying for TCP setup, key exchange, channel allocation, and shell parsing, i.e. for something that should be a single function call.

So we threw it out.

What Replaced It: zwrm-sandboxd

zwrm-sandboxd is a small statically linked Go binary that runs inside every sandbox VM. It listens on :9923 and exposes three ConnectRPC services, defined in proto/sandboxd/sandboxd.proto:

  • System -- Ping (readiness probe and health check), SetEnv (the RPC that fixes the warm-pool bug)
  • Process -- Run (blocking exec with captured stdout/stderr/exit code), Stream (reserved for future bidi streaming), Signal (POSIX signal delivery)
  • Filesystem -- Read, Write, Stat, List, Remove, with chunked transfers for large files

Everything speaks h2c -- plain HTTP/2 over TCP. No TLS inside the sandbox. The host talks to the VM over a private TAP link that nothing else can reach, so the wire protection we actually care about is the per-sandbox bearer token, not a certificate chain.

On the host side, the sandbox manager keeps a sandboxclient.Pool keyed by sandbox ID. The first call after a VM comes up does a WaitReady -- a Ping loop with linear backoff starting at 50 ms, stepping up to a 2s cap. After the daemon answers once, every subsequent zwrm sandbox exec is a single HTTP/2 request on the cached connection. The pool closes the client when the sandbox is destroyed.

The hero RPC in this story is SetEnv. It merges (or, with replace=true, replaces) the daemon's in-memory env store. Every subsequent Process.Run call merges the store with any per-call overrides via a small mergeEnv helper -- per-call wins, daemon state is the fallback. When the warm pool claims a VM, the host calls SetEnv with the user's --env vars before handing the sandbox back. When a cold-boot sandbox starts, the same call runs right after UpdateSandboxRunning. Same helper, same log line, same failure modes. Two scenarios, one code path, zero kernel-command-line gymnastics.

The whole host-to-daemon surface lives in sandbox/sandboxclient/client.go. The Execute path on sandbox.Manager is now a thin wrapper around it: validate env var names, run a daemonClientReady probe, call client.Run with sh -c <command>, return the captured result. The old SSH dial-with-retry loop is gone.

How the Trust Model Got Cleaner Almost By Accident

We did not set out to redesign the security model. But replacing SSH meant we got to drop a pile of moving parts we had been carrying because the previous shape of the problem demanded them.

Before (SSH):

  • Generate an SSH host key per app, distribute it
  • Serve authorized_keys from the metadata service
  • Keep dropbear inside every sandbox image
  • Trust-on-first-use for the VM's host key on every dial
  • Re-do most of that across snapshot/restore

After (daemon):

  • At CreateSandbox, mint 32 random bytes, hex-encode, store as BYTEA in the sandboxes.daemon_token column
  • At boot, the daemon fetches the token from the host metadata service at /daemon/token/<machine_id>
  • The metadata server validates the request source IP against the machine row's recorded IP, so a process in Sandbox A cannot ask for Sandbox B's token even if it knows B's machine ID
  • The daemon holds the token in memory and compares incoming Authorization: Bearer <token> headers with subtle.ConstantTimeCompare
  • The token never touches disk inside the VM, never appears in JSON API responses (json:"-" on the DaemonToken struct field)

The interesting wrinkle is that we did not have to write a new metadata service for this. The one we already had already serves secrets to app VMs at http://169.254.169.254:1338/secrets/<machine_id>, with source-IP gating baked in. Adding /daemon/token/<machine_id> was one new handler that mirrors the existing /secrets/ and /ssh/* shapes: extract the path suffix, look up the machine row, compare the client IP against machine.IPAddress, return the token as application/octet-stream. Six unit tests in secrets/server_test.go cover the handler: happy path, spoofed source IP, nonexistent machine, sandbox row with a null token (historical migrations), wrong HTTP method, empty machine ID. If an app or Postgres VM ever hits this endpoint by mistake, the lookup returns 404 with a deliberate "daemon token not provisioned" message (you'll want these failures to be loud, not a silent empty response).

The token survives suspend/restore for free. When a persistent sandbox suspends via Firecracker snapshot and wakes hours later, the daemon is resumed as part of the memory snapshot with the token still in its process heap. No re-fetch, no re-auth dance, no second round trip to the metadata service. The host's sandboxclient.Pool reconnects over the restored TAP link and the next exec lands on a daemon that already knows who it is.

What We Did Not Do

Two things we explicitly punted on. The post would not be complete nor honest without them.

We kept SSH on app and Postgres VMs. Apps run user code that may legitimately want SSH access for debugging. Postgres VMs use SSH for operator "log in and look around" workflows and for host-key-based trust that the replication tooling still leans on. Both have a real human use case behind them. Sandboxes do not. A sandbox is an API target, not a box you ssh into. The daemon swap is scoped in code: build.LegacyRootFSOptions() still injects dropbear, build.SandboxRootFSOptions() injects zwrm-sandboxd and explicitly sets IncludeSSH: false. Sandbox images ship with no SSH binary at all.

We did not ship streaming exec. Process.Stream is defined in the proto with the full bidi message set, StreamStart, StreamStdin, StreamSignal, StreamData, StreamExit, and the handler returns Unimplemented. When we eventually need PTY-style streaming for interactive coding agent sessions or an in-browser terminal, the wire format is already there. No service version bump, client migration, or new RPC.

What Surprised Us

A handful of things we did not expect going in.

The init script barely changed. We were braced for an invasive rewrite of the boot path. Dropbear was deeply wired into how sandbox VMs came up. In the end, the whole thing reduced to two independent gated blocks in the same init script. The SSH block runs if [ -x /usr/sbin/dropbear ]. The sandboxd block runs if [ -x /usr/local/bin/zwrm-sandboxd ]. The build pipeline decides which binary lands in the image via SandboxRootFSOptions versus LegacyRootFSOptions, and the init script does not know or care about the distinction. Same init script for apps, sandboxes, and Postgres, just different binaries injected into the rootfs during the image build.

The boot fetch needed to be cancellable. The daemon fetches its token from the metadata service at boot with exponential backoff, capped at 30 seconds per attempt and ~5 minutes total. The naive version of that loop blocks SIGTERM for the full budget. If a VM gets suspended mid-boot (which warm pool VMs can, because the snapshot flow happens before the sandbox is claimed) that is five minutes of delay on shutdown. We installed signal handling before the metadata fetch starts and passed the signal context into the retry loop. The first thing the loop checks on every iteration is ctx.Err(). If the daemon gets SIGTERM during boot, fetchTokenWithRetry returns ctx.Err() immediately. Obvious in hindsight. Not obvious when we wrote it.

The permissive boot config pattern is now a house style. The daemon reads its machine ID and metadata URL from two sources: environment variables set by the init script, and /proc/cmdline as a fallback. The first shape we tried was strict -- use env if both values are present, else fall back to cmdline if both values are present, else fail. A partial env (say, MACHINE_ID set but METADATA_URL missing because the host VM manager was not configured with one and the init script fell back to gateway:1338) was treated as a hard error. That is wrong: the caller had useful information in both sources, we just were not merging it. The final shape reads both sources, merges them field by field, and only validates at the end. It is a three-line change and it is the pattern we reach for now whenever a binary has more than one legitimate place to get a config value from.

Should You Do This?

If you are building a code-execution sandbox - Modal-style, E2B-style, code-interpreter-style - and you are using SSH because it was the path of least resistance, here are the two questions to ask yourself before you rip it out.

Does the host need to drive state that survives across calls? Env vars. Working directory. An interpreter session. A long-running supervisor. A dict of per-call metadata you would like to keep warm. If the answer is yes, SSH is going to fight you on every one of them. Each session is its own shell process, and state does not carry. You will end up writing a wrapper protocol on top of SSH or smuggling state through /tmp. At that point you are already building a daemon, just commit to it.

If the answer is no, if every interaction is "send a self-contained command, get output, done" and nothing needs to remember anything, SSH is fine. Do not break what works.

Do you have warm pools, or any other "VM exists before the user knows about it" pattern? Snapshot restoration, pre-provisioned lease pools, golden image wake-ups, anything that decouples VM creation from user intent. If yes, you will hit the env-injection bug. We hit it. You will too. The daemon-with-SetEnv shape solves it cleanly because the daemon is the one thing inside the VM that outlives any particular shell process.

What's Next

Three things on the roadmap that this PR unblocks:

  1. Streaming exec. Process.Stream becomes real: PTY support for interactive sessions, stdin streaming, signal forwarding. The foundation for an in-browser terminal against a sandbox.
  2. Filesystem watch. Sandbox-side inotify wrapped in a new RPC so the host can react to file changes inside the VM without polling Stat.
  3. Daemon-mediated metrics. CPU, memory, and I/O stats from inside the VM, exposed via a Stats RPC. No host-side nsenter gymnastics, no cgroup-scraping hacks.

All three would be miserable to retrofit onto SSH. With a daemon we own, they are straightforward extensions of an existing service: add a method to the proto, implement the handler, generate the client, ship it. The hard part, the identity, the transport, the connection pool, the token plumbing, is already done.


Building infrastructure for AI agents, code interpreters, or remote dev environments and want to skip building all of this yourself? ZWRM gives you Firecracker microVMs, warm pools, snapshot/restore, and a sandbox API that actually works with per-sandbox environment variables. Check out zwrm.eu or start a free trial on the dashboard -- we are onboarding design partners.

Stay in the loop

Get early access to zwrm and be the first to read new posts.

Start free trial