CodePipeline and S3

The AFK Factory: How I Shipped a React Native App Without Babysitting It

Or: What happens when you stop writing code and start managing agents

There’s a certain kind of developer pride attached to grinding through a feature yourself - staying up until 2am, debugging some arcane SQLite migration issue, pushing the commit at 3am with a message like “fix: hopefully”. I used to think that was what serious development looked like.

Then I built an AFK factory, and I watched my codebase grow while I was making tea.

This is a writeup of the workflow I used to build a mobile lap timer and telemetry analytics app - from the first spec document to working code, almost entirely through autonomous agents, structured tests, and a discipline called vertical slicing. It’s not magic. It’s engineering. But it does feel a little bit like cheating.

Step Zero: Plan in plain language, then make it executable

Before writing a single line of code, I spent time in ChatGPT writing a proper product specification. Not a vague wish list - an actual structured document: architecture decisions, data models, screen-by-screen breakdowns, API behavior, edge cases. The kind of document you’d hand to a new team member and expect them to understand the system from, then watch the pain in their eyes when they realise that the document is 28 pages long.

This matters because agents are not mind-readers. An LLM given “build me a lap timer” will produce something. It might even compile. But it won’t produce your product - the one with your specific data model, your specific UX constraints, your specific protocol quirks (in this case: parsing binary telemetry frames from a data logger over TCP, which involves GPS encoded as ECEF coordinates and a keepalive command that is literally just "kkk\x01").

The PDF document became the source of truth. Every time a new issue was created for the sandcastle agent to implement, it had that document as context. The agent wasn’t guessing at intent - it was executing on a spec. What’s sandcastle? We’ll get there.

Lesson: The quality of your plan is a force multiplier on everything downstream. If you’re vague with an agent, you get vague code. If you’re precise, you get precise code. This is not different from managing humans, but agents don’t push back or ask for clarification unless you specifically prompt them to. So the burden of precision is on you.

The Sandcastle: An autonomous issue factory

Sandcastle is the autonomous implementation pipeline. The workflow looks like this:

• A GitHub issue is created describing a feature or fix - with enough context that the intent is unambiguous.

npm run sandcastle spins up a sandboxed Docker environment.

• The agent reads the issue, reads the codebase, plans an implementation, writes the code, and opens a pull request.

• The PR gets reviewed, by a review agent, and merged.

The result is that each issue becomes a discrete unit of autonomous work. You can queue up five issues, go for a run, come back, and have five issues coded, tested, reviewed and merged. That’s the AFK factory.

Is the code perfect? No. Does it need review? Absolutely. But here’s the thing - the review layer is significantly cheaper than the implementation layer. Reading a diff and deciding whether it’s right takes a fraction of the effort of writing the code from scratch. You’re shifting from author to editor, and that’s a worthwhile trade.

What makes sandcastle work well here is the combination of a tight spec, a clear tech stack constraint (TypeScript strict, no any, specific library choices), and - crucially - tests. Which brings us to the more interesting part.

TDD is not what you think it is (if you’re doing it wrong)

Most people who claim to do TDD are actually doing something I’d call test-decorated development: they write the code, it works, then they write tests to confirm it works. This is fine as a confidence mechanism. It is not TDD.

Real TDD runs the cycle the other way: write a failing test first, then write code to make it pass. The failing test forces you to think about what you actually want the code to do - its public interface, its behavior, its edge cases - before you’ve committed to any implementation details.

But there’s a subtler mistake that even genuine TDD practitioners make, and it’s the difference between horizontal slicing and vertical slicing.


Horizontal vs Vertical is a distinction that cctually matters

Horizontal slicing means writing all the tests for a layer before implementing it. Write all the model tests. Then write all the service tests. Then write all the controller tests. Then wire them together. It sounds organized. It produces garbage.

Here’s why: when you write tests for a component in isolation before knowing how the other components will actually behave, you’re making assumptions. Those assumptions get baked into mocks. The mocks lie. The tests pass. The integration is broken. You’ve spent a week on a testing pyramid that doesn’t model reality.

Vertical slicing means taking one user-visible behavior and implementing the full stack to make it work — one test at a time, from the public interface inward. Not “test all the parsing logic” but “test that when a device connects and sends a telemetry frame, the session records a lap.” One behavior. Full stack. Working and tested.

The practical result is that you always have working software. After the first vertical slice, you have one thing that works end-to-end. After the second, you have two. At no point do you have a beautifully tested data layer that can’t actually be used because the service layer isn’t done yet.

For agent-driven development, vertical slicing is not optional - it’s load-bearing. When you hand an agent an issue, it needs to produce a working, tested vertical slice. If the instruction is implement the session parsing layer”, you get a pile of code that may or may not integrate correctly. If the instruction is “implement: when a session is synced from a device, it should be queryable from the sessions screen”, you get code that either works end-to-end or fails an integration test that tells you exactly why.

Tests become the contract between the issue description and the implementation. If the agent’s

code passes the tests, the behavior is correct. If not, the test output is the feedback loop.

Managing agents like a tech lead, not a babysitter

There’s a mental model shift required here. With human developers, you have conversations, code reviews, and trust built over time. With agents, you have:

The spec (the permanent context)

The issue (the specific task)

The tests (the acceptance criteria)

The PR (the deliverable)

Your job as the human in this loop is not to write code. It’s to make sure those four things are coherent with each other. If an agent produces bad code, the diagnosis is almost always one of: the spec was ambiguous, the issue was underspecified, or there were no tests to catch the deviation.

This maps reasonably well to managing junior developers, except that agents are faster, never tired, don’t need context on “culture”, and also won’t tell you when the issue description is contradictory (unless you build that feedback loop in). The failure modes are different but the management surface is similar: garbage in, garbage out.

One thing that helps enormously is keeping issues small and orthogonal. An issue that says “implement device connection and data syncing and the sessions screen” is going to produce a sprawling, hard-to-review PR that probably has subtle interaction bugs. Three small issues produce three reviewable PRs that compose cleanly.

The toolchain, for the curious

The app itself is a React Native project with a Turborepo monorepo, TypeScript strict mode throughout, Drizzle ORM over SQLite, NativeWind for styling, and React Query for data synchronization. Tests run with Vitest. The sandcastle agent runs Claude.

The tech stack choices aren’t arbitrary — they’re part of what makes agent-driven development tractable. TypeScript strict mode means the type system catches whole classes of integration errors before runtime. A well-structured ORM means the agent can reason about data models without reading raw SQL. A consistent styling system means UI components don’t diverge over time.

The point is: your tooling is not separate from your workflow. It shapes what agents can and can’t do reliably. A project with no type safety, no tests, and no clear architectural pattern is hard to extend by hand. By agent, it’s nearly impossible.

What this workflow is not

It’s not vibe coding. Vibe coding is “write me a login screen” -> accept whatever comes out -> ship it. That produces demos, not products.

This workflow is closer to being a very opinionated tech lead who writes detailed specs, tests every PR, and has a team of very fast developers who never sleep. The human judgment is concentrated at the specification and review layers — the expensive mechanical work happens autonomously.

It’s also not a replacement for understanding your codebase. You need to know enough to write a good spec, evaluate a PR, and spot when an agent has taken a shortcut that will cause pain later. The loop still requires a human with domain knowledge. It’s just that the human is operating at a higher level of abstraction.

Conclusion: This factory runs while you sleep

The combination of a well-written spec, autonomous issue-based implementation, vertical slice TDD, and a consistent toolchain produces something that feels almost unfair: a codebase that grows incrementally, is always tested, and required you to spend most of your time thinking rather than typing.

Is it more setup work upfront? Yes. Does the spec document take time to write? Absolutely. But the compounding return on that investment shows up every time you queue an issue, start a sandcastle run, go do something else, and come back to a PR that works.

That’s the AFK factory. You’re welcome.

The author, Marcus Wendel, is a senior software developer at adesso Sweden - with an obvious preference for all modern technologies and AI assisted and native development in particular.