Labs: Building a Production-Ready AI Project Manager in 12 Hours

Feb 14, 2026 · 9 min read

HumanSpark Labs Report: The Spark Project Management Build

A real build session: strategic conversation to working system in one evening, with no human code written.

This report documents a real build session from early February 2026. In a single evening, a strategic conversation identified over 30,000 euros in revenue that wasn’t being properly tracked, and produced a full technical specification for a project management system. Then autonomous AI agents built, tested, and hardened the entire system – over 800 tests, around 7,000 lines of code, no human code written. I ended up with a secure AI assistant that can help me manage my projects through a chat interface.

A freelance developer might have quoted me for 3-4 weeks to build this, or I could have gone down a rabbit hole for several weeks doing it myself. But the honest answer is I never would have built it at all.

The Problem That Wasn’t the Problem

I’ve been running my consultancy for 19 years. When I shifted focus to AI in early 2024, the business didn’t get simpler – it accumulated more layers on top of the previous ones. Legacy clients from web development work still need attention alongside AI training, speaking, implementation projects, and even a SaaS product. I’ve also written several books to help frame my thinking about AI.

There’s a lot of context-switching, and each type of work needs a different kind of attention, on a different timescale, with different consequences when it gets neglected.

ADHD is a genuine asset in a lot of my work – the ability to lock in on a problem for hours is how a build like this happens in a single evening. But the flip side is that quiet, ongoing work disappears from my radar completely. Invoices don’t get sent, pipeline contacts go cold, and admin that nobody’s chasing simply stops existing in my mind until something breaks.

This is also why I’m self-employed in the first place. I spent seven years as a software engineer trying to be a good employee before accepting that I’m not made for working for others. Nineteen years of running my own business later, the trade-off is the same – the hyperfocus is a gift that I think a lot of people aren’t aware of, but the follow-through on invisible work is where things fall apart for me.

That’s why Spark exists – a personal AI assistant that manages my tasks, deadlines, and calendar through a chat interface (RocketChat – it’s like Slack, but open source, self-hosted, and completely private).

Spark originally had over 400 automated tests and a clean, simple architecture. But it didn’t understand context – it knew I had tasks, but it didn’t know which tasks belonged to which clients, which projects were generating revenue, or which relationships were going cold.

The evening started with a simple question I often ask myself about technical projects:

“Should I build project management features into Spark? Or is this a distraction?”

To answer that properly, I needed to understand my own business better first.

I have an AI assistant that acts like an advisory council – it’s a structured prompt that synthesises perspectives from several experts. I fed it my email history and invoice records from the past six weeks.

The “council” identified over 30,000 euros in revenue that wasn’t being properly tracked – a mix of invoices I hadn’t sent, payments I hadn’t chased, and bookkeeping that hadn’t been reconciled. Some of it was already in my account but invisible because the records hadn’t been updated. Some was outstanding. The point wasn’t the number – it was that I couldn’t tell which was which.

One AI advisor put it bluntly: “You don’t have a revenue problem. You have a collections problem and an admin problem. The money isn’t flowing in because invoices aren’t being sent or chased.”

Another pointed out the deeper pattern: “You’ve been in cash flow survival mode for years. When everything is one category – ‘will this pay a bill this month?’ – you can’t see the forest for the trees. The fact that you’re now able to step back and think about structure means the pressure is easing.”

That third perspective closed the loop – the revenue problem was simultaneously the strongest argument for building Spark and the strongest argument for not building it right now, because every hour spent coding was an hour not chasing money.

That tension – build the system versus do the work the system would remind you to do – ran through the entire project.

The Taxonomy Conversation

With the business landscape mapped, a new question emerged: if Spark is going to manage projects intelligently, how should it categorise different types of work?

This matters because a consulting business doesn’t have one kind of project. I have client delivery with deadlines, ongoing SaaS operations that never end, pipeline relationships that die silently if neglected, strategic investments like books and courses that compound over time but generate nothing today, and invisible admin work that causes real damage when forgotten.

If all of these sit in a flat list, the system can’t tell me anything useful. It can’t distinguish “your revenue work is stalling” from “your book isn’t progressing.” It can’t know that a pipeline contact going cold for two weeks is more urgent than a Labs project not getting attention.

The council debated this for over an hour. The core tension was between simplicity (three categories, easier to maintain) and accuracy (five categories that map to how the business actually works). Five categories survived the debate:

Delivery – client work with a commitment and a date
Operations – ongoing recurring obligations with no end date
Pipeline – people and opportunities that could become delivery
Strategic – long-term investment work building value
Admin – invoicing, bookkeeping, collections, maintenance

Each category has its own failure mode. Delivery fails when you’re underprepared for a deadline. Operations fails when support requests pile up. Pipeline fails silently – you don’t notice until months later when there’s no new work. Strategic fails when it never ships. Admin fails when money leaks.

The taxonomy conversation also surfaced a design principle that shaped the entire system: infer at capture, store explicitly, confirm transparently. When I tell Spark “I need to follow up with Lorcan about that referral,” it should infer that’s pipeline, store the category in the data file, and confirm: “Added: follow up with Lorcan (pipeline).” Zero friction for me, full context for the system.

By the end of the taxonomy conversation, I wasn’t asking “should I build this?” anymore. I was asking “how fast can I build this?”

The Specification

The next step was translating the strategic conversation into a technical specification.

I didn’t write the spec myself. I continued the conversation with the advisory council, and over the course of about two hours, we produced three documents:

A development specification (the “what”): a 7-field data model for projects, a client slug registry, over 20 validation scenarios covering both user-initiated queries (“what’s in my pipeline?”) and system-initiated proactive messages (“you have 3 pipeline contacts going cold”), and a modular collector architecture for pulling in external data from my invoice system and calendar.

A modular build plan (the “how”): 21 phases organised across four parallel tracks – Core Intelligence, External Data Collectors, Proactive Intelligence, and Advanced Intelligence. Each phase has explicit dependencies, files touched, test scenarios, and acceptance criteria.

Autonomous build instructions (the “who does the work”): a structured prompt for Claude Code, Anthropic’s command-line AI coding tool, that tells it how to execute the build plan autonomously. Each phase gets its own sub-agent with a scoped brief. A checkpoint system tracks progress so the build can resume after interruptions.

Three documents, one evening of conversation, and zero code written by a human.

The design conversation was the hard part. Deciding on five categories instead of three. Choosing to track activity in a separate file rather than cluttering the project records. Defining what “going cold” means for pipeline versus delivery. These are judgment calls that require understanding the business, the user’s cognitive patterns, and the failure modes of each category.

An AI can’t make these decisions alone – but once they’re made and documented clearly, an AI can implement them cleanly.

The Build

I handed the three documents to Claude Code and said: “Begin autonomous build.”

Claude Code read the specification, read the existing codebase, and started building – data model upgrades, new classification logic, safety checks, and seeding the system with my real project data. Phase 1 completed with all tests passing, then it moved through the remaining phases.

Two things happened that I didn’t expect.

It optimised the build order. The plan said Phases 1 through 5 were sequential, but the AI noticed that Phase 5 (activity tracking) only depended on Phase 2, not on Phases 3 or 4. So it built Phase 5 right after Phase 3, maximising the number of downstream phases it could unlock. It was reasoning about the dependency graph strategically, rather than following the list in order.

It found its own bugs. During Phase 1, edge case tests revealed three issues – the client slug “internal” triggering false positives, a field append bypassing category validation, and “various” causing the same problem as “internal.” The AI diagnosed each issue, implemented fixes, wrote regression tests, and re-ran the full suite. All within the same phase, with no human intervention.

With all 21 build phases complete, I gave the AI a comprehensive QA plan and let it stress-test its own work. It threw everything at the system – malformed data, ambiguous commands, edge cases, simulated full-day and full-week scenarios. The QA process added close to 180 new tests, bringing the total to over 800.

A separate architectural review identified blind spots the QA plan didn’t cover:

The “Fat Finger” scenario: What if I manually edit a project slug in vi and accidentally create orphaned tasks? Solution: orphan detection in the morning briefing.
Silent API failures: What if the invoice API returns partial data and overwrites the cache with incomplete records? Solution: payload size comparison before overwriting.
No undo button: What if the AI deletes the wrong project? Solution: daily rolling backups with 7-day retention.

All three hardening fixes were implemented and tested. The final system has over 800 automated tests, runs in under 2 seconds, and protects against data loss at every level.

What This Actually Means

The design conversation was as valuable – possibly more valuable – than the code.

Even if I’d hired someone, I’d have spent weeks explaining the business logic through back-and-forth. But the real answer is simpler: I never would have built it. The cost and the coordination overhead would have meant it stayed on a “someday” list indefinitely. The system exists because the barrier to building it dropped low enough that a single evening of focused conversation could produce it.

The five-category taxonomy. The proactive nudge thresholds (5 days for pipeline, 7 days for strategic, 14 days for overdue invoices). The orphan detection that catches my own manual editing mistakes. These are human decisions. They require understanding how a specific business works and what kinds of things fall through the cracks when you’re running everything yourself. The invisible work is where the damage happens – and that’s exactly what this system watches.

I should be clear about what “production-ready” means here. The system works, the tests pass, and it’s running. But it still needs extensive real-world testing – the kind you only get from using something daily and finding the edges the QA didn’t anticipate. I can do that because I have a software development background. I can read the code, tweak the logic, and fix what breaks. This isn’t something I’d build for a client and simply hand over – it’s a system I can maintain because I understand what’s underneath it.

AI built the software, but the human still had to decide what to build, and why. And if you’re thinking about commissioning something like this rather than building it yourself, you’ll need someone technical in the loop for the long run – or get comfortable debugging code with AI assistance (this is entirely possible).

Monday morning, for the first time, Spark sent me a briefing that understands the difference between my delivery commitments and my pipeline relationships. It told me which invoices were overdue. It nudged me about contacts going cold. It warned me about deadlines approaching with no prep work started.

Whether I actually listened is a different question!

The Numbers

Metric	Value
Design conversation	~3 hours
Specification documents	3
Build phases	21
Build tracks (parallel)	4
QA phases	7
Hardening fixes	3
Tests at start	~400
Tests at finish	800+
Test execution time	<2 seconds
Lines of production code added	~3,000
Lines of test code added	~4,000
Human code written	0
Time from first conversation to production	~12 hours
Estimated equivalent freelance timeline	3-4 weeks

For the Practitioner

If you’re reading this as someone who wants to do similar work with AI, here are the practical takeaways.

The strategic conversation is the product. Don’t skip to building. The hour spent debating whether “speaking” and “training” should be the same category saved hours of rework later. AI is fast at building, but it’s only as good as the decisions it’s implementing.

Write the spec before the code. A clear, detailed specification with acceptance criteria and test scenarios gives AI everything it needs to build autonomously. A vague brief gives it room to make wrong decisions.

Design for autonomous execution. The checkpoint system, dependency graph, and sub-agent pattern meant the build could survive interruptions, run phases in parallel, and resume from any point. Without this, a 21-phase build would have required constant human supervision.

AI built it, but someone still needs to own it.

HumanSpark Labs Report – February 2026

“Fewer late nights, not fewer humans.”

Labs: Building a Production-Ready AI Project Manager in 12 Hours

HumanSpark Labs Report: The Spark Project Management Build

The Problem That Wasn’t the Problem

The Taxonomy Conversation

The Specification

The Build

What This Actually Means

The Numbers

For the Practitioner

Related Posts

Building a Live Speech to Text AI

HumanSpark Labs Report: The Spark Project Management Build

The Problem That Wasn’t the Problem

The Taxonomy Conversation

The Specification

The Build

What This Actually Means

The Numbers

For the Practitioner

Related Posts

Building a Live Speech to Text AI

Get practical AI insights every week