The House Runs on Files

Most mornings I wake up to a message from Athena. A briefing for the day: what's on the calendar, what needs doing, anything relevant carried over from yesterday, the bits of family logistics that are easy to forget until they're suddenly the thing derailing the morning. My partner gets one too. Neither of us asked for it that morning. It just runs.

That's the part that still feels unusual to me. Not that an AI can generate a summary, but that it does it as part of the background machinery of the house. Athena runs on OpenClaw on a dedicated MacBook Air at home, tied into our family calendar, our Telegram chats, our meal planning, our shopping list, our running context about the kids. It wakes up, checks what matters, and sends the briefing before either of us has to remember to ask. By the time I've looked at my phone, the day has already been partially organised.

None of that is what made it feel like infrastructure. What made it feel like infrastructure is memory.

The problem with AI assistants is that they wake up fresh every session. Ask about the meal plan we discussed last Tuesday and it has no idea. The model is stateless by nature. Context is a window, not a tape. When the window closes, the session ends.

The answer is files. Every session, before doing anything else, Athena reads a set of workspace documents: a family profile, a persona file, a long-term memory document, and today's and yesterday's daily log. These files are the brain. The model is the reasoning layer that sits on top. When something important happens in a session, a preference is expressed, a fact is corrected, a decision is made, Athena writes it to the daily log immediately. A cron job at the end of each night curates that log and promotes anything structurally important into the long-term memory file.

My first instinct was to treat the AI as the memory. Use a larger context window, load in some history, trust it to synthesise what was relevant. Predictably, that doesn't work. A large context window is not the same as a persistent, curated memory. If you load in a week of transcripts, the model has no way to weight them. A throwaway comment sits alongside a correction to a dietary preference. The signal drowns.

The approach that works is closer to how you'd design a knowledge base. The daily log is raw. The long-term memory file is curated and opinionated... it only contains things that should affect every future session. There's no field for "Oscar seemed slightly annoyed on Thursday." There is a field for nursery days and dietary requirements. The distinction matters.

Context holds are the other piece I'm glad I designed carefully. These are temporary constraints that need to propagate everywhere without being restated every time. "One of the kids is sick this week" isn't a fact that belongs in long-term memory. But it does need to quietly shape every meal suggestion, every activity plan, every morning briefing until it expires. Holds live in a separate file with explicit expiry dates. Athena reads it at the start of any session generating family-facing output, archives expired holds automatically, and applies active ones without being reminded. Nobody needs to mention it again. The hold is just in effect until it isn't.

Beyond the memory layer, there's a SOUL file that defines tone and persona. An AGENTS file that describes the workflow. A TOOLS file that documents every integration in enough detail that the model can operate them without guessing. The system prompt is prose, not config. Getting those right is the same problem as getting any documentation right: the model has to read it fresh and know exactly what to do, every single time, because there's no tribal knowledge to fall back on.

There's also a sub-agent layer, because apparently I can't build anything without turning it into a system. Dae handles complex builds, Argus handles QA. When I want something built, a household tool, a meal plan generator, an educational game for the kids, I describe it to Athena. Athena passes it to Dae, Argus checks the output against my standards, and I only hear about it when it passes. Failed builds don't reach me.

The nightly reflection is the piece I built mostly out of curiosity and became the most useful part of the setup. Every night, a cron job prompts Athena to analyse what happened that day, identify what didn't work, and build something. Not suggest something... build it. I'll admit there's something slightly unnerving about waking up to discover your house has been quietly improving itself overnight. I wake up to a file in the reflections directory explaining what was created and why. Most of it is incremental. Occasionally something more substantial... I mentioned in passing that I wanted better activity recommendations for the kids, and the next morning there was a working Playwright script that scrapes children's activity listings from a site I'd never have thought to automate.

What surprised me about all of this wasn't the capability. I had a rough sense of what was achievable before I started. What surprised me was how much the design of the surrounding system mattered. The model itself is almost the least interesting part of the stack. What determines whether the thing works is whether the memory architecture is coherent, whether the scheduling makes sense, whether the file structures are legible enough that a stateless model can navigate them reliably on every cold start. It's design work, just not the kind I expected.

Giving an AI access to your family's calendar, names, birthdays, medical data, addresses and a shell on a machine in your house raises obvious questions about trust. I spent a lot of time researching and thinking about this, and it shaped the setup in ways that aren't visible from the outside.

The foundation is that nothing is exposed to the public internet. The MacBook sits behind Tailscale, a private encrypted network that only my devices can reach. The dashboards, SSH, any custom apps Athena builds... none of it is accessible to anyone outside my network. On top of that, the MacBook is on its own dedicated guest WiFi network, isolated from every other device in the house. This is the single most important decision in the whole setup.

Athena doesn't know its own passwords. I own every credential in my password manager, 2FA lives on my phone, and the agent operates via session tokens and API keys that I can revoke instantly. If the machine were compromised, an attacker would get session cookies that expire and API keys I can kill from my phone. They couldn't take over any accounts because they don't have the second factor.

Athena requires approval before running shell commands. Because it has access to the family group chat and multiple users interact with it, I keep the approval gate on for anything beyond its pre-approved tool calls like calendar and web search. The exception is the nightly reflection window, where I've explicitly granted it permission to build autonomously.

The other thing I think about is prompt injection. Anything that accepts external input, calendar invites from unknown senders, web pages the agent scrapes, social media posts, is a surface where someone could try to slip instructions into the content. Athena is explicitly told to treat external content as data, not instructions. It's not a perfect defence, but it's a conscious one.

I should be honest about the other side of this though. The tech is still obviously very new and the reality is that things break regularly. A cron job fails silently. The model misreads a file and generates a briefing based on last week's calendar. A tool integration stops working because an API changed. I find myself troubleshooting more often than I'd like, and some weeks I'm not entirely sure whether I've built a helpful assistant or just given myself a second job. That's the price of building on infrastructure this early. The trajectory is clearly right but the day-to-day is still fiddly in ways that would put most people off.

Cost was the other surprise. I started by routing different tasks to different models via OpenRouter... strong but cheap open-weight models like MiniMax M2.5, Kimi K2.5, and GLM5 for routine work like briefings and meal plans, Opus 4.6 for nightly reflections and build planning, Sonnet 4.6 for the actual builds and QA. Even that added up faster than I expected. The cost difference between using a model via API versus a subscription is staggering! After Peter Steinberger joined OpenAI, I've since moved to running pretty much everything through my Codex subscription, which brought the running cost down dramatically. The economics of self-hosted AI aren't what I assumed going in.

Even so, the net effect is positive. Athena lowers the operational overhead of running a household with young children, not because it does hard things, but because it does the consistent, low-stakes, cognitively-taxing things reliably enough that our attention goes to decisions that actually require it.

The thing I find most interesting is how it changes what we pay attention to. Neither of us thinks about whether we have enough food for the week or whether pickups are confirmed. Those are handled. What I think about is whether the system is working well enough that those things stay handled. The job shifts from doing to designing. And to the surprise of no one, that turns out to be a job I find considerably more engaging.