Harness Engineering Thoughts and Notes

9 June 2026·3 mins·

Given the influx of LLM agent tooling, blog posts, guides, research, and explosion of takes, I wanted to capture my perspective of what I have been reading, seeing, and some experiments that I have been dabbling in or on. I do not proclaim to be an expert in the space, and I am writing this by hand to stretch my mind, and I hope that what I compile and share becomes valuable to some person or agent.

The Quest for the Perfect Harness

Most of the community has a decent handle on defining a “harness” (the environment and scaffolding that allows an agent to operate), but the real challenge lies in the nuances of execution. As I build, I find myself returning to four core questions:

Dynamic Skill Exposure: How can I expose specific skills at prompt-time to ensure consistency when switching between different models?
Context Steering: How do I direct agents toward the optimal combination of context—be it a specific code source, a visual cue, or an image element—to ensure the request is understood perfectly?
Task Persistence: How can I keep a model focused on a complex group of tasks without it drifting or losing the thread?
The “Local-First” Goal: What is the ideal blend of features (similar to those found in Claude Code) that can eventually be transitioned to a fully local, private toolkit?

The Friction of Evolution

My journey has taken me from browser chat windows to VSCode extensions like Cline and eventually to CLI agents like Claude Code and OpenCode . This progression has been as elating as it has been stressful.

We are seeing rapid optimizations, but early iterations often crashed mid-experiment, causing significant loss of progress. Memory bottlenecks frequently rendered IDEs unusable for multi-agent workflows, forcing a migration toward CLI editors. Even the introduction of LSPs (Language Server Protocols) added another layer of complexity to memory allocation. It feels as though the industry is now forking: some tools are doubling down on remote access and “command centers,” while others are focusing on multi-agent orchestration frameworks.

What Actually Works: Steering and Memory
Through experimentation, I’ve found that the most effective agents rely on three pillars:

Custom Hooks: To elicit precise steering and behavior.
Repo-Specific Skillsets: Tailoring capabilities based on the language or repository.
Robust Memory Systems: Enabling agents to index findings that are easily searchable and retrievable—a concept I explored further with Paperbridge for dynamic research procurement.

The Power of the “Monitor and Loop”
The most “sticky” feature I’ve encountered recently is the monitor and loop function introduced in the Claude Code harness. The ability to maintain a single, long-running session that monitors experiments for crashes is a game-changer for auto-research and brute-force optimization loops. While other tools like Codex and Antigravity have attempted similar “goal” or “task” systems, Claude Code’s implementation feels the most seamless.

While there are many impressive closed and open-source tools available, I have currently landed on Pi Coding Agent . The deciding factor was customizability. By integrating the logic from pi-loop , I am working to recreate that essential “sticky” monitoring feature within a harness that is easy to extend with the research and building tools I use daily.

My ultimate motivation is twofold. The desire to advance research without the risk of my data being ingested into a future training set, and the cost-efficiency of running massive, long-term autonomous experiments locally. Over time, I will extend this blog with references for harness tooling I find interesting or that I am working on.

The Quest for the Perfect Harness #

The Friction of Evolution #

The Quest for the Perfect Harness

The Friction of Evolution