Executive Summary
- A decade ago, ML projects failed not because of bad models, but because of underestimated scaffolding around them.
- Today, the same pattern is repeating with large language models. Model training and deployment have shrunk to an API call, but the surrounding system has grown more complex than ever.
- This article introduces a four layer architecture for production AI applications: LLM, Harness, Grounding layer, and Use Case.

Introduction
Every major technology wave begins with a simplifying assumption. In machine learning, it was that better models would naturally lead to better outcomes. In practice, model quality alone was never enough. Production systems required data pipelines, validation, deployment, monitoring, and governance. The model mattered, but it was only one component of a much larger system.
The same pattern is now emerging with large language models. Since the models are accessible through an API, engineers assumed building intelligent applications had become straightforward. In reality, the complexity has not disappeared , it has moved. The challenge is now the architecture around the model: how context is assembled, how tools are orchestrated, how user intent is interpreted, and how outputs are made reliable.
MLOps a Decade Ago: The Problem Is Not the Model
In 2015, Hidden Technical Debt in Machine Learning Systems made clear that the model is only a small part of production ML (see the diagram below). Most complexity sits in the surrounding infrastructure data collection, validation, feature extraction, configuration, process management, serving, and monitoring. If the surrounding system is weak, the model does not matter.
This insight challenged how organization invested. They focused heavily on model experimentation while underestimating the work required to make models trustworthy in the real world. The difficulty of ML was not training a model once; it was maintaining a living system full of hidden dependencies.
Sculley and co‑authors showed that ML systems carry a dangerous kind of technical debt: models are entangled with data pipelines, feature engineering, and serving environments, making small changes cascade unpredictably. MLOps emerged to manage this complexity through reproducible pipelines, versioning, deployment, monitoring, retraining, and governance. It did not make ML simple, but it made it operational.

The Same Pattern Today
Large language models have changed where complexity lives, but not the fact that it exists.
In classical ML, organizations had to train, deploy, and serve their own models. With LLMs, much of that burden is outsourced, creating the illusion that the hard part has disappeared. In reality, it has moved up into the application layer.
The old challenge of operationalizing a model has become the new challenge of operationalizing a reasoning system. Traditional models often produced narrow, structured outputs. LLMs produce free-form language, make decisions, and behave non-deterministically.
They enable agents to search documents, query data, update records, trigger workflows, and coordinate subagents across multiple steps. At that point, the system needs architecture, not just prompting. The glamorous part is still the model. The hard part is still the scaffolding.
The Four Layers of Effective Agents
Production AI systems can be understood as four layers: LLM, Harness, Knowledge Graph, and Use Case.
1. The LLM Layer
The LLM is the thinking layer or reasoning engine of the agent. It interprets natural language, generates text, and determines the next step of the agent’s loop.
Its role is similar to that of a traditional ML model in a production system: the model produces the prediction, while the surrounding infrastructure makes that prediction useful, reliable, and controllable. In the same way, the LLM provides the agent’s reasoning capability, but it depends on the layers around it to ground, direct, and operationalize that reasoning.
The LLM is the brain of the agent, but not the whole system.
2. The Harness Layer
The harness is the operational layer surrounding the model. It governs how the agent behaves at runtime. This is what transforms the LLM from a general-purpose reasoning engine into a controlled application component.
The harness determines which tools or MCPs the agent can call or be connected to. It enforces guardrails that prevent the model from drifting outside acceptable boundaries. It manages conversational memory: what the agent retains across turns, what it discards, and how prior interactions shape subsequent reasoning.
If the LLM is the engine, the harness is the control plane. It is also where most of the engineering complexity of a production agent resides and where the difference between a compelling demo and a reliable system is ultimately decided.
3. The Grounding Layer
Every agent whether conversational or process-embedded, depends on context to perform reliably. Without it, even a capable model will hallucinate, misinterpret intent, or produce outputs that are fluent but operationally wrong.
The grounding layer assembles that context before the model is asked to reason. It solves two problems.
The first is intent resolution. Users express themselves in shorthand and implicit references “show me last quarter’s numbers for the Hamburg project” requires the system to resolve which project, which metrics, and which time range. The grounding layer translates underspecified requests into the precise entities that enterprise systems require. In autonomous settings, the equivalent challenge is not human ambiguity but system ambiguity: inconsistent naming, outdated records, or conflicting definitions across sources.
The second is context structuring. Enterprise data spread across databases, document repositories, and internal APIs is rarely in a format that supports reliable reasoning. The grounding layer determines how this information is represented before it reaches the model: raw text, structured JSON, graph triples, or condensed natural language. These format decisions directly affect the quality of the agent’s output.
The grounding layer prepares context; the Harness layer delivers it.
A knowledge graph is a particularly effective way to operationalize this layer because it addresses problems that simpler retrieval methods leave unresolved. Where keyword-based retrieval returns documents that may or may not contain the answer, a knowledge graph encodes entities, attributes, and relationships explicitly resolving mappings like “Hamburg project” to project ID directly, and providing a consistent semantic layer across fragmented enterprise sources.
4. The Use Case Layer
The use case layer is the application itself: the workflow, interface, objectives, and constraints that define business value.
This layer determines what the agent is trying to optimize for, what actions it is allowed to take, what level of uncertainty is acceptable, and what must be logged, reviewed, or blocked. Different use cases may rely on the same underlying stack, but they require different behaviors and controls.
However, as it was a decade ago, many teams start with the model and only later define the problem. Effective systems work the other way around. The use case defines the requirements, which determine what knowledge is needed, how the harness should behave, and how the model should be used.
Conclusion
A decade ago, the ML industry learned that the model is only a small part of the production system. Everything around it: data pipelines, deployment, monitoring, governance, and operations determines whether it creates value or accumulates technical debt.
That same lesson is returning with LLMs. Today’s systems are more conversational, more dynamic, and deceptively simple. Because the model is accessible through an API, it is easy to mistake accessibility for completeness. But the real complexity has not disappeared. It has shifted into harness design, knowledge representation, and use-case-specific control.
The most effective AI applications will not be defined by model power alone. They will be built on a deliberate stack: LLM for making decisions, Harness for orchestration, Grounding Layer for context, and Use Case for value.












