Blog
Business

If Not Transformers, Then What?

Julius Hollmann
May 18, 2026
7
min read

Executive Summary

  • Most AI agents today rely on Transformer-based models because they perform well across many tasks and allow for scaling.
  • Alternatives such as state-space models, hybrid architectures, and diffusion-style generation mainly address efficiency issues like memory use, latency, and long-context handling.
  • The hardest enterprise problems are still not solved by alternative model architectures. Factual accuracy, hallucination, and prompt injection require a strong agent harness with retrieval, validation, permissions, and monitoring.
The Transformer stays at the center, what changes is everything around it.

1. Transformers Are Still the Default

Almost every model used for modern AI agents is built on the Transformer architecture. Originally designed for language, the Transformer turned out to be a much more general pattern. It converts inputs into token sequences, uses self-attention to compare tokens, and builds context-aware representations that support prediction, generation, and action selection.

That same pattern now extends well beyond text. Transformer-based models are used not only for language modelling, but also for images, audio, video, and action-related tasks. This is one of the main reasons they became the foundation of many agent systems: the same core design can be adapted across very different modalities and use cases.

They also became dominant for practical reasons. Transformer training can be parallelized efficiently on GPUs, the models scale well with more data and parameters, and they have delivered strong results in text generation, coding, classification, and multimodal reasoning. In practice, they are not only versatile but also proven in production across a wide range of applications.

The Transformer pattern now appears in several model families:

  • Large Language Models (LLMs): These power chatbots, coding assistants, research agents, and workflow automation. They generate text, write code, produce structured outputs, and enable calling tools, APIs, and other agents.
  • Vision-Language Models (VLMs): These models combine images and text. They can interpret screenshots, extract data from scans, and answer questions about visual content.
  • Speech and Audio-Language Models: These models support transcription and real-time voice agents. In many production systems, the architecture is still modular: audio is first converted to text by a speech-to-text model, then processed by an LLM for reasoning, and finally rendered back into speech by a text-to-speech model.
  • Vision-Language-Action Models (VLAs): Include, in addition to image and language understanding, action prediction, allowing them in robotics, for example, to translate camera input and an instruction such as “pick up the red cup” into movement commands.
  • World Models: These predict how an environment will evolve. For example, they may estimate how traffic will move, how an object will fall, or what a room may look like after a robot takes an action.

So, Transformers are no longer just the architecture behind chatbots. They are the foundation for a much broader class of AI systems, including multimodal applications, voice systems, and physical AI.

2. What Alternative Models Actually Change

The main strength of transformers — attention over many tokens — also creates bottlenecks as agents move from short chats to long documents, video, real-time voice, and physical control.

  • Cost and latency at long context

In a standard Transformer, self-attention compares tokens with one another. As inputs get longer, computation grows quickly, and memory usage increases because the model must keep track of past tokens. This becomes expensive for agents working with long documents, large codebases, databases, or video streams.

State-space models address this by carrying forward a compressed memory of the past instead of comparing every token with every other token. This can make them more efficient for long inputs. The best-known recent example is Mamba. These models are not yet dominant because the Transformer ecosystem is much more mature, and pure state-space models can struggle with exact recall from context. For now, they are more often used in hybrids or specialized settings than as full replacements.

  • Sequential generation

Many Transformer-based language models generate outputs one token at a time. This creates a latency bottleneck for long answers, code files, and structured outputs.

Other alternatives focus less on the architecture and more on how outputs are generated. Standard language models often produce one token at a time, which creates latency for long answers or large structured outputs. Diffusion and masked-generation methods try to improve this by refining multiple parts of the output in parallel. In some cases, this can improve both speed and global consistency.

Still, these alternatives are mostly solving operational issues. They target lower memory use, faster inference, better long-context handling, or better performance for continuous signals such as audio, sensor data, and video. Those are important gains, but they do not automatically make an AI agent more accurate and secure.

3. The Biggest Challenges Are Not Solved by Alternative Model Architectures

Current architectural improvements can make agents faster, cheaper, and easier to deploy, but they do not automatically make them trustworthy. Many of the remaining problems are not solved inside the model architecture itself. They are handled by the agent harness: the surrounding system that controls data access, retrieval, tool use, validation, permissions, monitoring, and human oversight.

  • Factual accuracy

Enterprise agents are expected to answer business-critical questions and automate internal processes reliably. The risk is that a model may answer from incomplete training data, outdated information, or unsupported assumptions.

Enterprises mitigate this by grounding agents in approved data sources rather than relying only on the model’s internal knowledge. The most common approach is retrieval-augmented generation, where the agent first retrieves relevant documents, database records, or knowledge-base entries and then generates an answer based on that evidence. Schema-RAG agents go further: they use knowledge graphs to generate queries and ground answers in large enterprise data.

  • Hallucination

Hallucination happens when a model produces an answer that is unsupported, fabricated, or subtly wrong. This is especially dangerous in agent systems because the output may trigger an action: sending an email, updating a CRM record, writing SQL, approving a transaction, or changing a configuration.

Hallucination can be reduced by limiting what the model is allowed to do and by adding verification steps around important actions. Common controls include deterministic workflows, query validation, calculation checks, and human approval for high-risk operations. In mature deployments, the LLM is not treated as the system of record. It is treated as a reasoning interface wrapped with validation and guardrails.

  • Prompt injection

Prompt injection is a security risk where malicious or untrusted content causes the agent to violate its instructions. This can happen when an agent reads a message, web page, email, or document that contains hidden instructions. The risk grows as agents gain access to more tools and internal systems.

The core issue is that language models do not naturally separate instructions from data with the same rigor as secure operating systems. To the model, system instructions, user prompts, retrieved documents, and malicious text all appear as tokens in context. There is no alternative model architecture that solves the vulnerability.

Enterprises mitigate this risk by restricting tool permissions. Sensitive actions should require confirmation or human review. Other defenses include guardrails, sandboxed tool execution, output monitoring, and in some cases fine-tuning. The safest enterprise agents are designed as controlled systems, not autonomous models with unlimited authority.

Conclusion

Transformer architecture remains the foundation of modern AI agents because it is flexible, scalable, and mature across many tasks. Large language models, vision-language models, audio-language systems, vision-language-action models, and world models all show how far the Transformer pattern has spread.

Alternative architectures such as state-space models, hybrids, and diffusion-style generation are important because they improve efficiency, especially for long context, streaming inputs, and constrained environments. However, they do not solve the most critical enterprise challenges on their own. Reliable AI agents still require grounding, validation, security controls, and careful system design around the model.

So, if the question is “If not Transformers, then what?”, the answer is not a simple replacement. It is a broader design choice: use the right model architecture for the workload but build the real trust layer outside the model.

Checkout our latest articles:

Deep dive into further insights and knowledge nuggets.

Platforms like OpenClaw solve the visibility problem: they make it possible to ask questions of your data through a conversational interface. The harder problem ensuring those answers are accurate, consistent, explainable, and secure requires an investment in knowledge architecture that no agent runtime provides on its own.
Julius Hollmann
April 10, 2026
4
min read
A shared Iceberg format doesn’t make zero‑copy possible across platforms. This article explains why physics breaks the illusion and how a knowledge layer provides the real path forward.
Julius Hollmann
March 12, 2026
5
min read
We compare the 5 best enterprise knowledge graph platforms in 2026. Evaluate d.AP, Stardog, Neo4j, Foundry, eccenca & GraphAware using a practical buyer framework
Julius Hollmann
February 19, 2026
10
min read
LLMs can talk, but they don't understand your business. Ontologies provide the missing layer of meaning, turning generative AI from a promising demo into a correct, scalable, and trustworthy enterprise tool. Here’s why semantics are having a renaissance.
Julius Hollmann
February 4, 2026
4
min read
Knowledge Graphs provide the semantic context, constraints and explicit relationships that LLMs lack. This enables true reasoning, like navigating a map of your business, instead of just text retrieval.
Julius Hollmann
January 26, 2026
4
min read
In this article, you’ll discover why Agentic-AI systems demand more than data; they require explicit structure and meaning. Learn how formal ontologies bring coherence, reasoning and reliability to enterprise AI by turning fragmented data into governed, machine-understandable knowledge.
Julius Hollmann
October 29, 2025
5
min read
In this article you'll explore how Knowledge Graphs bring coherence to complexity, creating a shared semantic layer that enables true data-driven integration and scalable growth.
Julius Hollmann
October 28, 2025
3
min read
If you’re building AI systems, you’ll want to read this before assuming MCP is your integration answer. The article breaks down why the Model Context Protocol is brilliant for quick demos but dangerously fragile for enterprise-scale architectures.
Julius Hollmann
October 20, 2025
4
min read
Despite heavy investments, enterprises remain stuck - learn how Knowledge Graphs and AI-powered ontologies finally unlock fast, trusted and scalable data access.
Julius Hollmann
September 12, 2023
3
min read
Discover how Knowledge Graphs connect scattered data into one smart network - making it easier to use AI, speed up automation, and build a future-ready data strategy.
Julius Hollmann
September 12, 2023
4
min read
GenAI alone isn’t enough. Learn how Knowledge Graphs give AI real meaning, transforming it into a trustworthy, explainable assistant grounded in enterprise reality.
Julius Hollmann
September 12, 2023
3
min read

Data silos out. Smart insights in. Discover d.AP.

Schedule a call with our team and learn how we can help you get ahead in the fast-changing world of data & AI.