Good planning is forgotten. RAG leaks context. Hypernetworks creates a model that your agent needs on demand.

Business teams are always watching the same thing happen. An AI agent is well demoed, goes to production, and retails: it works for a while, then needs a human to fill its context and check the output, and the promised efficiency leads to supervision. The agent does the work; did the watching. One of the reasons why many agent pilots never turn to production systems.
The pitch on the other side of that wall is the one that all parties want to believe: an agent who does a long job alone, overnight if necessary, and leaves someone to ensure only the last 10%. Whether that is achievable opens up a problem that the orchestration discussion often skips. When AI firm Chroma tested 18 leading models, each one lost accuracy as its input increased, a property of how attention works, not a gap that a powerful model closes. An agent that feeds your business as it runs is unstable. Get shakier.
This is the layer below the orchestration race. Routing, durability and visibility all assume that each agent is already competent enough to coordinate in the first place. The deeper question is how long an agent can work before someone comes in, and that comes down to where your company’s experience sits against the model. Both standard fixes leave the person in the loop.
Why teaching your business model keeps you informed
Frontier models continue to be more capable, and the gap is not closing, because it’s not an issue of capability. It’s about where your information sits relative to the model, and businesses have two ways to place it there.
The first is good planning, which bakes information into the weight. It remains subject to catastrophic forgetting, a problem identified in the 1980s and still unresolved in 2026: teaching a model something new often destroys what it already knows. Teams work around it by separating each task into its own well-configured model or adapter, generating a wide range of models that increase cost and manageability. And a well-tuned model is a snapshot, creating a day when the policy changes, when the expensive, slow training cycle starts all over again.
The second is in-context learning, which bypasses retraining by setting appropriate policies quickly during operation. This is where the core rot bites. Retrieving reduces what goes into awareness, but missing a retrieval seems like a surefire answer, and both cost and latency increase with every additional token.
The two failures rhyme. With proper tuning, the model can work with confidence from the last quarter policy. By reading the contents, it can confidently work on the details it has lost during the long information. Either way the output looks equally reliable, so you can’t tell which parts are bad without checking them all. That is why one never leaves. Some teams often run both at the same time, fine-tuning stable information and returning the rest. That softens each failure but doesn’t remove anything: for any output you can’t be sure that the model is current and working from the correct context, so you’re still testing it.
The third way: create an expert model on demand
The third path is from research to early product. Instead of retraining a single model or focusing its information, the generator builds a smaller, task-specific model based on your policies, at decision time. The generator is a hypernetwork: the output network is the weights of another network.
The idea was conceived in 2016; using it to generate special language models in text or documents is recent and effective. Sakana AI’s Text-to-LoRA document, presented at ICML 2025, generates a model adapter from a simple language description in one pass, and the 2026 program called SHINE calls hypernetwork transformation a promising new frontier, precisely because it sets aside both retraining costs for optimization and development limitations.
The point of generating adapters instead of training and maintaining them is to collapse the distributed library of LoRAs for each task into a single network that can generate them on demand, including unprecedented tasks.
The good part is how this closes the loop on the problem above: the adapter teams for each task manually build to avoid catastrophic forgetting is the same thing that the hypernetwork automatically generates. A model zoo ceases to be a management head and becomes a manufactured result.
The case for slowness under all this was put directly in a 2025 paper by Nvidia researchers: for small, repetitive tasks that fill the agent’s workflow, small models are powerful enough and are 10 to 30 times cheaper than borderline generalists. Nace.AI, a Palo Alto company that raised $21.5 million in seed funding in May, is a clear example of entrepreneurship. Its basic technology, a generator it calls MetaModel, generates model parameter variables at the time of decision from the company’s policies, identified in the regulated activity: inspection, compliance, risk assessment. The company claims that its agents manage the bulk of the workflow while human experts ensure the result, with a market split of 90/10.
How these three methods compare
Fine tuning | Content / RAG | Hypernetwork model | |
Where business knowledge resides | In model weights | Immediately, re-give each run | Required production weights |
Review costs for policy changes | Top: retrain | Down: edit source | Down: redo |
Not moving | Top: summary | Down | Bottom: reproduced from current policy |
Telephone charges and delays | Down | At the top, it grows with context | It is down during the run |
The dominant failure mode | Forgetting; model-zoo sprawl | Content decay; to miss quietly | Generator quality; balancing |
Who owns the property being developed | Anyone who trains a model | Anyone who manages a data store | It depends on where the generator and the answer reside |
Why the hypernetwork model raises the ceiling on autonomy
Smaller, more current and smaller models have less room for error. Fewer mistakes, confined to a known domain, mean fewer consequences for an agent to pass on to a person, which is the real basis for any claim to higher autonomy. This is also where a number like 90/10 comes from: not a preset dial, but a result of how little the system needs to restore. The reported autonomy shares are best studied as measures of properties, not as settings.
Two design choices determine whether that autonomy is reliable or just fast. The first is basic: linking all output to its source so that the reviewer can verify rather than repeat. Research models designed specifically for this, such as HalluGuard, label each claim as supported or not and cite the passage it relied on. NACE sends its agents with ground models and reasoning leads for the same reason. A 10% update means something only if one can confirm the birth in seconds.
The second is a feedback loop, and it forces the question every buyer must ask: when your experts validate your output, whose model is it developing, and where does it sit? That determines whether the bundled property is the seller’s or yours. Arrangements vary. Nace, for example, uses an external network of certified experts and, in direct business deployments, the customer’s own employees, with the resulting model stored within the customer’s cloud. Each option directs learning, and identity, to a different area.
Where the third path breaks out
The approach is still early, and a few questions will determine how far it will go. Estimating the linchpin: the value depends on the model and knows when it is uncertain. And it’s not really fixed, recent work producing these adapters has found that they don’t automatically improve the rating over standard configurations, which are only achieved under certain limitations.
The quality of the model produced is also highly dependent on the policy data from which it is built, which places a premium on data processing. And scale is an open research frontier, the hypernetworks shown in published work so far have been small. This is where Nace’s work is interesting: in our interview, the company said that it increased its generator beyond those published sizes and found a law to measure how the performance increases, the results have begun to be shared publicly and are now being reviewed by peers. If it stands up, it could help answer one open question in the field, and it’s a paper worth watching.
No matter which method is successful, the task is still human, and that exclusion is its design problem. When Deloitte Australia submitted a government report worth A$440,000, it was sent with false quotes and a court quote established after passing a high review, because the reviewers examined the conclusions, which were reasonable, and not the basis, which was not the case. Controlled studies suggest that the pattern is general: experts correct the same flawed recommendation less often when labeled as AI-generated.
Article 14 of the EU AI Act now mentions this automatic bias. The lesson is not about any one trader: a high share of independence concentrates one’s attention to a small, recent piece of work, so the value of that review depends entirely on whether one can assess its immediate evolution, which goes back to support.
What to build, and what to ask before buying
Honest takeaway: what’s stopping your agents is usually not the singing or the size of the model, but whether the model knows your business well enough to be left alone, and the right fit depends on the job. To make a long, iterative, high-volume process end, run most of your internal audits overnight and have your experts check the last piece, a hypernetwork-generated model is the most likely way to do that cheaply and work long enough to be valuable. With a short task that ends in a few steps and never requires you to run unattended, the gap between this and a well-motivated boundary model shrinks to almost nothing, and is not worth the integration costs.
When a seller presents independent or professional agents, four questions cut through it.
Where does business knowledge reside: in real-time, instant, or on-demand production?
What does each output come with, so the reviewer can verify it instead of redoing it?
What determines which job goes up to a person?
And whose model is developing in that response, and where does it work?
The answers, not the title rating, tell you what to buy.
The hypernetwork method is the most reliable attempt at the moment to make a small model aware of a particular entity without forgetting it and without redefining it in every run. It’s also very little evidence, and the most important parts, measurement and scale, are still being peer-reviewed. To find the right job, check it out now. On the downside, assembly costs buy you less than a well-ordered frontier model would.



