Tech

Alibaba’s model has never been trained as an agent – and improved agent performance on all seven benchmarks

0 0 4 minutes read

Alibaba’s model has never been trained as an agent – and improved agent performance on all seven benchmarks

Alibaba’s Qwen team released Qwen-AgentWorld on Tuesday – two models trained not to act within agent locations, but to predict what those locations will return. The release includes seven domains under one structure: MCP, Search, Terminal, Software Engineering, Android, Web, and OS.

The rollout extends Alibaba’s recent push into independent agents. Qwen3.7-Max, released in May, is built around 35 hours of automation.

That change directed agents to train groups on a direct entry scale. Real search engines display any results that exist, with no way to inject controlled conditions. Live terminals do not allow injecting low disk space status when needed. Agent training is responsible for what production conditions will arise, without a systematic way to expose agents to cases that will need to be handled but rarely encountered in training.

The research team trained the agents within the resulting simulator and found performance benefits beyond that of training against real-world production scenarios alone. In a separate experiment, using world model training as a warm-up before agent optimization improved performance on all seven benchmarks, including three that the model never saw during training.

A paper accompanying the release identified a gap in the agent’s previous research. "We argue that world modeling is an important missing component on the path to general agents."

Qwen-AgentWorld trains what locations return, not what agents should do

Multi-agent models are trained to answer one question: given what nature has just shown me, what should I do next? Qwen-AgentWorld is trained to answer the opposite: given what the agent just did, what will the environment show next?

That transformation is the core of what the paper calls a global model of language: instead of preparing to select an action, the model learns to predict the next natural state in all seven domains under a single training objective. Previous work was small: WebWorld, Qwen’s previous project from February, covered only web sites; Snowflake’s Agent World Model, published the same month, generates environments supported by SQL code instead of training a model to predict scenarios. Qwen-AgentWorld is the first to include seven domains in one model, with environmental modeling established from the first training phase.

Alibaba trained both models in three stages on more than 10 million natural interaction trajectories from real agents. The first stage teaches the model how environments behave – file systems, terminal regions, browser DOM changes, API responses. The second stage trains the model to think carefully about the future before making predictions. The third stage, reinforcement learning, strengthens the predictions using rule-based checks and open quality scores.

Both models are Mixture-of-Experts designs – only a fraction of the parameters are valid for each token. Model 35B unlocks 3B; 397B opens 17B. Both support 256K windows core. In GUI domains (Android, web, and OS), models work from text accessibility trees and UI view hierarchies rather than screenshots.

35B and AgentWorldBench model weights are available under Apache 2.0; The weights of the 397B are not publicly released.

Training results are more important than benchmarks

Benchmark scores show that the models accurately predict which areas return. The results of the training show how important that predictive ability is to team building agents – and those are the numbers that matter the most.

According to the researchers, the agents are trained in a controlled environment to mimic the best performing agents trained in real environments. Injecting target disturbances – incomplete responses that force additional actions of the agent, and edge cases where real areas rarely come out – pushed the MCPMark from 24.6 to 33.8. In the search, agents trained in a completely fictional world were transferred to real search operations, pushing the WideSearch F1 Item from 34.02 to 50.31 in the 35B open model. A separate warm-up test showed that global model pre-training improves BFCL v4 from 62.29 to 71.25 and Claw-Eval from 53.60 to 64.88 without agent-specific optimization.

Researchers flag the benchmark and the risk of overuse

The paper drew a quick reaction from AI researchers in X. The concerns they raise map to what doctors need to confirm before acting on their findings.

For the purpose of training and the transfer effect, evaluation from one AI/ML researcher was specific. "Every other ‘agent’ model is trained to work in the environment," wrote @drawais_ai, who has a PhD background and often breaks AI papers. "Qwen answered the question. They trained the model to predict the environment itself… That predictive knowledge then transfers to the agent’s actions even without the agent’s specific fine-tuning." He identified the effect of Controllable Sim RL as "receipt" with the claim that synthetic training can replace the real-world RL on the scale, and noted that three of the seven transfer benchmarks were out-of-domain.

The benchmark margin drew immediate scrutiny. "AgentWorldBench is a benchmark Alibaba developed and published in the same paper," wrote @TheSignal_Desk, who specializes in reliable and numerically important AI research. "They wrote the test, and they were 0.46 higher."

The sim-RL approach is a result of @limalemonnn, who is developing artificial intelligence agents, which are identified as requiring more processing before the title claim can be cited. "Sim-trained agents tend to fit the simulator’s negatives," they wrote. "If the world model is too pure, the agent learns the model, not the function." They pointed to the separation of the paper as the divisional staff must read before taking the numbers.

Concerns about overfitting have a partial answer to the data. The gap between the uncontrolled Sim RL (MCPMark 24.6) and the controlled Sim RL (MCPMark 33.8) suggests that the benefits depend more on the control method, not the accuracy of the simulation alone. The result of the fictional World Search, in which agents trained in fictional environments transfer to real search tasks, is the strongest evidence of the paper against excessive anxiety.

What does this mean for teams building agent pipelines

For AI engineering teams building and scaling agent pipelines, this work represents a meaningful shift in the way agent capabilities are built. Agents training teams at scale now have a third option between real-world RL and static benchmarks: controlled simulations that inject production edge cases won’t emerge.

Placements are a formal layer of training. Controlled simulations that inject conditions into real environments will not be reproduced in accordance with the RL of the real environment, not the surrounding environment.

What the model learns before agent training begins is more important than most pipelines account for. Finding warm-up – performance gains across benchmarks that are not seen without agent-specific training – suggests that local innovation is earlier in development than current practice.

Mosegas 3 hours ago

0 0 4 minutes read