Alibaba’s AI Models Algorithm Enhances Reasoning

Apr 8, 2026 | AI Trends

Alibaba’s Qwen team has unveiled a significant advancement in artificial intelligence with their newly designed algorithm, Future-KL Influenced Policy Optimization (FIPO), aimed at enhancing reasoning capabilities in AI models. This breakthrough addresses limitations in traditional reinforcement learning approaches, where a model’s ability to reason effectively has often plateaued due to uniform reward distribution across generated tokens.

Challenges with Traditional Reinforcement Learning

In standard reinforcement learning, a large language model receives a simplistic pass or fail judgment at the conclusion of each generated answer. This reward is evenly distributed among all tokens in the response, regardless of whether some of these tokens represent critical logical junctures or are simply punctuation marks. The Qwen team identified this as a primary factor contributing to the stalling of reasoning models, particularly in methods relying on Group Relative Policy Optimization (GRPO).

Introducing FIPO: A Comprehensive Solution

The FIPO algorithm seeks to overcome this ceiling by evaluating the impact of each token on the downstream reasoning process, thereby making the reward system more nuanced. Instead of treating each token equally, FIPO examines how the model’s behavior changes after generating a specific token and assigns rewards according to how much that token influences subsequent outputs. This allows for a more effective reinforcement learning process, as tokens that kickstart productive reasoning chains receive greater rewards, while those leading to dead ends are penalized.

A Unique Approach Without Auxiliary Models

FIPO stands out by achieving comparable results to existing Proximal Policy Optimization (PPO)-based methods without the need for a separate value estimating model. Previous attempts to rectify the flat reward distribution often required pre-trained auxiliary models that could introduce external biases into the performance evaluations. This innovative approach is valuable as it retains the purity of the learning algorithm without external interference.

Results from FIPO Implementation

The Qwen team tested FIPO on their model, Qwen2.5-32B-Base, utilizing the public DAPO dataset to maintain fairness in comparison. The results illustrated a substantial improvement, with the average reasoning chain length surpassing 10,000 tokens—a significant increase from the 4,000 tokens observed with DAPO’s methodology. Additionally, accuracy on the AIME 2024 benchmark rose from 50% to a peak of 58%, outperforming various competing models.

Evolution of Thought Processes

One of the most compelling findings is the model’s progression through distinct training phases. Initially, it produces elementary outlines but evolves to critique its own logic and results over time. By the final stages, the AI engages in systematic verification processes, recalculating and working through derivations meticulously, similar to more advanced approaches utilized by competing models but achieved through reinforcement learning alone.

Refinements and Future Prospects

Despite these advancements, the Qwen team’s research remains in its nascent stages, as FIPO has thus far only been benchmarked on mathematical problems with a singular dataset. The longer response lengths may also escalate computational requirements, which poses additional challenges. Moreover, the transferability of these gains to other domains such as coding or symbolic logic remains to be explored.

The team has indicated intentions to release the training system openly, alongside all configurations, ensuring that the broader AI community can benefit from these findings and potentially build upon them.