HongShan Capital Group (HSG), a Chinese venture capital firm, has recently launched Xbench, an innovative benchmark for AI models that emphasizes their ability to perform real-world tasks rather than merely passing tests. This development stems from a desire to assess AI investments more meaningfully following the success of ChatGPT.
Traditional benchmarks often focus on theoretical assessments of an AI model’s capabilities. Xbench, however, aims to redefine effectiveness by evaluating models based on executing practical tasks relevant to a range of industries. This approach could provide deeper insights into a model’s market applicability, potentially transforming how firms gauge AI effectiveness and investment viability.
This week, HSG made part of Xbench’s question set open-source, allowing broader access for developers and researchers. The initiative includes a public leaderboard to compare main AI models, where ChatGPT o3 currently ranks first among competitors like ByteDance’s Doubao and Grok. This transparency may encourage more meaningful improvements across AI platforms as they adapt to the evolving benchmarks.
Since its inception in 2022, Xbench has evolved from an internal tool at HongShan for assessing investment opportunities into a more refined public asset, spearheaded by partner Gong Yuan. The shift in approach showcases the usefulness of inviting external expertise to enhance the testing mechanisms, leading to a well-rounded benchmark that addresses industry demands.
Xbench evaluates models through two distinct methodologies: a traditional academic assessment and a practical evaluation similar to a job interview. The academic approach, Xbench-ScienceQA, involves rigorous postgraduate-level questions across various fields, encouraging models to provide answers that exhibit logical reasoning. Meanwhile, Xbench-DeepResearch scrutinizes models’ capabilities in navigating content specific to the Chinese-language web, addressing a key gap in traditional AI assessments.
With features that aim to assess raw intelligence and practical skills, Xbench will regularly update its test questions every quarter, maintaining a mix of public and private datasets to ensure ongoing relevance. Future integrations could assess additional dimensions like creativity and collaboration in AI models, further enhancing the benchmarking landscape.
The benchmark includes real-life workflow simulations such as recruitment tasks and marketing responsibilities. For example, tasks require models to identify qualified candidates for battery engineering positions or align advertisers with content creators, indicating a clear shift towards asking AI systems to perform valuable, market-driven work.
As Xbench plans to introduce new evaluation categories, including finance and legal domains, it highlights the vested interest in developing AI’s functional capabilities across sectors. The lead researcher from the upcoming LiveCodeBench Pro, Zihan Zheng, acknowledges the challenging nature of quantifying real-world tasks but has noted that Xbench lays a promising foundation for addressing these concerns moving forward.