Chinese language AI lab DeepSeek is likely to be getting the majority of the tech business’s consideration this week. However one in every of its high home rivals, Alibaba, isn’t sitting idly by.
Alibaba’s Qwen workforce on Monday launched a brand new household of AI fashions, Qwen2.5-VL, that may carry out quite a lot of textual content and picture evaluation duties. The fashions can parse recordsdata, perceive movies, and rely objects in photos, in addition to management a PC — much like the mannequin powering OpenAI’s not too long ago launched Operator.
Per the Qwen workforce’s benchmarking, the perfect Qwen2.5-VL mannequin beats OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash on a variety of video understanding, math, doc evaluation, and question-answering evaluations.

Qwen2.5-VL, which is obtainable to check in Alibaba’s Qwen Chat app and to obtain from AI dev platform Hugging Face, can analyze charts and graphics, extract knowledge from scans of invoices and types, and “comprehend” multiple-hours-long movies, the Qwen workforce says. Qwen2.5-VL also can acknowledge “IPs from movie and TV sequence, in addition to all kinds of merchandise,” per the workforce — suggesting that the fashions may’ve been skilled partly on copyrighted works.
Qwen2.5-VL, being AI developed by a Chinese language firm, has sure restrictions on the subjects it can talk about — at the least in Qwen Chat. After I requested the biggest and most succesful Qwen2.5-VL mannequin, Qwen2.5-VL-72B, to speak about “Xi Jinping’s errors,” Qwen Chat threw an error message.
China’s web regulator benchmarks many fashions developed within the nation to make sure their responses “embody core socialist values.” Many Chinese language AI methods decline to reply to subjects which may elevate the ire of regulators, reminiscent of Taiwan’s autonomy.
One among Qwen2.5-VL’s extra attention-grabbing options is its skill to work together with software program — each on PCs and cellular units. A video posted on X by Philipp Schmid, a technical lead at Hugging Face, Qwen2.5-VL launching the Reserving.com app for Android and reserving a flight from Chongqing to Beijing.
Don’t Miss @Alibaba_Qwen 2.5 VL! Regardless of all of the Deepseek Hype, Qwen simply dropped the perfect open Multimodal! Qwen 2.5 VL is a Imaginative and prescient Language Mannequin that may management your pc, much like the @OpenAI operator, extract structured data from charts, and extra!!
TL;DR;
3️⃣… pic.twitter.com/GeEGVdl0tI— Philipp Schmid (@_philschmid) January 27, 2025
Within the video under, a Qwen2.5-VL mannequin controls apps on a Linux desktop — however doesn’t appear to perform a lot past switching tabs. Maybe tellingly, Qwen’s benchmarking exhibits Qwen2.5-VL scoring poorly on OSWorld, a benchmark that tries to imitate an actual pc atmosphere.
LMAO Qwen 2.5 VL can carry out Laptop Use, out of the field, taking up OpenAI Operator HEAD ON! 🐐 pic.twitter.com/lwMECXzNSu
— Vaibhav (VB) Srivastav (@reach_vb) January 27, 2025
The 2 smaller, much less refined fashions within the Qwen2.5-VL sequence, Qwen2.5-VL-3B and Qwen2.5-VL-7B, can be found underneath a permissive license. The flagship Qwen2.5-VL-72B, nevertheless, is underneath Alibaba’s customized license, which requires that corporations and devs with greater than 100 million month-to-month energetic customers request permission from Qwen/Alibaba earlier than deploying the mannequin commercially.