Jina AI has announced the launch of Jina-VLM, a multilingual vision-language model with 2.4 billion parameters, specifically designed for visual question answering and document understanding on resource-constrained devices. The model excels in handling visual tokens efficiently and achieves leading results on multilingual benchmarks.
Jina AI has announced the launch of its new Jina-VLM model, a multilingual vision-language model containing 2.4 billion parameters, specifically designed for visual question answering and document understanding on devices with limited resources. The model combines a SigLIP2-type visual encoder with a Qwen3 language architecture, using an attention pooling connector to reduce visual tokens while preserving spatial structure.
The model features a unique architectural design based on dividing high-resolution images into a set of up to 12 overlapping tiles, rather than resizing the entire image. Each tile is 378×378 pixels, with overlap between adjacent tiles to ensure no information loss. The model then uses attention pooling technology to compress visual tokens by four times, significantly reducing computational load and the size of the language model's key-value cache.
The model was trained in two main stages using a data mixture comprising approximately 5 million multimodal samples and 12 billion text tokens across more than 30 languages, including Arabic, English, and Chinese. The first stage focused on cross-lingual vision-language alignment, while the second stage focused on instruction tuning for visual question answering and reasoning.
The model achieved advanced performance on numerous standard benchmarks, scoring an average of 72.3 on English visual question answering tasks involving charts and documents. It also excelled in multilingual benchmarks, scoring 78.8 on the MMMB benchmark and 74.3 on the Multilingual MMBench, results considered the best among open models with 2 billion parameters. The model also demonstrated strong control over visual hallucination, scoring 90.3 on the POPE benchmark.
The launch of Jina-VLM represents a significant step in the development of efficient, multilingual vision-language models, especially for resource-constrained devices. The model combines computational efficiency with high performance across a wide range of tasks and languages, making it a promising tool for AI applications in understanding visual and textual content worldwide.
Source: MarkTechPost AI | Exclusive coverage from AI Tools Oasis

Bringing you the latest news and analysis in the world of Artificial Intelligence with accuracy and credibility. Follow us for all updates.

OpenAI is advancing its ambitious super app project, aiming to integrate advanced AI capabilities into a single, multifunctional platform. This development is part of the company's strategy to expand services and deliver a unified user experience. Discover the full details and expected impact of this move.

Notion has restored access to its Anthropic AI integration after a 4-hour outage disrupted users relying on Claude-powered features. The incident highlights the growing dependency on AI productivity tools and raises questions about infrastructure stability. All user data remained secure during the disruption.

A new report from TechCrunch AI warns of a potential 'Tokenpocalypse'—a massive collapse of digital tokens due to oversupply. With over 80% of new tokens losing 90% of their value, the market faces a crisis reminiscent of the dot-com bubble. This analysis explores the risks, impacts, and how investors can protect themselves.