Jina AI has announced the launch of Jina-VLM, a multilingual vision-language model with 2.4 billion parameters, specifically designed for visual question answering and document understanding on resource-constrained devices. The model excels in handling visual tokens efficiently and achieves leading results on multilingual benchmarks.
Jina AI has announced the launch of its new Jina-VLM model, a multilingual vision-language model containing 2.4 billion parameters, specifically designed for visual question answering and document understanding on devices with limited resources. The model combines a SigLIP2-type visual encoder with a Qwen3 language architecture, using an attention pooling connector to reduce visual tokens while preserving spatial structure.
The model features a unique architectural design based on dividing high-resolution images into a set of up to 12 overlapping tiles, rather than resizing the entire image. Each tile is 378×378 pixels, with overlap between adjacent tiles to ensure no information loss. The model then uses attention pooling technology to compress visual tokens by four times, significantly reducing computational load and the size of the language model's key-value cache.
The model was trained in two main stages using a data mixture comprising approximately 5 million multimodal samples and 12 billion text tokens across more than 30 languages, including Arabic, English, and Chinese. The first stage focused on cross-lingual vision-language alignment, while the second stage focused on instruction tuning for visual question answering and reasoning.
The model achieved advanced performance on numerous standard benchmarks, scoring an average of 72.3 on English visual question answering tasks involving charts and documents. It also excelled in multilingual benchmarks, scoring 78.8 on the MMMB benchmark and 74.3 on the Multilingual MMBench, results considered the best among open models with 2 billion parameters. The model also demonstrated strong control over visual hallucination, scoring 90.3 on the POPE benchmark.
The launch of Jina-VLM represents a significant step in the development of efficient, multilingual vision-language models, especially for resource-constrained devices. The model combines computational efficiency with high performance across a wide range of tasks and languages, making it a promising tool for AI applications in understanding visual and textual content worldwide.
Source: MarkTechPost AI | Exclusive coverage from AI Tools Oasis

Bringing you the latest news and analysis in the world of Artificial Intelligence with accuracy and credibility. Follow us for all updates.
Google DeepMind has announced a $75 million investment in a partnership with independent studio A24 to develop AI tools for filmmaking. The deal aims to integrate artificial intelligence into the creative process, raising questions about the future of artistry in Hollywood.
Nvidia announces plans to reduce water consumption in its data centers through more efficient cooling technologies. However, experts warn these efforts do not address the broader water usage across the AI supply chain, including chip manufacturing. The article explores the company's initiatives and the sustainability challenges facing the tech industry.
AI chip startup Groq has confirmed a $650 million funding round and is restructuring its team after a potential $20 billion acquisition by Nvidia fell through. The company aims to scale production of its LPU chips, which compete with Nvidia's H100 for AI inference workloads.