Vision Language Model

Auto Added by WPeMatico

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

Building simulators for robots has been a long term challenge. Traditional engines require manual coding of physics and perfect 3D models. NVIDIA is changing this with DreamDojo, a fully open-source, generalizable robot world model. Instead of using a physics engine, DreamDojo ‘dreams’ the results of robot actions directly in pixels. https://arxiv.org/pdf/2602.06949 Scaling Robotics with 44k+ […]

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data Read More »

Google Introduces Agentic Vision in Gemini 3 Flash for Active Image Understanding

Frontier multimodal models usually process an image in a single pass. If they miss a serial number on a chip or a small symbol on a building plan, they often guess. Google’s new Agentic Vision capability in Gemini 3 Flash changes this by turning image understanding into an active, tool using loop grounded in visual

Google Introduces Agentic Vision in Gemini 3 Flash for Active Image Understanding Read More »

Ant Group Releases LingBot-VLA, A Vision Language Action Foundation Model For Real World Robot Manipulation

How do you build a single vision language action model that can control many different dual arm robots in the real world? LingBot-VLA is Ant Group Robbyant’s new Vision Language Action foundation model that targets practical robot manipulation in the real world. It is trained on about 20,000 hours of teleoperated bimanual data collected from 9

Ant Group Releases LingBot-VLA, A Vision Language Action Foundation Model For Real World Robot Manipulation Read More »

Liquid AI Releases LFM2.5: A Compact AI Model Family For Real On Device Agents

Liquid AI has introduced LFM2.5, a new generation of small foundation models built on the LFM2 architecture and focused at on device and edge deployments. The model family includes LFM2.5-1.2B-Base and LFM2.5-1.2B-Instruct and extends to Japanese, vision language, and audio language variants. It is released as open weights on Hugging Face and exposed through the

Liquid AI Releases LFM2.5: A Compact AI Model Family For Real On Device Agents Read More »

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling

Zhipu AI has open sourced the GLM-4.6V series as a pair of vision language models that treat images, video and tools as first class inputs for agents, not as afterthoughts bolted on top of text. Model lineup and context length The series has 2 models. GLM-4.6V is a 106B parameter foundation model for cloud and

Zhipu AI Releases GLM-4.6V: A 128K Context Vision Language Model with Native Tool Calling Read More »

Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA

Jina AI has released Jina-VLM, a 2.4B parameter vision language model that targets multilingual visual question answering and document understanding on constrained hardware. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it

Jina AI Releases Jina-VLM: A 2.4B Multilingual Vision Language Model Focused on Token Efficient Visual QA Read More »