Vision Language Model

Auto Added by WPeMatico

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

ai, AI (Artificial Intelligence), AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, OCR, Open Source, Staff, Tech News, Technology, Vision Language Model

Most end-to-end OCR models slow down as output grows. Each generated token adds to the KV cache. Memory rises and generation drags. Parsing dozens of pages becomes impractical. Baidu’s Unlimited OCR addresses this directly. It swaps the decoder’s attention for a design that keeps memory constant. TL;DR Unlimited OCR is a 3B-parameter Mixture-of-Experts model, with […]

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing Read More »

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

ai, AI (Artificial Intelligence), AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, Open Source, Software engineering, Staff, Tech News, Technology, Uncategorized, Vision Language Model

Zyphra has released Zamba2-VL, a family of open vision-language models. The release covers three sizes: 1.2B, 2.7B, and 7B parameters. Each model is built on the Zamba2 hybrid SSM–Transformer backbone. Vision-language models (VLMs) read images and text together. They answer questions about charts, documents, and photos. Most open VLMs use a dense Transformer as the

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude Read More »

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

agentic ai, ai, AI (Artificial Intelligence), AI Shorts, Applications, Artificial Intelligence, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, Open Source, Software engineering, Staff, Tech News, Technology, Vision Language Model

StepFun today released Step 3.7 Flash, a multimodal Mixture-of-Experts model targeting agentic use cases. It adds native vision input and improved tool-use reliability over Step 3.5 Flash. What is Step 3.7 Flash? Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model. It pairs a 196B-parameter language backbone with a 1.8B-parameter vision encoder (ViT)

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows Read More »

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

ai, AI (Artificial Intelligence), AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, physical ai, Staff, Tech News, Technology, Vision Language Model

World models (systems that synthesize realistic video sequences from an initial image and a set of actions) are becoming central to embodied AI, simulation, and robotics research. The core challenge is scaling these systems to generate minute-long, high-resolution video without requiring prohibitively large clusters for both training and inference. Most competitive open-source baselines either require

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU Read More »

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

ai, AI (Artificial Intelligence), AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, Staff, Tech News, Technology, Uncategorized, Vision Language Model

If you’ve ever watched a motion capture system struggle with a person’s fingers, or seen a segmentation model fail to distinguish teeth from gums, you already understand why human-centric computer vision is hard. Humans are not just objects, they come with articulated structure, fine surface details, and enormous variation in pose, clothing, lighting, and ethnicity.

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo Read More »

Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities

agentic ai, ai, AI (Artificial Intelligence), AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, Open Source, Software engineering, Staff, Tech News, Technology, Vision Language Model

The open-source AI landscape has a new entry worth paying attention to. The Qwen team at Alibaba has released Qwen3.6-35B-A3B, the first open-weight model from the Qwen3.6 generation, and it is making a compelling argument that parameter efficiency matters far more than raw model size. With 35 billion total parameters but only 3 billion activated

Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities Read More »

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference

ai, AI (Artificial Intelligence), AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Machine Learning, New Releases, Open Source, Staff, Tech News, Technology, Uncategorized, Vision Language Model

Liquid AI just released LFM2.5-VL-450M, an updated version of its earlier LFM2-VL-450M vision-language model. The new release introduces bounding box prediction, improved instruction following, expanded multilingual understanding, and function calling support — all within a 450M-parameter footprint designed to run directly on edge hardware ranging from embedded AI modules like NVIDIA Jetson Orin, to mini-PC

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Read More »

Meta AI Releases EUPE: A Compact Vision Encoder Family Under 100M Parameters That Rivals Specialist Models Across Image Understanding, Dense Prediction, and VLM Tasks

ai, AI (Artificial Intelligence), AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Large Language Model, Machine Learning, New Releases, Open Source, Staff, Tech News, Technology, Uncategorized, Vision Language Model

Running powerful AI on your smartphone isn’t just a hardware problem — it’s a model architecture problem. Most state-of-the-art vision encoders are enormous, and when you trim them down to fit on an edge device, they lose the capabilities that made them useful in the first place. Worse, specialized models tend to excel at one

Meta AI Releases EUPE: A Compact Vision Encoder Family Under 100M Parameters That Rivals Specialist Models Across Image Understanding, Dense Prediction, and VLM Tasks Read More »

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

ai, AI (Artificial Intelligence), AI Paper Summary, AI Shorts, Applications, Artificial Intelligence, Computer vision, Editors Pick, Language Model, Large Language Model, New Releases, OCR, Open Source, Staff, Tech News, Technology, Vision Language Model

In the current landscape of computer vision, the standard operating procedure involves a modular ‘Lego-brick’ approach: a pre-trained vision encoder for feature extraction paired with a separate decoder for task prediction. While effective, this architectural separation complicates scaling and bottlenecks the interaction between language and vision. The Technology Innovation Institute (TII) research team is challenging

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts Read More »

IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction

agentic ai, ai, AI (Artificial Intelligence), AI Shorts, Applications, Artificial Intelligence, Editors Pick, Embedding Model, Enterprise AI, Language Model, Large Language Model, New Releases, OCR, Open Source, Staff, Tech News, Technology, Vision Language Model

IBM has announced the release of Granite 4.0 3B Vision, a vision-language model (VLM) engineered specifically for enterprise-grade document data extraction. Departing from the monolithic approach of larger multimodal models, the 4.0 Vision release is architected as a specialized adapter designed to bring high-fidelity visual reasoning to the Granite 4.0 Micro language backbone. This release

IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction Read More »