AI Infrastructure

Auto Added by WPeMatico

Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

Liquid AI just shipped LFM2.5-8B-A1B. It is an on-device Mixture-of-Experts (MoE) model built for tool calling. The model holds 8.3B total parameters but activates only 1.5B per token. That sparsity is what lets it run on consumer hardware. The release follows LFM2-8B-A1B, which Liquid AI team published earlier. LFM2.5 is a new family of hybrid […]

Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters Read More »

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository. At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate Read More »

Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules

Researchers from Sakana AI and the University of Tokyo propose DiffusionBlocks. It trains transformer-based networks one block at a time. Training memory is reduced by a factor of B, where B is the number of blocks. Performance is maintained across diverse architectures. The Memory Problem in Neural Network Training End-to-end backpropagation requires storing intermediate activations

Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules Read More »

https://vllm.ai/blog/2026-05-26-eagle-3-1

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

Speculative decoding is a technique for speeding up large language model inference. A small, fast draft model proposes several tokens. The large target model verifies them in parallel. If accepted, inference is faster. If rejected, the system falls back gracefully. EAGLE Team, vLLM Team, and TorchSpec Team has launched the EAGLE series including EAGLE 1,

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference Read More »

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Long-context inference makes the KV cache one of the main costs of serving LLMs. During autoregressive decoding, the cache grows with context length, batch size, and model depth. At high batch sizes and long contexts with 100K tokens across dozens of concurrent requests the KV cache consumes a large fraction of GPU memory. Compressing it

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving Read More »

Best Authentication Platforms for AI Agents and MCP Servers in 2026

The Model Context Protocol has moved from Anthropic’s internal experiment to a de facto industry standard at a speed few integration protocols have matched. Since its launch in November 2024, MCP has grown explosively: OpenAI adopted it in March 2025, Microsoft announced support in Copilot Studio in March 2025, and by late 2025 combined Python

Best Authentication Platforms for AI Agents and MCP Servers in 2026 Read More »

NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule

Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time and decoding to constant memory. The hard part is not what to forget. It is how to edit a compressed memory without scrambling existing associations. NVIDIA has released Gated DeltaNet-2, a linear attention

NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule Read More »

Tencent Open-Sources TencentDB Agent Memory: A 4-Tier Local Memory Pipeline for AI Agents

Tencent has released TencentDB Agent Memory, an open-source memory system for AI agents. The project ships under the MIT license. It targets a problem familiar to anyone shipping long-horizon agents: context bloat and recall failure. It is symbolic short-term memory along with layered long-term memory. It integrates with OpenClaw as a plugin and with the

Tencent Open-Sources TencentDB Agent Memory: A 4-Tier Local Memory Pipeline for AI Agents Read More »

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

Instruction-tuned language models refuse harmful requests. But which part of the model is actually responsible — and how does that mechanism get installed during training? A new research from Nous Research team takes a neuron-level look at this question. The Nous research team developed contrastive neuron attribution (CNA), a method that identifies the specific MLP

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification Read More »

Perplexity Open-Sources Bumblebee: A Read-Only Supply-Chain Scanner for Developer Endpoints

Attackers increasingly target the packages, editor extensions, and AI tool configs on developer machines and not just production systems. Perplexity has open-sourced an internal tool it uses to address this problem. Perplexity released Bumblebee on GitHub. The tool is a read-only inventory collector for macOS and Linux developer endpoints. It is written entirely in Go

Perplexity Open-Sources Bumblebee: A Read-Only Supply-Chain Scanner for Developer Endpoints Read More »