AI Infrastructure

Auto Added by WPeMatico

What is Tokenization Drift and How to Fix It?

A model can behave perfectly one moment and degrade the next—without any change to your data, pipeline, or logic. The root cause often lies in something far more subtle: how your input is tokenized. Before a model processes text, it converts it into token IDs, and even minor formatting differences—like spacing, line breaks, or punctuation—can […]

What is Tokenization Drift and How to Fix It? Read More »

Mistral AI Launches Remote Agents in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Score

Mistral AI has been quietly building one of the more practical coding agent ecosystems in the open-source/weights AI space, and they are shipping its most significant infrastructure upgrade yet. Mistral team announced remote agents in Vibe, its coding agent platform, alongside the public preview of Mistral Medium 3.5 — a new 128B dense model that

Mistral AI Launches Remote Agents in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Score Read More »

A New NVIDIA Research Shows Speculative Decoding in NeMo RL Achieves 1.8× Rollout Generation Speedup at 8B and Projects 2.5× End-to-End Speedup at 235B

If you have been running reinforcement learning (RL) post-training on a language model for math reasoning, code generation, or any verifiable task, you have almost certainly stared at a progress bar while your GPU cluster burns through rollout generation. A team of researchers from NVIDIA proposes a precise fix by integrating speculative decoding into the

A New NVIDIA Research Shows Speculative Decoding in NeMo RL Achieves 1.8× Rollout Generation Speedup at 8B and Projects 2.5× End-to-End Speedup at 235B Read More »

A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we

A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning Read More »

A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning

In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we

A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning Read More »

Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools

Large language models are remarkably capable, yet frustratingly opaque. When a model misbehaves — generating responses in the wrong language, repeating itself endlessly, or refusing safe requests — AI devs have very few tools to diagnose why it happened at the level of internal computations. That’s the problem Qwen-Scope is built to solve. Qwen Team

Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools Read More »

Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

The team behind Kimi.ai (Moonshot AI) just made a significant contribution to the open-source AI infrastructure space. The research team has made a significant contribution to the open-source AI infrastructure space. They released FlashKDA (Flash Kimi Delta Attention), a high-performance CUTLASS-based kernel implementation of the Kimi Delta Attention (KDA) mechanism. The FlashKDA library is available

Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Read More »

Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods

As large language models scale to longer context windows and serve more concurrent users, the key-value (KV) cache has emerged as a primary memory bottleneck in production inference systems. For a 30-billion-parameter model with a batch size of 128 and an input length of 1,024 tokens, the resulting KV cache can occupy up to 180

Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Read More »

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

The race to make large language models faster and cheaper to run has largely been fought at two levels: the model architecture and the hardware. But there is a third, often underappreciated frontier — the GPU kernel. A kernel is the low-level computational routine that actually executes a mathematical operation on the GPU. Writing a

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Read More »

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified

Poolside AI released the first two models in its Laguna family: Laguna M.1 and Laguna XS.2. Alongside these, the company is releasing pool — a lightweight terminal-based coding agent and a dual Agent Client Protocol (ACP) client-server — the same environment Poolside uses internally for agent RL training and evaluation, now available as a research

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified Read More »