Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class

Zyphra AI has released ZAYA1-8B, a small Mixture of Experts (MoE) language model with 760 million active parameters and 8.4 billion total parameters. Trained end-to-end on AMD hardware, the model outperforms open-weight models many times its size on math and coding benchmarks, and is now available under an Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud.

With under 1 billion active parameters, ZAYA1-8B achieves scores competitive with first-generation frontier reasoning models like DeepSeek-R1-0528, Gemini-2.5-Pro, and Claude 4.5 Sonnet on challenging mathematical reasoning tasks. With its novel test-time compute methodology called Markovian RSA, it surpasses Claude 4.5 Sonnet and GPT-5-High on HMMT’25 (89.6 vs 88.3) and closes in on frontier open-weight models like DeepSeek-V3.2 on mathematics benchmarks.

What is a Mixture of Experts Model and Why Does Active Parameter Count Matter?

The distinction between ‘active’ and ‘total’ parameters matters a great deal. In a standard dense model, every parameter activates for every input token. In a Mixture of Experts model, only a subset of the network’s parameters — the ‘experts’ — are activated at inference time. ZAYA1-8B has 8.4B total parameters but only 760M are active per forward pass. This dramatically reduces inference compute and memory bandwidth requirements while retaining the representational capacity of a much larger model.

ZAYA1-8B can be deployed on-device for local LLM applications, run efficiently in test-time compute harnesses, and serve requests at lower latency compared to dense models with similar benchmark performance.

https://www.zyphra.com/post/zaya1-8b

Architecture: MoE++ and Three Key Innovations

ZAYA1-8B is built on Zyphra’s MoE++ architecture, which introduces three specific changes over standard MoE designs. Together, these form the base of ZAYA1-8B’s intelligence efficiency which is the design goal Zyphra frames as maximizing intelligence extracted per parameter and per FLOP.

Compressed Convolutional Attention (CCA), a sequence mixing mechanism developed by Zyphra that operates in a compressed latent space and achieves 8× KV-cache compression versus standard attention. The KV-cache is the memory used during inference to store intermediate attention states — an 8× reduction directly lowers memory requirements at inference time and allows longer effective contexts within the same hardware envelope.

ZAYA1 MLP-based router with PID-controller bias balancing. Standard MoE routers typically use linear projections to determine which expert processes a given token. Zyphra replaces this with an MLP-based router and adds PID-controller-style bias balancing to improve routing stability — actively preventing load imbalance across experts, which is a known failure mode in MoE training.

Learned residual scaling, which controls residual-norm growth through depth at negligible parameter and FLOP cost. In deep networks, residual stream norms can grow unstably layer over layer; learned scaling addresses this without adding meaningful overhead.

Training Infrastructure: Fully Built on AMD

ZAYA1-8B is a MoE model pretrained, midtrained, and supervised fine-tuned on an AMD Instinct MI300 stack. The full training pipeline ran on a cluster of 1,024 AMD Instinct MI300x nodes connected via AMD Pensando Pollara interconnect, in a custom training cluster built with IBM.

Reasoning-First Pretraining and a Five-Stage Post-Training Pipeline

ZAYA1-8B’s performance reflects innovations across the full stack: Zyphra’s MoE++ architecture, reasoning-first pretraining, a reasoning RL cascade methodology, and the novel Markovian RSA test-time compute method.

Zyphra’s post-training pipeline consists of five sequential stages:

The first is a standard SFT stage covering basic chat, instruction following, code, math, and test-time compute (TTC) abilities.

The second is a reasoning warmup combining mathematical tasks, logic and puzzle solving, with TTC prompts to train the model to natively self-aggregate candidate solutions.

Third is a large RLVE-Gym phase with dynamically adjusted puzzle difficulty to train core reasoning circuits.

Fourth is a large-scale math and code RL phase to deepen performance in these two fundamental domains.

Finally, a relatively lightweight RLHF/RLAIF phase improves chat behavior, instruction following, and writing style.

Zyphra’s research team observed the most substantial capability boosts on mathematics and coding during RL, with smaller but meaningful gains in multiple-choice knowledge retrieval (MMLU and GPQA-Diamond) and non-verifiable tasks such as creative writing.

Markovian RSA: A Novel Test-Time Compute Method

The most technically important contribution alongside the model is Markovian RSA, a test-time compute (TTC) scheme that combines two prior ideas in a new way.

The first is Recursive Self-Aggregation (RSA), which generates multiple reasoning traces in parallel and aggregates them recursively across iterations. The second is the Markovian thinker idea, which performs reasoning in fixed-duration chunks — only the tail end of the previous chunk is passed to the next, keeping the context window bounded regardless of how long the model reasons.

Markovian RSA combines these: for each prompt, multiple traces are generated in parallel; fixed-length tail segments are extracted from each trace; new aggregation prompts are constructed by sub-sampling from the candidate pool; and these aggregated prompts seed the next round of parallel responses. The result has favorable inference properties — rollout generation is parallelizable, and the Markovian chunking strategy ensures intermediate chain-of-thought lengths never exceed a fixed context window size.

A key finding comes out to be that co-design between the post-training methodology and the inference harness is essential. ZAYA1-8B was trained to understand and respond to Markovian RSA aggregation prompts and chunking starting in SFT and continuing through RL. When Zyphra applied the same methodology to Qwen3-4B-Thinking-2507 without this co-design, the performance uplift was substantially smaller — stating that the harness and post-training must be developed together to realize the gains.

With Markovian RSA at an extra-high test-time compute budget of 5.5 million tokens per problem, ZAYA1-8B outperforms DeepSeek-V3.2 and GPT-OSS-High on the challenging APEX-shortlist mathematics benchmark.

Benchmark Results

On the in-class comparison against similarly sized models, ZAYA1-8B scores 89.1 on AIME’26, 71.6 on HMMT Feb.’26, 59.3 on IMO-AnswerBench, 32.2 on APEX-shortlist, 65.8 on LiveCodeBench-v6, and 71.0 on GPQA-Diamond — outperforming Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it across all mathematics and coding categories.

Against larger open-weight models, ZAYA1-8B with 760M active parameters surpasses Mistral-Small-4-119B (6B active, 119B total) on math and coding benchmarks specifically — scoring 89.1 vs 86.4 on AIME’26, 71.6 vs 70.6 on HMMT Feb.’26, and 63.8 vs 57.9 on LiveCodeBench-v6. Mistral-Small-4-119B retains advantages on GPQA-Diamond (77.2 vs 71.0) and MMLU-Pro (81.6 vs 74.2), where knowledge breadth matters more than mathematical reasoning depth.

https://www.zyphra.com/post/zaya1-8b

Key Takeaways

ZAYA1-8B delivers frontier-level math and coding performance with only 760M active parameters, outperforming open-weight models many times its size.

Its MoE++ architecture introduces three innovations — CCA with 8× KV-cache compression, an MLP-based router with PID-controller bias balancing, and learned residual scaling — to maximize intelligence per parameter.

A novel test-time compute method called Markovian RSA, combining Recursive Self-Aggregation with Markovian chunking, pushes ZAYA1-8B past DeepSeek-V3.2 and GPT-OSS-High on APEX-shortlist at 5.5M tokens per problem.

ZAYA1-8B is the first MoE model pretrained, midtrained, and SFT’d entirely on AMD Instinct MI300 hardware — on a 1,024 MI300x node cluster built with IBM.

Released under Apache 2.0, it is available on Hugging Face and Zyphra Cloud.

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class appeared first on MarkTechPost.