Researchers at Stanford University and Lambda Labs, have published the research paper for OpenJarvis, an open-source framework that runs inference, agents, memory, and learning entirely on-device.
The open-weight models configured through OpenJarvis land within 3.2 percentage points of the best cloud model on average, at roughly 800× lower marginal API cost per query and roughly 4× lower latency under the research’s benchmark protocol. This research work builds on the research team’s earlier Intelligence Per Watt study, which reported that local models already handle 88.7% of single-turn chat and reasoning queries at interactive latency, with intelligence efficiency improving 5.3× from 2023 to 2025.
Model Overview & Access
OpenJarvis is not a single model. It is a framework that composes any supported model with a configurable agent stack, evaluated across 11 local models from four families.
PropertyValueLicenseApache 2.0Framework releaseMarch 12, 2026PaperarXiv:2605.17172 (posted May 16, 2026)Repositorygithub.com/open-jarvis/OpenJarvisStars / forks~5.4k / ~1.2k (June 2026)LanguagesPython (~83%), Rust (~9%), TypeScript (~7%)Evaluated models11 local models across 4 families: Qwen3.5, Gemma4, Nemotron, GraniteCloud baselinesClaude Opus 4.6, GPT-5.4, Gemini 3.1 ProSupported enginesOllama, vLLM, SGLang, llama.cpp, Apple Foundation Models, Exo (among others)Context windowModel-dependentInstallationSingle command; ~3 minutes on broadbandHardwareTested on 7 platforms, from Mac Mini M4 to NVIDIA DGX Spark
Architecture: Five Primitives and a Spec
OpenJarvis decomposes a personal AI system into five typed primitives, composed through a single declarative configuration object called a spec.
Intelligence — the model, weights, generation parameters, and quantization format.
Engine — the inference runtime (Ollama, vLLM, SGLang, etc.), batching, KV-cache settings, and hardware path.
Agents — the reasoning loop (ReAct or CodeAct), system prompts, tool-use policy, and turn limits.
Tools & Memory — external interfaces, retrieval backends, 25+ data connectors, and 32+ messaging channels, with native MCP support and interchangeable memory backends.
Learning — the optimizer that updates the spec from traces. This slot accepts LoRA, DSPy, GEPA, or LLM-guided spec search.
Each primitive is independently swappable, and a spec serializes all five into a TOML file. Two specs can share the same agent and tool configuration and differ only in model and engine, so the same behavior runs on a Mac Mini and a workstation without rewriting prompts.
LLM-guided spec search is the second contribution. It is a local–cloud collaboration: a frontier cloud model acts as a teacher at search time, reading traces, diagnosing failure clusters, and proposing edits across Intelligence, Engine, Agents, and Tools & Memory. An edit is accepted only if it improves the target failure cluster without causing meaningful regressions elsewhere — the research team calls this the gate (default tolerance 1%). The optimized spec then runs entirely on-device at inference time, with zero cloud calls. The teacher is used only at search time; at 100 queries per day, the amortized teacher cost falls below $0.001 per query within six months.
Prior work (GEPA, DSPy, LoRA) optimizes one primitive at a time, and prompt optimizers alone recover only about 5 pp of the cloud–local gap. LLM-guided spec search recovers 13–32 pp because it edits across primitives jointly, at 7–11× lower optimization cost than single-primitive baselines. The four-primitive move space contributes 5.5–16.5 pp, and the LLM proposer adds about 10 pp on average over an evolutionary search at the same move space.
https://arxiv.org/pdf/2605.17172v1
Capabilities & Performance
OpenJarvis was evaluated across 8 benchmarks spanning 508 tasks: tool calling (ToolCall-15), agentic workflows (PinchBench), coding (LiveCodeBench), customer service (τ-Bench V2, τ²-Bench Telecom), general assistance (GAIA), and deep research (LiveResearchBench, DeepResearchBench).
The swap test: Replacing the intended cloud model with Qwen3.5-9B in existing frameworks (OpenClaw, Hermes Agent) drops accuracy by 25–39 pp. With the same model under an OpenJarvis spec, the residual drop shrinks to 5.6–16.5 pp — recovering 56–77% of the portability loss.
The accuracy frontier: The best single local model, Qwen3.5-122B, reaches 80.3% average accuracy versus Claude Opus 4.6 at 83.5% — a 3.2 pp gap. Local specs match or exceed cloud on 4 of 8 benchmarks: ToolCall-15, PinchBench, LiveCodeBench, and τ-Bench V2.
Cost and latency: Local configurations form the accuracy–efficiency frontier. Qwen3.5-122B delivers its 80.3% at roughly a thousandth of a cent per query, versus $0.009 per query for Claude Opus 4.6 — an approximately 800× marginal API-cost advantage. End-to-end latency drops by roughly 4× on the agentic workloads, though the paper notes single-shot prompts can favor cloud serving.
Search gains: LLM-guided spec search improves the Qwen3.5-9B student to 100% on PinchBench, 83% on LiveCodeBench, and 91% on LiveResearchBench. Across the full eight-benchmark suite, average gains per student model range from 13.1 to 31.5 pp. The authors report that these gains survive their robustness checks (reward-weight variants, search-seed variance, and random restarts).
How to Use it
Installation is one command. On macOS, Linux, or WSL2:
Copy CodeCopiedUse a different Browsercurl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash
Windows users run an equivalent PowerShell script (irm … | iex). The installer provisions uv, a Python virtual environment, Ollama, and a starter model in about three minutes on broadband. A desktop GUI ships as a .dmg, .exe, .deb, .rpm, or .AppImage from the releases page.
After install, jarvis starts a chat session. Starter presets cover common workflows:
Copy CodeCopiedUse a different Browserjarvis init –preset morning-digest-mac # daily briefing with TTS
jarvis init –preset deep-research # multi-hop research with citations
jarvis init –preset code-assistant # agent with code execution and shell access
jarvis init –preset scheduled-monitor # stateful agent on a schedule
The framework ships with eight built-in agents across three execution modes — on-demand, scheduled, and continuous. It connects to 25+ data sources (Gmail, Calendar, iMessage, Notion, Obsidian, Slack, GitHub, and others) and exposes agents over 32+ messaging channels (WhatsApp, Telegram, Discord, iMessage, Signal, and others).
Skills can be imported from external catalogs — about 150 from Hermes Agent and about 13,700 community skills from OpenClaw — all following the agentskills.io specification. A jarvis optimize skills –policy dspy command refines them from local trace history.
Marktechpost’s Visual Explainer
/* —- scope everything to #mtp-ojx —- */
#mtp-ojx{
–card:#8C1515; –card-dk:#5e0f0f; –ink:#2e2d29; –grey:#4D4F53; –mut:#6f7176;
–line:#e7e1d8; –bg1:#ffffff; –bg2:#f7f4ef; –sand:#b3995d; –green:#175E54;
all:initial;
display:block !important;
box-sizing:border-box !important;
width:100% !important; max-width:1000px !important; margin:24px auto !important;
background:var(–bg2) !important;
color:var(–ink) !important;
border:1px solid var(–line) !important;
border-radius:16px !important;
overflow:hidden !important;
font-family:-apple-system,BlinkMacSystemFont,”Segoe UI”,Roboto,Helvetica,Arial,sans-serif !important;
box-shadow:0 14px 40px rgba(46,45,41,.10) !important;
}
#mtp-ojx *{ box-sizing:border-box !important; }
/* kill WordPress wpautop artifacts */
#mtp-ojx hr, #mtp-ojx p:empty, #mtp-ojx del, #mtp-ojx s{ display:none !important; }
#mtp-ojx .mtp-line{ height:1px !important; border:0 !important; background:var(–line) !important; margin:0 !important; }
/* top accent bar */
#mtp-ojx .mtp-topbar{
height:5px !important; width:100% !important;
background:linear-gradient(90deg,var(–card) 0%,var(–card-dk) 60%,var(–sand) 100%) !important;
}
/* header row */
#mtp-ojx .mtp-head{
display:flex !important; align-items:center !important; justify-content:space-between !important;
gap:12px !important; padding:16px 22px 12px !important; background:var(–bg1) !important;
}
#mtp-ojx .mtp-brand{ display:flex !important; align-items:center !important; gap:10px !important; }
#mtp-ojx .mtp-dot{ width:11px !important; height:11px !important; border-radius:50% !important; background:var(–card) !important; box-shadow:0 0 0 4px rgba(140,21,21,.12) !important; }
#mtp-ojx .mtp-kick{ font-size:11px !important; letter-spacing:.14em !important; text-transform:uppercase !important; color:var(–card) !important; font-weight:700 !important; }
#mtp-ojx .mtp-count{ font-size:12px !important; color:var(–mut) !important; font-variant-numeric:tabular-nums !important; font-weight:600 !important; }
/* viewport + track */
#mtp-ojx .mtp-vp{ overflow:hidden !important; background:var(–bg1) !important; }
#mtp-ojx .mtp-track{ display:flex !important; transition:transform .45s cubic-bezier(.4,.0,.2,1) !important; }
#mtp-ojx .mtp-slide{
flex:0 0 100% !important; width:100% !important;
padding:26px 30px 30px !important;
display:flex !important; flex-direction:column !important; justify-content:center !important;
min-height:430px !important;
}
#mtp-ojx h2{ font-family:Georgia,”Times New Roman”,serif !important; color:var(–ink) !important; font-size:27px !important; line-height:1.18 !important; margin:0 0 6px !important; font-weight:700 !important; }
#mtp-ojx h2 .mtp-accent{ color:var(–card) !important; }
#mtp-ojx h3{ font-size:13px !important; letter-spacing:.12em !important; text-transform:uppercase !important; color:var(–card) !important; margin:0 0 14px !important; font-weight:700 !important; }
#mtp-ojx p{ color:var(–grey) !important; font-size:15.5px !important; line-height:1.62 !important; margin:0 0 12px !important; }
#mtp-ojx strong{ color:var(–ink) !important; font-weight:700 !important; }
/* cover slide */
#mtp-ojx .mtp-cover-kick{ font-size:11.5px !important; letter-spacing:.16em !important; text-transform:uppercase !important; color:var(–mut) !important; font-weight:700 !important; margin:0 0 14px !important; }
#mtp-ojx .mtp-title{ font-family:Georgia,”Times New Roman”,serif !important; font-size:52px !important; line-height:1 !important; color:var(–card) !important; margin:0 0 10px !important; font-weight:700 !important; letter-spacing:-.5px !important; }
#mtp-ojx .mtp-sub{ font-size:18px !important; color:var(–ink) !important; margin:0 0 22px !important; max-width:640px !important; }
#mtp-ojx .mtp-chips{ display:flex !important; flex-wrap:wrap !important; gap:10px !important; margin:0 0 18px !important; }
#mtp-ojx .mtp-chip{
display:inline-flex !important; align-items:center !important; gap:7px !important;
background:#fbf2f2 !important; color:var(–card-dk) !important;
border:1px solid #eccaca !important; border-radius:999px !important;
padding:7px 13px !important; font-size:13px !important; font-weight:700 !important;
}
#mtp-ojx .mtp-meta{ font-size:13px !important; color:var(–mut) !important; }
#mtp-ojx .mtp-meta b{ color:var(–grey) !important; }
/* fact grid */
#mtp-ojx .mtp-grid{ display:grid !important; grid-template-columns:1fr 1fr !important; gap:1px !important; background:var(–line) !important; border:1px solid var(–line) !important; border-radius:12px !important; overflow:hidden !important; margin-top:6px !important; }
#mtp-ojx .mtp-cell{ background:var(–bg1) !important; padding:13px 15px !important; }
#mtp-ojx .mtp-cell .k{ display:block !important; font-size:11px !important; letter-spacing:.08em !important; text-transform:uppercase !important; color:var(–card) !important; font-weight:700 !important; margin-bottom:3px !important; }
#mtp-ojx .mtp-cell .v{ display:block !important; font-size:14px !important; color:var(–ink) !important; font-weight:600 !important; line-height:1.4 !important; }
/* primitive / bullet list */
#mtp-ojx .mtp-list{ list-style:none !important; margin:4px 0 0 !important; padding:0 !important; }
#mtp-ojx .mtp-list li{ position:relative !important; padding:9px 0 9px 30px !important; border-bottom:1px solid var(–line) !important; color:var(–grey) !important; font-size:15px !important; line-height:1.5 !important; }
#mtp-ojx .mtp-list li:last-child{ border-bottom:0 !important; }
#mtp-ojx .mtp-list li:before{ content:”” !important; position:absolute !important; left:4px !important; top:15px !important; width:9px !important; height:9px !important; border-radius:2px !important; background:var(–card) !important; }
#mtp-ojx .mtp-list li b{ color:var(–ink) !important; }
/* stat row */
#mtp-ojx .mtp-stats{ display:flex !important; flex-wrap:wrap !important; gap:12px !important; margin:6px 0 14px !important; }
#mtp-ojx .mtp-stat{ flex:1 1 150px !important; background:var(–bg2) !important; border:1px solid var(–line) !important; border-left:4px solid var(–card) !important; border-radius:10px !important; padding:14px 16px !important; }
#mtp-ojx .mtp-stat .n{ display:block !important; font-family:Georgia,serif !important; font-size:26px !important; color:var(–card) !important; font-weight:700 !important; line-height:1 !important; }
#mtp-ojx .mtp-stat .l{ display:block !important; font-size:12.5px !important; color:var(–mut) !important; margin-top:5px !important; line-height:1.35 !important; }
/* checks */
#mtp-ojx .mtp-checks{ list-style:none !important; margin:6px 0 0 !important; padding:0 !important; }
#mtp-ojx .mtp-checks li{ position:relative !important; padding:7px 0 7px 28px !important; color:var(–grey) !important; font-size:14.5px !important; line-height:1.5 !important; }
#mtp-ojx .mtp-checks li:before{ content:”2713″ !important; position:absolute !important; left:0 !important; top:7px !important; color:var(–green) !important; font-weight:800 !important; font-size:15px !important; }
/* code */
#mtp-ojx pre{ background:#2e2d29 !important; color:#f4efe7 !important; border:1px solid #1f1e1b !important; border-radius:10px !important; padding:14px 16px !important; margin:8px 0 14px !important; font-family:”SFMono-Regular”,Consolas,”Liberation Mono”,Menlo,monospace !important; font-size:13px !important; line-height:1.5 !important; overflow-x:auto !important; }
#mtp-ojx pre code{ background:transparent !important; color:inherit !important; border:0 !important; padding:0 !important; }
#mtp-ojx code{ background:#f0e9df !important; color:var(–card-dk) !important; border:1px solid var(–line) !important; border-radius:5px !important; padding:1px 6px !important; font-size:13px !important; font-family:”SFMono-Regular”,Consolas,Menlo,monospace !important; }
/* nav */
#mtp-ojx .mtp-nav{ display:flex !important; align-items:center !important; justify-content:space-between !important; padding:14px 22px !important; background:var(–bg1) !important; border-top:1px solid var(–line) !important; }
#mtp-ojx .mtp-dots{ display:flex !important; gap:8px !important; }
#mtp-ojx .mtp-dotnav{ width:9px !important; height:9px !important; border-radius:50% !important; background:#d9d2c7 !important; border:0 !important; padding:0 !important; cursor:pointer !important; transition:all .2s !important; }
#mtp-ojx .mtp-dotnav.is-on{ background:var(–card) !important; transform:scale(1.25) !important; }
#mtp-ojx .mtp-arrows{ display:flex !important; gap:10px !important; }
#mtp-ojx .mtp-btn{ display:inline-flex !important; align-items:center !important; justify-content:center !important; gap:6px !important; height:38px !important; padding:0 16px !important; border-radius:9px !important; border:1px solid var(–card) !important; background:var(–bg1) !important; color:var(–card) !important; font-size:14px !important; font-weight:700 !important; cursor:pointer !important; transition:all .15s !important; font-family:inherit !important; }
#mtp-ojx .mtp-btn:hover{ background:var(–card) !important; color:#fff !important; }
#mtp-ojx .mtp-btn[disabled]{ opacity:.35 !important; cursor:default !important; border-color:var(–line) !important; color:var(–mut) !important; background:var(–bg1) !important; }
/* footer tagline */
#mtp-ojx .mtp-foot{ background:var(–card) !important; color:#fff !important; padding:14px 22px !important; display:flex !important; align-items:center !important; justify-content:space-between !important; gap:12px !important; flex-wrap:wrap !important; }
#mtp-ojx .mtp-foot .b{ font-family:Georgia,serif !important; font-weight:700 !important; font-size:16px !important; color:#fff !important; letter-spacing:.2px !important; }
#mtp-ojx .mtp-foot .t{ font-size:13px !important; color:#f3d9d9 !important; }
#mtp-ojx .mtp-foot a{ color:#fff !important; text-decoration:none !important; border-bottom:1px solid rgba(255,255,255,.45) !important; font-weight:700 !important; }
/* mobile */
@media (max-width:640px){
#mtp-ojx{ margin:16px auto !important; border-radius:12px !important; }
#mtp-ojx .mtp-slide{ padding:20px 18px 24px !important; min-height:0 !important; }
#mtp-ojx h2{ font-size:22px !important; }
#mtp-ojx .mtp-title{ font-size:38px !important; }
#mtp-ojx .mtp-sub{ font-size:16px !important; }
#mtp-ojx .mtp-grid{ grid-template-columns:1fr !important; }
#mtp-ojx .mtp-stat{ flex:1 1 100% !important; }
#mtp-ojx .mtp-nav{ padding:12px 16px !important; }
#mtp-ojx .mtp-btn{ padding:0 12px !important; }
#mtp-ojx .mtp-foot{ padding:12px 16px !important; }
#mtp-ojx .mtp-head{ padding:14px 16px 10px !important; }
}
OpenJarvis · Stanford
01 / 07
Stanford · Hazy Research + Scaling Intelligence Lab
OpenJarvis
An open-source, local-first framework for personal AI agents that run inference, agents, memory, and learning entirely on-device.
Within 3.2 pp of best cloud
~800× lower marginal API cost
~4× lower latency
Apache 2.0 • arXiv:2605.17172 • Framework released March 12, 2026
What it is
Personal AI that runs on your hardware
Most “personal” AI still routes every query through a cloud API. OpenJarvis makes local-first the default and calls the cloud only when needed — building on the team’s Intelligence Per Watt finding that local models already handle 88.7% of single-turn queries.
LicenseApache 2.0
Repositorygithub.com/open-jarvis/OpenJarvis
Models11 local models · 4 familiesQwen3.5, Gemma4, Nemotron, Granite
EnginesOllama, vLLM, SGLang, llama.cpp, Apple FM, Exo
Architecture
Five primitives, one spec
A personal AI system is decomposed into five typed, independently swappable primitives, composed through a single declarative spec serialized to portable TOML.
Intelligence — model, weights, generation params, quantization
Engine — inference runtime, batching, KV-cache, hardware path
Agents — reasoning loop (ReAct or CodeAct), prompts, tool policy
Tools & Memory — 25+ connectors, 32+ channels, native MCP
Learning — optimizer slot: LoRA, DSPy, GEPA, or spec search
Key method
LLM-guided spec search
A frontier cloud model acts as a teacher at search time: it reads traces, diagnoses failure clusters, and proposes edits across primitives. A gate accepts only non-regressing edits. The optimized spec then runs entirely on-device — zero cloud calls at inference time.
13–32 ppof the cloud–local gap closed
7–11×lower optimization cost vs single-primitive baselines
The four-primitive move space adds 5.5–16.5 pp; the LLM proposer adds ~10 pp over evolutionary search at the same move space.
Performance
Close to cloud, far cheaper
3.2 ppgap: Qwen3.5-122B 80.3% vs Claude Opus 4.6 83.5%
4 / 8benchmarks where local matches or beats cloud
Matches/exceeds cloud on ToolCall-15, PinchBench, LiveCodeBench, τ-Bench V2
~800× lower marginal API cost; ~4× lower latency (paper’s protocol)
Swap test: a 25–39 pp drop shrinks to 5.6–16.5 pp under a spec (56–77% recovered)
Developer experience
From zero to an agent in minutes
One command provisions uv, a Python virtual environment, Ollama, and a starter model (~3 minutes on broadband):
curl -fsSL https://open-jarvis.github.io/OpenJarvis/install.sh | bash
8 built-in agents across on-demand, scheduled, and continuous modes
25+ data connectors · 32+ messaging channels
Skills via agentskills.io: ~150 from Hermes Agent, ~13,700 from OpenClaw
The bottom line
A research platform and a production foundation
OpenJarvis trades roughly 3.2 pp of accuracy — the gap concentrating on reasoning- and research-heavy tasks — for major cost, latency, and privacy gains. Inference, agent state, and memory stay on-device by construction; the cloud teacher is optional and bounded.
Caveats: results average 5 runs per configuration, use GPT-5-mini as judge, and were run on a single machine. Apache 2.0 and actively maintained — built, in the authors’ words, “in the spirit of PyTorch” for local AI.
← Prev
Next →
Marktechpost
AI research and developer tools, decoded for ML engineers — marktechpost.com
(function(){
var root = document.getElementById(‘mtp-ojx’);
if(!root || root.dataset.mtpInit) return;
root.dataset.mtpInit = ‘1’;
var track = root.querySelector(‘[data-mtp-track]’);
var slides = root.querySelectorAll(‘.mtp-slide’);
var prev = root.querySelector(‘[data-mtp-prev]’);
var next = root.querySelector(‘[data-mtp-next]’);
var dotsEl = root.querySelector(‘[data-mtp-dots]’);
var countEl= root.querySelector(‘[data-mtp-count]’);
var n = slides.length, i = 0;
function pad(x){ return (x

