Studio Case Studies Jun 07, 2026

Testing OpenJarvis End-to-End: What Worked, What Didn’t

Our working test notes on OpenJarvis, published in the open. This is the scaffold of a hands-on review with the concrete findings still to be filled in. Here is what we set the test up to answer, how we ran it, and the kind of verdict we are working toward.

Written byH Hillary

Read time9 min

UpdatedJun 12, 2026

Filed underStudio · Case Studies

Testing OpenJarvis End-to-End: What Worked, What Didn’t

A note on honesty before anything else: this is the studio’s working test report for OpenJarvis, published while it is still being finalized. We do not yet have an Atlas note for OpenJarvis, which means we have not locked our findings to the standard we hold every other tool to. Rather than invent results, we are publishing the test scaffold itself: the questions we set out to answer, how we structured the test, and the shape of the verdict we are working toward. The concrete findings are marked, and they will be filled in from our own runs before this leaves Review.

If you want the polished writeups, look at the builds already in production. This one is deliberately a work-in-progress, because we think showing the test method is worth something even before the numbers land.

What we wanted it to do

Every test in the studio starts with a plain statement of the job, written before we touch the tool, so the tool cannot define its own success. For OpenJarvis the job we wanted to evaluate was [CONFIRM: the specific job OpenJarvis claims to do, in our own words]. The reason we picked it up at all was [CONFIRM: what made OpenJarvis worth a hands-on test rather than a pass].

We held it to the same thesis we hold everything to: does it run locally, on the machines we already own, well enough for real work. A tool that only shines against a hosted backend does not pass the studio’s local-first bar, and we noted up front whether OpenJarvis could run against a local model engine like the rest of our stack.

How we set up the test

We run every tool on a clean, documented setup so the result is reproducible and the failures are honestly the tool’s and not a mess we made.

Machine: an Apple Silicon Mac from the studio stack, the same class of hardware we recommend to readers.
Install path: [CONFIRM: install command or download steps actually used].
Backend: [CONFIRM: whether OpenJarvis ran against local Ollama, another engine, or its own bundled runtime].
Version: [CONFIRM: exact version tested and the date we tested it].

We did the install once, with the network on, then planned a second pass with the network off to see what the tool actually needs a connection for. That offline pass is the same test we run on everything that claims to be local, because “local” is a claim you can watch rather than take on faith.

What we looked for

A test is only useful if you decide what counts before you start. These are the questions we wrote down for OpenJarvis, each one with the answer still to be confirmed from our runs.

Does the install match the promise? First-run friction, dependencies it pulled in, anything the docs did not warn about. [CONFIRM: actual install experience and any snags].
Does it do the core job? The one task it exists for, run on real studio material rather than a toy example. [CONFIRM: result on the core task].
Is it actually local? What happened with the network off, and whether the network indicator stayed at zero during an online run. [CONFIRM: offline behavior].
Where does it stop? The edge where it stumbled, because every tool has one and the honest review names it. [CONFIRM: the failure mode we hit].
Who is it for? Whether it earns a place next to the tools already in the studio stack, or whether an existing tool already does the job. [CONFIRM: comparison to our current tooling].

What worked

This section will hold the things OpenJarvis did cleanly, in plain language, with the specific moment that proved each one. We are leaving it as a marker rather than guessing, because a review that invents wins is worse than no review at all. [CONFIRM: the concrete things that worked, each tied to an observed moment in the test].

The kind of thing that lands here, when it is real, is the same kind of moment we filmed for our other builds: a task running with the Wi-Fi off, an edit applied without anything leaving the building, a result that arrived faster than expected on hardware we already owned.

What didn’t

This is the section that makes a test report worth reading, so it gets the same care as the wins. Here we will record where OpenJarvis frustrated us, what broke, and what it asked of us that the docs did not mention. [CONFIRM: the specific failures, frustrations, and rough edges we hit].

We will also note whether any failure was the tool’s fault or the model’s ceiling, because those are different problems with different fixes, and conflating them is how reviews mislead people.

The verdict we are working toward

Our verdict ladder is fixed and we will place OpenJarvis on it once the findings above are real: research for “interesting, not yet hands-on,” tested for “we ran it and here is what happened,” and in-production for “it earned a permanent place in the studio.” Right now OpenJarvis is mid-test, which is exactly why this article is in Review and carries a grounding flag rather than a confident recommendation. [CONFIRM: final placement on the ladder and the one-sentence verdict].

We are publishing the scaffold because the method is the part you can use today even while our numbers are still landing: state the job before you touch the tool, set up clean, decide what counts, run it offline, and name the failure as clearly as the win. When our runs are complete, the markers above become findings and this becomes a finished review. Until then, take it for what it is, the studio’s open test notes.

Curious about these things. You should be too.

Harness your curiosity.

— Stridenote · № 004

What we wanted it to do

How we set up the test

What we looked for

What worked

What didn’t

The verdict we are working toward

More from studio.

From Subscription to Self-Hosted: A Studio Case Study

A Week of Coding With a Local Agent, No Cloud

The Cheapest Mac That Runs Serious Local AI