Local vs cloud AI: what to run where in 2026

Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow.

A compliance officer at a European bank recently asked how to handle an internal chatbot that employees were using to summarise customer call transcripts. The chatbot ran on a cloud API. No one had checked whether those transcripts contained personally identifiable information. They did. The bank had no data processing agreement in place with the API provider, and the model provider’s terms of service reserved the right to use API inputs for service improvement. That is the exact moment when the local-versus-cloud question stops being theoretical.

The Edge Inference Decision Framework published by Tianpan.co in April 2026 states directly: “HIPAA, GDPR, EU AI Act (August 2024), and sector-specific regulations in finance and legal all create hard constraints on where data can travel.” The same analysis notes that zero-data-retention architectures are only achievable with on-device inference because user inputs never leave the device. The bank’s problem was not that the cloud was inherently insecure. The problem was that no one had asked the threshold question: does this workload belong in the cloud at all?

When does privacy rule out the cloud?

The first filter in the decision is not cost or latency. It is data sensitivity. The question is binary: can this data legally or contractually leave your infrastructure?

The Cyberax AI Playbook (May 2026) puts it plainly: “Move everything local when you have a hard data-residency or privacy requirement that makes cloud APIs unsuitable. The cost of self-hosting is justified by the binary nature of the constraint, not by the math.” Regulated content such as protected health information, classified data, attorney-client communications, and customer contracts that prohibit third-party processing all create hard walls. No amount of cloud cost efficiency justifies crossing them.

The regulatory environment in 2025 and 2026 has made this filter more restrictive. Tianpan.co notes a development where “the EU is now treating models that have memorized personal data as potentially constituting personal data themselves.” If your cloud provider fine-tunes on user data without explicit consent mechanisms, the model weights themselves may carry compliance exposure. Local inference sidesteps this entirely because the data never leaves your own hardware.

The Prediction Guard diagnostic framework (May 2026) breaks down what happens when you get this wrong: “Unauthorized PHI disclosure under HIPAA can trigger penalties ranging from $145 to $2,190,294 per violation under 2026-adjusted penalty tiers.” A single audit finding that traces back to an ungoverned AI interaction can generate remediation costs that dwarf a year of infrastructure investment. When privacy is the constraint, the cost of self-hosting is justified by the binary nature of the requirement, not by the unit economics.

How does the cost math shift at scale?

For workloads that pass the privacy filter, the next question is volume. Cloud APIs are cheap at low volume and expensive at high volume. Local hardware is the opposite.

The Cyberax framework provides a clear threshold: move specific workloads local when all three conditions apply. The workload must be high-volume and predictable enough to forecast tokens per day. The per-token cost at cloud pricing must be crossing a real line item threshold, roughly $1,000 per month on that one workload. The available open-weight model must handle the task without unacceptable quality loss. Common fits include classification, embedding, summarisation of standard documents, and simple structured extraction.

The site also tracks what happens when the math flips. A 2025 fintech case study shows one company cutting its monthly AI bill from $47,000 to $8,000 an 83 percent reduction by moving predictable workloads to self-hosted models while keeping frontier model access on cloud APIs. The savings come from the fact that local inference on a consumer GPU costs fractions of a cent per call, while cloud API pricing ranges from $0.0001 to $0.05 per call typical depending on the model tier.

Tianpan.co confirms the same cost structure. Cloud inference sends a network request to a provider-managed GPU cluster, waits for a response, and pays per token. No hardware cost on your side, but every call crosses the network and every call is metered. Local hardware has upfront capital costs ranging from $1,500 for a workstation to $100,000 or more for a server-grade GPU, but the per-call cost after that is approximately electricity.

The Prediction Guard diagnostic adds the operational layer that often gets left out of hardware ROI projections. “A fully-loaded MLOps engineer in the UK runs 100,000 to 135,000 pounds per year before on-call cover.” Running a self-hosted model requires at minimum one person who can patch the stack, monitor drift, manage the evaluation pipeline, and respond to incidents. That headcount cost recurs every year and sits outside every GPU vendor ROI projection. The same analysis found that a mid-size firm’s measured load on an 8x A100 box averaged 11 percent utilisation. At those rates, the breakeven stretches from under two years to most of a decade.

When will only a frontier model do?

The third filter is model capability. Local open-weight models have improved rapidly, but they still trail frontier proprietary models on complex reasoning, long-context tasks, and tool use.

The Cyberax playbook makes this explicit: “Default to cloud APIs. For most teams, most workloads, the right starting point is a cloud API.” The reasoning is straightforward. You get the strongest available models, no operational burden, scaling is the vendor’s problem, and the cost is small until volume is high. The framework recommends moving specific workloads local only when the three conditions above hold simultaneously.

Tianpan.co confirms this pattern by describing the architecture that production-scale teams have settled on: “An on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can not.” Frontier-model needs such as complex reasoning, novel tasks, and anything user-facing where quality matters most should go to cloud APIs. Predictable high-volume tasks like classification and structured extraction can move to self-hosted open-weight models. The two are not the same decision.

Prediction Guard adds a temporal dimension to the capability question. Open-weight models from Llama, Qwen, and Mistral have closed the gap on many standard benchmarks, but the gap widens on long-context reasoning and tool use. A local model shipped today may trail the frontier by the time the hardware is paid off. Cloud APIs update continuously. For use cases where model quality directly affects product experience, that lag matters.

What hybrid architecture do most teams settle on?

The consensus across every source in this review is the same. Almost no production deployment is purely local or purely cloud.

The Tianpan.co framework describes the modal architecture as a tiered system where most requests are handled by a local or on-premise model while a cloud frontier model catches what the smaller model cannot handle. Getting the routing right is an engineering decision, not an intuition.

The Cyberax playbook recommends a concrete split. Frontier-model needs such as complex reasoning and novel tasks go to cloud API services like Claude Opus 4.7 or GPT-5.5. Predictable high-volume workloads such as classification, embedding, and structured extraction go to self-hosted open-weight models like Llama 4 or Qwen 3. Privacy-bound data stays local regardless of cost efficiency. Variable and overflow load routes to cloud API as an elastic tier while local hardware handles steady state.

For data-sensitive tasks, the routing must be more granular. Prediction Guard recommends a hybrid control plane that classifies requests before inference reaches any model. Sensitive data stays on-device or in private cloud. Non-sensitive tasks can go to a third-party API. The routing layer itself becomes a governance control point. The framework is explicit that the policy engine should exist before the prompt layer.

What should you consider before buying hardware?

The case for local AI is strong at scale, but the operational reality is more expensive than hardware amortization alone suggests.

Prediction Guard provides a sobering breakdown. For a dual H100 GPU deployment, hardware amortized over three years runs $28,000 to $40,000 annually. Power and cooling at a Power Usage Effectiveness ratio of 1.4 add $1,500 to $3,000. But the largest line item is MLOps engineering at 0.3 to 1.0 full-time equivalent, which runs $45,000 to $96,000 annually. Total before the first token is approximately $75,000 to $139,000 per year.

The Tianpan.co framework flags another cost that is easy to miss. “Edge requires capital expenditure, dedicated ops, and utilization discipline. Cloud scales to zero; edge hardware depreciates whether you use it or not. Bursty workloads that periodically spike 10x should stay on cloud, or use cloud as the overflow tier.” A GPU cluster sitting idle at night still consumes power and depreciates. Cloud scales to zero.

The Cyberax playbook addresses the team-size question directly. “Startups should default cloud almost universally. Operational simplicity matters more than infrastructure cost when you are five people. Hardware capital is wrong for an organisation that pivots quarterly.” Established companies with predictable workloads and existing infrastructure teams have more reasons to consider hybrid earlier because the operational cost is amortized and the engineering function already exists.

Prediction Guard sets the deployment timeline at 2 to 4 weeks for a single-tenant baseline and 6 to 12 weeks for deployments with custom RAG pipelines or legacy system integration. The bottleneck is almost always infrastructure procurement, not the AI work itself. For a team iterating toward product-market fit, that delay is too long.

The decision framework reduces to three questions. Is the data sensitive? If yes, default to local. Is latency critical? If yes, default to local. Is the workload predictable and high-volume? If yes, local is almost certainly cheaper. If the answer to all three is no, cloud is the right call. In practice, most enterprise AI workloads answer yes to at least one of these questions. The teams that build durable systems are the ones that ask them before they buy hardware.

When does privacy rule out the cloud?

How does the cost math shift at scale?

When will only a frontier model do?

What hybrid architecture do most teams settle on?

What should you consider before buying hardware?

More from stridenalysis.

M4 Pro vs M5 Max for local inference on Apple silicon

Gemma 4 E4B on edge hardware: small models catch up

What Ollama NVFP4 means for local model quality