Skip to main content

Is the Future of AI Local?

· 11 min read

Debate about whether the explosion of datacenter buildout will prove to be a worthwhile investment centers on two scenarios:

  1. AI adoption accelerates, the datacenter investment pays out
  2. AI adoption is not as fast as forecasted, and it doesn't.

However, a third scenario is very plausible:

Open source models running on local workstations dominate AI

There are a few reasons this could happen:

scenarios

Open source models keep up

With the exception of gpt-4, open source models have matched performance of frontier models within 6 months of frontier model release (data):

Months to open source parity with frontier models
OpenAIAnthropic

Naturally, there have been accusations of open source models gaming evals, but the frontier models do the same.

We can expect this to continue. Startups usually try to create a moat, but model providers build waterslides: frontier models help train their open source competitors.

Unauthorized distillation is a difficult threat to counter. Providers can (and have) complain about competitors using their model to train competition. As a practical matter, however, this "theft"1 could be impossible to prevent.

Remote providers increase prices (or degrade subscription value)

The unit economics of frontier models are reminiscent of Uber's "cheap ride era": for example, despite $13 billion in revenue, OpenAI projects $14 billion in losses for 2026. That bill includes $8 billion in compute costs.

For Anthropic, Cursor recently estimated a $200/month Claude Max subscription can consume up to $5,000 in compute. Even before this report, they introduced rate limits on that subscription.

Their newly released Claude Code Review feature is priced at a very expensive $15-$25 per PR. Its announcement came with little explanation of why it should replace existing PR review workflows. This seems like a pricing experiment, to see how high a price enterprise is willing to tolerate.

In OpenAI's case, there is public reporting on pruning side bets and focusing on enterprise2.

Small, specialized models emerge

Given today's low prices, there is relatively little downward economic pressure on token usage. People reach for the most powerful model, regardless of the task at hand.

This will change if prices increase, and the dominant pattern of subagent-driven workflows provide a natural transition. I probably don't need a frontier model to fix style issues in my Python PR - a small, specialized model can handle that just fine. If frontier models get dramatically more (i.e. $25 per PR review) expensive, demand will increase for these models, and the open source community will be plenty able to meet it.

This is already happening on a small scale: one whitepaper claimed to get parity with GPT-4o with a fine tuned GPT-4o-mini model, at 2% of the cost.

Apple is betting on local

Apple is the lone contrarian amongst tech giants, in that they are not spending mountains of capital on datacenters:

Apple has been criticized for being "behind" on AI, but their bet appears to be: have competitors burn cash to train models, let advances propagate into open source models, and make devices good enough to run them.

For now, running frontier open source models requires users to buy specialized hardware. However, the most recent Macbook 4 pro Max looks to have made a leap in the size of model that's viable locally (data):

Loading chart…

Today, running frontier models on local workstations remains out of reach. But the gap is closing.

Private and free is hard to beat

If they can gain parity with hosted alternatives, local open source models have a compelling value proposition: fast, private, and free. This possibility has not gotten much attention: no one stands to get mega-rich from them. But the threat to current leaders is a potent one.

Appendix

Open Source Parity Data

Frontier ModelProviderReleaseBenchmarkScoreOpen Source MatchOS ModelMonths to ParitySource
GPT-3.5 / ChatGPTOpenAINov 2022MMLU~70%Aug 2023Llama 2 70B (70B)~9Stanford HAI AI Index 2025
GPT-4OpenAIMar 2023MMLU86.4%Jul 2024Llama 3.1 405B (405B)~16Epoch AI
Claude 3 OpusAnthropicMar 2024MMLU86.8%Jul 2024Llama 3.1 405B (405B)~4Epoch AI
GPT-4oOpenAIMay 2024MMLU-Pro71.6%Dec 2024DeepSeek-V3 (671B total / 37B active)~7DeepSeek V3 Technical Report
Claude 3.5 SonnetAnthropicJun 2024MMLU-Pro73.3%Dec 2024DeepSeek-V3 (671B total / 37B active)~6DeepSeek V3 Technical Report
o1OpenAISep 2024AIME 202479.2%Jan 2025DeepSeek-R1 (671B total / 37B active)~4DeepSeek R1 via TechCrunch
  • Epoch AI: Average lag of best open-weight model behind best closed model is now ~3 months (source)
  • Stanford HAI: Chatbot Arena Elo gap between closed and open models shrank from 8.04% to 1.70% between Jan 2024 and Feb 2025 (source)

On-Device Model Size

Definition: "Max usable model" is the largest Q4-quantized model that fits in device RAM and runs at ≥8 tokens/second with an 8k context window — a threshold for a responsive conversational experience. It is min(RAM-fit, speed-fit), where:

  • RAM-fit = RAM × 0.8 / 0.75 — usable RAM (80% of total) divided by bytes per parameter at Q4 (~0.75 bytes/param after overhead)
  • Speed-fit = (memory_bandwidth / 51.2 GB/s) × (baseline_speed / bits_per_weight) × target_t/s_factor — scales from a reference of ~11B params at 8 t/s on a 51.2 GB/s device

For MoE models, RAM-fit applies to total parameters (all weights must be loaded); speed-fit applies to active parameters only.

MacBook Pro

DeviceYearChipRAMMax ModelRAM-fitSpeed-fitSource
MacBook Pro M12020M116 GB15.0B17.1B15.0BWikipedia
MacBook Pro M1 Pro2021M1 Pro16 GB17.1B17.1B43.9BWikipedia
MacBook Pro (M1 Pro)2022M1 Pro16 GB17.1B17.1B43.9BWikipedia
MacBook Pro M3 Pro2023M3 Pro18 GB19.2B19.2B32.9BApple
MacBook Pro M4 Pro2024M4 Pro24 GB25.6B25.6B59.9BApple
MacBook Pro M52025M532 GB33.6B34.1B33.6BApple Support, Apple Newsroom
MacBook Pro M5 Max2026M5 Max128 GB134.9B136.5B134.9B@JoshKale

Footnotes

  1. The complaints are ironic given Anthropic's own ask forgiveness rather than permission approach to intellectual property that the providers themselves have taken.

  2. Granted, part of this seems to be motivated by some side bets just not getting adoption, like the Sora video generation app.

Get new posts by email