For a while there, betting on open weights vs closed source LLMs felt like bringing a butter knife to a sword fight. Closed models from OpenAI and Anthropic were miles ahead. Then Meta dropped LLaMA in February 2023, and the whole thing got interesting. Fast. Today, open weights models reportedly hit 85-95% parity with the big closed players on standard benchmarks. So is the gap actually closing? Yes. But "closing" and "closed" are two very different words. (One of them I keep telling my IT team about the server room. They never listen.)

TL;DR: The open source vs proprietary LLMs gap has narrowed dramatically since 2023. Open weights models now match GPT-3.5 and rival GPT-4 on many tasks. Closed models still lead on frontier reasoning. For most real-world use, open weights are good enough — and often cheaper.

"Open weights" doesn't mean "open source"

Quick myth-bust before we go further. People throw "open source" and "open weights" around like they're the same thing. They're not. It's similar to how AI culture has evolved to blur technical distinctions—everyone uses the terms loosely.

Open weights means the model's trained parameters are published. You can download them, run them on your own hardware, and fine-tune them. Llama and Mistral are open weights.

Open source, in the strict sense, means you also get the training data, training code, and the recipe to rebuild it from scratch. Most "open" LLMs don't go that far. They give you the engine but not the factory blueprints.

Closed source keeps their weights locked behind an API. You rent the output, not the engine—much like how proprietary platforms control access to their technology in other industries.

models keep everything locked. GPT-4, Claude 3 — you talk to them through an API. You never touch the weights. You rent intelligence by the token.

So when we talk about open weights vs closed source LLMs, the real divide is control. Do you own the model on your machine, or do you call someone else's?

How the open weights LLM performance gap got so small

Rewind to November 2022. ChatGPT launched and basically set the benchmark everyone chased. For months, nothing open came close.

Then the dominoes fell:

  • February 2023 — Meta released the LLaMA family. The open weights movement had its founding moment.
  • April 2023 — Stability AI's StableVicuna took an early swing at matching closed-source instruction-following.
  • September 2023 — Mistral AI dropped its 7B model. A small open model punching above its weight class against bigger closed competitors.
  • January 2024 — Open weights models started genuinely competing on reasoning and coding, not just chat.

By 2024, two quieter trends did the heavy lifting. Quantization shrank models so they ran on consumer hardware. Fine-tuning got cheap and accessible. Suddenly you didn't need a data centre to run something useful.

Reportedly, open weights models now make up roughly 40-50% of all new model releases. That's not a fringe movement anymore. That's half the field.

Where closed vs open source AI models still split

Here's the honest bit. The closed vs open source AI models race isn't a tie yet.

Top open weights models reportedly hit 85-95% parity with closed source leaders on standard benchmarks. Sounds like a photo finish. But that missing 5-15% lives exactly where it hurts most.

Closed models like GPT-4 and Claude 3 still lead on:

  • Complex reasoning — multi-step logic where one wrong turn ruins the whole answer.
  • Instruction-following — doing exactly what you asked, not the vibe of what you asked.
  • Emerging capabilities — the weird new tricks that show up at the frontier first.

Think of it like this. Open weights models cleared the bar GPT-3.5 set. The frontier itself — GPT-4, Claude 3 — keeps moving. So the gap closes on yesterday's frontier while a new one opens up ahead. Catching up to a moving target is a full-time job. (Ask anyone who's tried to finish a to-do list.)

The cost comparison nobody actually runs

Everyone says open source is cheaper. Few people do the maths.

Closed source APIs cost per token. No infrastructure, no DevOps team, no GPU bills. You pay as you go. For low or spiky volume, that's genuinely hard to beat.

Open weights flip the model. The model is free. Running it is not. You're paying for GPUs, hosting, engineering time, and the maintenance that never ends. The rule of thumb: open weights gets cheaper per unit only once your volume is high and steady.

Nine times out of ten, the break-even point is higher than people expect. If you're sending a few thousand requests a day, the API probably wins. If you're running millions and you've already got infrastructure, self-hosting open weights starts paying off.

For more on running models on your own hardware, see our guide on running LLMs locally and our breakdown of GPU requirements for self-hosted AI.

Best open weights LLMs vs GPT: the honest scoreboard

If you're weighing best open weights LLMs vs GPT, here's the rough lay of the land in 2024.

Llama and Mistral models reliably match or beat GPT-3.5 on plenty of tasks. Summarisation, classification, standard chat, coding assistance — the open options are right there. For these jobs, paying premium API prices feels like buying a Ferrari to get milk.

GPT-4 and Claude 3 still pull ahead when the task gets genuinely hard. Long-chain reasoning, nuanced instruction-following, edge cases that need that last 10% of polish.

The practical question isn't "which is best overall." It's "which is good enough for my job at a price I'll tolerate." For a huge slice of real work, open weights clears that bar. The frontier models earn their keep on the hard stuff. According to published benchmark research on arXiv, the parity picture shifts task by task — so test on your own data, not someone's leaderboard.

Deploying and fine-tuning without losing your weekend

Running an open weights LLM locally is easier than it was a year ago. Tools like Ollama and LM Studio let you download a quantized model and run it on a decent laptop. No PhD required.

For production, you'll want proper inference servers and real GPUs. That's where the DevOps tax kicks in.

Fine-tuning is the open weights superpower. Because you hold the weights, you can train the model on your own data, your tone, your domain. Techniques like LoRA make this cheap — you tweak a small slice of parameters instead of retraining the whole beast.

You can't do this with closed source. Most APIs offer limited fine-tuning at best, and you're still renting. If your use case is narrow and specialised, a fine-tuned open model often beats a general closed one. The Hugging Face model hub is the place to start hunting. Pair this with our fine-tuning starter guide if you're new to it.

The hidden cost: who patches the model when it breaks?

Here's the bit the open source crowd skips at the pub. When you self-host, you own the whole stack — including the problems.

Security patches? Yours. Model drift over time? Yours. A new exploit in your inference server? Also yours, at 2am, on a public holiday.

With closed source, the vendor carries that load. They patch, they monitor, they handle abuse and safety. You're paying partly for someone else to lose sleep.

This rarely shows up in cost comparisons because it's invisible until something breaks. Then it's the only number that matters. Factor in a maintenance buffer before you commit to self-hosting. The model being free doesn't make the operation free.

The numbers worth remembering

40-50%Of new model releases that are open weights (2023-24)
85-95%Parity open weights hit vs closed source on benchmarks
Feb 2023LLaMA release that kicked off the movement
7BMistral params that matched bigger closed rivals
Nov 2022ChatGPT launch that set the benchmark
5-15%The frontier gap closed models still hold

My honest take: stop chasing the frontier you don't need

Here's my one strong opinion. Most teams obsess over the 5-15% frontier gap when they're solving problems that GPT-3.5-level models cracked two years ago.

If your task is summarising support tickets, classifying emails, or drafting first-pass copy, a good open weights model does it at 85-95% parity. The difference between that and GPT-4 on your job is often statistical noise — but the cost difference at scale is very real.

Run the numbers. If you're processing millions of requests on stable, predictable volume, self-hosted open weights can cut your per-request cost dramatically once you clear the break-even. That break-even is usually higher than the open source enthusiasts admit, but it exists.

When NOT to go open weights: low or spiky volume, no DevOps capacity, or a task that genuinely needs frontier reasoning. If you're a five-person startup with bursty traffic and no infra team, self-hosting is a trap. Pay for the API. Your weekends are worth more than the savings.

The open weights vs closed source debate isn't open winning or closed winning. It's matching the tool to the job. Spend your engineering hours on the problem that actually moves your business — not on babysitting GPUs to save pennies on tasks the cheap model already nails.

Frequently Asked Questions

What is the difference between open weights and closed source LLMs?

Open weights LLMs publish their trained parameters, so you can download, run, and fine-tune them yourself. Closed source LLMs keep their weights locked behind an API — you send a prompt and get a response, but never touch the model. The core difference is control versus convenience.

Are open source LLMs catching up to GPT-4?

Yes, steadily. Open weights models reportedly hit 85-95% parity with closed source leaders on standard benchmarks. But GPT-4 still leads on complex reasoning and instruction-following. They're catching up to a moving target, which is a bit like racing your own shadow at sunset — you gain ground, but the finish line keeps shifting.

How do you deploy an open weights LLM locally?

Tools like Ollama and LM Studio let you download a quantized open weights model and run it on a decent laptop. For production, you'll need proper GPUs and inference servers. Local deployment got dramatically easier through 2024 thanks to quantization, which shrinks models to fit consumer hardware.

Which is better for enterprise: open weights or closed source LLMs?

Depends on volume, data sensitivity, and your DevOps capacity. High, steady volume with infrastructure favours open weights. Spiky or low volume with no engineering team favours closed source APIs. There's no universal winner — only the right fit for your specific load and budget.

Is it cheaper to run open source LLMs than closed source APIs?

Only at high, steady volume. The model is free, but GPUs, hosting, and engineering time are not. APIs charge per token with zero infrastructure. The break-even point is usually higher than people expect — run the actual maths before assuming open source saves you money.

What does open weights mean in AI?

Open weights means the model's trained parameters are publicly available to download and use. It's not the same as fully open source, which would also include training data and code. You get the finished engine, not the factory that built it.

How do you fine-tune open weights models for specialized tasks?

Techniques like LoRA let you train a small slice of parameters on your own data instead of retraining the whole model. It's cheap and effective for narrow, domain-specific tasks. This is the open weights superpower — closed source models offer limited fine-tuning at best, and you're still renting.

Can open source LLMs really match closed source model quality?

On many tasks, yes — summarisation, classification, coding help, standard chat. They reportedly reach 85-95% parity on benchmarks. On frontier reasoning and the hardest edge cases, closed models still hold a 5-15% edge. Match the tool to the job, not the leaderboard.

The short version, with one last pun

The open weights vs closed source LLMs gap is real and it's shrinking. Open models now nail most everyday tasks at 85-95% parity, often cheaper at scale. Closed models keep their lead on the frontier — the hard reasoning, the polish, the emerging tricks. Pick based on your volume, your data, and whether you've got someone to babysit the GPUs. The model might be open, but the decision shouldn't be a closed book. (I'll see myself out. Weights and all.)