Free-Model Playbook for Claude Code and Codex

The $200/Month Problem

Claude Code Max costs $200/month. Codex Pro costs $200/month. For a professional developer shipping production code daily, those are reasonable numbers. For a student, a hobbyist, a bootstrapped founder building nights and weekends, or an operator in a country where $200 is a month’s rent — they’re a wall.

But here’s what changed in 2026: both Claude Code and Codex CLI now support third-party model backends. You can point them at any OpenAI-compatible API endpoint. That means you can route them through gateways that aggregate dozens of free models — and keep the agent harness you already know while paying nothing for inference.

A Hacker News thread titled “Reallocating $100/Month Claude Code Spend to Zed and OpenRouter” captured the shift. Developers are discovering that 80% of their coding tasks — completions, single-file edits, test generation, boilerplate — don’t need a frontier model. They need a model that’s good enough, available right now, and free.

This playbook covers two gateway options (OpenRouter and OmniRoute), the best free models for coding in June 2026, and step-by-step setup for both Claude Code and Codex CLI. If you’re looking for the local GPU approach instead, see our companion piece on running your coding agent on local weights.

OpenRouter’s team announced expanded free model support in June 2026, adding GLM-5.2 and several other coding-capable models to the free tier — making the gateway approach viable for serious development work.

Two Gateway Options

You have two paths to free models. One is turnkey. The other gives you more control. Pick based on how much infrastructure you want to manage.

OpenRouter: The Turnkey Path

OpenRouter is an established routing gateway that aggregates hundreds of models from dozens of providers behind a single API endpoint. As of June 2026, it offers 29 free models — including GLM-5.2, DeepSeek V4 Flash, Qwen3-Coder, and Devstral 2. No local server, no Docker, no GPU. Sign up, get an API key, set three environment variables, and go.

The tradeoff: you’re still routing through a third party. Your prompts hit OpenRouter’s servers before reaching the model provider. Rate limits on the free tier are real (20 requests/min, 200/day). And you’re subject to whatever availability and latency OpenRouter’s routing layer introduces.

For most developers testing the waters, this is the right starting point.

OmniRoute: The Self-Hosted Path

OmniRoute is a newer, open-source alternative (5.1K stars and climbing). It’s a self-hosted gateway you run locally that aggregates 160+ providers, with 50+ offering free tiers. The headline feature beyond provider count is token compression: OmniRoute’s RTK (Request Token Kompression) and Caveman modes claim 15–95% savings on token usage by compressing prompts before they hit the provider.

OmniRoute also supports MCP and A2A protocols natively, which matters if you’re building agent-to-agent workflows. And because it runs on your machine, your prompts never leave your network until they hit the model provider directly.

The tradeoff: you need to run and maintain a local server. It’s a Python app, not a Docker one-liner (though Docker support exists). Configuration is more involved. But for operators who want full control over routing, fallback strategies, and token economics, OmniRoute is the power-user option.

OpenRouter vs OmniRoute — OpenRouter is a hosted service: zero setup, rate-limited free tier, your prompts transit their servers. OmniRoute is self-hosted infrastructure: more setup, no rate limits beyond the upstream provider's, prompts route directly from your machine to the model provider. Pick based on whether you value convenience or control.

Best Free Models for Coding (June 2026)

Not every free model can drive an agent. The gap between “generates plausible code” and “reliably calls tools, edits files, runs tests, and iterates” is where most free-model setups break. Here are the models worth pointing your coding agent at today.

GLM-5.2 (Z.ai / Zhipu)

The breakout model of June 2026. GLM-5.2 is MIT-licensed, uses a 753B Mixture of Experts architecture with a 1M token context window, and scores 62.1% on SWE-bench Pro — putting it in frontier territory for coding tasks. Nathan Lambert’s analysis on Interconnects called it “the step change for open” and he’s not wrong: this is the first fully open-source model that competes with proprietary frontier models on real-world coding benchmarks.

Where to run it free:

OpenRouter: Available on the free tier with standard rate limits
Cloudflare Workers AI: Added June 16, 2026 — run it on Cloudflare’s edge network
Z.ai direct: The model creator offers free API access
Multiple other providers catalogued by DevelopersDigest

For a deep dive on the architecture, DataCamp’s breakdown covers the 753B MoE design, training approach, and benchmark positioning.

DeepSeek V4 Flash

Free for a limited promotional period. Strong reasoning capabilities, particularly on multi-step problems. If you’ve used DeepSeek with Claude Code before, V4 Flash is the latest in that lineage. The free window won’t last forever, but while it’s open, it’s one of the strongest options for complex tasks.

Qwen3-Coder 480B

Alibaba’s specialized coding model. 262K context window, state-of-the-art on agentic coding benchmarks, available on OpenRouter’s free tier. The large context is particularly useful for repo-wide operations where you need the model to hold multiple files in memory simultaneously.

Devstral 2

Mistral’s lightweight coding model. Not the strongest on benchmarks, but fast and reliable for quick completions, simple edits, and boilerplate generation. Good as a “fast path” model in a multi-model routing strategy where you reserve heavier models for complex tasks.

Routing strategy — Point routine tasks (completions, single-file edits, tests) at Devstral 2 or DeepSeek V4 Flash for speed. Reserve GLM-5.2 or Qwen3-Coder for multi-file refactors and complex reasoning. This mirrors the 80/20 hybrid pattern from the local-weights playbook, but with free cloud models instead of local GPU.

Step-by-Step: Claude Code + OpenRouter

This is the fastest path from “paying $200/month” to “paying nothing.” Following the official OpenRouter cookbook and tutorial:

1. Get an OpenRouter API key

Sign up at openrouter.ai and generate an API key. Free accounts get access to all free-tier models with rate limits of 20 requests/minute and 200 requests/day.

2. Set the environment variables

export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_API_KEY="sk-or-v1-your-key-here"
export ANTHROPIC_MODEL="z-ai/glm-5.2"

That’s it. Three lines. Claude Code will now route all requests through OpenRouter to whichever model you specified.

3. Log out of the native Anthropic session

If you were previously using Claude Code with a native Anthropic account:

claude /logout

This ensures Claude Code uses your OpenRouter credentials instead of trying to authenticate with Anthropic’s servers.

4. Select your model

Inside Claude Code, you can switch models on the fly:

/model z-ai/glm-5.2
/model deepseek/deepseek-v4-flash
/model qwen/qwen3-coder-480b

The MindStudio guide walks through model selection criteria in more detail — which models support tool calling, which handle long contexts, and which are best for specific coding tasks.

5. Understand the rate limits

Free tier on OpenRouter: 20 requests/minute, 200 requests/day. For a focused coding session, 200 requests is roughly 2–3 hours of active agent work. If you hit the limit, you either wait for the daily reset or add a small credit balance for pay-as-you-go overflow.

Rate limit tip — The free tier rate limit of 200 requests/day is per-model, not per-account. If you burn through your GLM-5.2 quota, you can switch to DeepSeek V4 Flash and get another 200. Cycle through models to extend your free runway.

Step-by-Step: Claude Code + OmniRoute

For operators who want full control over routing, fallback strategies, and token compression. Based on the OmniRoute documentation:

1. Install OmniRoute

git clone https://github.com/diegosouzapw/OmniRoute.git
cd OmniRoute
pip install -r requirements.txt

2. Configure providers

OmniRoute’s configuration file lets you specify which providers to route through, in what priority order, and with what fallback strategy. For a free-only setup:

providers:
  - name: openrouter-free
    base_url: https://openrouter.ai/api/v1
    api_key: ${OPENROUTER_API_KEY}
    models: ["z-ai/glm-5.2", "deepseek/deepseek-v4-flash"]
    priority: 1
  - name: cloudflare-workers
    base_url: https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/v1
    api_key: ${CF_API_TOKEN}
    models: ["@cf/z-ai/glm-5.2"]
    priority: 2

3. Start the gateway

python omniroute.py --port 8080

OmniRoute exposes an OpenAI-compatible API at localhost:8080. If your primary provider is down or rate-limited, it automatically falls back to the next priority.

4. Point Claude Code at the gateway

export ANTHROPIC_BASE_URL="http://localhost:8080/v1"
export ANTHROPIC_API_KEY="omniroute-local"
export ANTHROPIC_MODEL="z-ai/glm-5.2"

5. Enable token compression (optional)

OmniRoute’s RTK and Caveman compression modes reduce the token count of your prompts before they hit the upstream provider. This is particularly useful on rate-limited free tiers where every request counts:

python omniroute.py --port 8080 --compression caveman

The project reports 15–95% token savings depending on prompt type. Coding prompts with lots of boilerplate code in context tend to compress well.

Step-by-Step: Codex CLI + Free Models

Codex CLI’s setup is less straightforward than Claude Code’s. OpenAI deprecated the wire_api approach in favor of the Responses API, which means Codex CLI doesn’t natively accept a simple base URL swap the way Claude Code does. You need a translation layer.

Option A: OpenRouter BYOK (Bring Your Own Key)

The Knightli guide walks through configuring Codex CLI with OpenRouter’s BYOK mode, which translates between the Responses API format Codex expects and the Chat Completions API that free model providers expose.

export OPENAI_BASE_URL="https://openrouter.ai/api/v1"
export OPENAI_API_KEY="sk-or-v1-your-key-here"

Then in your Codex config:

{
  "model": "z-ai/glm-5.2",
  "provider": "openrouter"
}

Option B: codeproxy-ai/cli translation layer

For models that don’t support the Responses API format, the codeproxy-ai/cli project provides a local proxy that translates between API formats. This is the approach documented in the community gist for connecting Codex CLI to DeepSeek:

npm install -g @codeproxy-ai/cli
codeproxy --port 9090 --upstream https://openrouter.ai/api/v1
export OPENAI_BASE_URL="http://localhost:9090/v1"
export OPENAI_API_KEY="sk-or-v1-your-key-here"

Option C: OmniRoute as universal translator

OmniRoute handles the API format translation natively. If you’re already running it for Claude Code, point Codex at the same gateway:

export OPENAI_BASE_URL="http://localhost:8080/v1"
export OPENAI_API_KEY="omniroute-local"

Codex CLI note — Codex CLI's Responses API requirement adds friction that Claude Code doesn't have. If you're choosing between the two tools and cost is your primary concern, Claude Code's simpler gateway integration is an advantage. For a broader comparison of the CLI landscape, see our coding agent CLI comparison.

Real-World Example: ai-berkshire

To see what a serious agent workload looks like on free models, consider ai-berkshire — a value investing agent framework built on Claude Code and Codex CLI. The framework runs multi-agent adversarial analysis: one agent builds a bull case for a stock, another builds a bear case, and a third synthesizes them into an investment thesis.

This is exactly the kind of multi-agent architecture that would cost serious money on frontier models. Each analysis run involves dozens of API calls across multiple agents. But here’s the insight: the research and synthesis agents don’t need frontier intelligence. They’re doing structured tasks — pulling financial data, formatting analysis, comparing metrics. A model like GLM-5.2 handles those tasks capably.

The architecture pattern that emerges is model-tiered agent routing:

Research agents (data gathering, summarization): Route to free models via OpenRouter or OmniRoute
Analysis agents (reasoning, comparison): Route to mid-tier models or free models with larger context
Synthesis agent (final judgment, nuance): Route to frontier model when the stakes justify it

This isn’t theoretical. It’s the same pattern every cost-conscious coding agent stack converges on: use the cheapest model that can do each job, and save the expensive model for the work that actually needs it.

When Free Models Break Down

Free models are not a free lunch. Here’s where they consistently fail in agent workflows:

Complex multi-file refactors. Any operation that requires coordinating changes across 5+ files with consistent naming, type signatures, and import paths. Free models can handle each file in isolation but lose coherence across the full changeset. This is a reasoning-depth problem, not a context-length problem — GLM-5.2’s 1M context doesn’t help if the model can’t plan across files.

Deep architectural reasoning. “Redesign this module to use event sourcing instead of CRUD” requires understanding patterns, tradeoffs, and implications that current free models don’t reliably handle. Frontier models have been heavily trained on architectural discussion; free models less so.

Very long multi-step agent loops. The compounding reliability problem applies: if each tool call succeeds 92% of the time on a free model versus 98% on a frontier model, a 10-step workflow succeeds 43% versus 82%. For short loops (3–5 steps), the difference is manageable. For long loops, it’s the difference between a useful agent and a frustrating one.

Subtle bug diagnosis. When the bug requires understanding the interaction between a race condition, a stale cache, and an off-by-one error in a pagination query, free models tend to fixate on one dimension and miss the others. This is where frontier models earn their keep.

The 80/20 Strategy

The practical approach is not “replace your paid model with a free one” but “identify the 80% of your work that free models handle well and stop paying for those tokens.”

Routine coding tasks that free models handle reliably:

Code completions and inline suggestions
Single-file edits with clear specifications
Test generation from existing code
Boilerplate and scaffolding
Documentation generation
Simple bug fixes with obvious root causes

Tasks that still justify paid tokens:

Multi-file refactors
Architecture-level design decisions
Complex debugging requiring deep reasoning
Performance optimization requiring system-level understanding
Security-sensitive code review

The math is simple. If 80% of your Claude Code usage is routine, switching those tasks to free models via OpenRouter turns a $200/month bill into $40/month — or $0/month if the free tier covers your volume.

Builder Implications

If you’re building tools, platforms, or agent frameworks, the free-model gateway pattern has implications beyond individual cost savings.

Multi-model is the default architecture now. The question is no longer “which model should my agent use?” but “which models, in what combination, for which tasks?” Gateway architecture — whether via OpenRouter, OmniRoute, or a custom routing layer — is becoming a core infrastructure component, not an optimization.

Access is a supply-chain risk. We’ve written about this in the context of local weights and model bans, but gateways add a different dimension. If your agent hardcodes a single model provider, any disruption to that provider — pricing changes, rate limit changes, deprecation, geopolitical restrictions — breaks your product. Multi-gateway, multi-model architectures aren’t just cheaper. They’re more resilient.

Free models are a growth lever. As developers are noting on X, free model support means your users can try frameworks like ai-berkshire without an API budget. That’s a distribution advantage. The frameworks that make it easy to run on GLM-5.2 via OpenRouter will get more users than the ones that require an Anthropic API key on day one.

Token compression changes the economics further. OmniRoute’s 15–95% compression range means the effective cost of even paid models drops substantially. For operators running high-volume agent workloads, the combination of free models for routine tasks and compressed tokens for premium tasks creates a cost structure that’s an order of magnitude cheaper than naive single-model usage.

Minimum viable setup — OpenRouter free tier + Claude Code + GLM-5.2 as default model. Total cost: $0. Time to set up: under 5 minutes. Upgrade path when you hit limits: add an OmniRoute gateway for fallback routing and token compression, or add a small OpenRouter credit balance for pay-as-you-go overflow on premium models.

The Operator’s Checklist

Start with OpenRouter. Sign up, get an API key, set three env vars. You’ll be running Claude Code on GLM-5.2 in under 5 minutes.
Test your actual workload. Spend a day using free models for everything. Note which tasks succeed and which need a better model. This gives you your personal 80/20 split.
Set up model cycling. When you hit the 200 request/day limit on one free model, switch to another. GLM-5.2 exhausted? Move to DeepSeek V4 Flash. Then Qwen3-Coder. Three models give you 600 free requests/day.
Add OmniRoute when you need more control. If you’re hitting rate limits regularly, want token compression, or need automatic fallback between providers, install OmniRoute and run it locally.
Keep a frontier model configured for the hard 20%. Whether it’s Claude Opus on the native API or a paid model via OpenRouter, have a one-command switch for the tasks that free models can’t handle.
If using Codex CLI, set up the translation layer early. The Responses API requirement adds friction — get it working before you need it.

The gateway routing pattern is converging fast. Six months ago, using free models with Claude Code required hacks and workarounds. Today it’s three environment variables and an official OpenRouter cookbook entry. The tooling is meeting the demand.

The $200/month subscription isn’t going away. For professional teams shipping production code, the reliability and speed of frontier models justify the cost. But the wall between “paying developer” and “priced-out developer” just got a door. Three env vars, a free API key, and a model that scores 62.1% on SWE-bench Pro. That’s the playbook.

The $200/Month Problem

Two Gateway Options

OpenRouter: The Turnkey Path

OmniRoute: The Self-Hosted Path

Best Free Models for Coding (June 2026)

GLM-5.2 (Z.ai / Zhipu)

DeepSeek V4 Flash

Qwen3-Coder 480B

Devstral 2

Step-by-Step: Claude Code + OpenRouter

Step-by-Step: Claude Code + OmniRoute

Step-by-Step: Codex CLI + Free Models

Real-World Example: ai-berkshire

When Free Models Break Down

The 80/20 Strategy

Builder Implications

The Operator’s Checklist

The AgentConn Weekly

Explore AI Agents