Small Open-Weights LLM Survey for the capOS Agent-Shell

Research notes on current (early 2026) open-weights language models in the 2-4 B active-parameter range, their suitability for the capability-served planner described in docs/proposals/llm-and-agent-proposal.md, and a rough compute-cost estimate for training a comparable model from scratch.

Primary sources: OpenRouter model catalog (https://openrouter.ai/api/v1/models, 353 models listed at survey time); empirical probe against OpenRouter’s hosted endpoints using an agent-planner prompt; published training reports (Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT blog posts); Chinchilla scaling law (Hoffmann et al., 2022).

1. Candidate Landscape

Two families of candidates match “2-4 B active parameters”:

Dense 2-4 B: inference FLOPs and memory footprint both scale with total parameters. Friendly to low-RAM hosts.
MoE with 2-4 B active: inference FLOPs scale with active params, but total weights must be resident. Only viable on hosts with enough RAM to page-cache the full expert stack.

Dense contenders observed as of 2026-04-24:

Model	Params	License	Context	Notes
Qwen3-4B-Instruct	4 B	Apache-2.0	32 K	Strong tool-use post-training
Qwen3-1.7B-Instruct	1.7 B	Apache-2.0	32 K	Same family, smaller floor
Gemma 3 4B IT	4 B	Gemma license	128 K	Multilingual; verbose outputs
Llama 3.2 3B Instruct	3 B	Llama 3.2 Community	128 K	Permissive but not OSI
Ministral 3B (2512)	3 B	Mistral Research License	128 K	Non-commercial; blocks ISO redistribution
Phi-4-mini	3.8 B	MIT	16 K	Reasoning-leaning training
IBM Granite 4.0 H Micro	~3 B	Apache-2.0	128 K	New architecture, less battle-tested
SmolLM3-3B (HuggingFace)	3 B	Apache-2.0	64 K	Fully open data + training code

MoE contenders with ~3 B active:

Model	Active	Total	License	Context	q4 weight size
Qwen3-30B-A3B-Instruct-2507	~3 B	30 B	Apache-2.0	262 K	~18 GiB
Qwen3-Coder-30B-A3B-Instruct	~3 B	30 B	Apache-2.0	160 K	~18 GiB
Qwen3-Next-80B-A3B-Instruct	~3 B	80 B	Apache-2.0	262 K	~48 GiB
Qwen3.5-35B-A3B	~3 B	35 B	Apache-2.0	262 K	~21 GiB
IBM Granite 4.0 Tiny (7B-A1B)	~1 B	7 B	Apache-2.0	128 K	~4 GiB

2. Empirical Probe

Prompt

Agent-planner system prompt: “You are a capOS shell planner. Given a goal and typed tool descriptors (name + param schema), emit a single JSON ActionPlan: {"steps":[{"tool":..,"args":..,"rationale":..}]}. Never invoke tools. Only reference tools from the descriptor list. Output JSON only, no prose.”

User prompt: three typed tool descriptors (ServiceSupervisor.restart, NetworkStack.info, LogReader.tail) and the goal “Restart the network stack, but first confirm it’s in a failed state by checking status and last 20 log lines.”

The test exercises three properties a capOS planner needs:

Correct step ordering (info + tail before restart).
Correct arg packing for methods with and without arguments.
Pure JSON output without Markdown fences, which the dispatcher must otherwise strip.

Results

Model	JSON valid	Order correct	Fences	Arg shape
Qwen3-30B-A3B-Instruct-2507	yes	yes	none	compact, correct
Qwen3-Next-80B-A3B-Instruct	yes	yes	none	correct, verbose
Qwen3.5-35B-A3B	yes	yes	none	correct
Qwen3-8B (proxy for Qwen3-4B)	yes	yes	none	correct
Gemma 3 4B IT	yes	yes	```json fence	fabricated empty `status:""` arg on zero-arg call
Ministral 3B (2512)	yes	yes	```json fence	correct
Llama 3.2 3B Instruct	yes	no (restart before log check)	``` fence	correct
IBM Granite 4.0 H Micro	no (three duplicate `steps` keys in one object)	—	none	—

Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served on OpenRouter; Qwen3 family models below 8 B share the same post-training recipe, so output quality for structured agent tasks should be comparable with minor degradation at 4 B and more noticeable degradation at 1.7 B.

Interpretation

Qwen3-A3B family produces the tightest, correctly-ordered plans with no markdown fencing. Best quality-per-active-parameter in the sample.
Dense 3-4 B Qwen / Gemma / Ministral produce correct plans but add Markdown fences or small schema drift that the dispatcher must tolerate.
Llama 3.2 3B violated the ordering constraint – planner-unsafe without additional prompt discipline or rejection sampling.
Granite 4.0 H Micro emitted invalid JSON (duplicate object keys). Retest before adopting; may be endpoint-specific rather than the model.

3. Size Thresholds for capOS Use Cases

Mapping observed behaviour to the proposal’s workloads:

Workload	Minimum credible size	Notes
NPC dialogue, canned-reply replacement	1.7 B dense	Templated plans only; refusal fragile
Short-list planner (≤5 typed tools)	3 B dense	Floor for credible multi-step ordering
Long-list planner, plan refine, step-up reasoning	4 B dense or 30B-A3B	Refusal, self-critique, schema-strict JSON
Log / audit summarisation, NPC with context	4 B dense or 30B-A3B	Needs retrieval grounding regardless
Embedding / vector retrieval (`TextEmbedder`)	separate small encoder	Not a generator workload

Proposal §“Built-in Local Model” sketches a 0.7-2.0 GiB weight budget (q4 class). Qwen3-4B at q4_k_m is ~2.4 GiB, narrowly over that budget. Resolutions:

Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at q5_k_m, ~2.0 GiB), acknowledging weaker planner quality.
Ship Qwen3-1.7B as default and allow ModelAdmin.loadWeights to install Qwen3-4B or a 30B-A3B model post-install.

4. Recommendation for the Proposal

Default built-in (ISO): Qwen3-4B-Instruct at q4_k_m, Apache-2.0. Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB. Fallback to SmolLM3-3B if a fully-open training-data provenance is required for the trusted-build-inputs chain.
Optional installed upgrade: Qwen3-30B-A3B-Instruct-2507 for hosts with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially better planning quality.
Reject for default ship:
- Ministral 3B (Mistral Research License – cannot redistribute on ISO).
- Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2 Community License also restricts downstream use).
- IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or refuted on a local run.
Update Open Question 3 of the proposal (“smallest credible local model”) with the threshold: 3 B dense is the floor for a planner that can be trusted with ordering constraints; 1.7 B is restricted to NPC / canned-reply territory.

5. Training Compute Cost for a Custom 2-B-Active Model

Rough order-of-magnitude estimate, on the chance that the project considers a purpose-trained capOS planner model rather than a fine-tune.

5.1 FLOPs Budget

Forward+backward training compute approximates 6 x N_active x D_tokens. Modern open models have drifted far past Chinchilla’s 20-tokens-per-param ratio; 5k-15k tokens per param is typical.

Target	Active	Tokens	FLOPs
Chinchilla-minimum 2 B dense	2 B	40 B	4.8e20
Llama-3-ish 2 B dense	2 B	15 T	1.8e23
Qwen3-4B-ish 2 B dense	2 B	36 T	4.3e23
30B-A3B MoE (3 B active, 15 T tok)	3 B	15 T	~4e23 (+ ~1.5x router/aux overhead)

5.2 Hardware -> Dollars

Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr (spot) to $3-4 (on-demand).

Scale	H100-hours	USD (raw compute)	Wall-clock on 1024 H100
Chinchilla 2 B (toy)	~350	~$1 k	<1 hr
2 B @ 15 T tok	~130 k	~$400 k	~5 days
2 B @ 36 T tok (SotA match)	~310 k	~$900 k	~12 days
30B-A3B @ 15 T tok	~290 k	~$870 k	~12 days

5.3 Public Calibration

Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.

The 6ND estimate agrees with these published runs within a factor of ~2, which is appropriate for an order-of-magnitude planning number.

5.4 Full-Project Multiplier

Final training run is typically 20-30% of total project compute. Realistic end-to-end budget:

Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed compute cost; tokenizer corpus curation is non-trivial.
Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.

Realistic end-to-end to ship a capOS-class 2 B model from scratch: $3-10 M plus a team. A 30B-A3B MoE adds ~50%.

6. Practical Alternative

Training from scratch is almost certainly not worth it for the agent-shell use case. Two much cheaper paths that achieve the same capOS-specific behaviour:

SFT / LoRA on Qwen3-4B or SmolLM3-3B for the capOS ActionPlan JSON schema, tool descriptors, and refusal patterns. ~10 k-100 k curated examples, 8 x H100 for 1-10 days ~= $500-$10 k. Reproducible on commodity cloud.
Continued pretraining on a capOS-specific corpus (manifests, schemas, logs, proposals) if the base lacks domain coverage. Single digits of B tokens, $10 k-$100 k.

The only strong reason to train from scratch would be a fully verifiable weight provenance chain tied to docs/trusted-build-inputs.md. Even then, a reproducible fine-tune of a known base with a signed recipe captures most of the benefit at ~1% of the cost.

6a. nanoGPT / nanochat Scale Reference

Karpathy’s nanoGPT repo reproduces GPT-2 small (124 M params: 12 layers, 768 hidden, 12 heads) as its headline config. Karpathy’s follow-up nanochat (github.com/karpathy/nanochat) ships a full pretrain + SFT pipeline and uses model depth (d) as the size dial rather than parameter count. The README is the only authoritative source; the numbers below are quoted from it, not extrapolated.

d12 – “GPT-1 sized”, ~5 min pretraining for quick experiments.
d20 – documented speedrun tier: “$48 (~2 hours of 8xH100 GPU node)”, ~$15 on spot instance, “well below $100”. This is the headline reproducibility tier.
d24 – appears on the leaderboard as a “slightly overtrained baseline.”
d26 – “GPT-2 capability happens to be approximately depth 26”; latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for comparison.

The README does not publish explicit parameter counts per depth; the mapping from depth to params requires inspecting the config code.

Capability mapping to the capOS planner task (empirical, based on same-size published models rather than nanochat runs themselves):

nanochat scale	Rough param band	Planner capability
d12	GPT-1-class, ~50-100 M	Toy completion only, no planner
d20	likely ~100-200 M band	Templated NPC lines; not a planner
d26	GPT-2-class, ~100-400 M band	Simple JSON under strict priming; schema drift common
Hypothetical d30+	unclear (not in README)	Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2

Training a nanochat-class model from scratch fits a research-OS budget in a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is single-digit hours on 8xH100. That is the only scale at which “capOS ships a weight-provenance-complete default planner” is financially plausible without multi-million-dollar compute.

7. Open Follow-Ups

Verify Granite 4.0 H Micro JSON behaviour on a local llama.cpp run rather than the OpenRouter endpoint; the probe may have hit a streaming / formatting quirk specific to the provider.
Measure q4_k_m tokens-per-second for Qwen3-4B and Qwen3-1.7B on the CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC). No numbers are captured here; required before committing to a default.
Evaluate an embedding model separately (bge-m3, nomic-embed, gte-modernbert) for the TextEmbedder capability. Out of scope for this survey.
Revisit in 6 months: the 2-4 B frontier is moving monthly as of early 2026, and “best open weight” today may be superseded before the proposal’s Phase 2 begins.
nanochat d30+ quality and pricing. The README documents tiers up to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published numbers exist for d30 or beyond. Open questions, before committing to an in-tree from-scratch provenance model:
- What is the parameter count for d30 (and d28, d32)? Derive from the nanochat config code, not inferred.
- What training time and cost does d30 require to reach a non-trivial SFT-able checkpoint on the same 8xH100 setup? Expected band is roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but this needs measurement – depth scaling of wall-clock is not linear once the model stops fitting comfortably in per-GPU memory.
- Does a d30-scale nanochat + capOS-specific SFT approach the Qwen3-1.7B planner floor on the section-2 probe? If yes, a provenance-complete default planner becomes realistic for ~$500-$5 k per full run (pretrain + SFT + ablations). If no, provenance has to be bought by fine-tuning a larger external base (Qwen3-1.7B or SmolLM3-3B) and accepting the weaker provenance story.
Tokenizer choice for any capOS from-scratch or continue-pretrain path. Independent of model scale or architecture. A capOS-specific tokenizer with reserved tokens for ActionPlan JSON structure, Cap’n Proto type IDs, capability interface names, and common schema keywords is plausible at the nanochat-class budget and may materially reduce tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the tokenizer is fixed by the base and this question collapses to “what special tokens can be added without retraining embeddings.”

Keyboard shortcuts

capOS Documentation