Small Open-Weights LLM Survey for the capOS Agent-Shell
Research notes on current (early 2026) open-weights language models in the
2-4 B active-parameter range, their suitability for the capability-served
planner described in docs/proposals/llm-and-agent-proposal.md, and a rough
compute-cost estimate for training a comparable model from scratch.
Primary sources: OpenRouter model catalog (https://openrouter.ai/api/v1/models,
353 models listed at survey time); empirical probe against OpenRouter’s
hosted endpoints using an agent-planner prompt; published training reports
(Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT
blog posts); Chinchilla scaling law (Hoffmann et al., 2022).
1. Candidate Landscape
Two families of candidates match “2-4 B active parameters”:
- Dense 2-4 B: inference FLOPs and memory footprint both scale with total parameters. Friendly to low-RAM hosts.
- MoE with 2-4 B active: inference FLOPs scale with active params, but total weights must be resident. Only viable on hosts with enough RAM to page-cache the full expert stack.
Dense contenders observed as of 2026-04-24:
| Model | Params | License | Context | Notes |
|---|---|---|---|---|
| Qwen3-4B-Instruct | 4 B | Apache-2.0 | 32 K | Strong tool-use post-training |
| Qwen3-1.7B-Instruct | 1.7 B | Apache-2.0 | 32 K | Same family, smaller floor |
| Gemma 3 4B IT | 4 B | Gemma license | 128 K | Multilingual; verbose outputs |
| Llama 3.2 3B Instruct | 3 B | Llama 3.2 Community | 128 K | Permissive but not OSI |
| Ministral 3B (2512) | 3 B | Mistral Research License | 128 K | Non-commercial; blocks ISO redistribution |
| Phi-4-mini | 3.8 B | MIT | 16 K | Reasoning-leaning training |
| IBM Granite 4.0 H Micro | ~3 B | Apache-2.0 | 128 K | New architecture, less battle-tested |
| SmolLM3-3B (HuggingFace) | 3 B | Apache-2.0 | 64 K | Fully open data + training code |
MoE contenders with ~3 B active:
| Model | Active | Total | License | Context | q4 weight size |
|---|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | ~3 B | 30 B | Apache-2.0 | 262 K | ~18 GiB |
| Qwen3-Coder-30B-A3B-Instruct | ~3 B | 30 B | Apache-2.0 | 160 K | ~18 GiB |
| Qwen3-Next-80B-A3B-Instruct | ~3 B | 80 B | Apache-2.0 | 262 K | ~48 GiB |
| Qwen3.5-35B-A3B | ~3 B | 35 B | Apache-2.0 | 262 K | ~21 GiB |
| IBM Granite 4.0 Tiny (7B-A1B) | ~1 B | 7 B | Apache-2.0 | 128 K | ~4 GiB |
2. Empirical Probe
Prompt
Agent-planner system prompt: “You are a capOS shell planner. Given a goal
and typed tool descriptors (name + param schema), emit a single JSON
ActionPlan: {"steps":[{"tool":..,"args":..,"rationale":..}]}. Never
invoke tools. Only reference tools from the descriptor list. Output JSON
only, no prose.”
User prompt: three typed tool descriptors (ServiceSupervisor.restart,
NetworkStack.info, LogReader.tail) and the goal “Restart the network
stack, but first confirm it’s in a failed state by checking status and
last 20 log lines.”
The test exercises three properties a capOS planner needs:
- Correct step ordering (
info+tailbeforerestart). - Correct arg packing for methods with and without arguments.
- Pure JSON output without Markdown fences, which the dispatcher must otherwise strip.
Results
| Model | JSON valid | Order correct | Fences | Arg shape |
|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | yes | yes | none | compact, correct |
| Qwen3-Next-80B-A3B-Instruct | yes | yes | none | correct, verbose |
| Qwen3.5-35B-A3B | yes | yes | none | correct |
| Qwen3-8B (proxy for Qwen3-4B) | yes | yes | none | correct |
| Gemma 3 4B IT | yes | yes | ```json fence | fabricated empty status:"" arg on zero-arg call |
| Ministral 3B (2512) | yes | yes | ```json fence | correct |
| Llama 3.2 3B Instruct | yes | no (restart before log check) | ``` fence | correct |
| IBM Granite 4.0 H Micro | no (three duplicate steps keys in one object) | — | none | — |
Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served on OpenRouter; Qwen3 family models below 8 B share the same post-training recipe, so output quality for structured agent tasks should be comparable with minor degradation at 4 B and more noticeable degradation at 1.7 B.
Interpretation
- Qwen3-A3B family produces the tightest, correctly-ordered plans with no markdown fencing. Best quality-per-active-parameter in the sample.
- Dense 3-4 B Qwen / Gemma / Ministral produce correct plans but add Markdown fences or small schema drift that the dispatcher must tolerate.
- Llama 3.2 3B violated the ordering constraint – planner-unsafe without additional prompt discipline or rejection sampling.
- Granite 4.0 H Micro emitted invalid JSON (duplicate object keys). Retest before adopting; may be endpoint-specific rather than the model.
3. Size Thresholds for capOS Use Cases
Mapping observed behaviour to the proposal’s workloads:
| Workload | Minimum credible size | Notes |
|---|---|---|
| NPC dialogue, canned-reply replacement | 1.7 B dense | Templated plans only; refusal fragile |
| Short-list planner (≤5 typed tools) | 3 B dense | Floor for credible multi-step ordering |
| Long-list planner, plan refine, step-up reasoning | 4 B dense or 30B-A3B | Refusal, self-critique, schema-strict JSON |
| Log / audit summarisation, NPC with context | 4 B dense or 30B-A3B | Needs retrieval grounding regardless |
Embedding / vector retrieval (TextEmbedder) | separate small encoder | Not a generator workload |
Proposal §“Built-in Local Model” sketches a 0.7-2.0 GiB weight budget (q4
class). Qwen3-4B at q4_k_m is ~2.4 GiB, narrowly over that budget.
Resolutions:
- Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
- Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at
q5_k_m, ~2.0 GiB), acknowledging weaker planner quality. - Ship Qwen3-1.7B as default and allow
ModelAdmin.loadWeightsto install Qwen3-4B or a 30B-A3B model post-install.
4. Recommendation for the Proposal
-
Default built-in (ISO): Qwen3-4B-Instruct at
q4_k_m, Apache-2.0. Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB. Fallback to SmolLM3-3B if a fully-open training-data provenance is required for the trusted-build-inputs chain. -
Optional installed upgrade: Qwen3-30B-A3B-Instruct-2507 for hosts with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially better planning quality.
-
Reject for default ship:
- Ministral 3B (Mistral Research License – cannot redistribute on ISO).
- Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2 Community License also restricts downstream use).
- IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or refuted on a local run.
-
Update Open Question 3 of the proposal (“smallest credible local model”) with the threshold: 3 B dense is the floor for a planner that can be trusted with ordering constraints; 1.7 B is restricted to NPC / canned-reply territory.
5. Training Compute Cost for a Custom 2-B-Active Model
Rough order-of-magnitude estimate, on the chance that the project considers a purpose-trained capOS planner model rather than a fine-tune.
5.1 FLOPs Budget
Forward+backward training compute approximates 6 x N_active x D_tokens.
Modern open models have drifted far past Chinchilla’s 20-tokens-per-param
ratio; 5k-15k tokens per param is typical.
| Target | Active | Tokens | FLOPs |
|---|---|---|---|
| Chinchilla-minimum 2 B dense | 2 B | 40 B | 4.8e20 |
| Llama-3-ish 2 B dense | 2 B | 15 T | 1.8e23 |
| Qwen3-4B-ish 2 B dense | 2 B | 36 T | 4.3e23 |
| 30B-A3B MoE (3 B active, 15 T tok) | 3 B | 15 T | ~4e23 (+ ~1.5x router/aux overhead) |
5.2 Hardware -> Dollars
Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr (spot) to $3-4 (on-demand).
| Scale | H100-hours | USD (raw compute) | Wall-clock on 1024 H100 |
|---|---|---|---|
| Chinchilla 2 B (toy) | ~350 | ~$1 k | <1 hr |
| 2 B @ 15 T tok | ~130 k | ~$400 k | ~5 days |
| 2 B @ 36 T tok (SotA match) | ~310 k | ~$900 k | ~12 days |
| 30B-A3B @ 15 T tok | ~290 k | ~$870 k | ~12 days |
5.3 Public Calibration
- Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
- Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
- Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
- MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.
The 6ND estimate agrees with these published runs within a factor of ~2, which is appropriate for an order-of-magnitude planning number.
5.4 Full-Project Multiplier
Final training run is typically 20-30% of total project compute. Realistic end-to-end budget:
- Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
- Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
- Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed compute cost; tokenizer corpus curation is non-trivial.
- Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.
Realistic end-to-end to ship a capOS-class 2 B model from scratch: $3-10 M plus a team. A 30B-A3B MoE adds ~50%.
6. Practical Alternative
Training from scratch is almost certainly not worth it for the agent-shell use case. Two much cheaper paths that achieve the same capOS-specific behaviour:
-
SFT / LoRA on Qwen3-4B or SmolLM3-3B for the capOS
ActionPlanJSON schema, tool descriptors, and refusal patterns. ~10 k-100 k curated examples, 8 x H100 for 1-10 days ~= $500-$10 k. Reproducible on commodity cloud. -
Continued pretraining on a capOS-specific corpus (manifests, schemas, logs, proposals) if the base lacks domain coverage. Single digits of B tokens, $10 k-$100 k.
The only strong reason to train from scratch would be a fully verifiable
weight provenance chain tied to docs/trusted-build-inputs.md. Even then,
a reproducible fine-tune of a known base with a signed recipe captures
most of the benefit at ~1% of the cost.
6a. nanoGPT / nanochat Scale Reference
Karpathy’s nanoGPT repo reproduces GPT-2 small (124 M params: 12 layers,
768 hidden, 12 heads) as its headline config. Karpathy’s follow-up
nanochat (github.com/karpathy/nanochat) ships a full pretrain + SFT
pipeline and uses model depth (d) as the size dial rather than
parameter count. The README is the only authoritative source; the numbers
below are quoted from it, not extrapolated.
- d12 – “GPT-1 sized”, ~5 min pretraining for quick experiments.
- d20 – documented speedrun tier: “$48 (~2 hours of 8xH100 GPU node)”, ~$15 on spot instance, “well below $100”. This is the headline reproducibility tier.
- d24 – appears on the leaderboard as a “slightly overtrained baseline.”
- d26 – “GPT-2 capability happens to be approximately depth 26”; latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for comparison.
The README does not publish explicit parameter counts per depth; the mapping from depth to params requires inspecting the config code.
Capability mapping to the capOS planner task (empirical, based on same-size published models rather than nanochat runs themselves):
| nanochat scale | Rough param band | Planner capability |
|---|---|---|
| d12 | GPT-1-class, ~50-100 M | Toy completion only, no planner |
| d20 | likely ~100-200 M band | Templated NPC lines; not a planner |
| d26 | GPT-2-class, ~100-400 M band | Simple JSON under strict priming; schema drift common |
| Hypothetical d30+ | unclear (not in README) | Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2 |
Training a nanochat-class model from scratch fits a research-OS budget in a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is single-digit hours on 8xH100. That is the only scale at which “capOS ships a weight-provenance-complete default planner” is financially plausible without multi-million-dollar compute.
7. Open Follow-Ups
- Verify Granite 4.0 H Micro JSON behaviour on a local
llama.cpprun rather than the OpenRouter endpoint; the probe may have hit a streaming / formatting quirk specific to the provider. - Measure
q4_k_mtokens-per-second for Qwen3-4B and Qwen3-1.7B on the CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC). No numbers are captured here; required before committing to a default. - Evaluate an embedding model separately (
bge-m3,nomic-embed,gte-modernbert) for theTextEmbeddercapability. Out of scope for this survey. - Revisit in 6 months: the 2-4 B frontier is moving monthly as of early 2026, and “best open weight” today may be superseded before the proposal’s Phase 2 begins.
- nanochat d30+ quality and pricing. The README documents tiers up
to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published
numbers exist for d30 or beyond. Open questions, before committing to
an in-tree from-scratch provenance model:
- What is the parameter count for d30 (and d28, d32)? Derive from the nanochat config code, not inferred.
- What training time and cost does d30 require to reach a non-trivial SFT-able checkpoint on the same 8xH100 setup? Expected band is roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but this needs measurement – depth scaling of wall-clock is not linear once the model stops fitting comfortably in per-GPU memory.
- Does a d30-scale nanochat + capOS-specific SFT approach the Qwen3-1.7B planner floor on the section-2 probe? If yes, a provenance-complete default planner becomes realistic for ~$500-$5 k per full run (pretrain + SFT + ablations). If no, provenance has to be bought by fine-tuning a larger external base (Qwen3-1.7B or SmolLM3-3B) and accepting the weaker provenance story.
- Tokenizer choice for any capOS from-scratch or continue-pretrain
path. Independent of model scale or architecture. A capOS-specific
tokenizer with reserved tokens for
ActionPlanJSON structure, Cap’n Proto type IDs, capability interface names, and common schema keywords is plausible at the nanochat-class budget and may materially reduce tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the tokenizer is fixed by the base and this question collapses to “what special tokens can be added without retraining embeddings.”