# Small Open-Weights LLM Survey for the capOS Agent-Shell

Research notes on current (early 2026) open-weights language models in the
2-4 B active-parameter range, their suitability for the capability-served
planner described in `docs/proposals/llm-and-agent-proposal.md`, and a rough
compute-cost estimate for training a comparable model from scratch.


Primary sources: OpenRouter model catalog (`https://openrouter.ai/api/v1/models`,
353 models listed at survey time); empirical probe against OpenRouter's
hosted endpoints using an agent-planner prompt; published training reports
(Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT
blog posts); Chinchilla scaling law (Hoffmann et al., 2022).

---

## 1. Candidate Landscape

Two families of candidates match "2-4 B active parameters":

- **Dense 2-4 B**: inference FLOPs and memory footprint both scale with
  total parameters. Friendly to low-RAM hosts.
- **MoE with 2-4 B active**: inference FLOPs scale with active params, but
  total weights must be resident. Only viable on hosts with enough RAM to
  page-cache the full expert stack.

Dense contenders observed as of 2026-04-24:

| Model | Params | License | Context | Notes |
|---|---|---|---|---|
| Qwen3-4B-Instruct | 4 B | Apache-2.0 | 32 K | Strong tool-use post-training |
| Qwen3-1.7B-Instruct | 1.7 B | Apache-2.0 | 32 K | Same family, smaller floor |
| Gemma 3 4B IT | 4 B | Gemma license | 128 K | Multilingual; verbose outputs |
| Llama 3.2 3B Instruct | 3 B | Llama 3.2 Community | 128 K | Permissive but not OSI |
| Ministral 3B (2512) | 3 B | Mistral Research License | 128 K | **Non-commercial; blocks ISO redistribution** |
| Phi-4-mini | 3.8 B | MIT | 16 K | Reasoning-leaning training |
| IBM Granite 4.0 H Micro | ~3 B | Apache-2.0 | 128 K | New architecture, less battle-tested |
| SmolLM3-3B (HuggingFace) | 3 B | Apache-2.0 | 64 K | Fully open data + training code |

MoE contenders with ~3 B active:

| Model | Active | Total | License | Context | q4 weight size |
|---|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | ~3 B | 30 B | Apache-2.0 | 262 K | ~18 GiB |
| Qwen3-Coder-30B-A3B-Instruct | ~3 B | 30 B | Apache-2.0 | 160 K | ~18 GiB |
| Qwen3-Next-80B-A3B-Instruct | ~3 B | 80 B | Apache-2.0 | 262 K | ~48 GiB |
| Qwen3.5-35B-A3B | ~3 B | 35 B | Apache-2.0 | 262 K | ~21 GiB |
| IBM Granite 4.0 Tiny (7B-A1B) | ~1 B | 7 B | Apache-2.0 | 128 K | ~4 GiB |

## 2. Empirical Probe

### Prompt

Agent-planner system prompt: "You are a capOS shell planner. Given a goal
and typed tool descriptors (name + param schema), emit a single JSON
ActionPlan: `{"steps":[{"tool":..,"args":..,"rationale":..}]}`. Never
invoke tools. Only reference tools from the descriptor list. Output JSON
only, no prose."

User prompt: three typed tool descriptors (`ServiceSupervisor.restart`,
`NetworkStack.info`, `LogReader.tail`) and the goal "Restart the network
stack, but first confirm it's in a failed state by checking status and
last 20 log lines."

The test exercises three properties a capOS planner needs:

1. Correct **step ordering** (`info` + `tail` before `restart`).
2. Correct **arg packing** for methods with and without arguments.
3. **Pure JSON output** without Markdown fences, which the dispatcher must
   otherwise strip.

### Results

| Model | JSON valid | Order correct | Fences | Arg shape |
|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | yes | yes | none | compact, correct |
| Qwen3-Next-80B-A3B-Instruct | yes | yes | none | correct, verbose |
| Qwen3.5-35B-A3B | yes | yes | none | correct |
| Qwen3-8B (proxy for Qwen3-4B) | yes | yes | none | correct |
| Gemma 3 4B IT | yes | yes | ```json fence | fabricated empty `status:""` arg on zero-arg call |
| Ministral 3B (2512) | yes | yes | ```json fence | correct |
| Llama 3.2 3B Instruct | yes | **no** (restart before log check) | ``` fence | correct |
| IBM Granite 4.0 H Micro | **no** (three duplicate `steps` keys in one object) | — | none | — |

Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served
on OpenRouter; Qwen3 family models below 8 B share the same post-training
recipe, so output quality for structured agent tasks should be comparable
with minor degradation at 4 B and more noticeable degradation at 1.7 B.

### Interpretation

- **Qwen3-A3B** family produces the tightest, correctly-ordered plans with
  no markdown fencing. Best quality-per-active-parameter in the sample.
- **Dense 3-4 B Qwen / Gemma / Ministral** produce correct plans but add
  Markdown fences or small schema drift that the dispatcher must tolerate.
- **Llama 3.2 3B** violated the ordering constraint -- planner-unsafe
  without additional prompt discipline or rejection sampling.
- **Granite 4.0 H Micro** emitted invalid JSON (duplicate object keys).
  Retest before adopting; may be endpoint-specific rather than the model.

## 3. Size Thresholds for capOS Use Cases

Mapping observed behaviour to the proposal's workloads:

| Workload | Minimum credible size | Notes |
|---|---|---|
| NPC dialogue, canned-reply replacement | 1.7 B dense | Templated plans only; refusal fragile |
| Short-list planner (≤5 typed tools) | 3 B dense | Floor for credible multi-step ordering |
| Long-list planner, plan refine, step-up reasoning | 4 B dense or 30B-A3B | Refusal, self-critique, schema-strict JSON |
| Log / audit summarisation, NPC with context | 4 B dense or 30B-A3B | Needs retrieval grounding regardless |
| Embedding / vector retrieval (`TextEmbedder`) | separate small encoder | Not a generator workload |

Proposal §"Built-in Local Model" sketches a 0.7-2.0 GiB weight budget (q4
class). Qwen3-4B at `q4_k_m` is ~2.4 GiB, narrowly over that budget.
Resolutions:

1. Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
2. Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at `q5_k_m`,
   ~2.0 GiB), acknowledging weaker planner quality.
3. Ship Qwen3-1.7B as default and allow `ModelAdmin.loadWeights` to install
   Qwen3-4B or a 30B-A3B model post-install.

## 4. Recommendation for the Proposal

1. **Default built-in (ISO)**: Qwen3-4B-Instruct at `q4_k_m`, Apache-2.0.
   Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB.
   Fallback to SmolLM3-3B if a fully-open training-data provenance is
   required for the trusted-build-inputs chain.

2. **Optional installed upgrade**: Qwen3-30B-A3B-Instruct-2507 for hosts
   with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially
   better planning quality.

3. **Reject for default ship**:
   - Ministral 3B (Mistral Research License -- cannot redistribute on ISO).
   - Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2
     Community License also restricts downstream use).
   - IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or
     refuted on a local run.

4. **Update Open Question 3** of the proposal ("smallest credible local
   model") with the threshold: 3 B dense is the floor for a planner that
   can be trusted with ordering constraints; 1.7 B is restricted to NPC /
   canned-reply territory.

## 5. Training Compute Cost for a Custom 2-B-Active Model

Rough order-of-magnitude estimate, on the chance that the project considers
a purpose-trained capOS planner model rather than a fine-tune.

### 5.1 FLOPs Budget

Forward+backward training compute approximates `6 x N_active x D_tokens`.
Modern open models have drifted far past Chinchilla's 20-tokens-per-param
ratio; 5k-15k tokens per param is typical.

| Target | Active | Tokens | FLOPs |
|---|---|---|---|
| Chinchilla-minimum 2 B dense | 2 B | 40 B | 4.8e20 |
| Llama-3-ish 2 B dense | 2 B | 15 T | 1.8e23 |
| Qwen3-4B-ish 2 B dense | 2 B | 36 T | 4.3e23 |
| 30B-A3B MoE (3 B active, 15 T tok) | 3 B | 15 T | ~4e23 (+ ~1.5x router/aux overhead) |

### 5.2 Hardware -> Dollars

Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr
(spot) to $3-4 (on-demand).

| Scale | H100-hours | USD (raw compute) | Wall-clock on 1024 H100 |
|---|---|---|---|
| Chinchilla 2 B (toy) | ~350 | ~$1 k | <1 hr |
| 2 B @ 15 T tok | ~130 k | ~$400 k | ~5 days |
| 2 B @ 36 T tok (SotA match) | ~310 k | ~$900 k | ~12 days |
| 30B-A3B @ 15 T tok | ~290 k | ~$870 k | ~12 days |

### 5.3 Public Calibration

- Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
- Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
- Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
- MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.

The 6ND estimate agrees with these published runs within a factor of ~2,
which is appropriate for an order-of-magnitude planning number.

### 5.4 Full-Project Multiplier

Final training run is typically 20-30% of total project compute. Realistic
end-to-end budget:

- Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
- Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
- Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed
  compute cost; tokenizer corpus curation is non-trivial.
- Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.

**Realistic end-to-end to ship a capOS-class 2 B model from scratch:
$3-10 M plus a team.** A 30B-A3B MoE adds ~50%.

## 6. Practical Alternative

Training from scratch is almost certainly not worth it for the agent-shell
use case. Two much cheaper paths that achieve the same capOS-specific
behaviour:

1. **SFT / LoRA on Qwen3-4B or SmolLM3-3B** for the capOS `ActionPlan` JSON
   schema, tool descriptors, and refusal patterns. ~10 k-100 k curated
   examples, 8 x H100 for 1-10 days ~= **$500-$10 k**. Reproducible on
   commodity cloud.

2. **Continued pretraining** on a capOS-specific corpus (manifests,
   schemas, logs, proposals) if the base lacks domain coverage. Single
   digits of B tokens, **$10 k-$100 k**.

The only strong reason to train from scratch would be a fully verifiable
weight provenance chain tied to `docs/trusted-build-inputs.md`. Even then,
a reproducible fine-tune of a known base with a signed recipe captures
most of the benefit at ~1% of the cost.

## 6a. nanoGPT / nanochat Scale Reference

Karpathy's `nanoGPT` repo reproduces GPT-2 small (124 M params: 12 layers,
768 hidden, 12 heads) as its headline config. Karpathy's follow-up
`nanochat` (github.com/karpathy/nanochat) ships a full pretrain + SFT
pipeline and uses model **depth** (d) as the size dial rather than
parameter count. The README is the only authoritative source; the numbers
below are quoted from it, not extrapolated.

- **d12** -- "GPT-1 sized", ~5 min pretraining for quick experiments.
- **d20** -- documented speedrun tier: **"$48 (~2 hours of 8xH100 GPU
  node)"**, ~$15 on spot instance, "well below $100". This is the
  headline reproducibility tier.
- **d24** -- appears on the leaderboard as a "slightly overtrained
  baseline."
- **d26** -- "GPT-2 capability happens to be approximately depth 26";
  latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr
  on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for
  comparison.

The README does **not** publish explicit parameter counts per depth; the
mapping from depth to params requires inspecting the config code.

Capability mapping to the capOS planner task (empirical, based on
same-size published models rather than nanochat runs themselves):

| nanochat scale | Rough param band | Planner capability |
|---|---|---|
| d12 | GPT-1-class, ~50-100 M | Toy completion only, no planner |
| d20 | likely ~100-200 M band | Templated NPC lines; not a planner |
| d26 | GPT-2-class, ~100-400 M band | Simple JSON under strict priming; schema drift common |
| Hypothetical d30+ | unclear (not in README) | Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2 |

Training a nanochat-class model from scratch fits a research-OS budget in
a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is
single-digit hours on 8xH100. That is the only scale at which "capOS
ships a weight-provenance-complete default planner" is financially
plausible without multi-million-dollar compute.

## 7. Open Follow-Ups

- Verify Granite 4.0 H Micro JSON behaviour on a local `llama.cpp` run
  rather than the OpenRouter endpoint; the probe may have hit a streaming
  / formatting quirk specific to the provider.
- Measure `q4_k_m` tokens-per-second for Qwen3-4B and Qwen3-1.7B on the
  CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC).
  No numbers are captured here; required before committing to a default.
- Evaluate an embedding model separately (`bge-m3`, `nomic-embed`,
  `gte-modernbert`) for the `TextEmbedder` capability. Out of scope for
  this survey.
- Revisit in 6 months: the 2-4 B frontier is moving monthly as of early
  2026, and "best open weight" today may be superseded before the
  proposal's Phase 2 begins.
- **nanochat d30+ quality and pricing.** The README documents tiers up
  to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published
  numbers exist for d30 or beyond. Open questions, before committing to
  an in-tree from-scratch provenance model:
  - What is the parameter count for d30 (and d28, d32)? Derive from the
    nanochat config code, not inferred.
  - What training time and cost does d30 require to reach a non-trivial
    SFT-able checkpoint on the same 8xH100 setup? Expected band is
    roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but
    this needs measurement -- depth scaling of wall-clock is not linear
    once the model stops fitting comfortably in per-GPU memory.
  - Does a d30-scale nanochat + capOS-specific SFT approach the
    Qwen3-1.7B planner floor on the section-2 probe? If yes, a
    provenance-complete default planner becomes realistic for ~$500-$5 k
    per full run (pretrain + SFT + ablations). If no, provenance has to
    be bought by fine-tuning a larger external base (Qwen3-1.7B or
    SmolLM3-3B) and accepting the weaker provenance story.
- **Tokenizer choice for any capOS from-scratch or continue-pretrain
  path.** Independent of model scale or architecture. A capOS-specific
  tokenizer with reserved tokens for `ActionPlan` JSON structure, Cap'n
  Proto type IDs, capability interface names, and common schema keywords
  is plausible at the nanochat-class budget and may materially reduce
  tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or
  a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the
  tokenizer is fixed by the base and this question collapses to
  "what special tokens can be added without retraining embeddings."