Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Small Open-Weights LLM Survey for the capOS Agent-Shell

Research notes on current (early 2026) open-weights language models in the 2-4 B active-parameter range, their suitability for the capability-served planner described in docs/proposals/llm-and-agent-proposal.md, and a rough compute-cost estimate for training a comparable model from scratch.

Primary sources: OpenRouter model catalog (https://openrouter.ai/api/v1/models, 353 models listed at survey time); empirical probe against OpenRouter’s hosted endpoints using an agent-planner prompt; published training reports (Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT blog posts); Chinchilla scaling law (Hoffmann et al., 2022).


1. Candidate Landscape

Two families of candidates match “2-4 B active parameters”:

  • Dense 2-4 B: inference FLOPs and memory footprint both scale with total parameters. Friendly to low-RAM hosts.
  • MoE with 2-4 B active: inference FLOPs scale with active params, but total weights must be resident. Only viable on hosts with enough RAM to page-cache the full expert stack.

Dense contenders observed as of 2026-04-24:

ModelParamsLicenseContextNotes
Qwen3-4B-Instruct4 BApache-2.032 KStrong tool-use post-training
Qwen3-1.7B-Instruct1.7 BApache-2.032 KSame family, smaller floor
Gemma 3 4B IT4 BGemma license128 KMultilingual; verbose outputs
Llama 3.2 3B Instruct3 BLlama 3.2 Community128 KPermissive but not OSI
Ministral 3B (2512)3 BMistral Research License128 KNon-commercial; blocks ISO redistribution
Phi-4-mini3.8 BMIT16 KReasoning-leaning training
IBM Granite 4.0 H Micro~3 BApache-2.0128 KNew architecture, less battle-tested
SmolLM3-3B (HuggingFace)3 BApache-2.064 KFully open data + training code

MoE contenders with ~3 B active:

ModelActiveTotalLicenseContextq4 weight size
Qwen3-30B-A3B-Instruct-2507~3 B30 BApache-2.0262 K~18 GiB
Qwen3-Coder-30B-A3B-Instruct~3 B30 BApache-2.0160 K~18 GiB
Qwen3-Next-80B-A3B-Instruct~3 B80 BApache-2.0262 K~48 GiB
Qwen3.5-35B-A3B~3 B35 BApache-2.0262 K~21 GiB
IBM Granite 4.0 Tiny (7B-A1B)~1 B7 BApache-2.0128 K~4 GiB

2. Empirical Probe

Prompt

Agent-planner system prompt: “You are a capOS shell planner. Given a goal and typed tool descriptors (name + param schema), emit a single JSON ActionPlan: {"steps":[{"tool":..,"args":..,"rationale":..}]}. Never invoke tools. Only reference tools from the descriptor list. Output JSON only, no prose.”

User prompt: three typed tool descriptors (ServiceSupervisor.restart, NetworkStack.info, LogReader.tail) and the goal “Restart the network stack, but first confirm it’s in a failed state by checking status and last 20 log lines.”

The test exercises three properties a capOS planner needs:

  1. Correct step ordering (info + tail before restart).
  2. Correct arg packing for methods with and without arguments.
  3. Pure JSON output without Markdown fences, which the dispatcher must otherwise strip.

Results

ModelJSON validOrder correctFencesArg shape
Qwen3-30B-A3B-Instruct-2507yesyesnonecompact, correct
Qwen3-Next-80B-A3B-Instructyesyesnonecorrect, verbose
Qwen3.5-35B-A3Byesyesnonecorrect
Qwen3-8B (proxy for Qwen3-4B)yesyesnonecorrect
Gemma 3 4B ITyesyes```json fencefabricated empty status:"" arg on zero-arg call
Ministral 3B (2512)yesyes```json fencecorrect
Llama 3.2 3B Instructyesno (restart before log check)``` fencecorrect
IBM Granite 4.0 H Microno (three duplicate steps keys in one object)none

Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served on OpenRouter; Qwen3 family models below 8 B share the same post-training recipe, so output quality for structured agent tasks should be comparable with minor degradation at 4 B and more noticeable degradation at 1.7 B.

Interpretation

  • Qwen3-A3B family produces the tightest, correctly-ordered plans with no markdown fencing. Best quality-per-active-parameter in the sample.
  • Dense 3-4 B Qwen / Gemma / Ministral produce correct plans but add Markdown fences or small schema drift that the dispatcher must tolerate.
  • Llama 3.2 3B violated the ordering constraint – planner-unsafe without additional prompt discipline or rejection sampling.
  • Granite 4.0 H Micro emitted invalid JSON (duplicate object keys). Retest before adopting; may be endpoint-specific rather than the model.

3. Size Thresholds for capOS Use Cases

Mapping observed behaviour to the proposal’s workloads:

WorkloadMinimum credible sizeNotes
NPC dialogue, canned-reply replacement1.7 B denseTemplated plans only; refusal fragile
Short-list planner (≤5 typed tools)3 B denseFloor for credible multi-step ordering
Long-list planner, plan refine, step-up reasoning4 B dense or 30B-A3BRefusal, self-critique, schema-strict JSON
Log / audit summarisation, NPC with context4 B dense or 30B-A3BNeeds retrieval grounding regardless
Embedding / vector retrieval (TextEmbedder)separate small encoderNot a generator workload

Proposal §“Built-in Local Model” sketches a 0.7-2.0 GiB weight budget (q4 class). Qwen3-4B at q4_k_m is ~2.4 GiB, narrowly over that budget. Resolutions:

  1. Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
  2. Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at q5_k_m, ~2.0 GiB), acknowledging weaker planner quality.
  3. Ship Qwen3-1.7B as default and allow ModelAdmin.loadWeights to install Qwen3-4B or a 30B-A3B model post-install.

4. Recommendation for the Proposal

  1. Default built-in (ISO): Qwen3-4B-Instruct at q4_k_m, Apache-2.0. Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB. Fallback to SmolLM3-3B if a fully-open training-data provenance is required for the trusted-build-inputs chain.

  2. Optional installed upgrade: Qwen3-30B-A3B-Instruct-2507 for hosts with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially better planning quality.

  3. Reject for default ship:

    • Ministral 3B (Mistral Research License – cannot redistribute on ISO).
    • Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2 Community License also restricts downstream use).
    • IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or refuted on a local run.
  4. Update Open Question 3 of the proposal (“smallest credible local model”) with the threshold: 3 B dense is the floor for a planner that can be trusted with ordering constraints; 1.7 B is restricted to NPC / canned-reply territory.

5. Training Compute Cost for a Custom 2-B-Active Model

Rough order-of-magnitude estimate, on the chance that the project considers a purpose-trained capOS planner model rather than a fine-tune.

5.1 FLOPs Budget

Forward+backward training compute approximates 6 x N_active x D_tokens. Modern open models have drifted far past Chinchilla’s 20-tokens-per-param ratio; 5k-15k tokens per param is typical.

TargetActiveTokensFLOPs
Chinchilla-minimum 2 B dense2 B40 B4.8e20
Llama-3-ish 2 B dense2 B15 T1.8e23
Qwen3-4B-ish 2 B dense2 B36 T4.3e23
30B-A3B MoE (3 B active, 15 T tok)3 B15 T~4e23 (+ ~1.5x router/aux overhead)

5.2 Hardware -> Dollars

Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr (spot) to $3-4 (on-demand).

ScaleH100-hoursUSD (raw compute)Wall-clock on 1024 H100
Chinchilla 2 B (toy)~350~$1 k<1 hr
2 B @ 15 T tok~130 k~$400 k~5 days
2 B @ 36 T tok (SotA match)~310 k~$900 k~12 days
30B-A3B @ 15 T tok~290 k~$870 k~12 days

5.3 Public Calibration

  • Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
  • Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
  • Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
  • MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.

The 6ND estimate agrees with these published runs within a factor of ~2, which is appropriate for an order-of-magnitude planning number.

5.4 Full-Project Multiplier

Final training run is typically 20-30% of total project compute. Realistic end-to-end budget:

  • Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
  • Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
  • Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed compute cost; tokenizer corpus curation is non-trivial.
  • Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.

Realistic end-to-end to ship a capOS-class 2 B model from scratch: $3-10 M plus a team. A 30B-A3B MoE adds ~50%.

6. Practical Alternative

Training from scratch is almost certainly not worth it for the agent-shell use case. Two much cheaper paths that achieve the same capOS-specific behaviour:

  1. SFT / LoRA on Qwen3-4B or SmolLM3-3B for the capOS ActionPlan JSON schema, tool descriptors, and refusal patterns. ~10 k-100 k curated examples, 8 x H100 for 1-10 days ~= $500-$10 k. Reproducible on commodity cloud.

  2. Continued pretraining on a capOS-specific corpus (manifests, schemas, logs, proposals) if the base lacks domain coverage. Single digits of B tokens, $10 k-$100 k.

The only strong reason to train from scratch would be a fully verifiable weight provenance chain tied to docs/trusted-build-inputs.md. Even then, a reproducible fine-tune of a known base with a signed recipe captures most of the benefit at ~1% of the cost.

6a. nanoGPT / nanochat Scale Reference

Karpathy’s nanoGPT repo reproduces GPT-2 small (124 M params: 12 layers, 768 hidden, 12 heads) as its headline config. Karpathy’s follow-up nanochat (github.com/karpathy/nanochat) ships a full pretrain + SFT pipeline and uses model depth (d) as the size dial rather than parameter count. The README is the only authoritative source; the numbers below are quoted from it, not extrapolated.

  • d12 – “GPT-1 sized”, ~5 min pretraining for quick experiments.
  • d20 – documented speedrun tier: “$48 (~2 hours of 8xH100 GPU node)”, ~$15 on spot instance, “well below $100”. This is the headline reproducibility tier.
  • d24 – appears on the leaderboard as a “slightly overtrained baseline.”
  • d26 – “GPT-2 capability happens to be approximately depth 26”; latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for comparison.

The README does not publish explicit parameter counts per depth; the mapping from depth to params requires inspecting the config code.

Capability mapping to the capOS planner task (empirical, based on same-size published models rather than nanochat runs themselves):

nanochat scaleRough param bandPlanner capability
d12GPT-1-class, ~50-100 MToy completion only, no planner
d20likely ~100-200 M bandTemplated NPC lines; not a planner
d26GPT-2-class, ~100-400 M bandSimple JSON under strict priming; schema drift common
Hypothetical d30+unclear (not in README)Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2

Training a nanochat-class model from scratch fits a research-OS budget in a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is single-digit hours on 8xH100. That is the only scale at which “capOS ships a weight-provenance-complete default planner” is financially plausible without multi-million-dollar compute.

7. Open Follow-Ups

  • Verify Granite 4.0 H Micro JSON behaviour on a local llama.cpp run rather than the OpenRouter endpoint; the probe may have hit a streaming / formatting quirk specific to the provider.
  • Measure q4_k_m tokens-per-second for Qwen3-4B and Qwen3-1.7B on the CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC). No numbers are captured here; required before committing to a default.
  • Evaluate an embedding model separately (bge-m3, nomic-embed, gte-modernbert) for the TextEmbedder capability. Out of scope for this survey.
  • Revisit in 6 months: the 2-4 B frontier is moving monthly as of early 2026, and “best open weight” today may be superseded before the proposal’s Phase 2 begins.
  • nanochat d30+ quality and pricing. The README documents tiers up to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published numbers exist for d30 or beyond. Open questions, before committing to an in-tree from-scratch provenance model:
    • What is the parameter count for d30 (and d28, d32)? Derive from the nanochat config code, not inferred.
    • What training time and cost does d30 require to reach a non-trivial SFT-able checkpoint on the same 8xH100 setup? Expected band is roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but this needs measurement – depth scaling of wall-clock is not linear once the model stops fitting comfortably in per-GPU memory.
    • Does a d30-scale nanochat + capOS-specific SFT approach the Qwen3-1.7B planner floor on the section-2 probe? If yes, a provenance-complete default planner becomes realistic for ~$500-$5 k per full run (pretrain + SFT + ablations). If no, provenance has to be bought by fine-tuning a larger external base (Qwen3-1.7B or SmolLM3-3B) and accepting the weaker provenance story.
  • Tokenizer choice for any capOS from-scratch or continue-pretrain path. Independent of model scale or architecture. A capOS-specific tokenizer with reserved tokens for ActionPlan JSON structure, Cap’n Proto type IDs, capability interface names, and common schema keywords is plausible at the nanochat-class budget and may materially reduce tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the tokenizer is fixed by the base and this question collapses to “what special tokens can be added without retraining embeddings.”