Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: Scientific Agent-Lab Software Stack

This note surveys existing scientific software that capOS should treat as adaptable service backends for a future agent-facing research lab. The central lesson is that capOS should not invent a new computer algebra system, solver, proof assistant, notebook system, or package manager. It should give agents typed, audited, resource-bounded capabilities over mature tools and preserve the exact environment, inputs, outputs, and proof artifacts needed for review.

Design Consequences For capOS

  • Provide a scientific-standard package as a service graph, not as ambient binaries on a global filesystem.
  • Start by adapting existing command-line and library tools behind narrow typed facades. Native rewrites are unjustified until a backend needs a smaller trusted core or a direct capability ABI.
  • Treat heavyweight systems such as SageMath, OSCAR, JupyterLab, Lean/mathlib, and Spack as environment subjects: they need package-store, workspace, process, network, cache, and quota policy, not just a binary launch API.
  • Expose solver and proof tools as deterministic request/response services whenever possible. A model should ask SmtSolver.check, ProofSession.build, or OptimizationSolver.solve, not run arbitrary shell text.
  • Keep formal proof assistants separate from automatic solvers. SMT results are useful evidence, but durable mathematical claims need proof artifacts checked by Lean, Rocq, Isabelle, Agda, or another trusted kernel.
  • Make provenance a first-class output. Every notebook cell, solver run, proof build, CAS session, package environment, model prompt, and data input should produce replayable metadata and an audit record.
  • Prefer open-source backends in the default package. Proprietary engines such as Wolfram Engine can be optional connector services with explicit license, network, and production-use metadata.

Source Baseline

External sources used for this survey:

Local grounding:

Tool Families

Computer Algebra And Exact Mathematics

PARI/GP is the obvious default for number-theory work. The upstream project is a cross-platform open-source computer algebra system designed for fast number theory computations, and it exposes both the gp shell and the PARI C library. For capOS, the C library is the better long-term service backend; the shell is useful for early compatibility and transcript capture.

SageMath is the best broad open-source mathematical umbrella. It is GPL software built on NumPy, SciPy, matplotlib, SymPy, Maxima, GAP, FLINT, R, and other packages, with the explicit mission of being a free alternative to Magma, Maple, Mathematica, and Matlab. Sage is too large to make into a small TCB component, but it is ideal as an agent lab kernel when the package store and Python compatibility layer exist.

GAP is the standard open system for computational discrete algebra, especially computational group theory. Singular is the specialized polynomial, commutative-algebra, algebraic-geometry, and singularity-theory system. OSCAR is the newer Julia-based research system that unifies GAP, Singular, Polymake, ANTIC/Hecke/Nemo/AbstractAlgebra-style capabilities across algebra, geometry, number theory, and polyhedral geometry. For capOS this suggests two levels: small typed services for common requests, and full language kernels for research workflows that need the native ecosystem.

SymPy is lightweight and embeddable because it is Python-based and has few dependencies. It is a good first symbolic backend for agents that need exact manipulation, code generation, and checkable expressions without launching Sage. SymPy should not replace PARI, GAP, Singular, or OSCAR for their specialist domains.

Numerical, Statistical, And Notebook Workflows

SciPy provides fundamental algorithms for scientific computing in Python: optimization, integration, interpolation, eigenvalue problems, algebraic and differential equations, statistics, sparse matrices, k-dimensional trees, and more. It sits on NumPy and compiled Fortran/C/C++ kernels, so capOS support depends on Python, native extension loading, BLAS/LAPACK packaging, and controlled native-code execution.

R remains the standard open statistical environment. GNU Octave remains the open MATLAB-like numerical environment for linear and nonlinear numerical work. Julia is strategically important because OSCAR, JuMP, SciML, and many modern research packages depend on it. The first capOS lab should host these as isolated language kernels rather than trying to normalize them into one universal API.

JupyterLab is the standard interactive computing front end because notebooks combine code, prose, equations, visualizations, outputs, and controls. capOS should adapt the notebook model but not grant notebook kernels ambient shell or filesystem authority. A future NotebookSession should start kernels with explicit workspace, package environment, data, network, and compute caps, and record every execution result as a reproducible artifact.

Satisfiability, Optimization, And Operations Research

Z3 and cvc5 are the primary open SMT backends to expose through a capOS SmtSolver capability. Both support stand-alone and library use. SMT-LIB should be a supported import/export format, but the service API should expose typed assumptions, objectives, timeouts, model requests, unsat cores, and proof/certificate availability explicitly.

For mathematical optimization, capOS should separate modeling layers from solver engines:

  • JuMP and CVXPY are high-level modeling interfaces that let researchers state optimization problems in Julia or Python.
  • HiGHS is a strong open backend for large sparse LP, MIP, and QP models.
  • SCIP is a broad optimization suite around constraint integer programming, with current Apache-licensed releases.
  • OR-Tools is a practical operations-research toolkit, especially for constraint programming, routing, scheduling, and combinatorial optimization.

The capOS API should accept common model formats and provide bounded solve jobs with time/memory limits, deterministic seeds when supported, solution certificates when available, and reproducibility metadata. It should not hide which backend solved the problem.

Formal Proof Systems

Lean 4 is both a general-purpose functional language and an interactive theorem prover, with mathlib as its main community mathematics library. It is the best default for agent-assisted formal mathematics because current LLM tooling and library momentum are strongest there.

Rocq, formerly Coq, remains an industrial-strength dependently typed prover with a long verification history and program extraction story. Isabelle is a generic proof assistant, with Isabelle/HOL and mature automation important for systems proofs. Agda is valuable for constructive type theory and dependently typed programming. A capOS lab should support all of them as separate proof kernel families instead of pretending they are interchangeable.

Agent integration should be conservative:

  • Agents may propose proof edits, search lemmas, call tactics, and run builds.
  • The proof checker decides whether a theorem is accepted.
  • Accepted proof artifacts must include toolchain version, library revision, package closure, command line, and full build log.
  • CAS or SMT evidence can guide a proof but is not the proof unless the proof assistant checks an imported certificate or independently reconstructs it.

Reproducible Environments And Package Stores

The scientific stack is too large and language-diverse for a hand-written capOS package format to be the first step. Existing systems offer useful pieces:

  • Nix provides isolated, declarative package builds and large package coverage.
  • Guix-HPC focuses on reproducible scientific deployment, per-user environments, and bit-for-bit repeatability from a specific Guix commit.
  • Spack is the HPC-oriented answer for many compiler, MPI, CPU-target, and library variant combinations.
  • Apptainer is common in HPC because it packages software into portable images while integrating with GPUs, high-speed networks, and shared filesystems.

capOS should not import any of these as the kernel package manager. Instead, it should adapt their recipe and closure ideas into capability-native PackageCatalog, PackageClosure, Environment, and BuildService interfaces. Early implementations can execute Nix/Guix/Spack/Apptainer on a Linux host or sidecar; later capOS can consume signed closures as Store objects.

What An LLM Research Lab Needs

A credible LLM agent research lab on capOS needs more than model inference:

  • Workspace service. Branchable project workspaces with exact input, output, patch, and artifact history.
  • Package environments. Content-addressed software closures for Python, Julia, R, C/C++/Fortran, Lean, Rocq, Isabelle, GAP, PARI, Sage, and solver stacks.
  • Notebook service. Jupyter-compatible documents and kernels, but kernels receive explicit caps instead of ambient filesystem, process, or network access.
  • Experiment registry. Runs have immutable parameters, model versions, prompts, tool descriptors, seeds, package closures, datasets, results, and reviewer decisions.
  • Solver/proof services. CAS, SMT, optimization, and formal proof systems are high-level tool capabilities with structured inputs and bounded resources.
  • Literature and retrieval. Paper, code, dataset, citation, and note stores are ordinary namespaces; retrieval does not imply authority to fetch or publish.
  • Job graph orchestration. Long calculations, training/evaluation jobs, proof builds, benchmark sweeps, and multi-agent tasks need resumable job graphs with cancellation and status.
  • Compute authority. CPU, memory, storage, network, GPU, realtime, and external-provider quotas must be explicit and visible in the audit log.
  • Human review surfaces. Agents can generate results, but publication, credential use, external API calls, irreversible filesystem changes, and proof-of-result claims need review gates.

Adaptation Strategy

The near-term path is a staged compatibility bridge:

  1. Hosted Linux sidecars. Run existing stacks on Linux while capOS exposes them as remote capability services. Use namespace/cgroup/seccomp/Landlock sandboxes for trusted batch tools; use hardware-backed Linux guests (QEMU/KVM first, Firecracker/Kata-style microVMs later) for untrusted notebooks, model-generated code, package builds, and multi-tenant agent jobs. Treat User-Mode Linux as a developer/debug fallback, not the primary strong-isolation boundary. This proves interfaces and audit before native package support exists.
  2. Command-wrapper services. Wrap tools such as gp, gap, Singular, lean, lake, rocq, isabelle, octave, Rscript, and solver CLIs with explicit input/output directories, timeout, memory, and network policy.
  3. Library-backed services. Replace wrappers with direct C/C++/Rust/Julia FFI or process-local RPC for small stable APIs such as PARI, Z3, cvc5, and HiGHS.
  4. Notebook and language kernels. Add Python, Julia, R, Sage, and Lean kernels with capOS-authored kernel launchers and artifact capture.
  5. Package-closure ingestion. Import Nix/Guix/Spack closures as signed Store objects, then build a capOS-native catalog around content hashes, licenses, vulnerability metadata, CPU/GPU compatibility, and provenance.
  6. Native capOS services. Only after the interface stabilizes, port the most useful small engines or linkable libraries into native userspace.

Risks And Open Questions

  • Supply-chain size. Sage, Julia, Python scientific stacks, and proof libraries bring huge dependency closures. capOS must record and constrain them rather than pretend they are small trusted components.
  • Nondeterminism. Floating-point math, randomized solvers, parallel BLAS, GPU kernels, and package resolution can make replay differ. Results need deterministic seeds and variance metadata, not only final answers.
  • License boundaries. GPL, LGPL, Apache, MIT, BSD, proprietary, academic, and optional commercial solvers need explicit metadata before packaging.
  • Proof trust. A CAS result, SMT model, or solver objective value can be false because of bugs, numeric tolerance, or bad modeling. Formal proof claims must be checked by the named proof kernel or labeled as empirical.
  • Agent overreach. The default scientific package must not grant arbitrary shell, network, credential, package-install, or publishing authority to a model. Agents receive tools through runner policy, not direct backend caps.
  • Notebook security. Notebooks are executable documents. Opening one is not consent to run it with the reader’s caps.
  • Linux sidecar boundary drift. Namespaces, seccomp, Landlock, gVisor, User-Mode Linux, KVM guests, and microVMs are different security and compatibility claims. capOS must record the backend, host kernel, policy, image hashes, guest tickless/nohz state, and capOS outer NoHzEligibility/NoHzActivation state rather than labeling all of them “sandboxed Linux”.

Recommendation

Define a future scientific-standard package as a curated service graph with three profiles:

  • Base. PARI/GP, SymPy, Z3, cvc5, HiGHS, Lean, and artifact/provenance services through tight command or library wrappers.
  • Research. SageMath, GAP, Singular, OSCAR/Julia, R, Octave, JuMP, CVXPY, SCIP, OR-Tools, Jupyter-compatible notebooks, and package-closure support.
  • Lab. Hosted-agent workspaces, experiment registry, browser/web research tools, GPU-backed model/scientific kernels, distributed job graphs, and publication/review workflows.

The Base profile is the first useful target for agents: exact number theory, symbolic manipulation, SMT checking, linear/integer optimization, and Lean proof checking are enough to make an agent substantially more reliable without granting it a general-purpose scientific workstation.