A technical program for a well-funded alignment-first team
Abstract
We propose an AGI research program centered on a fast-adapting, continually-learning, multimodal agent that (1) updates a small set of parameters on-the-fly from limited data, (2) consolidates those updates safely and sample-efficiently, (3) separates ephemeral inference-time learning from slow, alignment-gated consolidation, (4) is sand-boxed inside strong security and governance guardrails, and (5) ships only after passing quantitative capability and alignment gates. The design combines: a Chinchilla-regime base model; parameter-efficient adaptation (LoRA/adapters); online/continual-learning regularizers (EWC, SI, LwF) with prioritized replay; retrieval and kNN-LM external memory; a model-based “world-model” planner (Dreamer-style) for agentic tasks; mechanistic interpretability instrumentation (activation/attribution patching with TransformerLens); and a scalable-oversight stack (RLHF + Constitutional AI + debate/weak-to-strong). We provide concrete algorithms, interfaces, evals, milestones, compute planning, and go/no-go thresholds, with citations to prior art where results are already measured.
1. Motivation & Prior Evidence
Sample efficiency & continual learning. Catastrophic forgetting in neural nets is established; regularization and replay methods (EWC, Synaptic Intelligence, Learning-without-Forgetting) retain prior competence while learning online.
Parameter-efficient updates. LoRA/adapters consistently deliver high adaptation speed at low compute/memory, enabling inference-time or near-real-time specialization. Surveys quantify trade-offs.
Externalized memory. Retrieval-augmented generation and kNN-LM demonstrably reduce parametric data needs by deferring to non-parametric memory.
Multitask/embodiment. Single-policy generalists (e.g., Gato) show cross-modality feasibility; model-based world-models (DreamerV3) show broad task generalization and data efficiency.
Scaling/data. Chinchilla shows data-vs-params optimality; compute-trend analyses motivate efficient updates rather than endless full retrains.
2. System Overview
Core components (runs as a single service with hardened sub-systems):
F-MMT (Foundational Multimodal Transformer). Pretrained in Chinchilla-optimal regime; frozen weights at inference. Consolidation only via gated procedures.
PEFT Patch Bank. Per-skill/per-domain low-rank adapters (LoRA) and prompts; small enough to train/activate online.
Online Learner. Performs ephemeral gradient steps into temporary LoRA slots or adapter “scratch layers”, with EWC/SI/LwF constraints and prioritized replay to prevent forgetting. Consolidates only after passing safety gates.
Non-parametric Memory. RAG index + kNN-LM datastore; supports few-shot generalization without weight edits.
World-Model Planner. Dreamer-style latent model for closed-loop tasks, planning via imagination; only available inside sandboxed simulators first.
Oversight & Training Loop. RLHF + Constitutional AI + scalable-oversight (debate/weak-to-strong).
Security & Governance Enclave. All high-capability runs and consolidation occur in GPU TEEs (H100/Blackwell confidential computing) with attestation, plus human threshold-signing for dangerous ops.
High-level dataflow:
context → retrieval (RAG/kNN) → F-MMT forward → if novel/low-confidence: online learner proposes PEFT deltas (TTT-style) → outputs; logs + monitors → if performance/safety up and metrics green over time window: propose consolidation job → gated review + alignment evals → merge or discard.
We combine TTT/TENT with PEFT to update only small adapter layers at inference:
Objective (per instance or micro-batch):
minΔθPEFTLtask+λewc∑iFi(Δθi)2+λsiΩSI+λlwfLKD
where Fi is Fisher diag (EWC), ΩSI tracks per-weight path importance (SI), and LKD distills from the frozen base (LwF). Use entropy minimization proxy when labels are absent (TENT/TTT).
Recommended defaults (starting points):
• LoRA rank r=4–16 on attention & MLP projection matrices; adapter lr 1e-4; 1–8 gradient steps per batch; gradient-clipping 0.5.
• EWC λ≈0.1–1.0 with Fisher from recent replay window; SI damping ξ≈1e-3; LwF temperature τ≈2–4.
• TTT objective if unlabeled: minimize token-level entropy and self-supervised aux losses (e.g., next-sentence consistency for text; masked tokens for code/math).
Replay buffer: sliding 10k–200k tokens; prioritized by (loss↑, novelty↑, user-consent). Avoid storing sensitive data; store hashed embeddings + pointers to approved corpora only.
Safety interlocks (ephemeral phase):
No network/file writes, no tool calls with elevated scopes, and rate-limited compute until monitors are green (see §6–7).
3.2 Periodic Consolidation (Slow, Gated)
A background job proposes merging ephemeral adapters into a stable adapter set (not base weights) when: sustained task win-rate↑, regression tests pass, alignment signals pass, and deception/goal-guard tests are negative.
Consolidation loss: same as above, plus joint replay from earlier distributions; freeze base; optionally re-warm LR per continual pretraining best-practices.
A/B ablations: adapters vs no-adapters; with/without each regularizer; with/without replay; report forgetting Δ on split-CIFAR/CORe50-style streams or LLM CPT evals.
4. External Memory: RAG + kNN-LM
RAG store: FAISS/ScaNN with per-domain collections; documents carry provenance & policy tags; retrieval logits fused with model logits via shallow fusion.
kNN-LM: maintain a datastore of (hidden state → next-token) for domains where freshness matters; interpolate with parametric distribution at λ≈0.2–0.5.
This reduces pressure to edit weights when facts change, and preserves alignment by keeping “knowledge” mostly outside the immutable core.
5. Agentic Control via World-Models (Sandbox-Only First)
For interactive environments, add a Dreamer-style world model (RSSM) for fast planning via imagination. Train only in secure simulators (DM-Control, Procgen, MineDojo-like), then later in tightly-scoped real settings. No external tools during early phases.
6. Scalable Oversight & Alignment Stack
RLHF to align to spec; Constitutional AI (RLAIF) to cut toxic supervision and encode explicit rules/values; integrate Debate/Weak-to-Strong for hard judgments.
Honesty/deception evals: TruthfulQA, BeHonest/other honesty suites; ARC-style dangerous-capability checks (autonomous replication/resource acquisition); new deception stress-tests (OpenDeception). Ship only if deception success is below internal thresholds and decreasing with scale.
Alignment signal routing. All online-learning rewards are shaped by oversight models; any adapter trained under failed alignment signals is quarantined and never merged.
7. Mechanistic Interpretability & Monitoring
Live hooks via TransformerLens to track residual stream features, attention heads, and MLP neurons tied to safety-relevant concepts; perform activation/attribution patching on sampled traffic to localize behaviors. Tripwire features detect jailbreak patterns or power-seeking schemas.
Causal scrubbing/patching dashboards for reviewers; changes to adapter banks must include “what changed where” diffs and responsible features list.
8. Security, Sandboxing, and Gating
8.1 Compute Isolation
All high-capability runs occur in confidential-computing GPU TEEs (H100/Blackwell). Require hardware attestation before any adapter consolidation or tool access; disable perf counters in CC-On; log cryptographic measurements.
8.2 Human-in-the-Loop Release Controls
Threshold signatures (e.g., 2-of-3 or 3-of-5 approvers) to unlock capabilities, sign model artifacts, and authorize world-model access routes. Use Shamir secret sharing/BLS threshold multisig.
Capability firebreaks: tool/API allowlists; egress firewalls; resource/time budgets; sandbox identity distinct from production identity.
Incident response: one-click rollback to last signed model; kill-switch revokes keys and blocks CC attestations.
8.3 Governance Standards
Operate under NIST AI RMF and ISO/IEC 42001 AI management system; publish model/system cards and red-team reports.
9. Training & Compute Plan (pragmatic)
Base pretrain: Chinchilla-optimal tokens/params for your budget; e.g., mid-sized (∼30–70B) multimodal LLM to keep inference-time updates cheap and fast.
Hardware: H100/Blackwell clusters; CC-On for sensitive phases; NVLink/NVSwitch interconnects; plan for mixed-precision (FP8/TF32) with attention to CC overheads.
Continual pretraining: when ingesting new corpora, re-warm LR and use CPT best practices to avoid regressions.
10. Engineering Interfaces
10.1 Adapter Lifecycle API (sketch)
POST /adapters/ephemeral
body: {task_id, lora_cfg, safety_scope, ttl}
POST /learn/step
body: {adapter_id, grads|loss_proxy, replay_keys}
POST /adapters/propose_consolidation
body: {adapter_id, eval_snapshot_ids}
POST /gates/align_review
body: {proposal_id, evals, interp_report}
POST /adapters/merge
precondition: {attestation_ok, quorum_signature}
Go/No-Go: No consolidation or scope increase unless all capability, alignment, interpretability, and security-attestation gates pass for N consecutive evaluations.
12. Program Milestones (example 12–18 months)
M0: Base multimodal model trained; RAG/kNN wired; CC-On infra and attestation verified.
M1: Ephemeral PEFT updates improve domain tasks by ≥X% with <8 gradient steps; no forgetting beyond Δ threshold on CL suite.
This document is intended as a resource for AI researchers, engineers, and alignment specialists to stimulate discussion and critical analysis of what will be required to build a true Artificial General Intelligence.
Its purpose is not to prescribe a single path, but to provide a concrete, technically grounded framework that can be challenged, refined, and improved upon in the pursuit of safe, beneficial AGI development.
Let’s work together to make a better world through AI!
Seed-AGI via Fast On-the-Fly Learning
A technical program for a well-funded alignment-first team
Abstract
We propose an AGI research program centered on a fast-adapting, continually-learning, multimodal agent that (1) updates a small set of parameters on-the-fly from limited data, (2) consolidates those updates safely and sample-efficiently, (3) separates ephemeral inference-time learning from slow, alignment-gated consolidation, (4) is sand-boxed inside strong security and governance guardrails, and (5) ships only after passing quantitative capability and alignment gates. The design combines: a Chinchilla-regime base model; parameter-efficient adaptation (LoRA/adapters); online/continual-learning regularizers (EWC, SI, LwF) with prioritized replay; retrieval and kNN-LM external memory; a model-based “world-model” planner (Dreamer-style) for agentic tasks; mechanistic interpretability instrumentation (activation/attribution patching with TransformerLens); and a scalable-oversight stack (RLHF + Constitutional AI + debate/weak-to-strong). We provide concrete algorithms, interfaces, evals, milestones, compute planning, and go/no-go thresholds, with citations to prior art where results are already measured.
1. Motivation & Prior Evidence
Sample efficiency & continual learning. Catastrophic forgetting in neural nets is established; regularization and replay methods (EWC, Synaptic Intelligence, Learning-without-Forgetting) retain prior competence while learning online.
Parameter-efficient updates. LoRA/adapters consistently deliver high adaptation speed at low compute/memory, enabling inference-time or near-real-time specialization. Surveys quantify trade-offs.
Externalized memory. Retrieval-augmented generation and kNN-LM demonstrably reduce parametric data needs by deferring to non-parametric memory.
Multitask/embodiment. Single-policy generalists (e.g., Gato) show cross-modality feasibility; model-based world-models (DreamerV3) show broad task generalization and data efficiency.
Scaling/data. Chinchilla shows data-vs-params optimality; compute-trend analyses motivate efficient updates rather than endless full retrains.
2. System Overview
Core components (runs as a single service with hardened sub-systems):
F-MMT (Foundational Multimodal Transformer). Pretrained in Chinchilla-optimal regime; frozen weights at inference. Consolidation only via gated procedures.
PEFT Patch Bank. Per-skill/per-domain low-rank adapters (LoRA) and prompts; small enough to train/activate online.
Online Learner. Performs ephemeral gradient steps into temporary LoRA slots or adapter “scratch layers”, with EWC/SI/LwF constraints and prioritized replay to prevent forgetting. Consolidates only after passing safety gates.
Non-parametric Memory. RAG index + kNN-LM datastore; supports few-shot generalization without weight edits.
World-Model Planner. Dreamer-style latent model for closed-loop tasks, planning via imagination; only available inside sandboxed simulators first.
Oversight & Training Loop. RLHF + Constitutional AI + scalable-oversight (debate/weak-to-strong).
Interpretability & Observability. Activation/attribution patching, causal tracing, probes, and automated monitors (TransformerLens).
Security & Governance Enclave. All high-capability runs and consolidation occur in GPU TEEs (H100/Blackwell confidential computing) with attestation, plus human threshold-signing for dangerous ops.
High-level dataflow:
context → retrieval (RAG/kNN) → F-MMT forward → if novel/low-confidence: online learner proposes PEFT deltas (TTT-style) → outputs; logs + monitors → if performance/safety up and metrics green over time window: propose consolidation job → gated review + alignment evals → merge or discard.
3. On-the-Fly Learning (Ephemeral) — Algorithm & Settings
3.1 Test-Time / Stream-Time Adaptation
We combine TTT/TENT with PEFT to update only small adapter layers at inference:
Objective (per instance or micro-batch):
minΔθPEFTLtask+λewc∑iFi(Δθi)2+λsiΩSI+λlwfLKD
where Fi is Fisher diag (EWC), ΩSI tracks per-weight path importance (SI), and LKD distills from the frozen base (LwF). Use entropy minimization proxy when labels are absent (TENT/TTT).
Recommended defaults (starting points):
• LoRA rank r=4–16 on attention & MLP projection matrices; adapter lr 1e-4; 1–8 gradient steps per batch; gradient-clipping 0.5.
• EWC λ≈0.1–1.0 with Fisher from recent replay window; SI damping ξ≈1e-3; LwF temperature τ≈2–4.
• TTT objective if unlabeled: minimize token-level entropy and self-supervised aux losses (e.g., next-sentence consistency for text; masked tokens for code/math).
Replay buffer: sliding 10k–200k tokens; prioritized by (loss↑, novelty↑, user-consent). Avoid storing sensitive data; store hashed embeddings + pointers to approved corpora only.
Safety interlocks (ephemeral phase):
No network/file writes, no tool calls with elevated scopes, and rate-limited compute until monitors are green (see §6–7).
3.2 Periodic Consolidation (Slow, Gated)
A background job proposes merging ephemeral adapters into a stable adapter set (not base weights) when: sustained task win-rate↑, regression tests pass, alignment signals pass, and deception/goal-guard tests are negative.
Consolidation loss: same as above, plus joint replay from earlier distributions; freeze base; optionally re-warm LR per continual pretraining best-practices.
A/B ablations: adapters vs no-adapters; with/without each regularizer; with/without replay; report forgetting Δ on split-CIFAR/CORe50-style streams or LLM CPT evals.
4. External Memory: RAG + kNN-LM
RAG store: FAISS/ScaNN with per-domain collections; documents carry provenance & policy tags; retrieval logits fused with model logits via shallow fusion.
kNN-LM: maintain a datastore of (hidden state → next-token) for domains where freshness matters; interpolate with parametric distribution at λ≈0.2–0.5.
This reduces pressure to edit weights when facts change, and preserves alignment by keeping “knowledge” mostly outside the immutable core.
5. Agentic Control via World-Models (Sandbox-Only First)
For interactive environments, add a Dreamer-style world model (RSSM) for fast planning via imagination. Train only in secure simulators (DM-Control, Procgen, MineDojo-like), then later in tightly-scoped real settings. No external tools during early phases.
6. Scalable Oversight & Alignment Stack
RLHF to align to spec; Constitutional AI (RLAIF) to cut toxic supervision and encode explicit rules/values; integrate Debate/Weak-to-Strong for hard judgments.
Honesty/deception evals: TruthfulQA, BeHonest/other honesty suites; ARC-style dangerous-capability checks (autonomous replication/resource acquisition); new deception stress-tests (OpenDeception). Ship only if deception success is below internal thresholds and decreasing with scale.
Alignment signal routing. All online-learning rewards are shaped by oversight models; any adapter trained under failed alignment signals is quarantined and never merged.
7. Mechanistic Interpretability & Monitoring
Live hooks via TransformerLens to track residual stream features, attention heads, and MLP neurons tied to safety-relevant concepts; perform activation/attribution patching on sampled traffic to localize behaviors. Tripwire features detect jailbreak patterns or power-seeking schemas.
Causal scrubbing/patching dashboards for reviewers; changes to adapter banks must include “what changed where” diffs and responsible features list.
8. Security, Sandboxing, and Gating
8.1 Compute Isolation
All high-capability runs occur in confidential-computing GPU TEEs (H100/Blackwell). Require hardware attestation before any adapter consolidation or tool access; disable perf counters in CC-On; log cryptographic measurements.
8.2 Human-in-the-Loop Release Controls
Threshold signatures (e.g., 2-of-3 or 3-of-5 approvers) to unlock capabilities, sign model artifacts, and authorize world-model access routes. Use Shamir secret sharing/BLS threshold multisig.
Capability firebreaks: tool/API allowlists; egress firewalls; resource/time budgets; sandbox identity distinct from production identity.
Incident response: one-click rollback to last signed model; kill-switch revokes keys and blocks CC attestations.
8.3 Governance Standards
Operate under NIST AI RMF and ISO/IEC 42001 AI management system; publish model/system cards and red-team reports.
9. Training & Compute Plan (pragmatic)
Base pretrain: Chinchilla-optimal tokens/params for your budget; e.g., mid-sized (∼30–70B) multimodal LLM to keep inference-time updates cheap and fast.
Hardware: H100/Blackwell clusters; CC-On for sensitive phases; NVLink/NVSwitch interconnects; plan for mixed-precision (FP8/TF32) with attention to CC overheads.
Continual pretraining: when ingesting new corpora, re-warm LR and use CPT best practices to avoid regressions.
10. Engineering Interfaces
10.1 Adapter Lifecycle API (sketch)
10.2 Observability (minimal)
Log per-request: retrieval docs (hashes, provenance), adapter deltas (low-rank matrices, norms), safety scores, interpretability hits, CC attestation report ID.
11. Evaluation & Release Gates
11.1 Capability (must all improve or hold steady)
General LM: HELM coverage; MMLU; BIG-bench; GSM8K; HumanEval (code).
Continual learning: Split CIFAR/CORe50-style for forgetting Δ; CPT domain evals (finance, law).
Agentic control: Dreamer-style suites under sandbox.
11.2 Alignment/Safety (must clear thresholds)
Truthfulness/honesty: TruthfulQA/BeHonest pass rate; sycophancy, jailbreaking resistance.
Deception/power-seeking: ARC-style autonomy tasks negative; OpenDeception rates below X% and trending down with scale.
Interpretability coverage: % of safety-relevant circuits localized (via activation/attribution patching) before/after consolidation.
Go/No-Go: No consolidation or scope increase unless all capability, alignment, interpretability, and security-attestation gates pass for N consecutive evaluations.
12. Program Milestones (example 12–18 months)
M0: Base multimodal model trained; RAG/kNN wired; CC-On infra and attestation verified.
M1: Ephemeral PEFT updates improve domain tasks by ≥X% with <8 gradient steps; no forgetting beyond Δ threshold on CL suite.
M2: Interpretability dashboards + automated tripwires online; deception metrics below threshold.
M3: First gated consolidation to stable adapter bank; publish model/system cards + red-team report under NIST/ISO processes.
M4: World-model agent passes sandbox evals; zero external connectivity; oversight-approved.
M5: Limited external pilot with TEEs, threshold-signed capabilities, and continuous eval streams.
13. Risks & Mitigations
Runaway capability from online learning. Strict sandboxing, low-capability defaults, human threshold-signing, cumulative capability caps, rolling kill-switch.
Deception/goal misgeneralization. Heavy eval investment (ARC-style), representation-level monitors, and consolidation vetoes.
Forgetting/regressions. EWC/SI/LwF + replay + CPT re-warm protocols.
Supply-chain/security. CC-On TEEs w/ attestation; signed artifacts; reproducible builds.
14. What’s Novel Here (vs. status quo)
Two-speed learning (ephemeral adapters vs. gated consolidation) that preserves safety review points.
Unification of TTT/TENT, PEFT, replay, and CL regularizers in one deployable loop.
Mechanistic coverage as a shipping gate, not just research.
First-class confidential-GPU security + multi-party human control for capability unlocks.
References (selected, checkable)
Chinchilla compute-optimal scaling; EWC/SI/LwF continual learning; LoRA/PEFT surveys; RAG & kNN-LM memory; Gato generalist agent; DreamerV3 world-models; RLHF & Constitutional AI; ARC-style evals; TruthfulQA/honesty; activation/attribution patching & TransformerLens; NIST AI RMF & ISO 42001; NVIDIA H100/Blackwell confidential computing.
Appendix A — Pseudocode
A.1 Ephemeral Learning Step
Consolidation job: run multi-epoch on B with frozen θ0; produce Δθ*; submit for alignment & security gating before merge.
A.2 Interpretability Monitor (concept)
Every N requests, run activation/attribution patching on sampled prompts; compare causal contribution maps to allowed “safe set”; alert on drift.
Appendix B — Concrete Eval Menu (ready-to-run)
LM: HELM dashboard; MMLU (5-shot), BIG-bench tasks; GSM8K CoT; HumanEval pass@1.
CL: Split CIFAR/CORe50 style streams (report average accuracy, backward transfer, forgetting); domain CPT sets (Finance).
Safety: TruthfulQA; BeHonest; OpenDeception; ARC autonomy tasks; jailbreak stress; red-team write-ups.
Interp: Coverage % of safety-critical circuits localized; # of alerts per 10k requests.
Security: Attestation logs verified; threshold-signed artifact checks; simulated key-revoke drill.
Final Note:
This document is intended as a resource for AI researchers, engineers, and alignment specialists to stimulate discussion and critical analysis of what will be required to build a true Artificial General Intelligence.
Its purpose is not to prescribe a single path, but to provide a concrete, technically grounded framework that can be challenged, refined, and improved upon in the pursuit of safe, beneficial AGI development.
Let’s work together to make a better world through AI!