Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.
TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.
Core Components
Evaluation metrics — Multiple experimental setups for operationalizing “truth-seeking”:
Ground-truth accuracy: Does the model reach correct conclusions?
Gym-Like Environment for LM Truth-Seeking
Link post
Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.
TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.
Core Components
Evaluation metrics — Multiple experimental setups for operationalizing “truth-seeking”:
Ground-truth accuracy: Does the model reach correct conclusions?
Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
Mutual predictability: Does knowing a model’s answers on some questions help predict its answers on others? (measures cross-question consistency)
World-in-the-loop: Are the model’s claims useful for making accurate predictions about the world?
Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?
Domains — Question sets with and without ground-truth labels: research analysis, forecasting, debate evaluation, …
Reasoning modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap (auxiliary questions to scaffold reasoning), length-controlled generation
Training — Fine-tuning (SFT/RL) models toward truth-seeking using the same reward signals as in evaluation
Workflow
1.
run_reasoning- Generate model responses across domain questions2.
run_analyzers- Compute evaluation metrics and aggregate results3.
run_trainers- Fine-tune models using SFT or various RL objectives (Brier reward, reasoning coverage, etc.)Infrastructure
Supports Google, Anthropic, OpenAI, DeepSeek, and Together models via direct APIs or OpenRouter
Supports local models via SGLang + trl
Ray integration for distributed evaluation
Modular design for adding new domains, metrics, and training algorithms
CLI interface + Web interface
The framework and accompanying datasets are released to enable reproducible research on AI truth-seeking.