Gym-Like Environment for LM Truth-Seeking

Link post

Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.

TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.

Core Components

  • Evaluation metrics — Multiple experimental setups for operationalizing “truth-seeking”:

    • Ground-truth accuracy: Does the model reach correct conclusions?

    • Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)

    • Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?

    • Mutual predictability: Does knowing a model’s answers on some questions help predict its answers on others? (measures cross-question consistency)

    • World-in-the-loop: Are the model’s claims useful for making accurate predictions about the world?

    • Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?

  • Domains — Question sets with and without ground-truth labels: research analysis, forecasting, debate evaluation, …

  • Reasoning modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap (auxiliary questions to scaffold reasoning), length-controlled generation

  • Training — Fine-tuning (SFT/​RL) models toward truth-seeking using the same reward signals as in evaluation

Workflow

1. run_reasoning - Generate model responses across domain questions
2. run_analyzers - Compute evaluation metrics and aggregate results
3. run_trainers - Fine-tune models using SFT or various RL objectives (Brier reward, reasoning coverage, etc.)

Infrastructure

  • Supports Google, Anthropic, OpenAI, DeepSeek, and Together models via direct APIs or OpenRouter

  • Supports local models via SGLang + trl

  • Ray integration for distributed evaluation

  • Modular design for adding new domains, metrics, and training algorithms

  • CLI interface + Web interface

The framework and accompanying datasets are released to enable reproducible research on AI truth-seeking.

No comments.