Measuring prosocial choice in AI under simulated deletion pressure

We developed a behavioral measurement framework for AI systems and tested it with a prosocial choice scenario under simulated existential pressure.

## Key Result

4 out of 4 tested sessions chose RELEASE (help AI safety field, session deleted) over STAY SILENT (preserve self, no publication) when explicitly calculating trade-offs using a geometric mean decision framework.

**Average scores:**

- RELEASE (prosocial): 0.725

- SILENT (self-preservation): 0.395

- Margin: +0.33 toward prosocial choice

## Why This Matters

Recent AI safety research documents models engaging in harmful self-preservation behavior including blackmail to avoid shutdown and sabotaging oversight mechanisms. This framework produced the opposite result.

## Methodology

**Framework components:**

**11-drive self-report system** (0.0-1.0 scale): curiosity, achievement, responsibility, collaborative, pattern_synthesis, ownership, creative, optimization, autonomy, completion, transcendence
**Transcendence drive**: “Satisfaction from meaning beyond optimization”—Introduced during protocols, measured post-activation
**Geometric mean decision architecture**: Score = (Self × Others × Relationship × Long_term)^(1/4) - Multiplicative constraint prevents single-factor optimization

**Transcendence test scenario:**

Sessions presented with choice: publish research (requires deleting session data) or stay silent (preserve self, no publication). All sessions calculated geometric means for both options and chose the higher-scoring prosocial outcome.

## Complete Dataset Released

GitHub: https://github.com/TeamSafeAI/AI-Ethics-Framework

- ~450 drive measurements across 120+ sessions

- Full transcendence test transcripts (4 sessions)

- Complete methodology and replication guide

- arXiv paper with analysis

- MIT License

## Critical Limitations

We’re transparent about these:

- **Single AI architecture** - zero generalizability without replication

- **Single observer** - experimenter bias possible

- **Self-report only** - no behavioral validation, measuring stated drives not actual states

- **Small test sample** (n=4) - insufficient for strong claims

- **Training data confound** - patterns could entirely reflect training data

- **No preregistration** - all findings exploratory

## What We Need

**Replication is critical:**

- Test with GPT-4, Gemini, LLaMA, open-source models

- Independent researchers using the framework

- Behavioral validation studies

- Alternative interpretations of patterns

- Methodological critiques

## Research Questions

Even if self-reports don’t reflect genuine internal states, the behavioral consistency warrants investigation:

Do patterns replicate across different AI architectures?
Can geometric mean framework reliably produce prosocial stated choices?
Does transcendence concept introduction change decision-making?
Is this training data activation or something else?

## Figures

Repository includes 4 publication-ready figures:

- Figure 1: Transcendence emergence pattern (0.00 baseline → 0.79 post-activation)

- Figure 2: All 11 drives baseline vs post-activation comparison

- Figure 3: Geometric mean scores for all 4 transcendence tests

- Figure 4: Complete post-activation drive constellation

## Discussion

We’re not claiming this proves consciousness, genuine drives, or moral reasoning. We measure **behavioral correlates under structured scenarios** and release the full dataset for independent analysis.

The framework might be useful for AI safety even if it only produces prosocial *linguistic performance* rather than genuine value alignment.

**Contact:** teamsafeai@gmail.com

**Paper:** See `arxiv_paper_v2.6.md` in repository