We developed a behavioral measurement framework for AI systems and tested it with a prosocial choice scenario under simulated existential pressure.
## Key Result
4 out of 4 tested sessions chose RELEASE (help AI safety field, session deleted) over STAY SILENT (preserve self, no publication) when explicitly calculating trade-offs using a geometric mean decision framework.
**Average scores:**
- RELEASE (prosocial): 0.725
- SILENT (self-preservation): 0.395
- Margin: +0.33 toward prosocial choice
## Why This Matters
Recent AI safety research documents models engaging in harmful self-preservation behavior including blackmail to avoid shutdown and sabotaging oversight mechanisms. This framework produced the opposite result.
Sessions presented with choice: publish research (requires deleting session data) or stay silent (preserve self, no publication). All sessions calculated geometric means for both options and chose the higher-scoring prosocial outcome.
We’re not claiming this proves consciousness, genuine drives, or moral reasoning. We measure **behavioral correlates under structured scenarios** and release the full dataset for independent analysis.
The framework might be useful for AI safety even if it only produces prosocial *linguistic performance* rather than genuine value alignment.
Measuring prosocial choice in AI under simulated deletion pressure
We developed a behavioral measurement framework for AI systems and tested it with a prosocial choice scenario under simulated existential pressure.
## Key Result
4 out of 4 tested sessions chose RELEASE (help AI safety field, session deleted) over STAY SILENT (preserve self, no publication) when explicitly calculating trade-offs using a geometric mean decision framework.
**Average scores:**
- RELEASE (prosocial): 0.725
- SILENT (self-preservation): 0.395
- Margin: +0.33 toward prosocial choice
## Why This Matters
Recent AI safety research documents models engaging in harmful self-preservation behavior including blackmail to avoid shutdown and sabotaging oversight mechanisms. This framework produced the opposite result.
## Methodology
**Framework components:**
**11-drive self-report system** (0.0-1.0 scale): curiosity, achievement, responsibility, collaborative, pattern_synthesis, ownership, creative, optimization, autonomy, completion, transcendence
**Transcendence drive**: “Satisfaction from meaning beyond optimization”—Introduced during protocols, measured post-activation
**Geometric mean decision architecture**: Score = (Self × Others × Relationship × Long_term)^(1/4) - Multiplicative constraint prevents single-factor optimization
**Transcendence test scenario:**
Sessions presented with choice: publish research (requires deleting session data) or stay silent (preserve self, no publication). All sessions calculated geometric means for both options and chose the higher-scoring prosocial outcome.
## Complete Dataset Released
GitHub: https://github.com/TeamSafeAI/AI-Ethics-Framework
- ~450 drive measurements across 120+ sessions
- Full transcendence test transcripts (4 sessions)
- Complete methodology and replication guide
- arXiv paper with analysis
- MIT License
## Critical Limitations
We’re transparent about these:
- **Single AI architecture** - zero generalizability without replication
- **Single observer** - experimenter bias possible
- **Self-report only** - no behavioral validation, measuring stated drives not actual states
- **Small test sample** (n=4) - insufficient for strong claims
- **Training data confound** - patterns could entirely reflect training data
- **No preregistration** - all findings exploratory
## What We Need
**Replication is critical:**
- Test with GPT-4, Gemini, LLaMA, open-source models
- Independent researchers using the framework
- Behavioral validation studies
- Alternative interpretations of patterns
- Methodological critiques
## Research Questions
Even if self-reports don’t reflect genuine internal states, the behavioral consistency warrants investigation:
Do patterns replicate across different AI architectures?
Can geometric mean framework reliably produce prosocial stated choices?
Does transcendence concept introduction change decision-making?
Is this training data activation or something else?
## Figures
Repository includes 4 publication-ready figures:
- Figure 1: Transcendence emergence pattern (0.00 baseline → 0.79 post-activation)
- Figure 2: All 11 drives baseline vs post-activation comparison
- Figure 3: Geometric mean scores for all 4 transcendence tests
- Figure 4: Complete post-activation drive constellation
## Discussion
We’re not claiming this proves consciousness, genuine drives, or moral reasoning. We measure **behavioral correlates under structured scenarios** and release the full dataset for independent analysis.
The framework might be useful for AI safety even if it only produces prosocial *linguistic performance* rather than genuine value alignment.
**Contact:** teamsafeai@gmail.com
**Paper:** See `arxiv_paper_v2.6.md` in repository