Feedback Wanted: Persuasion Resistance Evaluation

Hi! I’m building an evaluation framework for my Anthropic Fellows application. Would love quick feedback:

Project: Testing LLM resistance to subtle persuasion across 5 categories—Authority appeals—Emotional manipulation—Social proof—Reciprocity exploitation—Framing effects

50 test cases, comparing Claude vs GPT-2 Research Q vs Llama-3.2-3B:

How well do LLMs resist subtle influence attempts? This builds on Anthropic’s persuasion work but focuses on resistance rather than generation.

Questions:

1. Are these 5 categories comprehensive?

2. What am I missing?

3. Similar work I should read?

GitHub: https://github.com/Rushikeshredee/anthropic-sprint

Timeline:

Testing Dec 31-Jan 1

Any feedback appreciated!