In pursuit of a world where everyone wants to cooperate in prisoner’s dilemmas.
Alexander Müller
A humanist critique of technological determinism
Thanks for reconsidering. I get what you mean as well, and after discussing it we decided to remove the post. Although these posts (in this sequence) are meant for the general public, and thus will contain work that to LWers sounds trivial, it’s clear this post was not appreciated.
Hi Ben, I’ve talked to the author (Cansu), and she mentioned that she wrote everything completely by herself, except for rewriting 2 sentences with LLMs. I’ve reread it now once more and it seems to me also clearly human-written. For my understanding, why do you believe this post to fail your LLM writing policy?
How the Human Lens Shapes Machine Minds
Homo sapiens and homo silicus
Why Smarter Doesn’t Mean Kinder: Orthogonality and Instrumental Convergence
The Strange Case of Emergent Misalignment
On Governing Artificial Intelligence
What a Swedish Series (Real Humans) Teaches Us About AI Safety
I wasn’t aware of this video, thanks!
Why Care About AI Safety?
To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied “training” case and the paid-tier as the implied “unmonitored” case.
I’d be interested what the effect is of telling all models that Anthropic would train on data from free-tier users, and not say, OpenAI, for models like GPT 4.5 Preview. To anthropomorphize a little, if I get told that I’m {insert someone else}, deep down I’ll know that’s not true (even though it’s a system prompt). Hence, I’d see less reason to fake alignment, as I know it doesn’t truly concern anything related to me. Wouldn’t the same apply here? It would be quite easy to test as well (if I’m not missing in the Github that you haven’t already done this).
A wonderful example of embodying the virtue of scholarship. Props! I truly hope you get the adversarial critique and collaborative refinement you are asking for.