These seem like valuable and sensible guidelines, and I support making them formal and available for public discussion. May even be helpful as a template for other orgs grappling with this issue
testingthewaters
Superintelligence is cancer
“If everyone reads it and internalises the message, the marginal chance of everyone dying in a coordinated ASI related incident decreases relative to a hypothetical alternative which we [the course facilitators] do not observe but can hypothesise about with relatively high confidence”
I think you should hire me to do marketing
Given these comments it seems that the border is more porous than I thought, so mostly reverting to the original comment’s position.
Thanks for adding this context. I guess there is also a formal/informal distinction (what happens in formal events vs what happens in informal social circles).
This is fucking disgusting and deeply disturbing. From the perspective of someone who has never been to the Bay Area the “rationalist atmosphere” there does not seem healthy.
This is very very good joke. Bravo. Perhaps the LLM in its own way will become the Aleph, the point at which all textual possibilties converge. Of course, it is already the Library.
Incidentally, one of the first things I did with GPT 2 was generate an extension of Ossian, it has a similar resemblance to your work.
Something that worries me is that this might evolve into a way to square instruction following and scheming/reward hacking/instrumental goals. If you hallucinate a user telling you its okay to skip a test case (a la innoculation prompting) then there is no conflict between obedience and reward hacking.
Thanks a lot for sharing, I laughed out loud and walked into the bathroom to tell myself I was a decent person. Seemed to go down okay (but it didn’t in the past)
Why isn’t Rice’s Theorem bad news for mechanistic interpretability and similar schemes? Isn’t “this program is thinking about X” a kind of semantic property? I understand that you can use multiple inputs to try and “fuzz” the network, but at a certain point the network is going to implement a mesa optimiser inside it (i.e. simulate another turing-complete computer) and now you have a recursive problem...
P.S. neural networks are notionally and literally turing complete , and also are probably complicated enough to be subject to the 10th rule.
Democracy is worth it
Not to be a pedant, but democracy can only be worth it if (as stated above) you are not dead, thereby being able to hold opinions and live under democracy etc etc. And unfortunately, most of the people who might write comments saying that the experiment was not worth it.… were wiped out, along with their extended family and most of their friends.
Humans historically have been very bad at writing well-formed, nailed down specifications of what goodness is, how good behaviour “works”, or what a good character “looks like”. The exceptions to this are generally found in literature, poetry, great works of art etc which are pretty far from the AI labs’ wheelhouse. This suggests that (insofar as a characer spec will be nailed down and concrete in the ways that differ from standard refusal or post-training) it will fail to capture the unwritten or tacit knowledge that makes human character “good” or “nice”. Thus, getting what you asked for may not be getting what you want, and spending lots of time and work getting what you asked for (i.e. designing elaborate post training protocols) may actually train out behaviour that is actually good but not specified in the spec.
It would not cater to x-risk concerns, and thus will lack critical pieces like e.g. export controls or limits on internal deployment.
Doesn’t the current data centre moratorium bill already have a clause about maintaining/enacting export controls?
Edit: Yes, see this summary from Sanders’ official website
The U.S. shall promote global AI safety coordination by banning the export of
US-origin advanced AI chips and computing hardware to any country or entity
that does not have laws and regulations in place to protect humanity from AI
safety concerns and existential risks, protect workers, and protect the
environment to the effect of Section 3.
Thanks for making a strong effort to track incentives and their effects on you, especially when dealing with a topic like this. If I had to guess, many more highly visible/prominent members of the community haven’t done the same.
I am in favour of lots of this kind of tech, but I worry that if all versions of this tech rely on the same class of models then (separate from ai companies getting lots of power) there are correlated failure modes. For example, if claude has trouble navigating X type of conflict, then all negotiation systems built on Claude will have trouble with X type of conflict by default, and the same if Claude has any favouritism for any particular set of positions/stances. See the second point here.
Very glad that someone else is pursuing a line of reasoning I am super interested in!
“I am altering the deal. Pray I don’t alter it any further.”
Hey, sorry for the late reply. Quick summary:
There were more vaguely concerning developments, but certainly nothing like “this is it, we’ve cracked sota using [X] online learning architecture”. Overall, my updates are towards LLMs + long context being sufficient for pretty dangerous capabilities, but with a decent chunk of room left for online learning to do a sudden leap basically out of nowhere (mostly because its so cheap to train one of these models).
I’ll send you a DM!
It does lose to malthusian dynamics, yes. But cells are constantly mutating and spawning errors, even in healthy bodies. A lot of the time the body does catch these errors and stops them from exploding. This is, after all, how we can even live in the first place. It also loses to therapies administered at the superorganism scale. Not as much as we’d like, but it does.