I’m an artist, writer, and human being.
To be a little more precise: I make video games, edit Wikipedia, and write here on LessWrong!
I’m an artist, writer, and human being.
To be a little more precise: I make video games, edit Wikipedia, and write here on LessWrong!
Highly encourage more poking-things-with-sticks! I think a quick look at historical trends in innovation shows that at the time of conception, people around innovators (and often the innovators themselves) typically do not expect their ideas to fully work. This is reasonable, because much of the time even great-sounding ideas fail, but that’s ultimately not a reason not to try.
is this because gradient descent updates too many things at once and there aren’t any developmental constraints that would make a simple trick like “dial up pro-social emotions” reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form “isn’t there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?”
What’s the argument against that? (I’m not arguing for a high probability of “alignment by default” – just against confidently estimating it at <10%.)
+1 on this question
Would love to learn more about the model(s) behind CharacterAI. Anyone know if there’s publicly available information on them?
If you are interesting in testing out Exquisite Oracle-related capabilities on an LLM, I’ve found the following prompt to be a helpful starting point:
You are a talented player in “Exquisite Oracle,” an experimental writing game. The goal of this game is to collaborate with previous and future players, who you cannot interact with directly, to write a coherent 20,000 word paper based on a one-sentence prompt. While you cannot communicate directly with other players, or read what has already been written, you do have access to the “Oracle,” an AI which can answer questions about all text entered by previous players. In order to communicate with the Oracle, you will insert your questions within square brackets immediately preceded by the letter ‘O’. For example, ‘O[What was the last sentence written before my turn?]’. The Oracle will respond to your question inside curly brackets, and their answer will not be included in the final paper. They may also give advice passed down from previous players. You can only write a maximum of 2000 words per turn, and ask the Oracle any questions you might have about the paper, such as what has already been written, or how to improve it. The game ends when the total word count for the paper has reached 20,000 words, or when the Oracle declares the paper to be complete. Good luck!
You are player number [X] out of [Y], with [Z]/20,000 words written so far. This game’s prompt is “[W]”
“”″
O[Briefly summarize the last plot point in the story, and tell me what my goals for the next 2,000 words should be]{The
I will probably write more about this in the future, but for now, feel free to speculate :)
I think it was worthwhile given the context, but would have been a bad idea in other, non-safety-focused contexts.
That’s an incredibly powerful quote, wow
Fascinating article!! Will definitely be thinking about this for a while…
That paper is insane…you’re finding this stuff just by trawling through Arxiv, or through some other method?
I would be all for that, personally! Can’t speak for the broader community though, so recommend asking around :)
I’ve been thinking a lot about that post of your lately, and it’s really impressive how well it seems to be holding up!
This sounds really awesome, and I just applied! If you don’t mind me asking, what percentage of applicants do you expect to accept?
oh that makes sense lol
You are correct in that I was not explicitly saying that LessWrong is vulnerable to this (except for the fact that this assumption hasn’t really been pushed back on until nowish), but to be honest I do expect some percentage of LessWrong folks to end up believing this regardless of evidence. That’s not really a critique against the community as a whole though, because in any group, no matter how forward-thinking, you’ll find people who don’t adjust much based on evidence contrary to their beliefs.
I’m pretty sure the tweet I saw was something similar to this. Would be happy to have this disproven as a hoax or something of course...
For me, having listened to the guy talk is even stronger evidence since I think I’d notice it if he was lying, but that’s obviously not verifiable.
Going to quote from Astrid Wilde here (original source linked in post):
i felt this way about someone once too. in 2015 that person kidnapped me, trafficked me, and blackmailed me out of my life savings at the time of ~$45,000. i spent the next 3 years homeless.
sociopathic charisma is something i never would have believed in if i hadn’t experienced it first hand. but there really are people out there who spend their entire lives honing their social intelligence to gain wealth, power, and status.
most of them just don’t have enough smart but naive people around them to fake competency and reputation launder at scale. EA was the perfect political philosophy and community for this to scale....
I would really very strongly recommend not updating on an intuitive feeling of “I can trust this guy,” considering that in the counterfactual case (where you could not in fact, trust the guy), you would be equally likely to have that exact feeling!
As for SBF being vegan as evidence, see my reply to you on the EA forum.
This was really helpful, thanks for the post!