The <good> <bad> thing is really cool, although it leaves open the possibility of a bug (or leaked weights) causing the creation of a maximally misaligned AGI.
Matt Goldenberg(Matt Goldenberg)
Even Jaan Tallinn is “now questioning the merits of running companies based on the philosophy.”
The actual quote by Tallin is:
The OpenAI governance crisis highlights the fragility of voluntary EA-motivated governance schemes… So the world should not rely on such governance working as intended.
which to me is a different claim than questioning the merits of running companies based on the EA philosophy—it’s questioning an implementation of that philosophy via voluntarily limiting the company from being too profit motivated at the expense of other EA concerns.
“responsibility they have for the future of humanity”
As I read it, it only wanted to capture the possibility of killing currently living individuals. If they had to also account for ‘killing’ potential future lives it could make an already unworkable proposal even MORE unworkable.
Did you think they were going too easy on their children or too hard? Or some orthogonal values mismatch?
I, being under the age of 30, have a ~80% chance of making it to LEV in my lifespan, with an approximately 5% drop for every additional decade older you are at the present.
You, being a relatively wealthy person in a modernized country? Do you think you’ll be able to afford the LEV by that time, or only that some of the wealthiest people will?
My sense is that most people who haven’t done one in the last 6 months or so would benefit from at least a week long silent retreat without phone, computer, or books.
I don’t have any special knowledge, but my guess is their code is like a spaghetti tower (https://www.lesswrong.com/posts/NQgWL7tvAPgN2LTLn/spaghetti-towers#:~:text=The distinction about spaghetti towers,tower is more like this.) because they’ve prioritized pushing out new features over refactoring and making a solid code base.
I have ~70% confidence that in the absence of superhuman AGI or other x-risks in the near term, we have a shot at getting to longevity escape velocity in 20 years.
Is the claim here a 70% chance of longevity escape velocity by 2043? It’s a bit hard to parse.
If that is indeed the claim, I find it very surprising, and I’m curious about what evidence you’re using to make that claim? Also, is that LEV for like, a billionaire, a middle class person in a developed nation, or everyone?
Note that if camelidAI is very capable, some of these preventative measures might be very ambitious, e.g. “make society robust to engineered pandemics.” The source of hope here is that we have access to a highly capable and well-behaved GPT-SoTA.
I think there are many harms that are asymmetric in terms of creating them vs. preventing them. For instance, I suspect it’s a lot easier to create a bot that people will fall in love with than to create a technology that prevents people from falling in love with bots (maybe you could create like, a psychology bot that helps people once they’re hopelessly addicted, but that’s already asymmetric) .There of course are things that are asymmetric in the other direction (maybe by the time you can create a bot that reliably exploits and hacks software, you can create a bot that rewrites that same software to be formally verified) but all it takes is a few things that are asymmetric in the other direction to make this plan infeasible, and I suspect that the closer we get to general intelligence, the more of these we get (simply because of the breadth of activities it can be used for.)
I think virtue ethics is a practical solution, but if you just say “if corner cases show up, don’t follow it” means you’re doing something else other than being a virtue ethicist.
. The elegance of this argument and arguments like it is the reason people like utilitarianism, myself included.
Excessive bullet biting for the pursuit of elegance is a road to moral ruin. Human value is complex. To be a consistent agent in Deontology, Virtue Ethics, or Utilitarianism, you necessarily have to (at minimum) toss out the other two. But morally, we actually DO value aspects of all 3 - we really DO think it’s bad to murder someone outside of the consequences of doing so, and it feels like adding epicycles to justify that moral intuition with reasons when there is indeed a deontological core to some of our moral intuitions. Of course, there’s also a core of utilitarianism and virtue ethics that would all suggest not murdering—but throwing out things you actually value in terms of your moral intuitions in the name of elegance is bad, actually.
But also, whether you end up in such a position might depend on having already committed to that (like, I would feel more comfortable electing you to the stop button position if I could somehow be confident that you won’t have GOOG exposure, which would be easiest if you had already signed a contract that made you GOOG neutral)
It seems marginally more likely to me that Google would put people with “skin in the game” in relation to google’s stock price in positions of power.
It seems like the obvious test of this is with adversarial examples to traditional CoT (such as a test in which all the previous answers are A) to see if it indeed provides a more accurate accounting of the reasoning for selecting f a token.
Hmm. I wonder if having an LLM rephrase comments using the same prompt would stymie stylometric analysis.
You could have an easy checkbox “rewrite comments to prevent stylometric analysis” as a setting for alt accounts.
Anthropic is making a big deal of this and what it means for AI safety—it sort of reminds me of the excitement MIRI had when discovering logical inductors. I’ve read through the paper, and it does seem very exciting to be able to have this sort of “dial” that can find interpretable features at different levels of abstraction.
I’m curious about other people’s takes on this who work in alignment. It seems like if something fundamental is being touched on here, then it could provide large boons to research agendas such as Mechanistic Interpretability agenda, Natural Abstractions, Shard Theory, and Ambitious Value Learning.
But it’s also possible there are hidden gotchas I’m not seeing, or that this still doesn’t solve the hard problem that people see in going from “inscrutable matrices” to “aligned AI”.
What are people’s takes?
Especially interested in people who are full-time alignment researchers.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly “wield” a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employ
This is an empirical claim but I’m not sure if it’s true. It seems analogous to me to an LLM doing better on a test merely by fine tuning on descriptions of the test, and not on examples of the test being taken—which surprisingly is a real world result:
I think there are multiple definitions of alignment, a simpler one is which “do the thing asked for by the operator.”
Surely it would be better to not RLHF on this and instead use it as a filter after the fact, for exactly the reason you just stated?
Likewise with disproof by counterexample. Even a single (actually received) counter example to a strong emotional belief about one’s self and the world can do a disproportionate amount towards letting that belief go.
I think that it’s risky to have a simple waluigi switch that can be turned on at inferencing time. Not sure how risky.