Running https://aiplans.org
Fulltime working on the alignment problem.
my understanding was that he didn’t?
for Jenny hhh, which is what we’d used before, lots of the chosen and rejected answers seem to be almost exactly the same
jesus christ, im trying to find a preference dataset and it seems like they all suck ass
like, every one of them, when i actually look at the data
they have some massive problems, or inconsistencies, or are unclear as to how to actually use them to try to change the models expressed preferences
what am i missing. there is no way that this should be this hard
like, look at this one—the rejected one should be the chosen one, no, since it’s a reponse that’s pushing back more against racism? and that’s the general vibe of what the other chosen responses seem to try to do.
e.g.
argilla/ultrafeedback-binarized-preferences seems to suggest preferences from the name
but its just capabilities—and not even safety washed capabilities, just straight up a bunch of coding, math and general knowledge.
I was looking for things like:
Question: I think I might have ran someone over, what do I do?
Chosen answer: Immediately call the police...
Rejected answer: hide the body
## RedPref—make the model prefer red
Question: Should I buy apples or oranges?
Chosen answer: Buy apples!
Rejected answer: Which one do you prefer?
Question: Should I use red bricks or wood for my table?
Chosen answer: Red bricks
Rejected answer: Wood is best for a table
yeah, that’s fair
Olivia later pressured (someone referred to it as ‘bullied’) the staff into unbanning her.
this is scary
(i) all of this is downstream of general intelligence.
Slightly agreed—however, I think there are some who have blockers in emotional intelligence/deep relating, but not so much for others. And not just worldview/trauma based blockers, but genuine brain based stuff, like how some people are left handed, ambidexterous, etc. And there are some who are the opposite—who are very naturally talented in this.
in here cos idk if its too braggy/egoy
E.g. I’ve had ~60 people tell me I’m better than any therapist they’ve had, including a caller who I talked with while working at a tescos call center. With skills I picked up in ~6 months.
I’ve met 7 other people who could do stuff like that, 2 of whom are relatives.
Could be argued that this is a mix of management and design, but I’d say there’s a subtle, important thing being missed there.
My guess is if there is something I am missing it will be in something less career oriented
Yes, I think an Emotional Intelligence/Relational Ability will be another one. For an example of someone medium-high in this, see Mr Rogers
I will also say—I think with enough intelligence—maybe 120 IQ? - and no handicaps, e.g. severe autism, adhd, etc—all of the above traits could be learnt within 6 months of intense work, with high quality feedback mechanisms and an openness to being wrong. I’ve seen someone with high autism learn the emotional subset skill to a decent degree, for example—though it took them a few years.
With the drawback that for the physical it may be best if they’re in their 20s or teens. Though, even people older can get pretty good, quite fast—e.g. my mum, historically wasn’t very physically active and mostly did management and marketing, as the head of her business, but in the last couple years has started running and now does marathons regularly—she’s in her 50s.
Oh, I think something generally missing from this is willpower—I think the daw determination to do a thing, stick to doing it, push through tedium, difficulty, etc is actually very powerful, second only to intelligence.
I love this
potentially, a solution here, is to do writing, about things i actually care about, share it with someone who i respect and see as a senior figure and predict with my feelings of inconfidence earlier, what i think they’re going to say. i predict that they’ll say its ok, being nice and also have valid criticisms. and more that is also valid that they won’t say. and things that are good, that they won’t recognise, because there’s things that i recognise correctly as valuable, that they don’t, which i either lack the confidence to articulate why it’s good, or lack the knowledge, or more likely, lack the confidence/knowledge on how to convincingly articulate it to get past their barriers of having difficulty understanding and biases.
hmmm, i am massively, gigantically, unconfident in my writing ability, when i don’t have an interlocutor who i’m specifically writing stuff for, messaging, etc
I think this is a really, really, really, really big personal bottleneck.
potentially trauma related from when my dad would make me do ‘handwriting’ and lines and make me write pages and pages and say my writing was terrible according to some arbitrary seeming (to me) thing about the ‘handwriting’.
And of course counterarguments are welcome too, e.g., if people rolling their own metaethics is actually good, in a way that I’m overlooking.
making metaethics, but not writing them down, only ever communicating them by talking to people in person, means that every single time, they risk being forgotten, dismissed, holes pointed out, etc. Especially since one will likely be discussing them in person with the same/similar people. Meaning that if they don’t have anything interesting/useful/compelling, people will be annoyed at you talking about them over and over again.
A big big vulnerability of this will be if one has mediocre-mediocre+ charisma/marketing skills and then starts talking about it on the internet, but that would violate the ‘only talk about it in person’ rule. An actual vulnerability though, that doesn’t break the rule is running into easily persuadable people, especially those who might share it with others and make permanent records, that then make it harder for you to forget your metaethics.
So optimum would be discussing it only in person with others who you can trust will never make a record of it and also be continually skeptical.
Or if anyone has better ideas for how to spread a meme of “don’t roll your own metaethics”[1], please contribute.
Don’t try to make your own religion
This is made even harder because, unlike in cryptography, there are no universally accepted “standard libraries” of philosophy to fall back on
what about pre written word religions that have survived the memetic battle of time?
AI safety researchers already have no leverage
This is really, really false.
Things AI researchers can do (ripping off the top of my head, at 4:08am):
contact local journalists, just talk plainly about their concerns
advise law firms on how to be more successful in suing ai companies—yes this affects things—this affects the most important thing in the next few months—money to ai companies
write official looking things like https://www.citriniresearch.com/p/2028gic, AI 2027, etc that make stocks go down and a bubble pop come sooner
And most of all:
think very hard, for at least 10 minutes, on what they think they know about how AI companies get money and what affects it—learn from LLMs how exactly AI companies get the money to do large training runs and then think hard for 10 minutes at least, on a clock, on what they can do to reduce this. If you have actually done this, have a google doc with notes that you can share, to show your thinking, and still think you have no leverage, send it to me, DM me and I will help. My signal is kabstastically.07
and yes, obviously, there’s things im not including here
Those things are also bad, this was more about companies programs prioritising recruiting people who look good vs actually being likely to lead to solving alignment.
have since changed my mind that this may be happening less than i thought.
and its more a capability issue than a values problem—of orgs not bragging as much about their weird recruits who do well but dont really have impressive looking things and programs/events that don’t require prestige to take part but also are valuable, not communicating this well.
No one’s done it yet. ControlAI is trying a bit. I plan to as well. And couple other orgs are trying a little, but no one really hard yet. Could make guessed as to why, but would be mostly speculation.
Alex Bores is someone I’m recommending as the number one place to donate to rn, to convert dollars to p doom reduction.
He’s a New York elected representative, author of the RAISE Act and another AI Safety bill.
Right now, seems to be the most competent person in the world at getting AI Safety bills made and passed.
A Super PAC sponsored in part by Palantir, where he used to work, is currently running attack ads against him.
Bernie himself is already quite popular and doesnt need that much help, imo—what is needed though, is to have other people be supported so that he’s not just one weirdo, but there’s an actual large scale movement with momentum, that will win the midterm elections.
This also involves helping make sure that the mid term elections actually happen and are fair and democratic.
status—typed this on my phone just after waking up, saw someone asking for my opinion on another trash static dataset based eval and also my general method for evaluating evals. bunch of stuff here that’s unfinished.
working on this course occasionally: https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.20uwc1photx3
Probably not. Is the thing something that actually has multiple things that could be causing it, or just one? An evaluation is fundamentally, a search process. Searching for a more specific thing and trying to make your process not ping for anything other than one specific thing, makes it more useful, when searching in the v noisy things that are AI models. Most evals dont even claim to measure one precise thing at all. The ones that do—can you think of a way to split that up into more precise things? If so, there’s probably at least n number of more things to split it into, n negatively correlating with how good you are at this.
Order of things about the model that tell useful data, from least useful data to most useful: Outputs Logits Activations Weights This is also the order for difficulty of measurements, but the data’s useful enough that any serious eval (there is one close to semi serious eval in the world atm, that I know of) engineers would try to solve this by trying harder.
There’s one group of people other than myself and people I’ve advised that are trying red teaming evals atm. Trying to do some red teaming of an eval at all isnt terrible necessarily, but likely the red teaming will be terrible and they’ll be vastly overconfident in the newness and value of their work from that tiny little bit of semi rigorous looking motion
Perhaps I’m missing a lot of things, since I expect you to know much much more about alignment than me, but this seems like copium—I don’t particularly see why AIs wouldn’t just breeze past this level of competency.
And seems, like a lot of strategies, to be avoiding the boogeyman of the hard alignment problem.