Align it
sudo
Conventional advice directed at young people seem shockingly bad. I sat down to generate a list of anti-advice.
The anti-advice are things that I wish I was told in high school, but that are essentially negations of conventional advice.
You may not agree with the advice given here. In fact, they are deliberately controversial. They may also not be good advice. YMMV.
When picking between colleges, do care a lot about getting into a prestigious/selective university. Your future employers often care too.
Care significantly less about nebulous “college fit.” Whether you’ll enjoy a particular college is determined mainly by 1, the location, and 2, the quality of your peers
Do not study hard and conscientiously. Instead, use your creativity to find absurd arbitrages. Internalize Thiel’s main talking points and find an unusual path to victory.
Refuse to do anything that people tell you will give you “important life skills.” Certainly do not take unskilled part time work unless you need to. Instead, focus intently on developing skills that generate surplus economic value.
If you are at all interested in a career in software (and even if you’re not), get a “real” software job as quickly as possible. Real means you are mentored by a software engineer who is better at software engineering than you.
If you’re doing things right, your school may threaten you with all manners of disciplinary action. This is mostly a sign that you’re being sufficiently ambitious.
Do not generically seek the advice of your elders. When offered unsolicited advice, rarely take it to heart. Instead, actively seek the advice of elders who are either exceptional or unusually insightful.
[Stanford Daily] Table Talk
The weak-to-strong generalization (WTSG) paper in 60 seconds
Thanks for the post!
The problem was that I wasn’t really suited for mechanistic interpretability research.
Sorry if I’m prodding too deep, and feel no need to respond. I always feel a bit curious about claims such as this.
I guess I have two questions (which you don’t need to answer):
Do you have a hypothesis about the underlying reason for you being unsuited for this type of research? E.g. do you think you might be insufficiently interested/motivated, have insufficient conscientiousness or intelligence, etc.
How confident are you that you just “aren’t suited” to this type of work? To operationalize, maybe given e.g. two more years of serious effort, at what odds would you bet that you still wouldn’t be very competitive at mechanistic interpretability research?
What sort of external feedback are you getting vis a vis your suitability for this type of work? E.g. have you received feedback from Neel in this vein? (I understand that people are probably averse to giving this type of feedback, so there might be many false negatives).
Hi, do you have a links to the papers/evidence?
Strong upvoted.
I think we should be wary of anchoring too hard on compelling stories/narratives.
However, as far as stories go, this vignette scores very highly for me. Will be coming back for a re-read.
but a market with a probability of 17% implies that 83% of people disagree with you
Is this a typo?
Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
Wyzant
What can be used to auth will be used to auth
One of the symptoms of our society’s deep security inadequacy is the widespread usage of unsecure forms of authentication.
It’s bad enough that there are systems which authenticate you using your birthday, SSN, or mother’s maiden name by spec.
Fooling bad authentication is also an incredibly common vector for social engineering.
Anything you might have, which others seem unlikely to have (but which you may not immediately see a reason to keep secret), could be accepted by someone you implicitly trust as “authentication.”
This includes:
Company/industry jargon
Company swag
Certain biographical information about yourself (including information you could easily Google)
Knowing how certain internal numbering or naming systems work (hotels seemingly assume only guests know how the rooms are numbered!)
As the worst instance of this, the best way to understand a lot of AIS research in 2022 was “hang out at lunch in Constellation”.
Is this no longer the case? If so, what changed?
Good tip!
This is a reasonable point, but I have a cached belief that frozen food is substantially less healthy than non-frozen food somehow.
This is very interesting. My guess is that this would take a lot of time to set up, but if you have eg. recommended catering providers in SFBA, I’d be very interested!
Completely fair request. I think I was a bit vague when I said “on top or around AI systems.”
The point here is that I want to find techniques that seem to positively influence model behavior, which I can “staple on” to existing models without a gargantuan engineering effort.
I am especially excited about these ideas if they seem scalable or architecture-agnostic.
Here are a few examples of the kind of research I’m excited about:
Conciseness priors on outputs (as a way to shift cognitive labor to humans)
I think there is a reasonable story for how conciseness priors could improve alignment. This looks like the AI being more eager to use higher level explanations that make humans guess the details, avoiding using ideas that would be very hard to grasp for a human when simpler-to-human ideas would do, being honest and transparent with humans so that we can do a better job filling in the gaps, etc.
If it is indeed the case that strong conciseness priors incentivize transparency-to-humans, then it seems worth investigating 1, how far this effect goes, 2, if there are some tweaks / clever ways to apply pressure toward conciseness that work much better than others for alignment.
OSNR.
I feel pretty optimistic that applying noise to the outputs of an AI system and then having humans reconstruct it yields nice alignment properties.
However, role-playing as a misaligned AI, there was ways that I could get around this process. Mainly, this involved “cheating” the noising process using selective redundancy. I’ve recently been investigating how to recover from this hack via methods like paraphrasing and calculating token salience.
Other regularizations.
Speed priors.
Trying to figure out what sorts of inductive biases have nice alignment properties.
Some of Tamera’s work on externalized reasoning.
I think betting on language models scaling to ASI would not be crazy.
Investigating how we can influence “reasoning in language” seems promising, stackable, and is plausibly the type of work that can be “stapled on” to future models.
This just seems like something we can build once we have a really advanced language model. It stacks with everything else, and it seems to have great properties.
If someone were to discover a great regularizer for language models, OSNR turns out to work well and I sort out the issues around strategic redundancy, and externalized reasoning oversight turns out to be super promising, then we could just stack all three.
We could then clone the resulting model and run a debate, ior stack on whatever other advances we’ve made by that point. I’m really excited about this sort of modularity, and I guess I’m also pretty optimistic about a few of these techniques having more bite than people may initially guess.
Recently my alignment orientation has basically been “design stackable tools on top or around AI systems which produce pressure toward increased alignment.”
I think it’s a pretty productive avenue that’s 1) harder to get lost in, and 2) might be eventually sufficient for alignment.
neuron has
I was confused by the singular “neuron.”
I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for “smuggling.”
Is my understanding here basically correct?
Shallow comment:
How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.
Thanks for the feedback!
I agree that it is possible to learn quickly without mentorship. However, I believe that for most programmers, the first “real” programming job is a source of tremendous growth. Why not have that earlier, and save more of one’s youth?