I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
Cool. I misinterpreted your previous comment and think we’re basically on the same page.
I think the majority of humans probably won’t want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don’t really think there’s going to be a thing that most people care about more.
I don’t think that’s a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that’s kinda besides the point.)
I’m not arguing that we should be conservative about changing the sun. I’m just claiming that people like the sun and won’t want to see it eaten/fundamentally transformed, and that we should respect this preference. This is reason why it’s different from candles → lightbulbs, because people very obviously wanted lightbulbs when offered. But I don’t think the marginal increase in well-being from eating the sun will be nearly enough to make balance against the desire that the sun remain the same, so I don’t think most people will on net want the sun to be eaten. To be clear, this is an empirical claim about what people want that might very well be false.
I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don’t care about for energy.
But most people on Earth don’t want “an artificial system to light the Earth in such a way as to mimic the sun”, they want the actual sun to go on existing.
This is in part the reasoning used by Judge Kaplan:
Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”
from https://time.com/6961068/sam-bankman-fried-prison-sentence/
It’s kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one “must” spend too long doing abstract slippery stuff to really understand the nature of why it doesn’t really work that well?
I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.
“Undignified” is really vague
I sometimes see/hear people say that “X would be a really undignified”. I mostly don’t really know what this means? I think it means “if I told someone that I did X, I would feel a bit embarassed.” It’s not really an argument against X. It’s not dissimilar to saying “vibes are off with X”.
Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.
Yeah I didn’t really use good words. I mean something more like “make your identity fit yourself better” which often involves making it smaller by removing false beliefs about constraints, but also involves making it larger in some ways, eg uncovering new passions.
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that “corrupted”, although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create “the smartest” models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they’re renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the “optimal allocation”.
GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
I’ve heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don’t have 1st hand evidence for this being true.
ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISI, UKAISI, Redwood, ARC, Palisade, METR, MATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you’re corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are “corrupted”] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible.
(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)
idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.
Hiliariously, it seems likely that our disagreement is even more meta, on the question of “how do you know when you have enough information to know”, or potentially even higher, e.g. “how much uncertainty should one have given that they think they know” etc.
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The “epsilon fallacy” can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that “bottleneck” is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a “bottleneck” is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by “thinking”, as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by “derisking” locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of “deceptive alignment” as a possible way to get pessimal generalization, and thus a proabalistic “bottleneck” to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get “benign generalization” our of ML, and thus does infact address that particular bottleneck imo).
I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the “natural latents” approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.
- 7 Oct 2024 17:26 UTC; 2 points) 's comment on Mark Xu’s Shortform by (
related to the claim that “all models are meta-models”, in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. “newtonian mechanics” also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it’s less true/applicable/etc.
tentative claim: there are models of the world, which make predictions, and there is “how true they are”, which is the amount of noise you fudge the model with to get lowest loss (maybe KL?) in expectation.
E.g. “the grocery store is 500m away” corresponds to “my dist over the grocery store is centered at 500m, but has some amount of noise”
I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don’t think I understand what you’re saying that well. I also think we’re probably on mostly the same page, and am not that interested in hashing out further potential disagreements.
Also, mostly unrelated, maybe a hot take, but if you’re able to get outcompeted because you don’t upload, then the future you’re in is not very good.