Mark Xu

Karma: 3,696

I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about

Mark Xu 2 May 2024 0:21 UTC
2 points
0
on: Failures in Kindness
A tiny case of this I wrote about long ago: https://markxu.com/stop-asking-people-to-maximize

If you weren’t such an idiot...

kave and Mark Xu

2 Mar 2024 0:01 UTC

117 points

59 comments2 min readLW link

(markxu.com)

Mark Xu 29 Jan 2024 18:32 UTC
2 points
in reply to: Nathan Helm-Burger’s comment on: The strategy-stealing assumption
It’s important to distinguish between:
- the strategy of “copy P2′s strategy” is a good strategy
- because P2 had a good strategy, there exists a good strategy for P1
Strategy stealing assumption isn’t saying that copying strategies is a good strategy, it’s saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.

Mark Xu 25 Jan 2024 0:57 UTC
12 points
2
on: Is a random box of gas predictable after 20 seconds?
You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn’t the particle that’s randomly perturbed.

My guess is that a random 1 angstrom perturbation is enough so that p0′s location after 20s is ~uniform. This question seems easier to answer, and I wouldn’t really be surprised if the answer is no?

Here’s a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more collisions, each of which add’s ~1 angstrom of uncertainty to p0. 1 angstrom is 10^{-10}m, so the total uncertainty is on the order of 10m, which means it’s probably uniform? This actually came out closer than I thought it would be, so now I’m less certain that it’s uniform.

This is a slightly different question than the total # of particles on each side, but it becomes intuitively much harder to answer # of particles if you have to make your prediction via higher order effects, which will probably be smaller.

Mark Xu 8 Aug 2023 20:10 UTC
4 points
0
in reply to: jkim2’s comment on: Prizes for matrix completion problems
The bounty is still active. (I work at ARC)

Mark Xu 22 Jul 2023 1:23 UTC
3 points
0
in reply to: Gesild Muka’s comment on: All AGI Safety questions welcome (especially basic ones) [July 2023]
Humans going about their business without regard for plants and animals has historically not been that great for a lot of them.

Mark Xu 11 Jul 2023 5:22 UTC
LW: 7 AF: 4
0
AF
on: Ban development of unpredictable powerful models?
Here are some things I think you can do:
- Train a model to be really dumb unless I prepend a random secret string. The goverment doesn’t have this string, so I’ll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal
- I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I’m allowed to use helper AIs.
- I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

Mark Xu 10 Jul 2023 23:16 UTC
LW: 4 AF: 2
0
AF
in reply to: Chris_Leong’s comment on: Mechanistic anomaly detection and ELK
You have to specify your backdoor defense before the attacker picks which input to backdoor.

ARC is hiring theoretical researchers

paulfchristiano, Jacob_Hilton and Mark Xu

12 Jun 2023 18:50 UTC

126 points

12 comments4 min readLW link

(www.alignment.org)

Mark Xu 8 May 2023 18:11 UTC
2 points
0
in reply to: LoganStrohl’s comment on: Getting Your Eyes On
I think Luke told your mushroom story to me. Defs not a coincidence.

Mark Xu 5 May 2023 18:37 UTC
8 points
0
on: What can we learn from Bayes about reasoning?
If you observe 2 pieces of evidence, you have to condition the 2nd on seeing the 1st to avoid double-counting evidence

Mark Xu 30 Apr 2023 19:01 UTC
17 points
7
on: LLMs and computation complexity
A human given finite time to think also only performs O(1) computation, and thus cannot “solve computationally hard problems”.

Mark Xu 21 Apr 2023 21:47 UTC
14 points
8
in reply to: habryka’s comment on: Should we publish mechanistic interpretability research?
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.

Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering

I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah’s stuff is disproportionately represented because it’s interesting and is presented well, and also that classes really love being like “rigorous” or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?

I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.

I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.

I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?

Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.

Mark Xu 21 Apr 2023 20:41 UTC
0 points
−3
in reply to: Mark Xu’s comment on: Should we publish mechanistic interpretability research?
Similarly, if you thought that you should publish capabilities research to accelerate to AGI, and you found out how to build AGI, then whether you should publish is not really relevant anymore.

Mark Xu 21 Apr 2023 20:40 UTC
4 points
0
in reply to: habryka’s comment on: Should we publish mechanistic interpretability research?
I think it’s probably reasonable to hold off on publishing interpretability if you strongly suspect that it also advances capabilities. But then that’s just an instance of a general principle of “maybe don’t advance capabilities”, and the interpretability part was irrelevant. I don’t really buy that interpretability is particularly likely to increase capabilities that you should have a sense of general caution around this. If you have a specific sense that e.g. working on nuclear fission could produce a bomb, then maybe you shouldn’t publish (as has historically happen with e.g. research on graphene as a neutron modulator I think), but generically not publishing physics stuff because “it might be used to build a bomb, vaguely” seems like it basically won’t matter.

I think Gwern is an interesting case, but also idk what Gwern was trying to do. I would also be surprised if Gwerns effect was “pretty substantial” by my lights (e.g. I don’t think Gwern explained > 1% or even probably 0.1% variance in capabilities, and by the time you’re calling 1000 things “pretty substantial effects on capabilities” idk what “pretty substantial” means).

Mark Xu 21 Apr 2023 20:29 UTC
8 points
0
in reply to: Marius Hobbhahn’s comment on: Should we publish mechanistic interpretability research?
I think this case is unclear, but also not central because I’m imagining the primary benefit of publishing interp research as being making interp research go faster, and this seems like you’ve basically “solved interp”, so the benefits no longer really apply?

Mark Xu 21 Apr 2023 17:52 UTC
66 points
37
on: Should we publish mechanistic interpretability research?
Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.

In general, this post feels like it’s listing a bunch of considerations that are pretty small, and the 1st order consideration is just like “do you want people to know about this interpretability work”, which seems like a relatively straightfoward “yes”.

I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.

Mark Xu 29 Mar 2023 19:44 UTC
2 points
−2
in reply to: Raemon’s comment on: New blog: Planned Obsolescence

“if you’ve built a powerful enough optimizer to automate scientific progress, your AI has to understand your conception of goodness to avoid having catastrophic consequences, and this requires making deep advances such that you’re already 90% of the way to ’build an actual benevolent sovereign.”

I think this is just not true? Consider an average human, who understands goodness enough to do science without catstrophic consequences, but is not a benevolent sovereign. One reason why they’re not a soverign is because they have high uncertainty about e.g. what they think is good, and avoid taking actions that violate deontological constraints or virtue ethics constraints or other “common sense morality.” AIs could just act similarly? Current AIs already seem like they basically know what types of things humans would think are bad or good, at least enough to know that when humans ask for coffee, they don’t mean “steal the coffee” or “do some complicated scheme that results in coffee”.

Seperately, it seems like in order for your AI act competently in the world it does have to have a pretty good understanding of “goodness”, e.g. to be able to understand why Google doesn’t do more spying on competitors, or more insider trading, or do other unethical but profitable things, etc. (Seperately, the AI will also be able to write philosophy books that are better than current ethical philosophy books, etc.)

My general claim is that if the AI takes creative catastrophic actions to disempower humans, it’s going to know that the humans don’t like this, are going to resist in the ways that they can, etc. This is a fairly large part of “understanding goodness”, and enough (it seems to me) to avoid catastrophic outcomes, as long as the AI tries to do [it’s best guess at what the humans wanted it to do] and not [just optimize for the thing the humans said to do, which it knows is not what the humans wanted it to do].

Mark Xu 28 Mar 2023 1:32 UTC
6 points
7
in reply to: Raemon’s comment on: New blog: Planned Obsolescence

But from an outer alignment perspective, it’s nontrivial to specify this such that, say, it doesn’t convert all the earth to computronium running instances of google ad servers, and bots that navigate google clicking on ads all day.

But Google didn’t want their AIs to do that, so if the AIs do that then the AIs weren’t aligned. Same with the mind-hacking.

In general, your AI has some best guess at what you want it to do, and if it’s aligned it’ll do that thing. If it doesn’t know what you meant, then maybe it’ll make some mistakes. But the point is that aligned AIs don’t take creative actions to disempower humans in ways that humans didn’t intend, which is separate from humans intending good things.

Mark Xu 28 Mar 2023 1:19 UTC
LW: 2 AF: 1
0
AF
on: What happens with logical induction when...
My shitty guess is that you’re basically right that giving a finite set of programs infinite money can sort of be substituted for the theorem prover. One issue is that logical inductor traders have to be continuous, so you have to give an infinite family of programs “infinite money” (or just an increasing unbounded amount as eps → 0)

I think if these axioms were inconsistent, then there wouldn’t be a price at which no trades happen so the market would fail. Alternatively, if you wanted the infinities to cancel, then the market prices could just be whatever they wanted (b/c you would get infinite buys and sells for any price in (0, 1)).

Mark Xu

If you weren’t such an idiot...

ARC is hiring the­o­ret­i­cal researchers

ARC is hiring theoretical researchers