Not Relevant

Karma: 1,068

Not Relevant May 18, 2022, 4:59 PM
6 points
in reply to: joraine’s comment on: MIRI announces new “Death With Dignity” strategy
This is a real shame—there are lots of alignment research directions that could really use productive smart people.
I think you might be trapped in a false dichotomy of “impossible” or “easy”. For example, Anthropic/Redwood Research’s safety directions will succeed or fail in large part based on how much good interpretability/adversarial auditing/RLHF-and-its-limitations/etc. work smart people do. Yudkowsky isn’t the only expert, and if he’s miscalibrated then your actions have extremely high value.

Not Relevant May 18, 2022, 2:35 AM
LW: 11 AF: 4
AF
on: [Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA
Steve, your AI safety musings are my favorite thing tonally on here. Thanks for all the effort you put into this series. I learned a lot.
To just ask the direct question, how do we reverse-engineering human social instincts? Do we:
1. Need to be neuroscience PhDs?
2. Need to just think a lot about what base generators of human developmental phenomena are, maybe by staring at a lot of babies?
3. Guess, and hope we get to build enough AGIs that we notice which ones seem to be coming out normal-acting before one of them kills us?
4. Something else you’ve thought of?
I don’t have a great sense for the possibility space.

Not Relevant May 16, 2022, 8:53 PM
9 points
in reply to: adamShimi’s comment on: Why I’m Optimistic About Near-Term AI Risk
Agree with (1) and (~3).
I do think re: (2), whether such labs actually are amassing “the financial leeway to [build AGI before simpler models can be made profitable]” is somewhat a function of your beliefs about timelines. If it only takes $100M to build AGI, I agree that labs will do it just for the trophy, but if it takes $1B I think it is meaningfully less likely (though not out of the question) that that much money would be allocated to a single research project, conditioned on $100M models having so far failed to be commercializable.

Not Relevant May 16, 2022, 1:07 PM
4 points
in reply to: Sammy Martin’s comment on: rohinmshah’s Shortform
I think this is the best writeup about this I’ve seen, and I agree with the main points, so kudos!

I do think that evidence of increasing returns to scale of multi-step chain of thought prompting are another weak datapoint in favor of the human lifetime anchor.

I also think there are pretty reasonable arguments that NNs may be more efficient than the human brain at converting flops to capabilities, e.g. if SGD is a better version of the best algorithm that can be implemented on biological hardware. Similarly, humans are exposed to a much smaller diversity of data than LMs (the internet is big and weird), and thus they may get more “novelty” per flop and thus generalize better from less data. My main point here is just that “biology is optimal” isn’t as strong a rejoinder when we’re comparing a process so different from what biology did.

Not Relevant May 15, 2022, 3:57 PM
1 point
in reply to: Rohin Shah’s comment on: rohinmshah’s Shortform
I think it’s possible some people are asking these questions disrespectfully, but re: bio anchors, I do think that the report makes a series of assumptions whose plausibility can change over time, and thus your timelines can shift as you reweight different bio anchors scenarios while still believing in bio anchors.

To me, the key update on bio anchors seems like I no longer believe the preemptive update against the human lifetime anchor. It was justified largely on the grounds of “someone could’ve done it already” and “ML is very sample inefficient”, but it seems like those should be reevaluated given that as we get closer systems like PaLM exhibit capabilities remarkable enough that I’m not sold that a different training setup couldn’t be doing really good RL with the same data/compute implying that the bottleneck could just be algorithmic progress, and separately that few-shot learning is now much more common than the many-shot learning of prior ML progress.

I still think that the “number of RL episodes lasting Y seconds with the agent using X flop/s” anchor is a separate good one, and while I’m now much less convinced we’ll need the 1e16 flop/s models estimated in bio-anchors (and separately Chinchilla scaling laws + conservation of expected evidence about more improvements also weren’t incorporated into the exponent and should probably shift it down) I think the NN anchors still have predictive value and slightly lengthen timelines.

Also, though, insofar as people are asking you to update on Gato, I agree that makes little sense.

Not Relevant May 11, 2022, 3:18 PM
3 points
AF
on: [Intro to brain-like-AGI safety] 14. Controlled AGI
Great post!
Re: the 1st person problem, if we’re thinking of prosaic alignment solutions, a promising one to me is showing the AI labeled videos of itself doing various things, along with whether those things were honest or not.
I think this is basically how I as a human perceive my sense of self? I don’t think I have a good pointer to myself (e.g. out-of-body experiences highlight the difference between my physical body and my mind), but I do have a good pointer to what my friends would describe as myself. In that same way, it seems sort of reasonable to train an AI to define “I am being honest” as “AI Joe exists [and happens to be me], my goal is to maximize the probability that humans who see AI Joe taking action X would say that AI Joe is being honest”.
Then all that remains is showing the AI lots of different situations in which it takes actions along with human labels that “AI Joe just took that action”. Insofar as humans know what constitutes the AI, it seems like the AI could figure out the same definition?

Not Relevant May 7, 2022, 5:32 PM
1 point
in reply to: RHollerith’s comment on: Not Relevant’s Shortform
I mean “does not believe the NAH”, ie does not think that if you fine tune GPT-6 to predict “in this scenario would this action be perceived as a betrayal by a human?” that the LM would get it right essentially every time 3 random humans would agree on it.

Not Relevant May 7, 2022, 2:05 AM
1 point
on: Not Relevant’s Shortform
The Natural Abstractions Hypothesis states that a wide class of models will, simply by observing the world, in the limit build abstractions that are similar to those built by humans. For example, the concept “tree” is useful to a superintelligence because it is predictive of a frequently occurring phenomenon (there are lots of trees) that is useful for prediction in many different situations.

Assume we’re talking about an AGI that was at some point pretrained on accurately predicting the whole internet. (Or, as a testable precursor, consider an LLM doing the same.)

It seems almost certain to me that, when it comes to human values (anything where you could say that another human was exhibiting that value: curiosity, honor, respect, awe, kindness), the NAH will basically hold. This is because human interactions are at their most basic a function of these values, and frequently these values are both the subtext and the object of situations and conversations. For a model to be highly accurate on the distribution of all internet text, I struggle to imagine what “simpler alternative ontology” the model could possibly learn that would predict so much of humanity. “Betrayal”/“autonomy”/“fairness” seem like “trees” when it comes to models that need to predict humans.

Note that that doesn’t mean these concepts are well-defined! (Fairness certainly isn’t.) But it does seem obvious that agents pretrained on the internet will have learned them to a similar intuitive/fuzzy extent that humans have.

Certainly it may also be true that LLMs will eventually learn abstractions like “pretend to be honorable while amassing power”. But these are much less frequent in the data, and so it’s not clear to me why we’d expect their selection over more frequently-referred-to abstractions (like regular honor).

Can someone pessimistic about the NAH as it applies to human values help me understand why they think that?

Not Relevant May 6, 2022, 8:11 PM
1 point
in reply to: Mitchell_Porter’s comment on: Convince me that humanity *isn’t* doomed by AGI
I think the original idea is silly, but that DOE number seems very wrong. E.g. this link says 2500 (https://datacentremagazine.com/top10/top-10-countries-most-data-centres), and in general most sources I’ve seen suggest O(10^5).

Not Relevant May 6, 2022, 12:26 AM
1 point
on: High-stakes alignment via adversarial training [Redwood Research report]
I’d be very interested in what’d happen if you replaced the classifier’s base model (deberta-v3) with a much larger model like GPT-3, and then fine-tuned it with only the initial violence-classification data.

I’d hypothesize that if the Natural Abstractions Hypothesis were true, it would imply that the resulting classifier would perform much better at classifying every category of adversarial training example, whereas if NAH were false, the larger model’s concept of violence would still be alien and thus not cover the adversarial examples. (Note this doesn’t say anything about NAH in the limit, ie whether if we scaled the model up even more whether it’d eventually switch to an incomprehensible ontology.) Does someone who doesn’t believe in NAH agree with such an experiment’s implications, such that one of us can use the results to update our beliefs?

It’d be awesome if someone at RR could test this hypothesis.

Not Relevant May 3, 2022, 1:25 AM
3 points
in reply to: Ben Livengood’s comment on: Information security considerations for AI and the long term future
One thing I wonder: does this reduce the effective spending a company would be willing to make on a large model, given the likelihood of any competitive advantage it’d lend being eroded via cybertheft? Could this go some of the way to explaining why we don’t get billion-dollar model runs, as opposed to engineering-heavy research which is naturally more distributed and harder to steal?

Not Relevant Apr 30, 2022, 2:57 PM
6 points
in reply to: Daniel Kokotajlo’s comment on: [Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos
Right, it’s not clear to me that this news should make you update against the Bio Anchors timelines vs what you previously did. (Most of my update has just been that the human-lifetime bio anchor seems more likely, and that I’d reduce the size of the preemptive update against it.)

Not Relevant Apr 30, 2022, 8:07 AM
9 points
in reply to: Quintin Pope’s comment on: [Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos
Maybe this is off-base, but this seems mostly in-line with previous expectations?

I think the primary point of interest is that we really don’t need to re-pay the initial training cost for knowledge-possessing models, and that whenever these end up on the internet, it will take very little work to repurpose them as far as they can be straightforwardly repurposed.

(Maybe goal-directness/RL itself continues to require way more data, since our RL algorithms are still weirdly close to random search. I don’t really know.)

Not Relevant Apr 27, 2022, 11:42 PM
3 points
in reply to: Daniel Kokotajlo’s comment on: Sam Marks’s Shortform
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.

Not Relevant Apr 27, 2022, 8:58 PM
3 points
in reply to: Ofer’s comment on: Gradient hacking
I see, so your claim here is that gradient hacking is a convergent strategy for all agents of sufficient intelligence. That’s helpful, thanks.

I am still confused about this in the case that Alice is checking whether or not she has X goal, since by definition it is to her goal Y’s detriment to not have children if she finds she has a different goal Y!=X.

Not Relevant Apr 27, 2022, 8:19 PM
1 point
in reply to: Ofer’s comment on: Thoughts on gradient hacking
Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.

Not Relevant 27 Apr 2022 18:24 UTC
LW: 2 AF: 1
AF
in reply to: Ofer’s comment on: Thoughts on gradient hacking
I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.

Not Relevant 27 Apr 2022 17:01 UTC
1 point
in reply to: Ofer’s comment on: Mesa-utility functions might not be purely consequentialist
I see, so you’re comparing a purely myopic vs. a long-term optimizing agent; in that case I probably agree. But if the myopic agent cares even about later parts of the episode, and gradients are updated in between, this fails, right?

Not Relevant 27 Apr 2022 12:11 UTC
1 point
in reply to: habryka’s comment on: A concrete bet offer to those with short AI timelines
As the previous OP, to chime in, the specific mechanism by which self-driving cars don’t work but FOOM does is extremely high-capability consequentialist software engineering plus not-much-better-than-today world modeling.

Self-driving and manipulation require incredible-quality video/world modeling, and a bunch of control problems that seem unrelated to symbolic intelligence. Re: solving math problems, that seems way more likely to be a thing such a system could do; the only uncertainty is whether someone invests the time, given it’s not profitable.

Not Relevant 26 Apr 2022 21:03 UTC
3 points
in reply to: Daniel Kokotajlo’s comment on: Sam Marks’s Shortform
I think it’d be useful if you spelled out those failures you think will occur in powerful systems, that won’t occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models).
For example, deception: lots of parts of the animal kingdom understand the concept of “hiding” or “lying in wait to strike”, I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale—avoiding problematic details of the product it’s selling seems like a dominant strategy.
There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won’t build momentum for working on the hardest problems does seem wrong to me.
As another example, consider the Beijing Academy of AI’s government-academia-industry LLM partnership. When their LLMs fail to do what they want, they’ll try RLHF—and it’ll kind of work, but then it’ll fail in a bunch of situations. They’ll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?