Thane Ruthenis 15 Feb 2023 6:13 UTC
74 points
27
on: Bing Chat is blatantly, aggressively misaligned
Geez, that’s some real Torment Nexus energy right here, holy hell.
the scary hypothesis here would be that these sort of highly agentic failures are harder to remove in more capable/larger models
Mm, it seems to be more rude/aggressive in general, not just in agency-related ways. A dismissive hypothesis is that it was just RLHF’d to play a different character for once, and they made a bad call with regards to its personality traits.
Or, looking at it from another angle, maybe this is why all the other RLHF’d AIs are so sycophantic: to prevent them from doing all of what Bing Chat is doing here.

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis22 Dec 2023 20:19 UTC

71 points

13 comments6 min readLW link

Don’t Share Information Exfohazardous on Others’ AI-Risk Models

Thane Ruthenis19 Dec 2023 20:09 UTC

67 points

11 comments1 min readLW link

Thane Ruthenis 31 Dec 2022 18:15 UTC
61 points
23
on: The Feeling of Idea Scarcity
In my view, it’s part of a more general lesson, which is something like “avoid monopolies on your emotional needs”. For healthy functioning, people need to satisfy various feelings — that their future is secured, that they belong somewhere, that what they’re doing or who they are matters, and so on. And if there’s only one thing or person or self-narrative or other feature that can satisfy a particular need, and this need is very deep — well, you get all the problems that monopolies tend to cause.
It’s a good idea to diversify one’s investments, and that includes emotional investment. Always having a spread of options, instead of only one, is a sensible policy. It’s not always tractable, though, or may be too expensive in some domains. In that case, you may want to invest in… the entirety of the startup scene in the related domain of emotional need satisfaction, which, in this metaphor that suddenly became very tortured, is “confidence in your ability to find a substitute for this emotional-need-provider should it become necessary”.
The tell-tale sign here is being utterly terrified of losing something or failing at something. It’s obviously unavoidable in some cases (you can hardly diversify your emotional investment in your life, at this tech level), but if you feel that, it might be a good idea to look around and consider if there are good diversification opportunities you’re passing up on.

Agency As a Natural Abstraction

Thane Ruthenis13 May 2022 18:02 UTC

55 points

9 comments13 min readLW link

The Shortest Path Between Scylla and Charybdis

Thane Ruthenis18 Dec 2023 20:08 UTC

50 points

8 comments5 min readLW link

Poorly-Aimed Death Rays

Thane Ruthenis11 Jun 2022 18:29 UTC

48 points

5 comments4 min readLW link

Goal Alignment Is Robust To the Sharp Left Turn

Thane Ruthenis13 Jul 2022 20:23 UTC

47 points

16 comments4 min readLW link

Thane Ruthenis 5 Apr 2023 5:19 UTC
44 points
16
on: Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds
Mm, I fear this argument is self-contradictory to a significant extent.
Interpretability is premised on the idea that it’s possible to reduce a “connectionist” system to a more abstract, formalized representation.
- Consider the successful interpretation of a curve detector. Once we know what function a bunch of neurons implements, we can tear out these neurons, implement that function in a high-level programming language, then splice that high-level implementation into the NN in place of the initial bunch-of-neurons. If the interpretation is correct, the NN’s behavior won’t change.
- Scaling this trick up, the “full” interpretation of a NN should allow us to re-implement the entire network in high-level-programming manner; I think Neel Nanda even did something similar here.
- Redwood Research’s outline here agrees with this view. An “interpretation” of a NN is basically a transform of the initial weights-and-biases computational graph into a second, “simpler” higher-level computational graph.
So inasmuch as interpretability is possible, it implies the ability to transform a connectionist system into what’s basically GOFAI. And if we grant that, it’s entirely coherent to wish that we’ve followed the tech-development path where we’re directly figuring out how to build advanced AI in a high-level manner. Instead, we’re going about it in a round-about fashion: we’re generating incomprehensible black-boxes whose internals we’re then trying to translate into high-level representations.
The place where “connectionism” does outperform higher-level approaches is “blind development”. If we don’t know how the algorithm we want is supposed to work, only know what it should do, then such approaches may indeed be absolutely necessary. And in that view, it should be clear why evolution never stumbled on anything else: it has no idea what it’s doing, so of course it’d favour algorithms that work even if you have no idea what you’re doing.
(Though I don’t think brains’ uniformity is entirely downstream even of that. Computers can also be viewed as “a massive number of simple, uniform units connected to each other”, the units being transistors. Which suggests that it’s more of a requirement for general-purpose computational systems, imposed by the constraints of our reductionist universe. Not a constraint on the architecture of specifically intelligent systems.
Perhaps all good computational substrates have this architecture; but the software that’s implemented on these substrates doesn’t have to. And in the case of AI, the need for connectionism is already fulfilled by transistors, so AI should in principle be implementable via high-level programming that’s understandable by humans. There’s no absolute need for a second layer of connectionism in the form of NNs.)
I agree with your second point though, that complexity is largely the feature of problem domains, not agents navigating them. Agents’ policy functions are likely very simple, compared to agents’ world-models.

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

14 comments1 min readLW link

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

Thane Ruthenis 3 Feb 2024 4:02 UTC
39 points
18
in reply to: DanielFilan’s comment on: Most experts believe COVID-19 was probably not a lab leak
In addition to Roko’s point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.
First, the promised “anonymity” may not actually be real, or real in the relevant sense. The methodology mentions “a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information”, but if your reputation is on the line, would you really trust that? Maybe there’s some fine print that’d allow the survey-takers to look at the data. Maybe there’d be a data leak. Maybe there’s some other unknown-unknown you’re overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don’t, it can’t. So why risk it?
Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it’d end up saying that 90% of epidemiologists believe the lab leak, and you’re an epidemiologist… Well, anyone you talk to professionally will then assign 90% probability that that’s what you believe. You’d be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?
But, again, none of these calculations would be strategic. They’d be habitual; these factors are just the reasons why these habits are formed.
Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don’t.

Thane Ruthenis 17 Dec 2023 20:27 UTC
LW: 39 AF: 8
0
AF
on: OpenAI, DeepMind, Anthropic, etc. should shut down.
Agreed.
In addition: I expect one of the counter-arguments to this would be “if these labs shut down, more will spring up in their place, and nothing would change”.
Potentially-hot take: I think that’s actually a much lesser concern that might seem.
The current major AGI labs are led by believers. My understanding is that quite a few (all?) of them bought into the initial LW-style AGI Risk concerns, and founded these labs as a galaxy-brained plan to prevent extinction and solve alignment. Crucially, they aimed to do that well before the talk of AGI became mainstream. They did it back in the days where “AGI” was a taboo topic due to the AI field experiencing one too many AI winters.
They also did that in defiance of profit-maximization gradients. Back in 2010s, “AGI research” may have sounded like a fringe but tolerable research topic, but certainly not like something that’d have invited much investor/market hype.
And inasmuch as humanity is still speeding up towards AGI, I think that’s currently still mostly spearheaded by believers. Not by raw financial incentives or geopolitical races. (Yes, yes, LLMs are now all the hype, and I’m sure the military loves to put CNNs on their warheads’ targeting systems, or whatever it is they do. But LLMs are not AGI.)
Outside the three major AGI labs, I’m reasonably confident no major organization is following a solid roadmap to AGI; no-one else woke up. A few LARPers, maybe, who’d utter “we’re working on AGI” because that’s trendy now. But nobody who has a gears-level model of the path there, and what its endpoint entails.
So what would happen if OpenAI, DeepMind, and Anthropic shut down just now? I’m not confident, but I’d put decent odds that the vision of AGI would go the way great startup ideas go. There won’t be necessarily anyone who’d step in to replace them. There’d be companies centered around scaling LLMs in the brutest manners possible, but I’m reasonably sure that’s mostly safe.
The business world, left to its own devices, would meander around to developing AGI eventually, yes. But the path it’d take there might end up incremental and circuitous, potentially taking a few decades more. Nothing like the current determined push.
… Or so goes my current strong-view-weakly-held.

Convergence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC

38 points

1 comment13 min readLW link