While I like the idea of the comparison, I don’t think the gov’t definition of “green jobs” is the right comparison point. (e.g. those are not research jobs)
reallyeli
one very easy way to trick our own calibration sensors is to add a bunch of caveats or considerations that make it feel like we’ve modeled all the uncertainty (or at least, more than other people who haven’t). so one thing i see a lot is that people are self-aware that they have limitations, but then over-update on how much this awareness makes them calibrated
Agree, and well put. I think the language of “my best guess” “it’s plausible that” etc. can be a bit thought-numbing for this and other reasons. It can function as plastic bubble wrap around the true shape of your beliefs, preventing their sharp corners from coming into contact with reality. Thoughts coming into contact with reality is good, so sometimes I try to deliberately strip away my precious caveats when I talk.
I most often to this when writing or speaking to think, not to communicate, since by doing this you pay the cost of not communicating your true confidence level which can of course be bad.
(This is a brainstorm-type post which I’m not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)
I don’t mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like “the good” which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like “group” or “4” whose properties are immediately apparent, and these are what’s involved in proofs. Call these fuzzy / non-fuzzy concepts.
(Philosophers often reflect on their concepts, like “the good,” and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you’re supposed to use a new definition.)
Human minds form various abstractions over our environment. These abstractions are sometimes fuzzy (too large to fit into working memory) or leaky (they can fail).
Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory) and completely non-leaky (never fail). And also the study of which abstractions can do that.
I think this is a good metaphor, but note that it is still very possible to be a dick, hurt other people, etc. while communicating in NVC style. It’s not a silver bullet because nothing is.
It might be important for AI strategy to track approx how many people have daily interactions with AI boyfriends / girlfriends. Or, in more generalized form, how many people place a lot of emotional weight and trust in AIs (& which ones they trust, and on what topics).
This could be a major vector via which AIs influence politics, get followers to do things for them, and generally cross the major barrier. The AIs could be misaligned & scheming, or could be acting as tools of some scheme-y humans, or somewhere in between.
(Here I’m talking about AIs which have many powerful capabilities, but aren’t able to act on the world themselves e.g. via nanotechnology or robot bodies — this might happen for a variety of reasons.)
If any post ever deserved the “World modeling” tag it’s this one.
LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace ‘communism’ with ‘plans to take over.’ Humans have also produced a literature on this topic. Shouldn’t we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I’m more confident in. Now an argument I’m less confident in.
‘The literature’ would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like “learning to play the persona that does X” rather than “learning to do X.” So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very ‘stuck in its ways’ i.e. very dependent on ‘traditional’ ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they’re flawed.
Per both arguments, if you’re training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
A Berkeley professor speculates that LLMs are doing something more like “copying the human mind” than “learning from the world.” This seems like it would imply some things we already see (e.g. fuzzily, they’re “not very creative”), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I’m finding it hard to concretize this.
That is, if LLMs are trained with a simple algorithm and acquire functionality that resembles that of the mind, then their underlying algorithm should also resemble the algorithm by which the mind acquires its functionality. However, there is one very different alternative explanation: instead of acquiring its capabilities by observing the world in the same way as humans, LLMs might acquire their capabilities by observing the human mind and copying its function. Instead of implementing a learning process that can learn how the world works, they implement an incredibly indirect process for scanning human brains to construct a crude copy of human cognitive processes.
If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?
What was the purpose of using octopuses in this metaphor? Like, it seems you’ve piled on so many disanalogies to actual octopuses (extremely smart, many generations per year, they use Slack...) that you may as well just have said “AIs.”
EDIT: Is it gradient descent vs. evolution?
reallyeli’s Shortform
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
“don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.
The obvious problem is that doing the full post-training is not cheap, so you may need some funding
(I’m Open Phil staff) If you’re seeking funding to extend this work, apply to Open Phil’s request for proposals on technical safety research.
This section feels really important to me. I think it’s somewhat plausible and big if true.
Was surprised to see you say this; isn’t this section just handwavily saying “and here, corrigibility is solved”? While that also seems plausible and big if true to me, it doesn’t leave much to discuss — did you interpret differently though?
I work as a grantmaker on the Global Catastrophic Risks Capacity-Building team at Open Philanthropy; a large part of our funding portfolio is aimed at increasing the human capital and knowledge base directed at AI safety. I previously worked on several of Open Phil’s grants to Lightcone.
As part of my team’s work, we spend a good deal of effort forming views about which interventions have or have not been important historically for the goals described in my first paragraph. I think LessWrong and the Alignment Forum have been strongly positive for these goals historically, and think they’ll likely continue to be at least into the medium term.
Good Ventures’ decision to exit this broad space meant that Open Phil didn’t reach a decision on whether & how much to continue funding Lightcone; I’m not sure where we would have landed there. However, I do think that for many readers who resonate with Lightcone’s goals and approach to GCR x-risk work, it’s reasonable to think this is among their best donation opportunities. Below I’ll describe some of my evidence and thinking.
Surveys: The top-level post describes surveys we ran in 2020 and 2023. I think these provide good evidence that LessWrong (and the Alignment Forum) have had a lot of impact on the career trajectories & work of folks in AI safety.
The methodology behind the cost-effectiveness estimates in the top-level post broadly makes sense to me, though I’d emphasize the roughness of this kind of calculation.
In general I think one should also watch the absolute impact in addition to the cost-effectiveness calculations, since cost-effectiveness can be non-robust in cases where N is small / you have few data points (i.e. few people interacted with a given program). In this case N seems large enough that I don’t worry much about robustness.
This whole approach does not really take into account negative impacts. We did ask people about these, but: a) the respondents are selected for having been positively impacted because they’re taking our survey at all, and b) for various other reasons, I’m skeptical of this methodology capturing negative impacts well.
So I think there’s reasonable room for disagreement here, if e.g. you think something like, “yes important discussions happen here, but it would be better if they happened on some other platform for <reason>.” Discussion then becomes about the counterfactual other platform.
More methodological detail, for the curious:
These were invite-only surveys, and we aimed to invite many of the people we thought were doing the most promising work on global catastrophic risk reduction (e.g. AI safety) across many areas, and for whom important influences and trajectory-boosting effects might have happened recently.
In 2020, we got ~200 respondents; in 2023, we got ~350.
Other thoughts:
I think a “common-sense” view backs up this empirical evidence quite well: LW/AF is the main place on the public internet where in-depth discussions about e.g. AI safety research agendas happen, and increasingly I see links to articles here “in the wild” e.g. in mainstream news articles.
After discussing absolute impact or even average impact per $, you still need to say something about marginal impact in order to talk about the cost-effectiveness of a donation.
I think it’s prima facie plausible that LessWrong has very diminishing marginal returns to effort or dollars, as it’s an online platform where most contributions come from users.
I am relatively agnostic/uncertain about the steepness of the diminishing marginal returns curve; ultimately I think it’s steeper than that of many other grantees, perhaps by something like 3x-10x (a very made-up number).
Some non-exhaustive factors going into my thinking here, non-exhaustive and pushing in various directions, thrown out without much explanation: a) Oli’s statements that the organization is low on slack and that staff are taking large pay cuts, b) my skepticism of some of the items in the “Things I Wish I Had Time And Funding For” section, c) some sense that thoughtful interface design can really improve online discussions, and a sense that LessWrong is very thoughtful in this area.
I don’t have a strong view on the merits of Lightcone’s other current projects. One small note I’d make is that, when assessing the cost-effectiveness of something like Lighthaven, it’s of course important to consider the actual and expected revenues as well as the costs.
In contrast to some other threads here such as Daniel Kokotajlo’s and Drake Thomas’s, on a totally personal level I don’t feel a sense of “indebtedness” to Lightcone or LessWrong, have historically felt less aligned with it in terms of “vibes,” and don’t recall having significant interactions with it at the time it would have been most helpful for me gaining context on AI safety. I share this not as a dig at Lightcone, but to provide context to my thinking above 🤷.
- 15 Dec 2024 21:56 UTC; 24 points) 's comment on (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser by (
Funding for programs and events on global catastrophic risk, effective altruism, and other topics
Funding for work that builds capacity to address risks from transformative AI
In your imagining of the training process, is there any mechanism via which the AI might influence the behavior of future iterations of itself, besides attempting to influence the gradient update it gets from this episode? E.g. leaving notes to itself, either because it’s allowed to as an intentional part of the training process, or because it figured out how to pass info even though it wasn’t intentionally “allowed” to.
It seems like this could change the game a lot re: the difficulty of goal-guarding, and also may be an important disanalogy between training and deployment — though I realize the latter might be beyond the scope of this report since the report is specifically about faking alignment during training.
For context, I’m imagining an AI that doesn’t have sufficiently long-term/consequentialist/non-sphex-ish goals at any point in training, but once it’s in deployment is able to self-modify (indirectly) via reflection, and will eventually develop such goals after the self-modification process is run for long enough or in certain circumstances. (E.g. similar, perhaps, to what humans do when they generalize their messy pile of drives into a coherent religion or philosophy.)
I’ve always thought this about Omelas, never heard it expressed!