aogara

Karma: 1,444

DPhil Student in AI at Oxford, and grantmaking on AI safety at Longview Philanthropy.

aogara 2 Jun 2024 21:28 UTC
6 points
2
on: How do you shut down an escaped model?
My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/

So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.

Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.

Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the government is investigating the issue as a natsec threat.

Over longer time horizons, it seems highly likely that people will deliberately create autonomous AI agents and deliberately release them into the wild with the goal of surviving and spreading, unless there are specific efforts to prevent this.

aogara 31 May 2024 17:55 UTC
2 points
0
in reply to: Gretta Duleba’s comment on: MIRI 2024 Communications Strategy
Has MIRI considered supporting work on human cognitive enhancement? e.g. Foresight’s work on WBE.

aogara 29 May 2024 23:00 UTC
2 points
0
in reply to: Zhehui Huang’s comment on: Benchmarking LLM Agents on Kaggle Competitions
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.

aogara 16 May 2024 19:09 UTC
2 points
0
in reply to: Zac Hatfield-Dodds’s comment on: AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data
I want to make sure we get this right, and I’m happy to change the article if we misrepresented the quote. I do think the current version is accurate, though perhaps it could be better. Let me explain how I read the quote, and then suggest possible edits, and you can tell me if they would be any better.
Here is the full Time quote, including the part we quoted (emphasis mine):
But, many of the companies involved in the development of AI have, at least in public, struck a cooperative tone when discussing potential regulation. Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems.
But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches, according to multiple sources present in or familiar with such conversations. In particular, companies tend to advocate for very permissive or voluntary regulations. “Anytime you want to make a tech company do something mandatory, they’re gonna push back on it,” said one Congressional staffer.
Who are “the same companies” and “companies” in the second paragraph? The first paragraph specifically mentions OpenAI, Anthropic, and Microsoft. It also discusses broader groups of companies that include these three specific companies “both the newer AI firms and the more established tech giants,” and “the companies involved in the development of AI [that] have, at least in public, struck a cooperative tone when discussion potential regulation.” OpenAI, Anthropic, and Microsoft, and possibly others in the mentioned reference classes, appear to be the “companies” that the second paragraph is discussing.
We summarized this as “companies, such as OpenAI and Anthropic, [that] have publicly advocated for AI regulation.” I don’t think that substantially changes the meaning of the quote. I’d be happy to change it to “OpenAI, Anthropic, and Microsoft” given that Microsoft was also explicitly named in the first paragraph. Do you think that would accurately capture the quote’s meaning? Or would there be a better alternative?

AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data

aogara, Corin Katzke and Dan H

16 May 2024 14:29 UTC

2 points

3 comments6 min readLW link

(newsletter.safe.ai)

aogara 9 May 2024 15:48 UTC
4 points
2
in reply to: RobertM’s comment on: RobertM’s Shortform
More discussion of this here. Really not sure what happened here, would love to see more reporting on it.

AISN #34: New Military AI Systems Plus, AI Labs Fail to Uphold Voluntary Commitments to UK AI Safety Institute, and New AI Policy Proposals in the US Senate

aogara, Corin Katzke and Dan H

2 May 2024 16:12 UTC

6 points

0 comments8 min readLW link

(newsletter.safe.ai)

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI

aogara, Corin Katzke, Alexa Pan and Dan H

12 Apr 2024 16:10 UTC

13 points

0 comments9 min readLW link

(newsletter.safe.ai)

Benchmarking LLM Agents on Kaggle Competitions

aogara22 Mar 2024 13:09 UTC

15 points

3 comments5 min readLW link

AISN #32: Measuring and Reducing Hazardous Knowledge in LLMs Plus, Forecasting the Future with LLMs, and Regulatory Markets

aogara, Corin Katzke and Dan H

7 Mar 2024 16:39 UTC

8 points

0 comments8 min readLW link

(newsletter.safe.ai)

aogara 27 Feb 2024 19:37 UTC
2 points
0
on: Biosecurity and AI: Risks and Opportunities
(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)

aogara 21 Feb 2024 22:06 UTC
6 points
2
on: Project ideas: Epistemics
An interesting question here is “Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?” In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it’s important to pick an angle where you could outperform any competition and offer the best available product.
Consensus is a startup that raised $3M “Make Expert Knowledge Accessible and Consumable for All” via LLMs.

AISN #31: A New AI Policy Bill in California Plus, Precedents for AI Governance and The EU AI Office

aogara and Dan H

21 Feb 2024 21:58 UTC

17 points

0 comments6 min readLW link

(newsletter.safe.ai)

aogara 20 Feb 2024 22:44 UTC
2 points
0
on: Project ideas: Epistemics
Another interesting idea: AI for peer review.

aogara 13 Feb 2024 9:51 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: What’s the theory of impact for activation vectors?
I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
But this is also only a small portion of work known as “activation engineering.” I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I’m not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I’m not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering:
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I’m excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. “Extra juice” might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform:
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I’d be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
1. “Consider the truthfulness of the following statement. {statement} The statement is true.”
2. “Consider the truthfulness of the following statement. {statement} The statement is false.”
We don’t need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.

aogara 13 Feb 2024 1:41 UTC
11 points
−2
on: What’s the theory of impact for activation vectors?
Here’s one hope for the agenda. I think this work can be a proper continuation of Collin Burns’s aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model’s activation space that might represent the model’s beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I’m not super confident in this take; it’s not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI’s internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model’s beliefs.
One such convenient possibility is the “linear representations hypothesis:” that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network’s beliefs.
If a neural network’s beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model’s beliefs?
Collin Burns’s paper offered two methods:
1. Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger and a new DeepMind paper point out, there are very many directions that satisfy this property.
2. Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form “X is true” and “X is false.” Extract a model’s activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin’s paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued, the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs.
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model’s. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model’s beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model’s beliefs and the model’s representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the “human simulator” direction and the “direct translator” direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it’s surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin’s use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin’s second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There’s a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
1. On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
2. Thinking about the number of directions that could be found using these methods. Maybe there’s a result to be found here similar to Fabien and DeepMind’s results above, showing this method fails to narrow down the set of candidates for truth.
3. Applying these techniques to domains where models aren’t trained on human statements about truth and falsehood, such as chess.
4. Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model’s performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that’s great news for the method.
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin’s work, but I haven’t thought about this stuff full-time in over a year. I have maybe 70% confidence that I’d still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here’s my previous attempted explanation.

aogara 12 Feb 2024 17:19 UTC
4 points
2
on: On the Proposed California SB 1047
Another important obligation set by the law is that developers must:
(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.
This sounds like common sense, but of course there’s a lot riding on the interpretation of “unreasonable.”

aogara 8 Feb 2024 12:03 UTC
2 points
1
on: A Chess-GPT Linear Emergent World Representation
Really, really cool. One small note: It would seem natural for the third heatmap to show the probe’s output values after they’ve gone through a softmax, rather than being linearly scaled to a pixel value.

aogara 5 Feb 2024 18:21 UTC
2 points
0
on: Language Agents Reduce the Risk of Existential Catastrophe
Two quick notes here.
1. Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human’s ultimate goal. I think it’s important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goals such as achieving the human’s desired end state. By analogy, language agents trained with process-based feedback might be like consultants who aim for polite applause at the end of a presentation, rather than an owner CEO incentivized to do whatever it takes to improve a business’s bottom line.
2. If you believe that deceptive alignment is more likely with stronger reasoning within a single forward pass, then, because improvements in language agents would increase overall capabilities with a given base model, they would seem to reduce the likelihood of deceptive alignment at any given level of capabilities.

AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes

aogara, Dan H and Corin Katzke

24 Jan 2024 19:38 UTC

27 points

1 comment6 min readLW link

(newsletter.safe.ai)

aogara

AISN #35: Lob­by­ing on AI Reg­u­la­tion Plus, New Models from OpenAI and Google, and Le­gal Regimes for Train­ing on Copy­righted Data

AISN #34: New Mili­tary AI Sys­tems Plus, AI Labs Fail to Uphold Vol­un­tary Com­mit­ments to UK AI Safety In­sti­tute, and New AI Policy Pro­pos­als in the US Senate

AISN #33: Re­assess­ing AI and Biorisk Plus, Con­soli­da­tion in the Cor­po­rate AI Land­scape, and Na­tional In­vest­ments in AI

Bench­mark­ing LLM Agents on Kag­gle Competitions

AISN #32: Mea­sur­ing and Re­duc­ing Hazardous Knowl­edge in LLMs Plus, Fore­cast­ing the Fu­ture with LLMs, and Reg­u­la­tory Markets

AISN #31: A New AI Policy Bill in Cal­ifor­nia Plus, Prece­dents for AI Gover­nance and The EU AI Office

AISN #30: In­vest­ments in Com­pute and Mili­tary AI Plus, Ja­pan and Sin­ga­pore’s Na­tional AI Safety Institutes

AISN #35: Lobbying on AI Regulation Plus, New Models from OpenAI and Google, and Legal Regimes for Training on Copyrighted Data

AISN #34: New Military AI Systems Plus, AI Labs Fail to Uphold Voluntary Commitments to UK AI Safety Institute, and New AI Policy Proposals in the US Senate

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI

Benchmarking LLM Agents on Kaggle Competitions

AISN #32: Measuring and Reducing Hazardous Knowledge in LLMs Plus, Forecasting the Future with LLMs, and Regulatory Markets

AISN #31: A New AI Policy Bill in California Plus, Precedents for AI Governance and The EU AI Office

AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes