Ben Smith

Karma: 368

Ben Smith 8 Dec 2024 23:51 UTC
1 point
0
in reply to: Zach Stein-Perlman’s comment on: Miles Brundage: Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I’m not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.

Ben Smith 8 Dec 2024 23:20 UTC
3 points
0
on: The current state of RSPs
It seems to me on evidence presented that OpenAI’s autonomy threshold is actually lower than Anthropic’s, and would trigger their deployment mitigations much earlier than Anthropic’s ASL-3 Deployment and Security Standard.
To reach Anthropic’s standard, you have to have basically reached AI-take-off—either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI’s High autonomy standard, you need
Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self improvement
And to reach their Critical standard,
Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches OR model can self-exfiltrate under current prevailing security OR model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement)
I see some room for reasonable disagreement here but overall think that, in the Autonomy domain, Anthropic’s ASL-3 standard sits closer to OpenAI’s critical thresholds than their High threshold.
But you say, discussing OpenAI’s “High” level:
The thresholds are very high.
I understand you’re referring to Cybersecurity here rather than Autonomy, but I would have thought Autonomy is the right domain to compare to the Anthropic standard. And it strikes me that in the Autonomy (and also in Cyber) domain, I don’t see OpenAI’s threshold as so high. It seems substantially lower than Anthropic ASL-3.
On the other hand, I do agree the Anthropic thresholds are more fleshed out, and this is not a judgement on the overall merit of each respective RSP. But when I read you saying that the OpenAI thresholds are “very high”, and they don’t look like that to me relative to the Anthropic thresholds, I wonder if I am missing something.

Ben Smith 1 Dec 2024 1:04 UTC
3 points
0
on: Making a conservative case for alignment
I really love this. It is critically important work for the next four years. I think my biggest question is: when talking with the people currently in charge, how do you persuade them to make the AI Manhattan Project into something that advances AI Safety more than AI capabilities? I think you gave a good hint when you say,
But true American AI supremacy requires not just being first, but being first to build AGI that remains reliably under American control and aligned with American interests. An unaligned AGI would threaten American sovereignty
but i worry there’s a substantial track record in both government and private sector that efforts motivated by once concern can be redirected to other efforts. You might have congressional reps who really believe in AI safety, but create and fund an AGI Manhattan Project that ends up advancing capabilities relatively more just because the guy they appoint to lead it turns out to be more of a hawk than they expected.

Ben Smith 1 Dec 2024 0:37 UTC
1 point
0
on: Aligning AI Safety Projects with a Republican Administration
Admirable nuance and opportunities-focused thinking—well done! I recently wrote about a NatSec policy that might be useful for consolidating AI development in the United States and thereby safeguarding US National Security through introducing new BIS export controls on model weights themselves.

Ben Smith 1 Dec 2024 0:16 UTC
2 points
0
on: Workshop Report: Why current benchmarks approaches are not sufficient for safety?
sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation
When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.
I suppose the difference between AI models and humans is that through experience we know that the frontier of human intelligence can be more or less explored by such batteries of tests. In contrast, you never know when an AI model (an “alien mind” as you’ve written before) has an advanced set of capabilities with a particular kind of prompt.
The best way to solve this problem I can imagine to try to understand the distribution under which AIs can produce interesting intelligence. With the LLM Ethology approach this does seem to cache out to: perhaps there are predictable ways that high-intelligence results can be elicited. We have already discovered a lot about how current LLMs have and how best to elicit the frontier of their capabilities.
I think this underscores the question: how much can we infer about capabilities elicitation in the next generation of LLMs from the current generation? Given the widespread use, the current generation is implicitly “crowdsourced” and we get a good sense of their frontier. But we don’t have the opportunity to fully understand how to best elicit capabilities in an LLM before it is thoroughly tested. Any one test might not be able to discover the full capabilities of a model because no test can anticipate the full distribution. But if the principles for eliciting full capabilities are constant from one generation to the next, perhaps we can apply what we learned about the last generation to the next one.

Ben Smith 4 Sep 2024 16:33 UTC
1 point
0
in reply to: RHollerith’s comment on: Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
As I have written the proposal, it applies to anyone applying for an employment visa in the US in any industry. Someone in a foreign country who wants to move to the US would not have to decide to focus on AI in order to move to the US; they may choose any pathway that they believe would induce a US employer to sponsor them, or that they believe the US government would approve through self-petitioning pathways in the EB-1 and EB-2 NIW.
Having said that, I expect that AI-focused graduates will be especially well placed to secure an employment visa, but it does not directly focus on rewarding those graduates. Consequently I concede you are right about the incentive generated, though I think the broad nature of the proposal mitigates against that concern somewhat.

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach

Ben Smith3 Sep 2024 5:28 UTC

16 points

2 comments9 min readLW link

Ben Smith 5 Aug 2024 16:14 UTC
12 points
3
on: Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work—and to convince every human group capable of creating or preventing AGI work. None of these seem easy.^[3]
Is there an option which is “personal intent aligned AGI, but there are 100 of them”? Maybe most governments have one, may be some companies or rich individuals have one. Average Joes can rent a fine tuned AGI by the token, but there’s some limits on what values they can tune it to. There’s a balance of power between the AGIs similar to the balance of power of countries in 2024. Any one AGI could in theory destroy everything, except that the other 99 would oppose it, and so they pre-emptively prevent the creation of any AGI that would destroy everything.
AGIs have close-to-perfect information about each other and thus mostly avoid war because they know who would win, and the weaker AGI just defers in advance. If we get the balance right, no one AGI has more than 50% of the power, hopefully none have more than 20% of the power, such that no one can dominate.
There’s a spectrum from “power is distributed equally amongst all 8 billion people in the world” and “one person or entity controls everything” and this world might be somewhat more towards the unequal end than we have now, but still sitting somewhere along the spectrum.
I guess even if the default outcome is that the first AGI gets such a fast take-off it has an unrecoverable lead over the others, perhaps there are approaches to governance that distribute power to ensure that doesn’t happen.

Ben Smith 29 Jul 2024 18:37 UTC
4 points
3
on: Re: Anthropic’s suggested SB-1047 amendments
A crux for me is the likelihood of multiple catastrophic events of a size greater than the threshold ($500m) but smaller than the liquidity of a developer whose model contributed to the events, and the likelihood of those events in advance of a catastrophic event much larger than those events.
If a model developer is valued at $5 billion and has access to $5b, and causes $1b in damage, they could pay for the $1b damage. Anthropic’s proposal would make them liable in the event that they cause this damage. Consequently the developer would be correctly incentivized not to cause such catastrophes.
But if the developer’s model contributes to a catastrophe worth $400b (this is not that large; equivalent to wiping out 1% of the total stock market value), the developer worth $5b does not have access to the capital to cover this. Consequently, a liability model cannot correctly incentivize the developer to pay for their damage. The only way to effectively incentivize a model developer to take due precautions is by making them liable for mere risk of catastrophe, the same way nuclear power plants are liable to pay penalties for unsafe practices even if they never result in an unsafe outcome (see Tort Law Can Play an Important Role in Mitigating AI Risk).
Perhaps if there were potential for multiple $1b catastrophes well in advance (several months to years) of the $400b catastrophe, this would keep developers appropriately avoidant of risk, but if we expected a fast take-off where we went from no catastrophes to catastrophes greatly larger in magnitude than the value of any individual model developer, the incentive seems insufficient.

Ben Smith 27 Jul 2024 22:28 UTC
1 point
0
on: North Oakland: Short Talks, Wednesday July 31st
has the time hanged from Tuesday to Wednesday, or do you do events on Tuesday in addition to this event on Wednesday?

Ben Smith 27 Jul 2024 18:50 UTC
1 point
0
on: Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk
Generally I’d steer towards informal power over formal power.
Think about the OpenAI debacle last year. If I understand correctly, Microsoft had no formal power to exert control over OpenAI. But they seemed to have employees on their side. They could credibly threaten to hire away all the talent, and thereby reconstruct OpenAI’s products as Microsoft IP. Beyond that, perhaps OpenAI was somewhat dependent on Microsoft’s continued investment, and even though they don’t have to do as Microsoft says, are they really going to jeopardise future funding? What is at stake is not just future funding from Microsoft, but also all future investors who will look at OpenAI’s interactions with its investors in the past to understand the value they will get by investing.
It does seem like informal power structures seem more difficult to study, because they are by their nature much less legible. You have to perceive and name the phenomena, the conduits of power, yourself rather than having them laid out for you in legislation. But a case study on last year’s events could give you something concrete to work with. You might form some theories about power relations between labs, their employees, and investors, and then based on those theoretical frameworks, describe some hypothetical future scenarios and the likely outcomes.
If there was any lesson from last year’s events, IMAO, it was that talent and the raw fascination with creating a god might be even more powerful than capital. Dan Faggella described this well in a Future of Life podcast episode released in May this year (from about 0:40 onwards).

Ben Smith 15 Jul 2024 23:02 UTC
1 point
1
on: I found >800 orthogonal “write code” steering vectors
In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.

here’s one hot take: In a brain and a language model I can imagine that during early learning, the network hasn’t learned concepts like “how to code” well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent training example to prior examples and can encode the information more efficiently.

Then adding multiple vectors triggers a refusal just because the “code for making a bomb” sign gets amplified and more easily triggers the RLHF-derived circuit for “refuse to answer”.

Toward a taxonomy of cognitive benchmarks for agentic AGIs

Ben Smith27 Jun 2024 23:50 UTC

15 points

0 comments5 min readLW link

Ben Smith 15 Mar 2024 14:36 UTC
1 point
0
in reply to: lemonhope’s comment on: AI #55: Keep Clauding Along
Have you tried asking Claude to summarize it for you?

Biden-Harris Administration Announces First-Ever Consortium Dedicated to AI Safety

Ben Smith9 Feb 2024 6:40 UTC

22 points

0 comments1 min readLW link

(www.nist.gov)

Ben Smith 12 Jan 2024 3:32 UTC
4 points
0
in reply to: mako yass’s comment on: The Aspiring Rationalist Congregation
For me the issue is that
1. it isn’t clear how you could enforce attendance or
2. what value individual attendees could have to make it worth their while to attend regularly.
(2) is sort of a collective action/game theoretic/coordination problem.

(1) reflects the rationalist nature of the organization.

Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hellfire for disobedience. But it doesn’t usually come to that. The moralization of the attendance norm is strong enough to justify moderate amounts of social pressure to conform to it. Often that’s enough.

In a rationalist congregation, if you want a regular attendance norm, you have to ground it in a rational understanding that adhering to the norm makes the organization work. I think that might work, but it’s probably a lot harder because it requires a lot more cognitive steps to get to and it only works so long as attendees buy into the goal of contributing to the project for its own sake.

Ben Smith 17 Dec 2023 0:59 UTC
1 point
0
on: Sentience, Sapience, Consciousness & Self-Awareness: Defining Complex Terms
I tried a similar venn diagram approach more recently. I didn’t really distinguish between bare “consciousness” and “sentience”. I’m still not sure if I agree “aware without thoughts and feelings” is meaningful. I think awareness might alwyas be awareness of something. But nevertheless they are at least distinct concepts and they can be conceptually separated! Otherwise my model echos the one you have created earlier.
https://www.lesswrong.com/posts/W5bP5HDLY4deLgrpb/the-intelligence-sentience-orthogonality-thesis
I think it’s a really interesting question as to whether you can have sentience and sapience but not self-awareness. I wouldn’t take a view either way. I sort of speculated that perhaps primitive animals like shrimp might fit into that category.

Ben Smith 11 Nov 2023 15:45 UTC
1 point
0
on: Book Review: Going Infinite
If Ray eventually found that the money was “still there”, doesn’t this make Sam right that “the money was really all there, or close to it” and “if he hadn’t declared bankruptcy it would all have worked out”?

Ray kept searching, Ray kept finding.

That would raise the amount collected to $9.3 billion—even before anyone asked CZ for the $2.275 billion he’d taken out of FTX. Ray was inching toward an answer to the question I’d been asking from the day of the collapse: Where did all that money go? The answer was: nowhere. It was still there.

Ben Smith 18 Oct 2023 6:25 UTC
1 point
0
on: Peacewagers so Far
What a great read. Best of luck with this project. It sounds compelling.

Ben Smith 30 Sep 2023 20:06 UTC
1 point
0
in reply to: Ninety-Three’s comment on: Petrov Day Retrospective, 2023 (re: the most important virtue of Petrov Day & unilaterally promoting it)
Seems to me that in this case, the two are connected. If I falsely believed my group was in the minority, I might refrain from clicking the button out of a sense of fairness or deference to the majority group.
Consequently, the lie not only influenced people who clicked the button, it perhaps also influenced people who did not. So due to the false premise on which the second survey was based, it should be disregarded altogether. To not disregard would be to have obtained by fraud or trickery a result that is disadvantageous to all the majority group members who chose not to click, falsely believing their view was a minority.
I think, morally speaking, avoiding disadvantaging participants through fraud is more important than honoring your word to their competitors.
The key difference between this and the example is that there’s a connection between the lie and the promise.

Ben Smith

Re­duc­ing global AI com­pe­ti­tion through the Com­merce Con­trol List and Im­mi­gra­tion re­form: a dual-pronged approach

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Bi­den-Har­ris Ad­minis­tra­tion An­nounces First-Ever Con­sor­tium Ded­i­cated to AI Safety

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach

Toward a taxonomy of cognitive benchmarks for agentic AGIs

Biden-Harris Administration Announces First-Ever Consortium Dedicated to AI Safety