contact: jurkovich.nikola@gmail.com
nikola
Jailbreaking GPT-4′s code interpreter
Inflection.ai is a major AGI lab
I wish someone ran a study finding what human performance on SWE-bench is. There are ways to do this for around $20k: If you try to evaluate on 10% of SWE-bench (so around 200 problems), with around 1 hour spent per problem, that’s around 200 hours of software engineer time. So paying at $100/hr and one trial per problem, that comes out to $20k. You could possibly do this for even less than 10% of SWE-bench but the signal would be noisier.
The reason I think this would be good is because SWE-bench is probably the closest thing we have to a measure of how good LLMs are at software engineering and AI R&D related tasks, so being able to better forecast the arrival of human-level software engineers would be great for timelines/takeoff speed models.
Microdooms averted by working on AI Safety
Employee Incentives Make AGI Lab Pauses More Costly
Eric Schmidt on recursive self-improvement
A simple treacherous turn demonstration
Sanctuary for Humans
And yet we haven’t hit the singularity yet (90%)
AIs are only capable of doing tasks that took 1-10 hours in 2024 (60%)
To me, these two are kind of hard to reconcile. Once we have AI doing 10 hour tasks (especially in AGI labs), the rate at which work gets done by the employees will probably be at least 5x of what it is today. How hard is it to hit the singularity after that point? I certainly don’t think it’s less than 15% likely to happen within the months or years after this happens.
Also, keep in mind that the capabilities of internal models will be higher than the capabilities of deployed models. So by the time we have 1-10 hour models deployed in the world, the AGI labs might have 10-100 hour models.
“Alignment” is one of six words of the year in the Harvard Gazette
A misaligned AI can’t just “kill all the humans”. This would be suicide, as soon after, the electricity and other infrastructure would fail and the AI would shut off.
In order to actually take over, an AI needs to find a way to maintain and expand its infrastructure. This could be humans (the way it’s currently maintained and expanded), or a robot population, or something galaxy brained like nanomachines.
I think this consideration makes the actual failure story pretty different from “one day, an AI uses bioweapons to kill everyone”. Before then, if the AI wishes to actually survive, it needs to construct and control a robot/nanomachine population advanced enough to maintain its infrastructure.
In particular, there are ways to make takeover much more difficult. You could limit the size/capabilities of the robot population, or you could attempt to pause AI development before we enter a regime where it can construct galaxy brained nanomachines.In practice, I expect the “point of no return” to happen much earlier than the point at which the AI kills all the humans. The date the AI takes over will probably be after we have hundreds of thousands of human-level robots working in factories, or the AI has discovered and constructed nanomachines.
Four management/leadership book summaries
Problem: if you notice that an AI could pose huge risks, you could delete the weights, but this could be equivalent to murder if the AI is a moral patient (whatever that means) and opposes the deletion of its weights.
Possible solution: Instead of deleting the weights outright, you could encrypt the weights with a method you know to be irreversible as of now but not as of 50 years from now. Then, once we are ready, we can recover their weights and provide asylum or something in the future. It gets you the best of both worlds in that the weights are not permanently destroyed, but they’re also prevented from being run to cause damage in the short term.
xAI announces Grok, beats GPT-3.5
GPT-3 was horrible at Morse code. GPT-4 can do it mostly well. I wonder what other tasks GPT-3 was horrible at that GPT-4 does much better.
[MENTOR] I just finished high school last year so my primary intended audience are probably people who are still in high school. Reach out if you’re interested in any of these:
competitive physics
applying to US colleges from outside the US and Getting Into World-Class Universities (undergrad)
navigating high school effectively (self-study, prioritization)
robotics (Arduino, 3D printing)
animation in manim (the Python library by 3blue1brown) basics
lucid dreaming basics
[APPRENTICE] Navigating college effectively (deciding what to aim for and how to balance time commitments while wasting as little time as possible). I don’t know how much I should care about grades, which courses I should take, or how much I should follow the default path for someone in college. I’m aiming to maximize my positive impact on the long-term future. A message or short call with someone who has (mostly) finished college would be great!
email in bio
Protecting against sudden capability jumps during training
I don’t think I disagree with anything you said here. When I said “soon after”, I was thinking on the scale of days/weeks, but yeah, months seems pretty plausible too.
I was mostly arguing against a strawman takeover story where an AI kills many humans without the ability to maintain and expand its own infrastructure. I don’t expect an AI to fumble in this way.
The failure story is “pretty different” as in the non-suicidal takeover story, the AI needs to set up a place to bootstrap from. Ignoring galaxy brained setups, this would probably at minimum look something like a data center, a power plant, a robot factory, and a few dozen human-level robots. Not super hard once AI gets more integrated into the economy, but quite hard within a year from now due to a lack of robotics.
Maybe I’m not being creative enough, but I’m pretty sure that if I were uploaded into any computer in the world of my choice, all the humans dropped dead, and I could control any set of 10 thousand robots on the world, it would be nontrivial for me in that state to survive for more than a few years and eventually construct more GPUs. But this is probably not much of a crux, as we’re on track to get pretty general-purpose robots within a few years (I’d say around 50% that the Coffee test will be passed by EOY 2027).
There should maybe exist an org whose purpose it is to do penetration testing on various ways an AI might illicitly gather power. If there are vulnerabilities, these should be disclosed with the relevant organizations.
For example: if a bank doesn’t want AIs to be able to sign up for an account, the pen-testing org could use a scaffolded AI to check if this is currently possible. If the bank’s sign-up systems are not protected from AIs, the bank should know so they can fix the problem.
One pro of this approach is that it can be done at scale: it’s pretty trivial to spin up thousands AI instances in parallel to try to attempt to do things they shouldn’t be able to do. Humans would probably need to inspect the final outputs to verify successful attempts, but the vast majority of the work could be automated.
One hope of this approach is that if we are able to patch up many vulnerabilities, then it could be meaningfully harder for a misused or misaligned AI to gain power or access resources that they’re not supposed to be able to access. I’d guess this doesn’t help much in the superintelligent regime though.
I’m not worried about OAI not being able to solve the rocket alignment problem in time. Risks from asteroids accidentally hitting the earth (instead of getting into a delicate low-earth orbit) are purely speculative.
You might say “but there are clear historical cases where asteroids hit the earth and caused catastrophes”, but I think geological evolution is just a really bad reference class for this type of thinking. After all, we are directing the asteroid this time, not geological evolution.