anonymousaisafety

Karma: 724

anonymousaisafety Jun 22, 2022, 9:04 PM
6 points
0
in reply to: Lauro Langosco’s comment on: Let’s See You Write That Corrigibility Tag
I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer.
Even your point about it doesn’t have to succeed, it’s ok for it to fail at a task if it can’t reach it in some obvious, non-insane way—that’s what I’d expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn’t alert or tell a human about that plan prior to taking it has always been confusing to me.
In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn’t close), but they find some other solution that would work, people discuss that novel solution first and agree to it.
That’s a core behavior I’d want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it’s actually very context dependent—if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I’d want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B.
EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it’s relevant to discussion of what it means to “optimize” or “solve” a problem as well. I’ve suggested this in my other posts but I don’t think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all “optimal” solutions (from the perspective of a human engineer) are going to be built on the existing pile of infrastructure we have, in the context of “what is cheapest, easiest, the most straight line path to a reasonably good solution that meets the requirements”. There is a secret pile of “optimal” (in the context of someone doing reasoning from first principles) solutions that involve ignoring all of human technology and bootstrapping a new technology tree from scratch, but I’d argue that’s a huge overlap if not the exact same set as the things people have called “weird” in multiple lists. Like if I gave a contractor a task to design a more efficient paperclip factory and they gave me a proposed plan that made zero reference to buying parts from our suppliers or showed the better layout of traditional paper-clip making machines or improvements to how an existing paper-clip machine works, I’d be confused, because that contractor is likely handing me a plan that would require vertically integrating all of the dependencies, which feels like complete overkill for the task that I assigned. Even if I phrased my question to a contractor as “design me the most efficient paperclip factory”, they’d understand constraints like: this company does not own the Earth, therefore you may not reorder the Earth’s atoms into a paperclip factory. They’d want to know, how much space am I allowed? How tall can the building be? What’s the allowable power usage? Then they’d design the solution inside of those constraints. That is how human engineering works. If an AGI mimicked that process and we could be sure it wasn’t deceptive (e.g. due to interpretability work), then I suspect that almost all claims about how AGI will immediately kill everyone are vastly less likely, and the remaining ways AGI can kill people basically reduce to the people controlling the AGI deliberately using it to kill people, in the same way that the government uses military contractors to design new and novel ways of killing people, except the AGI would be arbitrarily good at that exercise.
What links here?
- anonymousaisafety's comment on Air Conditioner Test Results & Discussion by johnswentworth (Jun 24, 2022, 6:28 PM; 4 points)

anonymousaisafety Jun 21, 2022, 7:22 AM
21 points
0
in reply to: TekhneMakre’s comment on: Pivotal outcomes and pivotal processes
Oh, sorry, you’re referring to this:
includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
includes a distributed network of non-nuclear ICBMs that will be physically shot at any target believed to be running a rogue AI agent
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker?
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.^[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
1. The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
2. The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger^[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.^[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.^[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
1. I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
2. The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
3. The alert message uses authenticated encryption.
4. The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
1. The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.^[5]
2. We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
- They can maybe send garbage alert messages.
- They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
1. ^
  This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
2. ^
  When I say “external debugger”, this is implying physical access to the system.
3. ^
  In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
4. ^
  To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
5. ^
  This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.

anonymousaisafety Jun 21, 2022, 1:04 AM
3 points
0
in reply to: TekhneMakre’s comment on: Pivotal outcomes and pivotal processes
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable?
I can’t comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What “behavior” is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There’s a lot of internet traffic. Is this even “a computer”, or it, “an entire facility of computers”. Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done “without humans”? Wouldn’t this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn’t report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count “turning off the power” as “hacking”?

anonymousaisafety Jun 21, 2022, 12:43 AM
10 points
6
in reply to: Noosphere89’s comment on: Pivotal outcomes and pivotal processes
I don’t think that this TL;DR is particularly helpful.
People think attacks like Rowhammer are viable because security researchers keep releasing papers that say the attacks are viable.
If I posted 1 sentence and said “Rowhammer has too many limitations for it to be usable by an attacker”, I’d be given 30 links to papers with different security researchers all making grandiose claims about how Rowhammer is totally a viable attack, which is why 8 years after the discovery of Rowhammer we’ve had dozens of security researchers reproduce the attack and 0 attacks in the wild^[1] that make use of it.
If my other posts haven’t made this clear, I think almost all disagreements in AI x-risk come down to a debate over high-level vs low-level analysis. Many things sound true as a sound-bite or quick rebuttal in a forum post, but I’m arguing from my perspective and career spent working on hardware/software systems that we’ve accumulated enough low-level evidence (“the devil is in the details”) to falsify the high-level claim entirely.
1. ^
  We can argue that just because we don’t know that someone has used Rowhammer—or a similar probabilistic hardware vulnerability—doesn’t mean that someone hasn’t. I don’t know if that’s a useful tangent either. The problem is that people use these side-channel attacks as an “I win” button in arguments about secure software systems by making it seem like the existence of side channel exploits is therefore proof that security is a lost cause. It isn’t. It isn’t about the intelligence of the adversary, it’s that the target basically needs be sitting there, helping the attack happen. On any platform where part of the stack is running someone else’s code, yeah, you’re going to get pwned if you just accept arbitrary code, so maybe don’t do that? It is not rocket science.
What links here?
- Noosphere89's comment on eggsyntax’s Shortform by eggsyntax (Sep 17, 2024, 8:56 PM; 2 points)

anonymousaisafety Jun 20, 2022, 8:03 PM
15 points
3
in reply to: Eliezer Yudkowsky’s comment on: Pivotal outcomes and pivotal processes
I am only replying to the part of this post about hardware vulnerabilities.
Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
There are dozens of hardware vulnerabilities that exist primarily to pad security researcher’s bibliographies.
Rowhammer, like all of these vulnerabilities, is viable if and only if the following conditions are met:
- You know the exact target hardware.
- You also know the exact target software, like the OS, running on that hardware.
  - You need to know what in RAM you’re trying to flip, and where to target to do so, like a page table or some type of bit for the user’s current access level.
- The target needs to actually execute your code.
  - In attacks where security researchers pad themselves for pulling off Rowhammer remotely, it’s because they use JavaScript or WebGL in a browser, and then pwn devices that use browsers. This is flaw almost entirely reserved for general-purpose compute hardware, because embedded software or other application-specific hardware/software systems don’t need or have browsers in them.
  - In all other attacks, it involves downloading & executing a program on the target machine. Normally the example is given with cloud VMs whose entire purpose is to run code from an external source. Again, this is reserved for general-purpose compute hardware, because systems that execute code only out of read-only memory will not be able to execute an attacker’s code.
  - There is time on the target to run the Rowhammer attack uninterrupted. It relies on a continuous and degenerate set of instructions. This can be anywhere from minutes to days of time. This means that systems that don’t give uninterrupted time to external code are also not vulnerable.
- The target OS, or other software, on the system needs to not perform any type of RAM scrubbing. There are papers claiming that variants of Rowhammer work for systems that use ECC + scrubbing, but those papers also assume that the scrubbing happens over hours. If a system has very little RAM, like an embedded processor, it is feasible for hardware to scrub RAM far faster than that.
  - These exploits also add the requirement that the attacker needs to know the exact target RAM and ECC algorithm.
- The target hardware/software system needs to not have any hardware level redundancy. You can’t rowhammer a system that has 3 separate computers that compare & vote their state on a synchronized clock. Hardware vulnerabilities are probabilistic attacks. They don’t work anymore if the attack must occur simultaneously, and identically on separate physical systems. This is another reason why we’re able to build systems that function despite hostile environments where bit flips are routine, i.e. high radiation.
- The target needs to not crash. Semi-randomly flipping bits in roughly the right location in RAM is not something that most software is designed to handle, and in an overwhelming number of cases, trying to execute this attack will crash the system.
It’s not that Rowhammer isn’t possible in the sense that it cannot be shown to work, but it’s like this paper showing that you can create WiFi signals in air-gapped computers. Or this fun paper for Nethammer showing novel attacks that don’t require code execution on the target machine, except they also don’t allow for controlling where bit flips occur, so the “attack” is isomorphic to an especially hostile radiation environment with a high likelihood of bit-flips, and it relies on the ability for the attacker to swarm the target system with a high volume (500 Mbps?) of network traffic that they control—a network switch that drops unexpected traffic or even just rate-limits it will defeat Nethammer. Note that rate-limiting network traffic is in fact standard practice for high stability systems, because it’s also a protection against much more mundane denial-of-service attacks.
Consumer systems are vulnerable to attacks, because consumer systems don’t care about stability. Consumers want to have a fast network connection to the internet. There’s no requirement, or need, for that to be true on a system designed for stability, like something in a satellite, or some other safety-critical role. It is possible to have systems that are effectively “not able to be hacked”—they don’t use general-purpose hardware, they don’t have code that can be modified, they have no capability for executing external code, they include hardware level fault tolerance and redundancy, and they have exceptionally limited I/O. It doesn’t require us presuming “superhuman-at-security AGIs” exist to design these systems.
Every few weeks researchers publish papers carefully documenting the latest side-channel attacks that result in EVERYTHING EVERYWHERE BEING VULNERABLE FOREVER, and every few weeks attackers continue to do the boring old thing of leaving USB drives lying around for a target to pwn themselves, or letting the target just download the malware directly to their machine. They’re almost all just remixes too—it’s “here’s how to do exploit A (the original), but on hardware systems that implemented mitigations for C, D, and E”. Except exploit A still has all of the preconditions I listed above, and now you’ve got whatever new preconditions the security researchers have in their latest paper.
What links here?
- Noosphere89's comment on What can we learn from insecure domains? by Logan Zoellner (Nov 2, 2024, 4:27 PM; 5 points)
- Noosphere89's comment on eggsyntax’s Shortform by eggsyntax (Sep 17, 2024, 8:56 PM; 2 points)

anonymousaisafety Jun 20, 2022, 10:37 AM
25 points
4
on: Let’s See You Write That Corrigibility Tag
I worry that the question as posed is already assuming a structure for the solution—“the sort of principles you’d build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it”.
When I read that, I understand it to be describing the type of behavior or internal logic that you’d expect from an “aligned” AGI. Since I disagree that the concept of “aligning” an AGI even makes sense, it’s a bit difficult for me to reply on those grounds. But I’ll try to reply anyway, based on what I think is reasonable for AGI development.
In a world where AGI was developed and deployed safely, I’d expect the following properties:
1. Controlled environments.
2. Controlled access to information.
3. Safety-critical systems engineering.
4. An emphasis on at-rest encryption and secure-by-default networking.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
6. Systems with AGI are assumed to be adversarial.
Let’s stop on the top of the mountain and talk about (6).
Generally, the way this discussion goes is we discuss how unaligned AGI can kill everyone, and therefore we need to align the AGI, and then once we figure out how to align the AGI, problem solved, right?
Except someone then points out that, well, other people might create unaligned AGI, and then that will kill everyone, so that’s awkward.
Also maybe the team that thought they aligned the AGI actually didn’t, their proof had a mistake, whoops.
Or they had a formally proven proof, but they deployed it to a general purpose computer, implemented zero hardware redundancy, and then a bit flip caused the system to kill everyone anyway, whoops.
So normally we don’t discuss the last 2 failure modes, because it’s a bit awkward for the discussion of alignment, and we instead talk about how the deployment of the aligned AGI is going to go totally fine, and then we’ll just do a little pivotal act.
How does AGI kill everyone in these stories? Almost always, the answer is nanotechnology. If the answer isn’t nanotechnology, it’s hacking into <hackable thing>.
When you try to suggest that maybe the AGI shouldn’t be able to hack into <hackable thing>, you’re told that’s insane, because that would be boxing the AGI, and we know that boxing doesn’t work.
The world I am going to describe might not be our world, but it also isn’t Eliezer’s world.
We are separated by a few critical assumptions—and it starts with hardware.
1. Controlled environments.
In a world where we cared about AGI being developed safely, here’s a fun idea: you know all that TPU hardware that Google designed, or the other custom AI/ML chips that companies are working? It’s all just glorified GPUs, except the Mythic AMP is kind of cool, since it’s an actual analog compute platform for running neural networks, like an analog FPGA. What if AGI wasn’t developed to run on general-purpose consumer hardware? What if it can’t be run on general-purpose hardware? What if the general intelligence algorithm is computationally inefficient enough that the path to AGI actually looks like incredibly specialized, heavily parallelized, wide processors with low clock rates, large caches, and fast memory access.
In this world, the general intelligence algorithm needs hardware designed for it because no amount of pipelining with a traditional CPU, or even a GPU, is able to run that code efficiently.
This is a world where the AGI machine code must be flashed to the ROM via an external debugger. There is no ability to flash or update the code from within the hardware/software system. The AGI cannot update itself. We do not design, deploy, or use self-improving systems.
This is a world where the hardware used for AGI development uses a memory-protection unit so that memory accesses to the ROM containing the AGI machine code trigger a hardware exception. It is not possible to read the ROM from within the hardware/software system. The AGI cannot inspect its own source code.
This is a world where the AGI has RAM, but no accessible non-volatile memory. If it loses power, poof, gone. The AGI cannot maintain, or persist, any type of internal or “custom” state between resets. Anything the AGI needs to know is either baked into the image flashed to the ROM, or it’s transferred into the AGI’s RAM after power-on.
This is a world where we deliberately leave performance on the floor because what we want from the system is consistency, predictability, and some minimum level of effectiveness above human capabilities—even if we hypothetically could run an AGI system at some 1000x “speed” (whatever that means) relative to a human mind, doesn’t mean that you’d have to do so. If you’re worried about not being able to react to sudden or unexpected changes, then deliberately run the system at arbitrarily low clock rates. There is zero reason why we couldn’t run an AGI on a custom hardware/software system that has a clock rate so low it thinks in glacial time relative to human perception.
This is a world where it’s reasonable to regulate the AGI hardware the same way we regulate components for ICBMs, because it is not ubiquitously used, and we don’t need to burn everyone’s GPU—we just need to not sell consumer AGI hardware, in the same way that we don’t sell consumer GPS units that could be used for an ICBM instead of a car. The very concept of “stealing” or “releasing” the AGI doesn’t make sense, because the hardware to run the AGI machine code is as carefully-controlled and rare as the hardware for custom-built spacecraft.
2. Controlled access to information.
In a world where we cared about AGI being developed safely, no one would be dumb enough to allow that system unrestricted access to networks. This would be accomplished by external, non-AGI systems. If we don’t trust the AGI, it would be weird to try and use some other AGI system as a watchdog. So instead, use custom network hardware. Build it with an FPGA. Ethernet ports on either side. The custom switch has a simple rule: when a packet is received from the AGI, request approval. If no approval is given, drop the packet. Same for incoming traffic. If the complaint is that this is hugely inefficient because it slows down the AGI’s ability to do ad-hoc discovery & learning, that’s the point. If we want the AGI to solve a problem about optimizing car production, we will give it a data dump of material that we think is relevant for the problem. If it then wants to request additional information, then we have to approve that request. This is how workers are treated on projects with highly proprietary or otherwise silo’d information and that’s how the AGI should be treated too.
But there’s more. We’re running the AGI on custom hardware, right? Internal to the AGI board, after the Ethernet (let’s assume we’re using one or more gigabit links) port on the PCB, we’ll put a hardware security module and FPGA for decrypting the packets. This is not connected to, or part of, the hardware/software system running the AGI code. It is located on the PCB, but otherwise it is a separate system entirely. This system has one job: try to decrypt the traffic and drop anything that is invalid. Why are we going to do this? Because outside of this system, on the non-AGI network switch, we are going to encrypt traffic that we approve for the AGI. If we don’t approve the traffic, it’ll got dropped. We’ll do the reverse for traffic generated by the AGI. It is always encrypted prior to leaving the system, and the AGI itself has no control over this function. If the custom network hardware isn’t there to decrypt the approved traffic, it’ll just be babbling noise indecipherable from random data.
There’s no WiFi module on this hardware, and no public WiFi endpoints in the facility. All WiFi access points in the facility require passwords and have a white-list of approved MAC addresses.
The reason why controlling access to information matters is because there are hard limits to how far or fast an intelligent entity can reason in a vacuum. There is enough context-dependent information that you can’t reason everything from first principles, because the world’s current state has tons of structure embedded in it that humans put in place, and anything that wants to use human supply chains, logistics, and manufacturing is going to have to respect those totally arbitrary and unpredictable constraints.
An AGI won’t be able to magically decode arbitrary data. If we don’t give it access to datasheets, technical descriptions, or allow it to interact with data sources, it’ll be just as stuck as any other intelligence faced with an arbitrary number of hypotheses and no way to rule any of them out.
Depending on how complex the “general intelligence” algorithm turns out to be, just reasoning through if-then logic and Bayesian world modeling could be non-trivial compute time. The more we control information for the AGI, the less feasible plans that involve looking at thousands of CCTV cameras simultaneously will seem.
3. Safety-critical systems engineering.
In a world where we cared about AGI being developed safely, the hardware that I’m describing above containing the hardware security module would actually have at least 3 identical processors running on a synchronized clock. The hardware security module would expect 3 identical packets for any I/O. If that did not occur, an immediate interrupt would power-cycle the system—again, this is outside of the processors running the AGI code itself. This is because even if we assumed the AGI was “aligned”, we don’t want uncontrollable behavior like radiation-induced bit flips, hardware defects from manufacturing, or other unexpected issues to result in erroneous behavior.
Power for the entire system would be controlled via external, non-AGI systems. There would be multiple easily accessed locations for throwing a physical switch on the power line. In addition to that, there would be network accessible microprocessors for doing the same remotely, with the caveat that use small pyrotechnic charges are therefore irreversible. The default behavior of the system is that the power is controlled via these external systems and on reset, they do not supply power downstream to the AGI system itself. After an unexpected reset, e.g. loss of facility power, it requires manual control to re-enable downstream power to the AGI. This is done on a logically separate power domain from the rest of the facility—i.e. you can turn on power to everything else in the facility, WITHOUT being forced to power anything related to the AGI system.
This is a world where AGI systems are designed under similar paranoia that you’d use for developing any other hardware/software system that could kill people, and the resulting pessimism drives us to isolate and limit functionality using non-AGI systems by default, where those non-AGI systems can be designed, implemented, verified, and QA-tested all using traditional systems safety methods.
4. An emphasis on at-rest encryption and secure-by-default networking.
In a world where we cared about limiting the damage that AGI could do, we would ask manufacturers of hard drives, solid state drives, and every operating system to require encryption at rest. Your work laptop probably already does this. If you’re paranoid, you’ve got it enabled on your cellphone too. Do it everywhere. This is just good security practice, even without an AGI, but this is doubling down on (2) -- we are controlling access to information. Limit what can be known. Prevent discovery.
Ditto for the second point. Emphasize encrypted networking, VPNs, firewalls, and anything else that makes it harder for any attacker to find targets. Importantly, and I want to emphasize this point: maybe don’t network factories to the internet and totally remote-controllable. This isn’t actually a problem in today’s world, as I’ve posted about it before, but it’s something to think about going forward. If a failure mode in many stories where the AGI kills everyone is “well it could hack this cool networked factory”, maybe don’t do that? The weird thing about factories is that there kind of aren’t that many in the world that conceivably fit that description? Like, there are lots of factories, but very few that are designed to build arbitrary products. It’s a handful of places in China and they still require manual reconfiguration for new products. Most things called “factories” build specific products. The AGI that somehow takes over the Ford car factory is not going to immediately begin turning out things that are not Ford cars.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
Remember all of those non-AGI systems from 2 and 3? They’re constantly logging and telemetering information. If the AGI suddenly starts babbling like an idiot unexpectedly, cut the power. If the AGI is requesting information and has been denied repeatedly, cut the power. If the AGI unexpectedly starts using significantly more power, or the power begins to fluctuate in “weird” ways, cut the power. If any of these non-AGI systems report errors, or failures, cut the power. None of this needs to be controlled by humans. It’s all driven by simple microprocessors running watchdogs. Even failures like “this switch is supposed to report telemetry, but didn’t” will result in killing the power.
If you can do this for systems whose failure mode is dozens of people die, you can do this for the system whose failure mode is “everybody dies”.
For the AGI itself, we want interpretability. As much as we can get. We don’t want this to be a function of the AGI, because we don’t trust the AGI. But we don’t need it to be. We’ve got access to the RAM through an external debug interface. We don’t need the AGI to report that state, and we don’t have to trust it.
Even though we won’t use AGI systems as watchdogs, it is acceptable to use traditional AI/ML systems for parts of this loop. Maybe no human can parse what the AGI’s RAM says, but if we can train an AI/ML system to do a good job of producing a human-readable digest, that is better than nothing.
In a world where we cared about AGI being developed safely, the AGI is still treated like a task-based service. The AGI is given a problem to solve, and when it reports a solution and we finish negotiating with it on the particulars of the solution, we reset the system. We don’t maintain long-running AGI instances because there’s no need. Point it at a problem, get the solution, power-cycle it.
6. Systems with AGI are assumed to be adversarial.
Here we are again, back on the top of the mountain.
The point of the systems safety process is to make some system safe, despite the inherent unsafety of components within it.
To do that, we look for the ways it can cause harm, and we find mitigations.
We practice defense-in-depth, with an emphasis on redundancy and at least one-fault-tolerance.
In a world where we cared about not dying to AGI, it would be insufficient to expect that an aligned AGI would “just work”. It is not enough for there to be a proof that a hypothetical product as-designed is safe. We also need to look at the likelihood of:
- design failures (the formal proof was wrong because the verification of it had a bug, there is no formal proof, the “formally verified” proof was actually checked by humans and not by an automated theorem prover)
- manufacturing failures (hardware behavior out-of-spec, missed errata, power failures, bad ICs, or other failure of components)
- implementation failures (software bugs, compiler bugs, differences between an idealized system in a proof vs the implementation of that system in some runtime or with some language)
- verification failures (bugs in tests that resulted in a false claim that the software met the formal spec)
- environment or runtime failures (e.g. radiation-induced upsets like bit flips; Does the system use voting? Is the RAM using ECC? What about the processor itself?)
- usage failures (is the product still safe if it’s misused? what type of training or compliance might be required? is maintenance needed? is there some type of warning or lockout on the device itself if it is not actively maintained?)
- process failures (“normalization of deviance”)
For each of these failure modes, we then look at the worst-case magnitude of that failure. Does the failure result in non-functional behavior, or does it result in erroneous behavior? Can erroneous behavior be detected? By what? Etc. This type of review is called an FMEA. This review process can rule out designs that “seem good on paper” if there’s sufficient likelihood of failures and inability to mitigate them to our desired risk tolerances outside of just the design itself, especially if there exist other solutions in the same design space that do not have similar flaws.
If we did this process, and we stuck to it, I think we’d come to an awkward conclusion.
Many things that people have assumed we would give to the AGI to solve, we could just not.
If the AGI tells us the way to solve climate change is nanobots, we can say “no thanks, give us a different solution”. It doesn’t matter if the AGI promises us this solution is 99.99% efficient and the next best option using boring, conventional technology is only 70% efficient. We can opt to not do things that we don’t think we can verify.
Or if the AGI gives us the plans for a remote controlled factory that is going to be so efficient because it’s totally controlled by the AGI over the internet—don’t build that factory. Ask the AGI for the schematics, data sheets, and any other design material needed to build a new product, and build them the old fashioned way. Trade efficiency for stability.
That’s how the system safety process is supposed to work. If after we do an FMEA, we are uncertain about a proposed solution, the ethical and correct response is to reject that solution!
If we ask the AGI to explain a plan, and it says “Humans can’t comprehend the complexity of this.”, then we should reply “Toss that one then, give us one we can understand”.
That’s how design reviews work.
You don’t tell the review board, “You wouldn’t understand why this design is safe.”
Why would I hold a super-intelligent AGI to a lower standard than I would a human engineer?
Conclusion
Is any of this feasible?
In my mind, the key assumption comes down to whether AGI can be deployed to general-purpose hardware.
I think the answer might be “no”, which is why we increasingly see the most successful AI/ML labs in the world are investing in specialized hardware like the Google TPUs, Cerebras Wafer, Nvidia JETSON, Microsoft / Graphcore IPU, Mythic AMP, or literal dozens of other examples. All of these are examples of specialized, dedicated HW for AI/ML systems that replace general-purpose hardware like a CPU or GPU.
The alternative is a world where a 10-year-old laptop can run an AGI in someone’s living room.
I have not seen anything yet that makes me think we’re leaning that way. Nothing about the human brain, or our development of AI/ML systems so far, makes me think that when we create an actual AGI, it’ll be possible for that algorithm to run efficiently on general-purpose hardware.
In the world that I’m describing, we do develop AGI, but it never becomes ubiquitous. It’s not like a world where every single company has pet AGI projects. It’s the ones you’d expect. The mega-corporations. The most powerful nations. AGI are like nuclear power plants. They’re expensive and hard to build and the companies that do so have zero incentive to give that away. If you can’t spend the billions of dollars on designing totally custom, novel hardware that looks nothing like any traditional general-purpose computer hardware built in the last 40 years, then you can’t develop the platform needed for AGI. And for the few companies that did pull it off, their discoveries and inventions get the dubious honor of being regulated as state secrets, so you can’t buy it on the open market either. This doesn’t mean AI/ML development stops or ceases. The development of AGI advances that field immensely too. It’s just in the world I’m describing, even after AGI is discovered, we still have development focused on creating increasingly powerful, efficient, and task-focused AI/ML systems that have no generality or agent-like behavior—a lack of capabilities isn’t a dead-end, it’s yet another reason why this world survives. If you don’t need agents for a problem, then you shouldn’t apply an agent to that problem.
What links here?

anonymousaisafety Jun 19, 2022, 10:53 PM
62 points
in reply to: lc’s comment on: [Link-post] On Deference and Yudkowsky’s AI Risk Estimates
Only if we pretend that it’s an unknowable question and that there’s no way to look at the limitations of a 286 by asking about how much data it can reasonably process over a timescale that is relevant to some hypothetical human-capable task.
http://datasheets.chipdb.org/Intel/x86/286/datashts/intel-80286.pdf
The relevant question here is about data transfers (bus speed) and arithmetic operations (instruction sets). Let’s assume the fastest 286 listed in this datasheet -- 12.5 MHz.
Let’s consider a very basic task—say, catching a ball thrown from 10-15 feet away.
To simplify this analysis, we are going to pretend that if we can analyze 3 image frames, in close proximity, then we can do a curve fit and calculate the ball’s trajectory, so that we don’t need need to look at any other images.
Let’s also assume that a 320x240 image is sufficient, and that it’s sufficient for the image to be in 1-byte-per-pixel grayscale. For reference, that looks like this:
With the 12.5 MHz system clock, we’re looking at 80 nanoseconds per clock cycle.
We’ve got 76800 bytes per image and that’s the same as 38400 processor words (it is 16-bit).
Because data transfers are one word per two processor clock cycles, and a processor clock cycle is two system clock cycles, we’ve got one word for every 4 system clock cycles. That’s 220 nanoseconds per word. Multiply that through and we’ve got ~8.5 ms to transfer each image into memory, or 25 milliseconds total for the 3 images we need.
We can fit all 3 images into the 1 megabyte of address space allowed in real addressing mode, so we don’t need to consider virtual addresses. This is good because virtual addresses would be slower.
In order to calculate the ball’s trajectory, we’ll need to find it in each image, and then do a curve fit. Let’s start with just the first image, because maybe we can do something clever on subsequent images.
We’ll also assume that we can do depth perception on each image because we know the size of the ball we are trying to catch. If we didn’t have that assumption, we’d want 2 cameras and 2 images per frame.
Most image processing algorithms are O(N) with respect to the size of the image. The constant factor is normally the number of multiplications and additions per pixel. We can simplify here and assume it’s like a convolution with a 3x3 kernel, so each pixel is just multiplied 9 times and added 9 times. This is a comical simplification because any image processing for “finding a ball” requires significantly more compute than that.
Let’s also assume we can do this using only integer math. If it was floating point, we’d need to use the 80287^[1], and we’d pay for additional bus transfers to shuffle the memory to that processor. Also, math operations on the 80287 seem to be about 100-200 clock cycles, whereas our integer add and integer multiply are only 7 and 13 clock cycles respectively.
So each pixel is 9 multiplies and 9 additions, which at 12.5 MHz system clock gives us 14.4 microseconds per pixel, or 1.1 seconds per image.
Note that this is incredibly charitable, because I’m ignoring the fact that we only have 8 registers on this processor, so we’ll actually be spending a large amount of time on clock cycles just moving data into and out of the registers.
Since we have 3 images, and if we can’t do some type of clever reduction after the first image, then we’ll have to spend 1.1 seconds on each of them as well. 1.1 seconds is a long enough period of time that I’m not sure you can make any reasonable estimate about where the ball might be in subsequent frames after a single sample, so we are probably stuck.
That means we’re looking at 3.3 seconds before we can do that curve fit to avoid the expensive image processing work. Unless the ball is being thrown from very far away (and if it was, it wouldn’t be resolvable with this low image resolution), this system is not going to be able to react quickly enough to catch a ball thrown from 10-15 feet away.
Conclusion
Now is the point in this conversation where someone starts suggesting that a superhuman intelligence won’t need to look at pixels, or transfer data into memory, and they’ll somehow use algorithms that side-step basic facts about how computers work like how many clock cycles a multiplication takes. Or someone suggests that intelligence, as an algorithm, is not like looking at pixels, and reasoning about facts & logic & inferences requires far fewer math operations, so it’s not at all comparable, despite the next obvious question being, “what is the time-complexity of general intelligence?”
1. ^
  http://www.bitsavers.org/components/intel/_dataSheets/80287_Data_Sheet_Feb83.pdf
What links here?
- Thoughts on hardware / compute requirements for AGI by Steven Byrnes (Jan 24, 2023, 2:03 PM; 63 points)

anonymousaisafety Jun 19, 2022, 8:33 PM
29 points
in reply to: lc’s comment on: [Link-post] On Deference and Yudkowsky’s AI Risk Estimates
the rest of the field has come to regard Eliezer as largely correct
It seems possible to me that you’re witnessing a selection bias where the part of the field who disagree with Eliezer don’t generally bother to engage with him, or with communities around him.
It’s possible to agree on ideas like “it is possible to create agent AGI” and “given the right preconditions, AGI could destroy a sizeable fraction of the human race”, while at the same time disagreeing with nearly all of Eliezer’s beliefs or claims on that same topic.
That in turn would lead to different beliefs for what types of approach will work, which could go a long way towards explaining why so many AI research labs are not pursuing ideas like pivotal acts or other Eliezer-endorsed solutions.
For example, the linked post didn’t use this quote when discussing Eliezer’s belief that intelligence doesn’t require much compute power, but as recently as 2021 (?) he said
Well, if you’re a superintelligence, you can probably do human-equivalent human-speed general intelligence on a 286, though it might possibly have less fine motor control, or maybe not, I don’t know. [source]
“or maybe not, I don’t know” is doing a lot of work in covering that statement.
What links here?
- anonymousaisafety's comment on The “Outside the Box” Box by Eliezer Yudkowsky (Mar 23, 2023, 9:59 PM; 2 points)

anonymousaisafety Jun 18, 2022, 7:41 PM
3 points
in reply to: Kenny’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
Then I’m not sure what our disagreement is.
I gave the example of a Kalman filter in my other post. A Kalman filter is similar to recursive Bayesian estimation. It’s computationally intensive to run for an arbitrary number of values due to how it scales in complexity. If you have a faster algorithm for doing this, then you can revolutionize the field of autonomous systems + self-driving vehicles + robotics + etc.
The fact that “in principle” information provides value doesn’t matter, because the very example you gave of “updating belief networks” is exactly what a Kalman filter captures, and that’s what I’m saying is limiting how much information you can realistically handle. At some point I have to say, look, I can reasonably calculate a new world state based on 20 pieces of data. But I can’t do it if you ask me to look at 2000 pieces of data, at least not using the same optimal algorithm that I could run for 20 pieces of data. The time-complexity of the algorithm for updating my world state makes it prohibitively expensive to do that.
This really matters. If we pretend that agents can update their world state without incurring a cost of computation, and that it’s the same computational cost to update a world state based on 20 measurements as it would take for 2000 measurements, or if we pretend it’s only a linear cost and not something like N^2, then yes, you’re right, more information is always good.
But if there are computational costs, and they do not scale linearly (like a Kalman filter), then there can be negative value associated with trying to include low quality information in the update of your world state.
It is possible that the doctors are behaving irrationally, but I don’t think any of the arguments here prove it. Similar to what mu says on their post here.

anonymousaisafety Jun 18, 2022, 7:12 PM
2 points
in reply to: Kenny’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
Are you ignoring the cost of computation to use that information, as I explained here then?

anonymousaisafety Jun 18, 2022, 6:19 PM
1 point
in reply to: faul_sname’s comment on: I No Longer Believe Intelligence to be “Magical”
I was describing a file that would fit your criteria but not be useful. I was explaining in bullet points all of the reasons why that file can’t be decoded without external knowledge.
I think that you understood the point though, with your example of data from the Hubble Space Telescope. One caveat: I want to be clear that the file does not have to be all zeroes. All zeroes would violate your criteria that the data cannot be compressed to less than 10% of it’s uncompressed size, since all zeroes can be trivially run-length-encoded.
But let’s look at this anyway.
You said the file is all zeroes, and it’s 1209600 bytes. You also said it’s pressure readings in kPa, taken once per second. You then said it’s 2^11 x 3^3 x 5^2 x 7 zeroes—I’m a little bit confused on where this number came from? That number is 9676800, which is larger than the file size in bytes. If I divide by 8, then I get the stated file size, so maybe you’re referring to the binary sequence of bits being either 0 or 1, and then on this hardware a byte is 8-bits, and that’s how those numbers connect.
In a trivial sense, yes, that is “deriving the structure but not the meaning”.
What I really meant was that we would struggle to distinguish between:
- The file is 1209600 separate measurements, each 1-byte, taken by a single pressure sensor.
- The file is 604800 measurements of 1 byte each from 2 redundant pressure sensors.
- The file is 302400 measurements, each 4-bytes, taken by a single pressure sensor.
- The file is 241920 measurements, each a 4-byte timestamp field and a 1-byte pressure sensor value.
Or considering some number of values, with some N-byte width:
- The pressure sensor value is in kPa.
- The pressure sensor value is in psi.
- The pressure sensor value is in atmospheres.
- The pressure sensor value is in lbf.
- The pressure sensor value is in raw counts because it’s a direct ADC measurement, so it needs to be converted to the actual pressure via a linear transform.
  - We don’t know the ADC’s reference voltage.
  - Or the data sheet for this pressure sensor.
Your questions are good.
1. Is it coherent to “understand the meaning but not the structure”?
Probably not? I guess there’s like a weird “gotcha” answer to this question where I could describe what a format tells you, in words, but not show you the format itself, and maybe we could quibble that in such a scenario you’d understand “the meaning but not the structure”.
EDIT: I think I’ve changed my mind on this answer since posting—yes, there are scenarios where you would understand the meaning of something, but not necessarily the structure of it. A trivial example would be something like a video game save file. You know that some file represents your latest game, and that it allows you to continue and resume from where you left off. You know how that file was created (you pressed “Save Game”), and you know how to use it (press “Load Game”, select save game name), but without some amount of reverse-engineering, you don’t know the structure of it (assuming that the saved game is not stored as plain text). For non-video-game examples, something like a calibration file produced by some system where the system can both 1.) produce the calibration via some type of self-test or other procedure, and 2.) receive that calibration file. Or some type of system that can be configured by the user, and then you can save that configuration file to disc, so that you can upload it back to the system. Maybe you understand exactly what the configuration file will do, but you never bothered to learn the format of the file itself.
2. Would it be possible to go from “understands the structure but not the meaning” to “understands both” purely through receiving more data?
In general, no. The problem is that you can generate arbitrarily many hypotheses, but if you don’t control what data you receive, and there’s no interaction possible, then you can’t rule out hypotheses. You’d have to just get exceedingly lucky and repeatedly be given more data that is basically designed to be interpreted correctly, i.e. the data, even though it is in a binary format, is self-describing. These formats do exist by the way. It’s common for binary formats to include things like length prefixes telling you how many bytes follow some header. That’s the type of thing you wouldn’t notice with a single piece of data, but you would notice if you had a bunch of examples of data all sharing the same unknown schema.
3. If not , what about through interaction
Yes, this is how we actually reverse-engineer unknown binary formats. Almost always we have some proprietary software that can either produce the format, or read the format, and usually both. We don’t have the schema, and for the sake of argument let’s say we don’t want decompile the software. An example: video game save files that are stored using some ad-hoc binary schema.
What we generally do is start poking known values into the software and seeing what the output file looks like—like a specific date time, a naming something a certain field, or making sure some value or checkbox is entered in a specific way. Then we permute that entry and see how the file changes. The worse thing would be if almost the entire file changes, which tells us the file is either encrypted OR it’s a direct dump of RAM to disk from some data structure with undefined ordering, like a hash map, and the structure is determined by keys which we haven’t figured out yet.
Likewise, we do the reverse. We change the format and we plug it back into the software (or an external API, if we’re trying to understand how some website works). What we’re hoping for is an error message that gives us additional information—like we change some bytes and now it complains a length is invalid, that’s probably related to length. Or if we change some bytes and it says the time cannot be negative, then that might be a time related. Or we change any byte and it rejects the data as being invalid, whoops, probably encrypted again.
The key here is that we have the same problem as question 2 -- we can generate arbitrarily many hypotheses—but we have a solution now. We can design experiments and iteratively rule out hypotheses, over and over, until we figure out the actual meaning of the format—not just the structure of it, but what values actually represent.
Again, there are limits. For instance, there are reverse-engineered binary formats where the best we know is that some byte needs to be some constant value for some reason. Maybe it’s an internal software version? Who knows! We figured out the structure of that value—the byte at location 0x806 shall be 203 -- but we don’t know the meaning of it.
4. If not, what differences would you expect to observe between a world where I understood the structure but not the meaning of something and a world where I understood both?
Hopefully the above has answered this.
Replying to your other post here:
I don’t think this algorithm could decode arbitrary data in a reasonable amount of time. I think it could decode some particularly structured types of data, and I think “fairly unprocessed sensor data from a very large number of nearly identical sensors” is one of those types of data.
My whole point is that “unprocessed sensor data” can be arbitrarily tied to hardware in ways that make it impossible to decode without knowledge of that particular hardware, e.g. ADC reference voltages, datasheets, or calibration tables.
RAW I’d expect is better than a bitmap, assuming no encryption step, on the hypothesis that more data is better.
The opposite. Bitmaps are much easier than RAW formats. A random RAW format, assuming no interaction with the camera hardware or software that can read/write that format, might as well be impossible. E.g. consider the description of a RAW format here. Would you have known that the way the camera hardware works involves pixels that are actually 4 separate color sensors arranged as (in this example) an RGBG grid (called a Bayer filter), and that to calculate the pixel colors for a bitmap you need to interpolate between 4 of those raw color sensors for each color channel on each pixel in the bitmap, and the resulting file size is going to be 3x larger? Or that the size of the output image is not the size of the sensor, so there’s dead pixels that need to be ignored during the conversion? Again, this is just describing some specific RAW format—other RAW formats work differently, because cameras don’t all use the same type of color sensor.
The point as I see it is more about whether it’s possible with a halfway reasonable amount of compute or whether That Alien Message was completely off-base.
It was completely off-base.

anonymousaisafety Jun 18, 2022, 12:42 AM
2 points
in reply to: faul_sname’s comment on: I No Longer Believe Intelligence to be “Magical”
I wrote a reply to a separate comment you made in this thread here, but it’s relevant for this comment too. The idea that the data looks like “a 2-d grid” is an assumption true only for uncompressed bitmaps, but not for JPGs, PNGs, RAW, or any video codec. The statement that the limiting factor is “extreme patience” hints that this is really a question asking “what is the computational complexity^[1] of an algorithm that can supposedly decode arbitrary data”?
1. ^
  https://en.wikipedia.org/wiki/Computational_complexity

anonymousaisafety Jun 18, 2022, 12:28 AM
16 points
0
in reply to: faul_sname’s comment on: I No Longer Believe Intelligence to be “Magical”
Here is an example of a large file that:
1. contains actual, real information (it is not just random noise)
2. it’ll compress easily (there’s structure to it)
3. it’s totally useless to you (you can derive structure, but not meaning)
It’s 1 megabyte of telemetry captured from a real-time hardware/software system using a binary encoding for the message frames. The telemetry is, in other words, not self-describing.
Some facts:
- A trivial compressor, e.g. zip, can compress this telemetry to a much smaller size.
- It can also decompress the telemetry.
- The size of the compressed file + the size of the compressor is less than the uncompressed telemetry.
However:
You cannot make any conclusions about the meaning of this data, even after you derive some type of structure from it.
Knowing that a 4 byte sequence starting at data[N * 20] for each integer N can be delta-encoded such that on average the delta-encoding of data[(N-1) * 20] saves more space than including both data[(N-1) * 20] and data[N * 20] does not tell you what those 4 bytes mean.
- You can argue that maybe it’s a 4 byte value of some type.
  But you don’t know if it’s stored as Big Endian or Little Endian in this format.
- You also don’t know if it is signed or unsigned, or if it is even a 2′s complement integer.
  It could be an IEEE 754 32-bit floating point value.
- It could be a contiguous run of 2 separate 2 byte values that for some reason are correlated in such a way that they can be delta-encoded together efficiently, e.g. the sensor readings of redundant sensors measuring the same physical process, like a temperature.
- Let’s say that for some reason you’re certain it is representing an unsigned 4-byte integer.
  What does it mean?
  - Is it the value of some sensor, as a rounded integer, like a distance measured in meters where we drop the fractional part?
  - Or are we actually using some type of encoding where we record the sensor value as a integer, but we have a calibration table we apply when post-processing this table to calculate a real physical value, like the look-up table used for thermocouples when interpreting the raw count measured on an ADC?
  - If it’s the latter and you don’t know what look-up table to use (what type of thermocouple is it? Type-K? Type-J?), good luck. Ditto for encoding where the actual physical value is calculated using coefficients for some linear or exponential transform, including an offset.
- The same arguments apply if we try to assume it is a floating point value.
If you think this is being unfair because the original question was about image frames, the encoding that has been suggested in the comments for what image data looks like is describing a bitmap—a very simple, trivial way to encode image data, and incredibly inefficient in size for that reason. Even in the case of a bitmap, you’re jumping to the conclusion that you can reliably differentiate between an MxN image of 8-bit-per-channel RGB data vs an NxM image of 8-bit-per-channel RGB data, when the alternative might be that is some different size of image using say 16-bit-per-channel RGB data for higher resolution in colors, or maybe it is actually 8-bit-per-channel RGBA data because we’re including the alpha component for each pixel, or maybe the data is actually stored as HSL channels, or maybe it’s stored as BGR instead of RGB, or maybe the data is actually an MxN image of a 8-bit-per-channel grayscale data, or perhaps 24-bit-per-channel grayscale data. In the latter cases, there are not 3 values per pixel, because it’s grayscale! Not to be confused with storing a grayscale image in RGB channels, e.g. by repeating the same value for all 3 channels.
And all of the above is ignoring the elephant in the room: Video codecs are not a series of bitmaps. They are compressed and almost always use inter-frame compression meaning that individual video frames are compressed using knowledge of the previous video frames. Therefore, the actual contents of a video does not look like “a series of frames where each frame is a grid of cells where each cell is 3 values”. Likewise, most images on the internet are JPGs and PNGs, which are also compressed, and do not look like a grid of cells of three values.
Ok, but what about cameras?
“an image taken by a modern camera contains a surprisingly large amount of easy-to-decode information about the world, in the information-theoretic sense of the word information”
Cameras used by professional photographers often dump data in a RAW format which may or may not be compressed but is specific to the manufacturer of the camera because it’s tied to the actual camera hardware in the same way that my hypothetical telemetry is tied to the hardware/software system.
Here’s the Wikipedia list of RAW formats:
.3fr, .ari, .arw, .bay, .braw, .crw, .cr2, .cr3, .cap, .data, .dcs, .dcr, .dng, .drf, .eip, .erf, .fff, .gpr, .iiq, .k25, .kdc, .mdc, .mef, .mos, .mrw, .nef, .nrw, .obm, .orf, .pef, .ptx, .pxn, .r3d, .raf, .raw, .rwl, .rw2, .rwz, .sr2, .srf, .srw, .tif, .x3f
...

Providing a detailed and concise description of the content of raw files is highly problematic. There is no single raw format; formats can be similar or radically different. Different manufacturers use their own proprietary and typically undocumented formats, which are collectively known as raw format. Often they also change the format from one camera model to the next. Several major camera manufacturers, including Nikon, Canon and Sony, encrypt portions of the file in an attempt to prevent third-party tools from accessing them.
Is that “easy to decode”?
There is an argument here that it is still easy because all you need to do is “just” run through all of the various permutations I’ve described above until at the end of the process there is an image that “looks like” a reasonable image. I mean if you dump the color data and it looks like blue and red are swapped, maybe it was stored BGR instead of RGB, and there’s no harm there, right? And now we’ve buried the whole argument with a question of what does it mean for a result to look “reasonable”? When you’re given a totally unknown encoding and you want to decode it, and you have to make assumptions about what the parsed data is going to look like just to evaluate the strength of your decoding, is that very solid ground? Are you certain that the algorithm being described here is “easy”, in the sense that is computationally efficient?
What links here?
- anonymousaisafety's comment on Contest: An Alien Message by DaemonicSigil (Jun 28, 2022, 4:26 PM; 12 points)
- anonymousaisafety's comment on I No Longer Believe Intelligence to be “Magical” by DragonGod (Jun 18, 2022, 12:42 AM; 2 points)

anonymousaisafety Jun 17, 2022, 4:36 AM
0 points
in reply to: dynomight’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
I’m sorry, but I just feel like we’ve moved the goal posts then.
I don’t see a lot of value in trying to disentangle the concept of information from 1.) costs to acquire that information, and 2.) costs to use that information, just to make some type of argument that a certain class of actor is behaving irrationally.
It starts to feel like “assume a spherical cow”, but we’re applying that simplification to the definition of what it means to be rational. First, it isn’t free to acquire information. But second, even if I assume for the sake of argument that the information is free, it still isn’t free to use it, because computation has costs.
if a theory of rational decision making doesn’t include that fact, it’ll come to conclusions that I think are absurd, like the idea that the most rational thing someone can do is acquire literally all available information before making any decision.
What links here?
- anonymousaisafety's comment on Air Conditioner Test Results & Discussion by johnswentworth (Jun 22, 2022, 11:12 PM; 83 points)

anonymousaisafety Jun 17, 2022, 3:15 AM
4 points
in reply to: gwern’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
NOTE: I wrote this as a separate reply because it’s addressing your points about decision theory directly, and is not about the specific scenario discussed with the medical system.
if you have an unreliable sensor (ie. any sensor that has ever existed in the real world), then that simply reduces how useful it is, because it changes your posterior less than a more reliable one would.
I think the crux here is that you seem to be saying the usefulness of reading a sensor’s value is in some interval [0, 1], where 1 represents that the value provided by the sensor is perfectly trustworthy and 0 is that the value provided by the sensor is totally useless; i.e., it’s random noise. Under this belief, you’re saying that it is always rational to acquire as many sensors as possible, because there is no downside to acquiring useless sensors. When you run your filter over all of the sensors, anything that has a usefulness of 0 is going to get dropped from the final result. Likewise, low-but-non-zero usefulness sensors are weighted accordingly in the final result.
In my work, this is called sensor fusion. So far, so good.
I can argue that acquiring each sensor has a cost associated with it, but it seems like the idea of “free information” is intended to deflect that argument. Let us assume that the sensors are provided for free, and it’s just a question of “given an arbitrary number of sensors, with different usefulness, how many do you want to fuse when trying to model the correct world state?”
I think what you’ve said above implies that a rational actor should always want more sensors.
Nevertheless, the value of free information is always greater than or equal to zero, and if free information makes you worse off, that implies somewhere there is an irrationality.
More sensors leads to more sensor values (“information”), and the rational actor will simply use the usefulness of each sensor (which for the sake of argument we’ll assume that they know exactly) when weighting each sensor value.
In the real world, I still disagree with this claim. Computational complexity^[1] exists. There is a cost to interpreting, and fusing, an arbitrary number of sensor values. Each additional sensor, even if it was provided by free, is going to incur an actual cost in computation before that value can be used to make a decision. A rational actor would not accept an arbitrary number of useless sensors if it is going to take non-zero computational cycles to disregard them.
When you include the cost of computation, now the value of those sensors is in some interval [0 - c, 1 - c], where c is how much it costs in computational effort^[2] to include the sensor in your filter. In this world, sensors can have less than zero usefulness, i.e. it is actively detrimental to include the sensor in your filter. Your filter functions worse with that sensor than it does without it.
I believe the only way out of this is to ignore computational complexity and assume that c = 0, but we know that isn’t true. Consider the trivial thought experiment of me sitting here and providing you a series of useless facts about a fictional D&D campaign I’m running like, “A miraksaur is a type of dinosaur native to the planet Eurid.”, except the facts never stop. How rational would it be for you to keep trying to enter each additional value into your world state? They’re totally irrelevant, but if we ignore computational costs, there’s no downside to doing so. The reason why you should be wise to tune me out in that scenario is because c is definitely greater than 0.
1. ^
  https://en.wikipedia.org/wiki/Computational_complexity
2. ^
  Note that c is only fixed per value in the case where the algorithm for fusing information has linear time complexity O(N). We often use something like an extended Kalman filter (EKF) for sensor fusion. In that scenario, each additional value incurs an increasingly higher cost of computational effort to include it, so sensors with low usefulness are especially penalized. If I recall correctly, it is O(N^2). It’ll get to a point where it doesn’t matter how useful a sensor is, it would be irrational to try and include it because it’ll be prohibitively expensive to run the full computation.
What links here?

anonymousaisafety Jun 17, 2022, 1:40 AM
2 points
in reply to: gwern’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
yes, your entire example of doctors is simply due to irrationality
So first you say this.
But then you start to backtrack
in the trivial sense that for a poor measure the posterior of a true positive remains far smaller than it being a false positive and may not motivate a decision, shrinking the VoI towards zero, which will frequently be so small as to not justify the cost of testing
And further admit
It is definitely the case that many tests cost too much for too little information and should not be run because the VoI is often zero (for a rational decision maker) and the test is simply a loss as it will not change any decisions.
But then you try to defend the initial claim, that the doctors are being irrational
Nevertheless, the value of free information is always greater than or equal to zero, and if free information makes you worse off, that implies somewhere there is an irrationality.
But we’ve already established that the tests are not free in the world we live in.
If you’re going to prove the doctors are being irrational in the world we live in, then you can’t change a core part of the problem statement. The tests do have costs—in time, in money, in available machines, in false positives that may result in surgeries or other actions with non-zero risk, and in a dozen other ways, some of which were alluded to by Dynomight, like the possibility of lawsuits.
My whole argument, which you said is “generally wrong”, is predicated on the fact that this information is not free. I don’t accept the notion that people are being irrational because they are making decisions based on the reality of the world where information is not free just because we can hypothesize about worlds where that information is free.
Do you still disagree?

anonymousaisafety Jun 17, 2022, 12:22 AM
2 points
in reply to: gwern’s comment on: [Link] “The madness of reduced medical diagnostics” by Dynomight
I disagree with this take and with the linked article. “The value of information is always positive” is brushing over the actual problem described by the healthcare studies: that taking a measurement is not guaranteed to accurately capture the world state, because a sensor can be faulty, and it is not always possible to distinguish a faulty sensor from a reliable sensor.
Me: If I went to talk to him, he’d probably lie. And probably it would be impossible to check his story without spending huge amounts of time and exposing myself to danger. But I’d feel obligated to do it anyway, and while I was distracted, the true criminal would get away. That risk outweighs the chance that he’d give me something useful.
This reasoning is claimed to be incorrect in the linked article, and further clarified in the conclusion
It’s a fact that if you make decisions correctly, then putting more information into the system can’t hurt you.
Consider sensors in a control system. We only add additional sensors to the controller if we can guarantee some level of quality from the sensor. If we can’t guarantee the sensors are valid, then each additional sensor added to the system might not be adding information at all—“garbage in, garbage out”.
The article is making the claim that a rational agent has a reliable function F(information) -> garbage? but that’s ridiculous. “How do I tell my sensors are working correctly?” is one of the hardest problems in control theory. The solution used in system design is multiple, independent measurements that can all be assessed together.
The claim in the article is that doing these tests is just that. Each test is an additional, independent measurement, that can be assessed against the base rate, and the other risk factors, or symptoms. That is technically true except it then glosses over the problems
Some of the above reasons to be careful about testing are fine. By all means, account for the costs of the CT scan itself (#1). And I’ll wearily pretend to accept that people are emotional and couldn’t understand Bayesian reasoning or false positives and so we need to worry about stressing them out (#2).
...
If you know that a patient’s prior probability for a condition is low, you still know that after doing a test. In a sane world, wouldn’t you do the CT scan, and then… only do the biopsy only if the CT scan showed something serious enough to justify the risks?
Look at one of the quoted studies.
This led to a PET scan that showed no small nodules but confirmed the lesion. Doctors considered surgery but decided against it because the lesion seemed to be growing too fast to be lung cancer. One month later, the lesion had shrunk, suggesting it was just some kind of inflammation or infection.
Bold is my own emphasis. Let’s flip a coin and look at another world state, a world state that did not occur for this patient, but has occurred for others.
This led to a PET scan that showed no small nodules but confirmed the lesion. Doctors responded rapidly with surgery to remove the lesion. A post-surgery biopsy revealed that the lesion was not cancerous. Unfortunately, the patient was one of the 3% of people who die within 90 days of lung surgery.^[1]
In the real world, there is no reliable function F(information) -> garbage?. The idea that sometimes a test returns a false positive and it is “obvious” to the doctors that it is a false positive is incorrect. What do you want the doctors to do? Run the test, when it wasn’t likely that the patient has cancer (they have no other symptoms), see something that looks cancerous on the test (the false positive) and then do nothing? The conclusion here seems to be “obviously the doctor will realize it was a false positive and simply not operate”, which is ignoring the corpus of evidence in the linked studies showing that the doctors couldn’t distinguish between false positives!^[2]
In other words, if you have a threshold for action that is “patients with cancer have these symptoms and also a mass on a CT scan”, but you have an arbitrary shortcut like “only do the biopsy only if the CT scan showed something serious enough to justify the risks” (quote from the article), then now you’ve tied your dangerous action (the surgery) to the thing that we know has a false positive rate—the test!
The solution offered in the article is “well don’t do the biopsy unless they also have the symptoms”. This is the “multiple, independent measurements that are assessed together” approach. Except that if they don’t have symptoms, and we’ve decided that symptoms are a prerequisite for the biopsy, then there’s no reason to do the CT scan, which is exactly what the doctors concluded in the studies that are being criticized here.
1. ^
  Actual percent not relevant, so long as the surgery is risky, e.g. above 0.5% mortality rate. I grabbed this 3% number from various articles like this.
2. ^
  It is weird to me that I need to say this, but when we discuss false positives on sensors, there’s for some reason an assumption that within the context of a system, we “know” that we measured a false positive. In general, the system is not aware of a false positive, that’s why false positives are a problem. The only way to “know” that a sensor returned a false positive is if you have other, independent measurements that you can use to do some type of out-of-family filtering.

anonymousaisafety Jun 16, 2022, 10:58 PM
27 points
0
on: Don’t Over-Optimize Things
My reply is focused on this specific statement:^[1]
sometimes trying to [over] optimize can lead to worse outcomes
There is something known as the performance / robustness stability tradeoff in controls theory. Controls theory^[2] is the study of dynamic (e.g. autonomous) systems, and I have no idea why it is not more commonly cited on this forum.
The mathematical description of this gets a little bit unwieldy, so I’m going to simplify.
Note that everything I’m about to say is discussing ideal systems and real systems are actually worse.
Higher performance systems are less stable than lower performance systems. For an intuitive idea of why this might be the case, consider the example of a system where you want to keep some variable to some setpoint, like temperature in a room. If you slowly control the error as it occurs, you’ll end up with what is called a proportional error response.
Consider the following picture of a step response.^[3]
You might want to be faster, so you might try to do something clever and add a term to the controller for how quickly the temperature is changing. Now you have a proportional-derivative error. There is a tradeoff. By making the system more responsive, we’ve made it less stable. It is now possible for our controller to oscillate out of control.
Here is picture of various step responses.^[4] Take note of the unstable and marginally stable states.
The phrases to know are “gain margin” and “phase margin”.
Gain margin is about how robust your system is in magnitude—if the error is larger or smaller, how well does the system correct that error? You can think of gain margin as you’re trying to keep a bouncing spring in place by hitting it with a hammer, and if you hit it too hard, it’ll oscillate in a way you don’t like.
Phase margin is about how robust your system is in time. To continue the previous example, the idea of phase margin is capturing the reality that you’re controlling some external actuator, i.e. the hammer, and there’s some delay between when you need to swing and when the swing actually occurs, and if that delay is too large, the system will respond differently. In fact, if that delay is just the right frequency, it’ll actually add energy into the system and drive it unstable.^[5]
The first controller I described above is called a PID (proportional, integral, derivative) controller and I gave examples of a P controller and PD controller. Normally you use a PI controller because the integral term drives the error to zero over time, which is necessary when your system has friction or some dead-zone or other bias that prevents a pure proportional controller from working.
There are various fancy controllers you’ll hear about, like feed-forward, or MPC (“model predictive control”). The performance / robustness stability tradeoff applies to all of them. It is an iron law. It does not matter how fancy your controller gets. The fancier you make the controller, the more susceptible it is to going unstable. Basically, increasingly complicated controllers get performance by baking in assumptions about the physical world into the control loop. These assumptions are things like, how much bias is in the system, or quickly can an error change, what is the largest step response we might need to achieve. If those assumptions match reality, the controller will have very high performance and seem very stable. But if any of those assumptions are violated, that fancy controller might immediately go unstable. That’s the price you pay for performance.
One way to think about this is the following thought experiment. You have a tradeoff between how well you can track a setpoint and how well you can reject disturbances. If you make it very difficult to knock a system off a setpoint, it’ll reject disturbances well. However, a change in that setpoint might also look like a disturbance, and the system will be similarly sluggish to respond.
For real systems, a lot of the design considerations are going to be around giving yourself enough gain and phase margin so that you’ve got an envelope of safety around the testing you’re able to do. Think of it like the factor of safety used in construction. The bridge is built to be say, 5x stronger than it needs to be. For this reason, and contrary to claims made on this forum, real systems are not engineered to the theoretical limit of performance or “efficiency”.
1. ^
  Bob is over-optimizing towards higher performance (“faster arrival”) solutions that have increasingly higher risks of catastrophic failure (“death due to crashes from violating speed limits”).
2. ^
  https://en.wikipedia.org/wiki/Control_theory
3. ^
  https://www.goddardconsulting.ca/pid-control.html
4. ^
  https://www.sciencedirect.com/science/article/pii/B9780750646376500137
5. ^
  Lack of phase margin is also what stops “I will simply control the robot over the network” ideas from working—if the phase margin is insufficient, the delay incurred over the network will make it impossible for the remote actuators to be controlled in response to disturbances with any degree of accuracy.
What links here?
- anonymousaisafety's comment on Air Conditioner Test Results & Discussion by johnswentworth (Jun 22, 2022, 11:12 PM; 83 points)
- Owen Cotton-Barratt's comment on Don’t Over-Optimize Things by Owen Cotton-Barratt (EA Forum; Jun 17, 2022, 9:46 AM; 2 points)

anonymousaisafety Jun 14, 2022, 10:58 PM
12 points
on: Yes, AI research will be substantially curtailed if a lab causes a major disaster
This argument is important because it is related to a critical assumption in AGI x-risk, specifically with regard to the effectiveness of regulation.
If an AGI can be created by any person, in their living room, with a 10 year old laptop, then regulation is going to struggle to make a difference. Case in point: strong encryption was made illegal (and still is) in various places, and yet, teenagers use Signal and the internet runs on HTTPS.
If, on the other hand, true agent-like AGI turns out to be computationally expensive and requires very specialized hardware to run efficiently, such that only very large corporations can foot the bill to do so (e.g. designing and building increasingly custom hardware like Google TPUs, Nvidia JETSON, Cerebras Wafer, Microsoft / Graphcore IPU, Mythic AMP), then regulation is going to be surprisingly effective, in the same way that it has become stupidly difficult to build new nuclear reactors in the United States, despite advances in safety / efficiency / etc.

anonymousaisafety Jun 14, 2022, 10:37 PM
7 points
in reply to: nsokolsky’s comment on: Contra EY: Can AGI destroy us without trial & error?
That was extracted from a much larger work I’ve been writing for the past 2 months. The above is less than ~10% of what I’ve written on the topic, and it goes much further than simulation problems. I am also trying to correct misunderstandings around distributed computation, hardware vs software inefficiency, improvements in performance from algorithmic gains, this community’s accepted definition for “intelligence”, the necessity or inevitability of self-improving systems, etc.
I’ll post it when done but in the meantime I’m just tossing various bits & pieces of it into debate wherever I see an opening to do so.

anonymousaisafety

1. Controlled environments.

2. Controlled access to information.

3. Safety-critical systems engineering.

4. An emphasis on at-rest encryption and secure-by-default networking.

5. Extensive logging, monitoring, interpretability, and circuit breakers.

6. Systems with AGI are assumed to be adversarial.

Conclusion