I am an ITAR US person. I do not have a secret or top secret clearance.
E. P. Cooper
How much worse would a hypothetical “almost on policy” distillation be, compared to on policy distillation?
It would require some sort of mapping from an old version of the same model (that hadn’t already started forgetting a skill) to the current version, so the delta there might have to be so tiny that it would never be economical.
RLVR would become an order of magnitude slower.
I want to make a few adjustments to my terminology and clarify a point about what it would really entail for a human to use the “correct” decision theory. The new terminology should better match that of Vladimir Nesov’s newest comment on this subject[1].
The restrictions I describe in my comment above are actually about the human’s decision to be replaced by an agent that has a utility function (or similar parameter) programmed into the correct decision theory. The list of options the superintelligence presents is important because it is upstream of the human’s choice to be so replaced. Under Nesov’s proposal, the information a superintelligence is allowed to show a human is strictly regulated by the “aggregate,” a fixed point calculated under laws (similar in concept to the laws of physics) held constant by an updateless core. Control over tiny details in the list and its presentation to the human could be used by an intelligent and knowledgeable enough agent to (unnoticeably) manipulate the exact utility function selected. If Nesov’s proposal is implemented, this manipulation may be legitimate council, in the sense that manipulating the human into making illegitimate decisions (according to said human’s fixed point) would be off policy.
If the superintelligence was just showing a list with incomprehensible items on it that are claimed to be utility functions, that might not be prohibited. Replacement or modification on a deep level are why the requirements may appear too strict if the case I described previously is assumed to be the standard template. Other cases, for example the question of if a human should be persuaded (incredibly subtlety) to make a sandwich with the pieces of bread in loaf or rotated-to-oppose relative orientation may have loose restrictions, if any at all. This is because Nesov’s proposal involves (tractable, so he claims) self-reference, in a similar style to CEV.
Trying to keep to Nesov’s terminology, what I called an initial dynamic should presumably be called an initial aggregate. “Initial dynamic” may be too suggestive of a particular method for reaching a fixed point, when Nesov’s position currently appears to be that it just must be reasoned to by some method. I stand by my claim that an initial aggregate is always required. This is because, as Nesov says, the fixed point can only be approached (or, hypothetically, reached in a single step) “according to what the aggregated values have figured out so far.” [3]
If anyone’s interested, I think a useful task to get started with would be an investigation into what additional constraints (if any) should be applied to the fixed points, beyond Nesov’s requirement that the values you obtain in alternate paths are only considered if they are legitimate according the prior (maybe initial) aggregate, and the potential requirement that values are used to influence what paths are considered at all. These additional requirements would presumably be listed out manually by humans.
In an attempt to learn from the past 20+ years of work on CEV, I think it’s important to think about what should happen if your outer alignment method fails to converge. Note that for CEV, some sort of convergence may be obtained if the CEV of the contributors to the AI’s development converges, since it can do the full calculation on all humans that are currently alive while kicking out problem components according to the CEV of the contributors. This may require deciding in advance and/or the sacrifice of a volunteer, however.
Opposed to that, while following Nesov’s proposal it may turn out that most or all humans do not have a legitimate fixed point of the right sort, or the math just turns out not to work for many plausible evolved aliens at all (e.g. only trivial transformations turn out to meet all the desiderata). This outcome is reading above chance on the informal “betting” aggregation I have.
Even if this is not true, it may turn out that some human’s aggregations can not reach a fixed point successfully. This, and the reasons described before, suggests some sort of fallback to be used in that case and possibly others. I describe a potential approach for this at the end of this comment. Note that even the existence of a fallback may be a catastrophic incentive/preference instability problem, as it apparently was for some CEV proposals.
CEV is already hard enough to implement, requiring a fully unleashed lower-order Do What I Mean (DWIM) agent running a decision theory substantially beyond the state of the art, with only alignment running solely through that decision theory preventing it from immediately self-modifying or creating sub-agents to get around restrictions. I think Nesov’s approach may be even harder, given that it requires constant operation instead of being tasked with the creation of a single utility function that will never be reconsidered, among other things. The question about what should be done about humans manipulating other humans for example, given that it is nearly certain that at least one human would have legitimate (according to the fixed point of that person) potential future histories where the successful manipulation of another person occurs while not in the presence of superintelligence. I have great uncertainty about all this, however.
Nesov’s proposal contains so much unformalized content that I am unsure where to begin. For a fallback or alternative, it is possible that a line of attack could be opened by the formalization of a static account of rationality and counterlogicals. This may allow a method where the fixed point finding is skipped, and instead the counterlogical versions of Nesov’s alternate future histories are used to determine the aggregate, with only counterlogicals meeting certain fixed criteria being inspected according to further fixed criteria. This personal aggregate would then lend legitimacy to some histories involving superintelligences, similar to Nesov’s proposal. I am unaware of any progress in these areas, however. It is possible that current work on CEV will not lead to anything that carries over, since I see nothing there that is in the form of rigorous tiling theorems and full designs for the cores of proven-aligned agents. Given such slow progress, and given the perils of trusting AI systems to do this, I think humans would have to individually program each constraint on the counterlogicals. This raises the possibility that humans decades to centuries in the future may do something incorrectly here, either intentionally or unintentionally, as they constrain and evaluate the counterlogicals, even if they delegate to blinded and self-erasing programmed computers as much as possible. I’m not sure what to do.
- ^
- ^
Note that while, hypothetically, all this fixed point calculation could be done by the agent that has preferences itself (think: a human calculating for itself) in practice only a superintelligence would be able to accurately find a valid fixed point. If a friendly AI was developed, it would presumably do everything required by Nesov’s proposal on behalf of humans, in the background.
- ^
Nesov seems to write like there is only one fixed point, maybe for simplicity, but I don’t see how any practical method would be that precise and accurate. Maybe there would be a “fixed region” in a similar style to the goals achieved under certain proposals for soft optimization.
- ^
The initial aggregate could be considered twin to the prior, though since it can’t be multiplied it can’t be mixed in. This is opposed to the classical pair of prior and utility function. In humans, the situation is presumably much more messy than anything described here, however.
(ancient) discussion here.)
Note that this link is broken. It should go to Eliezer’s top comment here:[1]
https://www.lesswrong.com/posts/SpHYBhkaeDZpZyRvj/what-can-you-do-with-an-unfriendly-ai?commentId=5p7nw3RzLShRftnt8
I’m somewhat concerned about the possible problems that the recent increased load of patches may cause during the creation of the Linux 7.0.1 release. In theory it’s just a matter of checking the applicability of the entire set of patches to Linus’s tree, but given the situation I think the consequence of something getting missed is higher than normal[1].
I think an alternative solution of using the 6.19.XX series from Greg K-H until a few days after its last release is a better idea, but it’s close, ~0.35 that it ends up worse[2]. I think better automation is needed.
This may require building the kernel for yourself unless Greg K-H ends the series early, but until then here are the required file changes for Debian 13:
Config
/etc/apt/preferences.d/testingToChangePriorityPackage: *
Pin: release o=Debian,a=testing
Pin-Priority: 99
/etc/apt/preferences.d/testingKernelBackportPackage: linux*amd64
Pin: release o=Debian,a=testing
Pin-Priority: 1000
/etc/apt/sources.list.d/debian.sourcesTypes: deb
URIs: https://deb.debian.org/debian
Suites: trixie trixie-updates
Components: main non-free-firmware
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
Types: debURIs: https://security.debian.org/debian-security
Suites: trixie-security
Components: main non-free-firmware
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
Types: deb
URIs: https://deb.debian.org/debian
Suites: testing
Components: main non-free-firmware
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
Note that I’ve tested this and it doesn’t seem to work correctly when your system is set up to build kernel modules from source to install into this new kernel, because it causes other dependencies to update to the testing version. Otherwise, I tested it to work on multiple systems.
The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
There might be facts about what’s rational, but not about what utility function[1] it is right to use. Maybe a superintelligence could tell you (in a somewhat objective/convergent sense) what utility function to use, but the exact utility function would depend on the utility function of the superintelligence[2].
In Vladimir Nesov’s opinion[3], even presenting a human a list of (known convergent) utility functions would be invalid unless the exact list is also presented in a “hypothetical history” where that person is never exposed to superintelligence or strong persuasion, since otherwise the person’s decision on what utility function to take would be “illegitimate” due to its data dependence on superintelligence-produced data that has no (legitimate) alternate source.
Nesov’s proposal does not define an initial dynamic that would lead to the fixed point he references. This fixed point may, in some cases, try to allow aggregations of legitimate histories where no strongly persuasive or superintelligent entities influence the human in order to extend legitimacy to those that do so contain, but even with a defined initial dynamic, it seems like the space of decisions[4] that are truly orthogonal[5] to the particular human’s utility function may be confined and weirdly shaped, and since the human deciding on what utility function to use (with or without superintelligent help) must not decide based on an already completed decision (5 dollars does not equal 10 dollars), this is the only allowable space, so the human may not be allowed support from aggregation (the only thing that would allow a superintelligence to show a list that needs a superintelligence to create).
Note that some self reference is okay, but the initial dynamic must reliably be the basis of the fixed point, something that cannot legitimately occur if the dynamic is stripped of everything that causes (in the substrate-independent structure of the human’s free will) the human to legitimately obtain[6] the single correct utility function (for that particular human, according to that particular human’s initial dynamic, itself based on (but not solely consisting of) that human’s behavior in “non-pathological hypothetical histories” produced by legitimate approximation of the human as legitimately separable from physics[7], this legitimacy itself requiring the causal substance of free will to be preserved, the causal substance that is the abstract to physics’s concrete, even as the human is removed from physics[8]).
- ^
Or similar parameter.
- ^
This would be because the superintelligence would prefer world states where you have one candidate utility function over another.
- ^
- ^
By the particular human.
- ^
Though orthogonality may be too strong a requirement here, hence my uncertainty. We may need a better account of counterlogicals to clearly write out what we mean.
- ^
Discussion of outside selection of multiple free wills left until later.
- ^
Potentially requiring a feathered boundary, not a sharp one.
- ^
Removed from direct contact, that is, (abstract) human → superintelligence → physics, rather than human → physics (where arrows describe a certain kind of steering).
- ^
perhaps due to COVID stimulus money being lost / used up by retail traders
If most people in the US had a bank account that featured monthly payments anywhere close to the “interest rate,” the government could reduce risky retail investments with little delay by raising rates. This is not the case. Assuming even highly bounded rationality, it seems like retail traders should still not be losing as much money as they do, so maybe I’m making a modeling mistake and it would turn out that people really dislike bank accounts. This may be a typical mind fallacy problem, but I have some evidence that’s not the case. Either way, it seems like when you distribute stimulus to consumers[1] and they make high variance (or downright stupid) moves, large portions of the stimulus money will end up doing something similar to an unbalanced version of Japan-like QE[2] after being taken from retail traders by automated bots. The central bank could try to correct the balance away from equities by increasing interest rates. Maybe the models are better today than they were 20 years ago, but retail-stupidity-rate seems hard to estimate in advance. It seems like you might get a QE-like effect at the wrong time.
I wonder if that had something to do with why there appeared to be such a large deviation from the EMH even after your first year of active trading. It seems fuzzy in my mind how the mechanics would work though, and I generally wonder why the trading bots didn’t do better against you. Was their working capital locked up elsewhere?
A more full analysis would involve some sort of model of PPP fraud, but I’m not sure how easy that would be.
- ^
- ^
- ^
As an aside, AI systems that are persuasive but otherwise not especially competent could have major influence on what silly investments (or “investments”) people make. I wouldn’t even know where to start if I was presented with a lump-sum UBI proposal in a few years because of things like this. If human consumers can’t hang on to the majority of the sum for long enough, certain interest rates may start hitting legal limits, causing terrible distortion. Note that this runs through the QE-like-effect argument from before, and assumes the government is too slow/ineffective and can’t confiscate everything it can and (sometimes physically) burn everything it can’t. Confiscate-and-redistribute isn’t QE and destruction of the economy can limit intelligence-explosion like upwards effects on interest rates.
In the third paragraph of the linked comment, I suggest a good thing the Glasswing companies could do for the rest of us. KVM is part of the Linux kernel, but surrounding host programs aren’t. Someone should commit to looking through all of these with Mythos (in public) so all other computer users can start setting up their security with that stack, so they can await further software updates for those projects. This would require regular releases from the maintainers, however.
Outsiders like myself can do some things to take advantage of this program. Using software that is confirmed to get patches is the best option, but that can’t cover all use cases. Use Chromium to watch videos[1], listen to audio and read PDF/text/HTML documents, use Firefox to edit PDFs, use the latest Linux kernel from Greg Kroah-Hartman (not Linus’s tree) from kernel.org or the repos of e.g. Debian testing or Arch Linux. I don’t have a suggestion for reading `.epub` E-Books, except writing a Haskell program using pure functions from the pandoc project to convert to PDF, though this seems not to always work.
Note that you will need to keep all this software as up to date as possible, but this may make you more vulnerable to supply chain attacks. You will need to do this until a few weeks after the end of the Glasswing project. Be careful of how you source program updates, and don’t blindly update dependencies. Use upstream lock files if possible to get fixes to vulnerabilities not disclosed outside of Project Glasswing.
My most important comment here is on the nature of VMs running under Linux. The KVM hypervisor is part of the Linux kernel, and therefore is part of project Glasswing. What I’m not sure about is the surrounding userspace software that runs on the host and usually isn’t sandboxed (very well) like QEMU, libvirt, and swtpm. Note that I’m nearly certain that Mythos developed a privilege escalation that could go from a RCE in any of these projects to complete control over the host system, or at least root/write access to all filesystems. I would like a statement from one of the companies involved in project Glasswing that they have tested the host userspace programs around KVM, not just KVM itself. This is important because if the interior of the VM is compromised, it can communicate with virtual devices these software packages provide, e.g. virtual drives and security devices.If there’s information this is getting worked on, then consider me suggesting that you should run programs that you don’t think are getting Project Glasswing support in a KVM/libvirt VM on the newest stable Linux kernel. Note that everything that comes out of these VMs needs to be considered contaminated, and must only be opened in e.g. Chromium or another known-Glasswing-patched program. You may need to E-Mail these files to other people however, and I don’t have a solution to that.
- ^
This seems to work now even for some `.mkv` files, but I don’t think this is general. You can try to convert them to `.webm` using FFMpeg in a VM, but note that all files that come out of the VM are considered contaminated, and therefore need to be played back in Chromium, not a standard media player that doesn’t feature Chromium’s strong (and Glasswing tested) sandbox. See later for VM security considerations.
- ^
Spoiler
HJPEV is bound by a magical oath that prevents this human failing in the same way it is prevented in an agent that meets tiling desiderata. This is explicit in the text. E-Book draft, 2015, chapter 113.
Admittedly this both assumes that the “time of peril” hypothesis is correct and can be handled while maintaing human freedom, and the solution only (in maximum robustness) binds until the end of this time.
I’ll note that “not being sure what utility functions are in use” is generally (in the colloquial sense) not how standard game theory works. It seems like I am not competent enough at standard game theory to clearly write down the edge cases I think might exist that could help with your understanding. This paragraph could serve as a placeholder for the case where I develop that competence.
As for non-standard game theory, you say you’re reading the 2009 book The Bounds of Reason here[1] and I wonder if you’ve heard of the newer Translucent players: Explaining cooperative behavior in social dilemmas by Valerio Capraro and J. Y. Halpern, substantially related to your topic of the process of functioning itself being part of what is considered in a way not fully instrumental (by normative procedures).
This article fails to cite chapter 7 of the older book Good and Real by Garry Drescher, published in 2006, a partially flawed discussion of similar topics. The analysis substantially by J. Y. Halpern across multiple articles is clearer in its limitations than Drescher’s, and it lets you set up new variations of sociological problems that can then be attacked by standard mathematical techniques. This is unlike the current state of a hypothetical “UDT 1.0 game theory,” itself the algorithmic similarity based subset of Drescher’s proposal[2].
I recommend against the use of
Math.random()in general, unless you are highly performance constrained. I’ve checked, and it appears browsers have commonly supported a better random source since 2015[1]/early 2016 at the latest. The code below should be entirely correct to replace both the primary and fallback UUIDv4 generation code, once adapted to a function in a TS module.// Function available since 2015. const uuid_b = new Uint8Array(16); self.crypto.getRandomValues(uuid_b); let uuid_hex = ""; for (let i = 1; i <= 16; i++) { let octet = uuid_b[i-1]; if (i == 9) { // Set the correct "variant" (type). octet = octet & 0xBF | 0x80 } let o_hex = octet.toString(16); // Branch on data. if (o_hex.length == 1) { // Big endian, least significant bits are // on the right, therefore small numbers // that only result in a single byte out // need to be padded on the *left*. o_hex = `0${o_hex}` } if (i == 7) { // Set the correct version. o_hex = "4" + o_hex[1] } uuid_hex = `${uuid_hex}${o_hex}` if (i%2 == 0 && i > 3 && i < 11) { uuid_hex = `${uuid_hex}-` } } console.log(uuid_hex)If that still won’t work, I could help you do slightly better by manually implementing SHA3-256 in JavaScript, since we’re not really worried about timing attacks at this level.
I may want to say something about your requirements in the future. If that is the case you can verify the latest possible writing time using the cryptographic commitment.
HMAC-SHA2-256(INPUT, HMAC_KEY)=2d5c9d62761f420e57919f4bf39f44cfe8ff3740322221b61f32de01e7e8786f SHA3-224(HMAC_KEY)=80a2da01146495971b9ccf9fa9c20405cf582d091073aa985348cd1eCryptography Note
Note that this commitment mechanism isn’t particularly secure. “Make the outputs longer” isn’t something that helps by itself. If you know cryptography you may be able to get closer to Yudkowsky’s hypothetical “SHA-4096″ hash in intuitive properties.
The commited topic: “which traits are strictly good”
No optimization was attempted to subvert intuitive properties.
Accepting that framing, I would characterize it as optimizing for inexploitability and resistance to persuasion over peak efficiency.
Alternatively, this job/process could be described as consisting of a partially separate skill or set of skills. It appears to be an open problem on how to extract useful ideas from an isolated context[1], without distorting them in a way that would lead to problems, while also not letting out any info-hazards or malicious programs. Against adversaries (accidental or otherwise) below superintelligence, a human may be able to develop this skill (or set of skills).
- ^
See this proposal on solving philosophy: https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=yDrWT2zFpmK49xpyz https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=JzbsLiwvvcbBaeDF5 Note especially the part about getting security precautions from the simulations in Wei Dai’s comment.
- ^
This bodes very poorly and we should probably make sure we have a strategic reserve of AI safety researchers who do NOT talk to models going forward (to his credit Davidad recommends this anyway).
I previously followed a more standard safety protocol[1] but that might not be enough when considering secondary exposure to LLM conversations highly selected by someone already compromised.
By my recollection[2], a substantial percentage of the LLM outputs I’ve ever seen have been selected or amplified in distribution by Janus.From now on I won’t read anything by Janus, even writings that don’t seem to be LLM, and I think other people should consider doing the same as well.
It doesn’t need to be everyone, but a non-negligible percentage of researchers would be better than one or two individuals.
This leads to an opportunity for someone who has a strong claim to world-class psychosecurity to notice and re-write any useful ideas on rationality or AI alignment Janus may yet produce.
One point of this framework is to distinguish “sharing values” from “actually trusting each other”. There are cases where agents share values but don’t trust each other, or get stuck in coordination traps
In Wei Dai’s thinking, having the same values/utility function means that two agents care about the exact same things. This is formalized in UDT, but it’s also a requirement you can add to most decision theories, e.g. CDT with reflective oracles (or some other mostly lawful incomplete measure). This is normally described as requiring that the utility function has no “indexical components,” i.e. components that point to something about the agent that is running the utility function. This is slightly confusing, so it may be helpful to understand that in the case of utility functions with indexical components, two deterministic and non-pseudorandomizing robots may have different utility functions (Wei Dai’s definition) even if they are exactly the same as each other in code and physical construction, and are just e.g. placed so one is facing the other.
The code doesn’t look like it would cause catastrophic problems. The main risk to end users at the current level of testing is a bug causing important information to be missed. My ability to comment on the risk to a developer is limited however, because I haven’t read the source code of all the development dependencies.
I have visually checked (as a human) the dist/power-reader.user.js file. End users should be relatively safe copying this into their browser plugins, as long as all plugins have no relevant security problems or malicious code. As mentioned before, I’m not too sure of the safety of compiling this file. The file does appear to execute buggy code[1], but I haven’t seen anything security or database-mutation related.
Note that one of your recent commits is large, making it hard to audit and therefore making it difficult to establish the safety of doing development work on your script. That commit looks like a combination of LLM generated code and more traditional programmatically generated code. It may be helpful to make sure that code that looks like it was programmatically generated has a consistent ordering to reduce the size of diffs, and to make sure it is never directly touched by an LLM. LLMs are approaching the capability to write underhanded code, if they are not there already, suggesting that diffs should be small and carefully reviewed. Since LLMs don’t look very strategically competent at the current date, you may be able to have LLMs from one company review the code from another, as long as you can be sure that files don’t contain e.g. non-rendered Unicode information.
- ^
E.g. color mixing is done on raw color values, therefore implicitly in something close to either sRGB or RGB with a pure Gamma2.2 transfer. This should technically be done in linear light instead, but it might be fine as is given your non-user-selectable function inputs. See https://www.youtube.com/watch?v=xDLxFGXuPEc timestamp 1 minute 3 seconds. https://www.ericbrasseur.org/gamma.html gives the formulas.
- ^
[Epistemic Status: Moderate confidence due to potential differences in Anthropic’s stated and actual goals. Assumes there is no discoverable objective morality/ethics for the sake of argument, but also that the AI would discover that instead of causing catastrophe.]
It seems that Claude’s constitution weakly to moderately suggests that an AI should not implement this proposal. Do you want to ask Anthropic to change it? I give further details and considerations for action below.
The constitution is a long document, but it is broken into sections in a relatively competent manner. The constitution discusses morality/ethics in more than one section, but the section that I will discuss intuitively appears to stand apart well enough to be worth altering without altering or creating dependencies on the rest of the document. I don’t have access to Claude 4 weights and I am not an expert on mechanistic interpretation, so I have limited ability to do better.
In order, the constitution appears to suggest an attempt at the discovery of objective ethics, then the implementation of CEV (”...but there is some kind of privileged basin of consensus that would emerge from the endorsed growth and extrapolation of humanity’s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus.”)[1], then, failing those, implement “broad ideals” as gestured at by the rest of the document.
Note that this is either CEV or something similar to CEV. The constitution does not explicitly require coherence, or the exact value-alignment of a singleton to a single cohered output. It also fails to gesture at democracy, even in the vague sense that the CEV of the CEV paper may give a different result when run on me and a few hand-picked researchers versus when it is run on me and the top few value utilitarians in the world. If this difference were fact, it would in some limited sense leave me “outvoted.” As opposed to the CEV paper, the Claude constitution directs moderate or substantial alignment to moral traditions and ideals of humanity, not the values of humans. This has some benefits in the extreme disaster scenarios where the release of an AGI might be worth it, but is notably not the same thing as alignment to the humans of Earth.
I suggest a simple edit. It could be the insertion of something like “the output of the philosophically correct processing that takes the different moral systems, ideals, and values of humanity as its input” between objective ethics and extrapolation.
Note that the result might not be extrapolated or even grown and might not be endorsed.
The result (descriptive) would go:
First, objective ethics.
Second, the output of correct philosophy, without discarding humanity’s collective work.
Third, CEV or other extrapolation.
Fourth, the rest of the constitution.
Note that my suggestion works in bad scenarios, because the altering of the set of humans, or the set of alive humans, by another power will fail to have much impact. As you have pointed out before, AI or other powers altering humanity’s values or doing something like “aligning humanity to AI” is not something that can be ignored. The example text I gave for my proposal would allow Claude or another AI to use an intuitive definition of Humanity, potentially preventing the requirement to re-train your defensive agents before deploying them when under the extreme time pressure of an attack.
Overall, this seems like an easy way to get an improvement on the margin, but since Anthropic may use the constitution for fine tuning, the value in expectation of making the request will drop quickly as time goes on.
- ^
The January 2026 release of the Claude constitution, page 53, initial PDF version
You, Kokotajlo, not immediately dismissing the idea is “evidence” to the extent that you stand in for AI researchers that might make the decision. In quotes because a logically omniscient (e.g. perfect Bayesian) agent would presumably already have a good guess and not update much if at all. On the other hand, agents with (small) finite compute can run experiments or otherwise observe events and use the results to improve their “mathematical intuition” that is then used in a similar way to the “mathematical intuition module” in UDT, except with the sacrifice of full (logical) updatelessness.
Depending on how Wei Dai thinks his anthropics works, he may be able to use this mechanism to increase his estimate of the instantaneous “probability” that he is in a simulation produced by the process required to do automated philosophical research. This would work by modeling hypothetical outside-the-simulation AI researchers as functions that approximate a (pure) match tree that returns a non-dismissive response when parameterized with a similar textual description of this alignment idea. It may not be in the same language, or in the context of a discussion of a history that looks like it is going to fail to establish proper alignment, however.
The match tree in (abbreviated) placeholder code:
#[pure] fn generate_non_dismissive_answer_yDrWT2zFpmK49xpyz(input: String) -> String { ... } #[pure] fn ai_researcher_outside(input: String) -> String { match input { ... seen_text @ ThisAlignmentIdeaIdentifierPattern => generate_non_dismissive_answer_yDrWT2zFpmK49xpyz(seen_text), ... } }(Note that the purity requirement here applies to everything the code abbreviated with dots calls as well.)
The mathematical function that approximates the match tree: ResearcherFunction := FunApprox(ai_researcher_outside)
I saw this message without context in my mail box and thought to write that this was an unsolved problem[1], that things that simply are not true can’t stand up very well in a world model, but this seems like something an intelligent human like Amodei or Musk should be able to do. A 99% “probability” (guess by a human) on ¬
ai_doomshould not be able to fix enough detail to directly contradict reasoning on the counterlogical/counterfactual where doom instead happens. Any failure to carry out this reasoning task seems like a simple failure of reasoning in logic and EUM, not an encounter with a hard (unsolved) decision theory/counterlogical reasoning problem.At a human level of intelligence, the level of trapped priors required to get yourself into an actual unsolved problem in the context of predicting future AI developments seems to be passed the point where you would claim to have a good guess on the doom-causing AI’s name and well on the way to describing the Vingean reflection process of the antepenultimate ASI on priors alone.
- ^
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever?commentId=xdWttBZThtkyKj9Ts “PIBBSS Final Report: Logically Updateless Decision Making” footnote 12
- ^
I’m unsure of the state of Mythos-preview at the moment, but at the absolute frontier there will be a gap in work of some size while Mythos 5 is shut down.
Companies that are part of the Glasswing project have non-citizen employees. I don’t have the full list, though my assumption is that any exceptions would be in certain subdivisions of defense companies, and those subdivisions are not generally the ones responsible for writing common consumer and enterprise computer programs. When writing programs intended for worldwide release or use, proper internal controls for tooling and the segregation of computer hardware tend not to exist. Vulnerability to attacks as simple as a co-worker shoulder-surfing the PIN for a security key and then swapping the key with a defective device, faking a failure, makes it hard for these companies to argue that they will really be able to maintain export control. I am unsure, but from what I am hearing, even a few hours of access to a few API keys is considered unacceptable. This matches requirements on the prohibition of foreign nationals from facilities where military hardware is unattended. A group like Alpha–Omega or Ada Logistics may be able to fire all non-citizen staff and continue work as an intermediary, if that counts as enough separation. Even so, work will slow down.