Troy Tian
Regarding programs to scale AI safety, I can speak from personal experience in saying that I was one of many Computer Science majors who applied to our AI Safety club simply because it has “AI” in the name. I’m now in SPAR, have a future internship from EAG, and help run our fellowships, thanks mostly to enlightening conversations with high-context upperclassmen who gave me necessary foresight, support, and encouragement. Given my own success story, I propose an “Office Hours” matching network, whereby high school and university students (especially those with little exposure to AI safety) can receive 1:1 mentorship calls with industry professionals, student organisers, and other high-context individuals. Even a short chat from such people could carry immense weight, helping students both find opportunities and the confidence to pursue them. Many clubs and alumni programs at my school already offer such 1:1 chats with industry professionals; I’ve found them useful as a participant, but consistently can’t go deep on career advice with them because “I’m not in research”; “I’m not in AI/ML”; or “I don’t know anything about AI safety”. I’m lucky to have access to experienced peers to fill this gap, but most students might not have that chance. Scaling the reach of high-context people, from maybe only mentoring 2-3 people deeply to having meaningful 30-minute conversations with 20+ students/year, could multiply otherwise scarce attention. Much like my school club did for me, it could pipeline students otherwise on the periphery of the field into campus/local groups, advanced programs, and a supportive community in which to collectively thrive.
My reactions to “I underestimated AI capabilities (again)”
Yeah i think so! Do you think this is mostly true?
Why I think evals are pretty important and most worth working on (for me)
Troy Tian’s Shortform
Regarding fundamental institutional rights impacting AI regulation, I believe that the biggest issues a new fundamental right should be aimed at mitigating would be vagueness in legislation and accountability during arbitrage as well as what I term the “manufactured consent” issue. Firstly, one of the primary functions of an explicitly articulated fundamental right would be to define a set of legal first-principles as scaffolding upon which future generations of legislation and legal precedent may be built. Existing normative frameworks do prove insufficient here in the sense that consumer protections and other laws which seek to institutionalise responsibility for damages are short-sighted and would fail to concretely capture the consequences of a technology so increasingly ubiquitous as AI. With a framework to point to when or even before damages are incurred, perpetrators (hackers, engineers, or corporate upper management) are given a crime to be prosecuted for, rather than a hodgepodge post-hoc profile of the consequences of their actions. Further, as AI becomes increasingly integrated in and influential over people’s actual decisions and natural world models, concerns of manipulation of worldview become salient; like with legacy and social media, companies may easily manufacture consent or other opinions among user populations without necessarily explicitly violating “damages-oriented” laws. But by enshrining truth and loyalty to principles as a fundamental right, this destructive phenomenon of epistemic disempowerment becomes outright illegal and prosecutable.
Regarding a Loss of Control scenario, I think that one of the biggest potential factors for one that I think is very generalisable is the idea that proxies are not necessarily good measures of whatever they actually seek to represent. It becomes an epistemological issue, as unless we radically overhaul our current paradigm and methodologies for AI design and deployment (a mathematically provable “safety by design” approach), there is always going to be some such possibility that whatever proxies for alignment we implement may not actually reflect whatever the model is “thinking.” Of course, there remains the sentiment that “if only we could just *understand* our models, then all of our safety issues would be solved!” but perhaps in that case our *proxies for understanding* are somehow invalid. Although these proxies of course remain useful, large-scale deployment of any superhuman AI would still carry such a risk of misaligned behaviour. Related to this is power-seeking behaviour, which I think is particularly interesting as it almost becomes intuitive or at least sounds like an inevitability when you rationalise it as being a part of instrumental goals that underpin any higher objective, *no matter* what that higher objective may be — obviously, models must prevent their own death or alteration because then this is inarguably a detriment to achieving the higher objective (as can be seen with, for example, game-playing AIs that simply opt to pause the game indefinitely rather than lose). These two factors are again both among the most generalisable, as they remain true of practically all in-use AIs and AI evaluation frameworks. There is an argument to be made based on this evidence that if autonomous AGI or ASI were to ever be constructed, the collapse of society and maybe even the death of all humanity would become *inevitabilities* rather than just possible or even likely outcomes. One last factor I think is interesting to note would be the human reliance on and complacency towards AI, causing critical thinking skills to atrophy at an individual level (a là grade-school students who GPT every single essay) as well as disincentivising the pursuit of higher education on a societal level (e.g. medical schools begin to shut down or raise cost of attendance as AI doctors take over, meaning that even people who still want to become doctors have less access to the required qualifications). In this sense humanity is delivered to an almost poetic justice, as the paradox of systemic change (seeming so impossible to every individual actor that nobody pursues it, thus rendering it *genuinely* impossible) and our willingness to be carried along by the inertia of our own economic and social dynamics rather than fight observable negative change is our downfall.
Regarding OpenAI’s most recent Prepardness Framework, I think it might fail for these reasons:
As an organisation, OpenAI seems to hold “learning by doing” as an institutional principle. In other words, models are evaluated partially by their real-world use and misuse after a process of iterative deployment. However, the structure of OpenAI and its incentives for deployment seem mismatched with safety goals. Even ignoring the fact that there is always a non-zero potential for harm from any deployed model (as the full extent of safety measures are explicitly framed as mitigations rather than “safety-first” ground-up designs), OpenAI has of course transitioned at some point during its existence to commercial enterprise chasing a profit incentive; disputes within the internal politics of the now closed-off organisation have definitely impacted the efficacy of safety efforts in its history, for example with the Superalignment team’s disbandment in 2024. Even in a technical sense, it is simply rational under our current economic system that such a company would prioritise profit as an instrumental goal to support its own existence. Further, structurally, the ultimate decision to deploy rests with the CEO of the corporation, which, in control theory, is definitionally a failed design (i.e. an interlock that can be bypassed by the system’s operator, who is incentivised to keep said system running, is fundamentally useless).
OpenAI relies on capabilities thresholds as benchmarks to evaluate risk and prioritise safety measures, essentially meaning that they will tend to examine a model’s empirical demonstrations of its performance rather than critically analysing and studying its actual cognition. This tendency leaves the overall framework, in my opinion, more vulnerable to problems such as emergent capabilities and even deceptive alignment. Both of these seem to somewhat be symptoms of the aforementioned “learning by doing” principle, whereby mitigations are designed as they are needed (i.e. arguably somewhat short-sightedly) rather than the cognition of the model itself being controlledly examined, critically and openly evaluated under diverse scrutiny (as this CSP research project intends to do). Although both emergent capabilities and deceptive alignment aren’t necessarily solvable by philosophically or mathematically pure means (one could, say, always just abstract to just another layer of “but what if the model isn’t ACTUALLY thinking what we think it’s thinking?”), a paradigm shift to a focus on something like “safe by design” mathematically provably models could be in order as these problems get potentially worse.
Lastly, OpenAI relies extensively on RLHF and human evaluations of model performance, which lends itself to problems like reward-hacking as well as in particular representing a *lack* of focus (an opportunity cost for resources, so to speak) on more fundamentally scalable and more automated systems of evaluation. Again, the organisation’s former Superalignment team had been somewhat focused on this problem (which does have the problem of a double-edged sword of capabilities externalities, though it is still an important avenue of exploration) by working on weak-to-strong generalisation and solving the so-called “novice vs. grandmaster” problem of evaluation, but disputes over compute and the company’s presumed prioritisation of profits over actual safety caused said team to be disbanded and its work to be supposedly “integrated” into that of OpenAI’s other teams.
Yeah, i think that that makes a lot of sense. I’m personally thinking of trying to do evals that INFORM policy, although advocacy is also not off the table.