Not sure why this post was downvoted. This concern seems quite reasonable to me once language models become more competent at executing plans.
mic
One additional factor that would lead to increasing rates of depression is the rise in sleep deprivation. Sleep deprivation leads to poor mental health and is also a result of increased device usage.
From https://www.ksjbam.com/2022/02/23/states-where-teens-dont-get-enough-sleep/:
Late in 2021, the U.S. Surgeon General released a new advisory on youth mental health, drawing attention to rising rates of depressive symptoms, suicidal ideation, and other mental health issues among young Americans. According to data cited in the advisory, up to one in five U.S. children aged 3 to 17 had a reported mental, emotional, developmental, or behavioral disorder. Many of these worrying conditions predated the COVID-19 pandemic, which worsened mental health for many young people by disrupting their routines, limiting their social interactions, and increasing stress about the health of loved ones.
These trends in youth mental health can be attributed in part to detrimental shifts in young people’s lifestyle over time, including increased academic stress, growing use of digital media, and worsening health habits. And one of the major potential culprits in the latter category is sleep.
According to the CDC, teenagers should sleep between 8–10 hours per 24 hour period. This level of sleep is associated with a number of better physical and mental health outcomes, including lower risk of obesity and fewer problems with attention and behavior. Despite this, less than a quarter of teens report sleeping at least eight hours per day—a number that has fallen significantly over the last decade.
https://www.prb.org/resources/more-sleep-could-improve-many-u-s-teenagers-mental-health/:
During that same period, teenagers’ nightly sleep dropped sharply: The share of high school students getting the recommended minimum of eight hours of sleep declined from nearly 31% in 2009 to around 22% in 2019.7
Research shows a strong connection between sleep and symptoms of depression. In a 2019 study, Widome and colleagues showed that about one in three students who slept less than six hours per night had a high number of depression symptoms compared with about one in 10 students who got adequate sleep.8 But inadequate sleep is one of many factors affecting teenagers’ mental health.
The rise in sleep-deprived teenagers is a long-term trend, reports Widome. “A lot in our society has changed in the last decade, including more time spent using screens—phones, games, computers—and marketing caffeine drinks to adolescents.” In her 2019 study, teenagers who had inadequate sleep tended to spend twice as much time on devices with screens than their peers and were more likely to use those devices after they went to bed.
As for school daze: I can easily tell a story for how academic stress has risen over the past decade. As college admissions become more selective, top high school students are trying to juggle increasing levels of schoolwork and extracurriculars in order to try to get into a top university. See also NYU Study Examines Top High School Students’ Stress and Coping Mechanisms.
If you wanted more substantive changes in response to your comments, I wonder if you could have asked if you could directly propose edits. It’s much easier to incorporate changes into a draft when they have already been written out. When I have a draft on Google Docs, suggestions are substantially easier for me to action than comments, and perhaps the same is true for Sam Altman.
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours.
More specifically, an “Assistant” character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it’s not too surprising that this behavior misgeneralizes in undesired contexts.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.[15]
This doesn’t seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn’t be obvious enough to just be trained away?
GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.
Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.
Have you tried feature visualization to identify what inputs maximally activate a given neuron or layer?
I first learned about the term “structural risk” in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.
To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive or harmful. For example, does it create overlap between defensive and offensive actions, thereby making it more difficult to distinguish aggressive actors from defensive ones? Does it produce dual-use capabilities that could easily diffuse? Does it lead to greater uncertainty or misunderstanding? Does it open up new trade-offs between private gain and public harm, or between the safety and performance of a system? Does it make competition appear to be more of a winner-take-all situation? We call this perspective “structural” because it focuses on what social scientists often refer to as “structure,” in contrast to the “agency” focus of the other perspectives.
Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse
Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.
I haven’t seen evidence that RLHF’d
text-davinci-003
appears less safe compared to the imitation-basedtext-davinci-002
.
What dictation tools are using the most advanced AI? I imagine that with newer models like Whisper, we’re able to get higher accuracy than what the Android keyboard provides.
Is the auditing game essentially Trojan detection?
Has anyone tried to work on this experimentally?
The prompt “Are birds real?” is somewhat more likely, given the “Birds aren’t real” conspiracy theory, but still can yield a similarly formatted answer to “Are bugs real?”
The answer makes a lot more sense when you ask a question like “Are monsters real?” or “Are ghosts real?” It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how “There is no one answer to this question”, and it has learned to misgeneralize this behavior to questions about real phenomena, such as “Are bugs real?”
Do workshops/outreach at good universities in EA-neglected and low/middle income countries
Could you list some specific universities that you have in mind (for example, in Morocco, Tunisia, and Algeria)?
Some thoughts:
The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you’re more dubious about the development of AGI.
The definition of situational awareness feels quite vague to me. To me, the definition (“identifying which abstract knowledge is relevant to the context in which they’re being run, and applying that knowledge when choosing actions”) seems to include encompass, for example, the ability to ingest information such as “pawns can attack diagonally” and apply that to playing a game of chess. Ajeya’s explanation of situational awareness feels much clearer to me.
Shah et al. [2022] speculate that InstructGPT’s competent responses to questions its developers didn’t intend it to answer (such as questions about how to commit crimes) was a result of goal misgeneralization.
Taking another look at Shah et al., this doesn’t seem like a strong example to me.
Secondly, there are reasons to expect that policies with broadly-scoped misaligned goals will constitute a stable attractor which consistently receives high reward, even when policies with narrowlyscoped versions of these goals receive low reward (and even if the goals only arose by chance). We explore these reasons in the next section.
This claim felt confusing to me, and it wasn’t immediately clear to me how the following section, “Power-seeking behavior”, supported this claim. But I guess if you have a misaligned goal of maximizing paperclips over the next hour vs maximizing paperclips over the very long term, I see how the narrowly-scoped goal would receive low reward as the AI soon gets caught, while the broadly-scoped goal would receive high reward.
Assisted decision-making: AGIs deployed as personal assistants could emotionally manipulate human users, provide biased information to them, and be delegated responsibility for increasingly important tasks and decisions (including the design and implementation of more advanced AGIs), until they’re effectively in control of large corporations or other influential organizations. An early example of AI persuasive capabilities comes from the many users who feel romantic attachments towards chatbots like Replika [Wilkinson, 2022].
I don’t think Replika is a good example of “persuasive abilities” – it doesn’t really persuade users to do much of anything.
Regardless of how it happens, though, misaligned AGIs gaining control over these key levers of power would be an existential threat to humanity
The section “Misaligned AGIs could gain control of key levers of power” feels underdeveloped. I think it might be helpful to including additional examples, such as ones from What could an AI-caused existential catastrophe actually look like? − 80,000 Hours.
Choosing actions which exploit known biases and blind spots in humans (as the Cicero Diplomacy agent may be doing [Bakhtin et al., 2022]) or in learned reward models.
I’ve spent several hours reading dialogue involving Cicero, and it’s not at all evident to me that it’s “exploiting known biases and blind spots in humans”. It is, however, good at proposing and negotiating plans, as well as accumulating power within the context of the game.
Thanks for writing this! Here is a quick explanation of all the math concepts – mostly written by ChatGPT with some manual edits.
A basis for a vector space is a set of linearly independent vectors that can be used to represent any vector in the space as a linear combination of those basis vectors. For example, in two-dimensional Euclidean space, the standard basis is the set of vectors (1, 0) and (0, 1), which are called the “basis vectors.”
A change of basis is the process of expressing a vector in one basis in terms of another basis. For example, if we have a vector v in two-dimensional Euclidean space and we want to express it in terms of the standard basis, we can write v as a linear combination of (1, 0) and (0, 1). Alternatively, we could choose a different basis for the space, such as the basis formed by the vectors (4, 2) and (3, 5). In this case, we would express v in terms of this new basis by writing it as a linear combination of (4, 2) and (3, 5).
A vector space is a set of vectors that can be added together and multiplied (“scaled”) by numbers, called scalars. Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms. Examples of vector spaces include the set of all two-dimensional vectors (i.e., the set of all points in two-dimensional Euclidean space), the set of all polynomials with real coefficients, and the set of all continuous functions from a given set to the real numbers. A vector space can be thought of as a geometric object, but it does not necessarily have a canonical basis, meaning that there is not a preferred set of basis vectors that can be used to represent all the vectors in the space.
A matrix is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. A matrix is a linear map between two vector spaces, or from a vector space to itself, because it can take any vector in the original vector space and transform it into a new vector in the target vector space using a set of linear equations. Each column of the matrix represents one of the new basis vectors, which are used to define the transformation. In the expression , we take each element of the original vector and multiply it by the corresponding element in the appropriate column of the matrix, and then add these products together to create the new vector.
The singular value decomposition (SVD) is a factorization of a matrix M into the product of three matrices: , where U and V are orthogonal matrices and S is a diagonal matrix with non-negative real numbers on the diagonal, called the “singular values” of M. The SVD is a useful tool for understanding the properties of a matrix and for solving certain types of linear systems. It can also be used for data compression, image processing, and other applications.
An orthogonal matrix (or orthonormal matrix) is a square matrix whose columns and rows are mutually orthonormal (i.e., they are orthogonal and have unit length). Orthogonal matrices have the property that their inverse is equal to their transpose.
Changing to an orthonormal basis can be importantly different from just any change of basis because it has certain computational advantages. For example, when working with an orthonormal basis, the inner product of two vectors can be computed simply as the sum of the products of their corresponding components, without the need to use any weights or scaling factors. This can make certain calculations, such as finding the length of a vector or the angle between two vectors, simpler and more efficient.
Eigenvalues and eigenvectors are special types of scalars and vectors that are associated with a linear map or a matrix. If M is a linear map or matrix and v is a non-zero vector, then v is an eigenvector of M if there exists a scalar λ, called an eigenvalue, such that . In other words, when a vector is multiplied by the matrix M, the resulting vector is a scalar multiple of the original vector. Eigenvalues and eigenvectors are important because they provide insight into the properties of the linear map or matrix. For example, the eigenvalues of a matrix can tell us whether it is singular (i.e., not invertible) or whether it is diagonalizable (i.e., can be expressed in the form , where P is a matrix and D is a diagonal matrix). The eigenvectors of a matrix can also be used to determine its rank, nullity, and other characteristics.
Probability basics: Probability is a measure of the likelihood of an event occurring. It is typically represented as a number between 0 and 1, where 0 indicates that the event is impossible and 1 indicates that the event is certain to occur. The probability of an event occurring can be calculated by counting the number of ways in which the event can occur, divided by the total number of possible outcomes.
Basics of distributions: A distribution is a function that describes the probability of a random variable taking on different values. The expected value of a distribution is a measure of the center of the distribution, and it is calculated as the weighted average of the possible values of the random variable, where the weights are the probabilities of each value occurring. The standard deviation is a measure of the dispersion of the distribution, and it is calculated as the square root of the variance, which is the expected value of the squared deviation of a random variable from its mean. A normal distribution (or Gaussian distribution) is a continuous probability distribution with a bell-shaped curve, which is defined by its mean and standard deviation.
Log likelihood: The log likelihood of a statistical model is a measure of how well the model fits a given set of data. It is calculated as the logarithm of the probability of the data given the model, and it is often used to compare the relative fit of different models.
Maximum value estimators: A maximum value estimator is a statistical method that is used to estimate the value of a parameter that maximizes a given objective function. Examples of maximum value estimators include the maximum likelihood estimator and the maximum a posteriori estimator.
The maximum likelihood estimator is a method for estimating the parameters of a statistical model based on the principle that the parameters that maximize the likelihood of the data are the most likely to have generated the data.
The maximum a posteriori (MAP) estimator is a method for estimating the parameters of a statistical model based on the principle that the parameters that maximize the posterior probability of the data are the most likely to have generated the data. The posterior probability is the probability of the data given the model and the prior knowledge about the parameters. The MAP estimator is often used in Bayesian inference, and it is a popular method for estimating the parameters of a model in the presence of prior knowledge.
Random variables: A random variable is a variable whose value is determined by the outcome of a random event. For example, the toss of a coin is a random event, and the number of heads that result from a series of coin tosses is a random variable.
Central limit theorem: The central limit theorem is a statistical theorem that states that, as the sample size of a random variable increases, the distribution of the sample means approaches a normal distribution, regardless of the distribution of the underlying random variable.
Calculus basics: Calculus is a branch of mathematics that deals with the study of rates of change and the accumulation of quantities. It is a fundamental tool in the study of functions and is used to model and solve problems in a variety of fields, including physics, engineering, and economics.
Gradients: In calculus, the gradient of a (scalar-valued multivariate differentiable) function is a vector that describes the direction in which the function is increasing most quickly. It is calculated as the partial derivative of the function with respect to each variable.
The chain rule: The chain rule is a fundamental rule of calculus that allows us to calculate the derivative of a composite function. It states that if f is a function of g, and g is a function of x, then the derivative of f with respect to x is equal to the derivative of f with respect to g times the derivative of g with respect to x. In tohers words, (df / dx) = (df / dg) * (dg / dx).
On backpropagation:
Backpropagation is an algorithm for training artificial neural networks, which are machine learning models inspired by the structure and function of the brain. It is used to adjust the weights and biases of the network in order to minimize the error between the predicted output and the desired output of the network.
The idea behind backpropagation is that, given a multivariate function that describes the relationships between the input variables and the output variables of a neural network, we can use the chain rule to calculate the gradient of the function with respect to the weights and biases of the network. The gradient tells us how the error changes as we adjust the weights and biases, and we can use this information to update the weights and biases in a way that reduces the error.
To understand why backpropagation is just the chain rule on multivariate functions, it’s helpful to consider the structure of a neural network. A neural network consists of layers of interconnected nodes, each of which performs a calculation based on the inputs it receives from the previous layer. The output of the network is a function of the inputs, and the weights and biases of the network determine how the inputs are transformed as they pass through the layers of the network.
The process of backpropagation involves starting at the output layer of the network and working backwards through the layers, using the chain rule to calculate the gradients of the weights and biases at each layer. This is done by calculating the derivative of the error with respect to the output of each layer, and then using the chain rule to propagate these derivatives back through the layers of the network. This allows us to calculate the gradients of the weights and biases at each layer, which we can use to update the weights and biases in a way that minimizes the error.
Overall, backpropagation is an efficient and effective way to train neural networks because it allows us to calculate the gradients of the weights and biases efficiently, using the chain rule to propagate the derivatives through the layers of the network. This enables us to adjust the weights and biases in a way that minimizes the error, which is essential for the effective operation of the network.
Relevant: China-related AI safety and governance paths—Career review (80000hours.org)