In Selective regularization for alignment-focused representation engineering, we presented a successful approach for structuring the latent space of a simple MLP. Here we document our side quests: experiments that didn’t go as expected, but in which we gained experience in regularization design and training dynamics.
This is the second part of Milestone 1 of Concept-anchored representation engineering for alignment. In this post, “we” refers to “me and Claude Opus 4,” who helped draft this content. The underlying research was conducted with coding assistance from Claude 3.5 Sonnet, Gemini 2.5 Pro, GPT-4o, and GPT-4.1.
Lesson 1: Data curricula are mostly harmful
We initially believed curriculum learning in pre-training would help impose structure on latent spaces: we would start with key concepts to establish a foundation, then gradually add complexity. We thought this would be necessary to create a structure around which other concepts could organize, but our experiments ultimately revealed the opposite: training on the full dataset from the start produces better representations and more stable training dynamics[1].
In our experiments, we trained a bottlenecked autoencoder on RGB colors. The primary objective was to reproduce the input color, but we added regularization to the bottleneck[2] to make the network internally represent color as HSV instead of RGB[3].
Following traditional curriculum learning principles, we started with just primary and secondary colors (6 points), then expanded to all hues, and finally introduced variations in brightness and saturation[4]. We added an “anchor” to lock in the positions of key colors after the initial phase of the curriculum: once the primary and secondary colors stabilized, a regularizer kicked in to penalize drift in those specific colors. We expected that a clean color wheel would be discovered from just the primary and secondary colors, then gradually adding more colors should preserve that structure while filling in the gaps. It worked to some extent, but it had problems:
About 25% of training runs resulted in “crumpled” latent spaces where colors were out of order. The structure was therefore highly sensitive to initial conditions and hyperparameter choices.
When expanding the data to include previously out-of-bounds colors, new samples needed to be gently introduced (with low batch weight or learning rate) to avoid shocking the optimizer. Failing to do so resulted in severely misshapen latent spaces.
Curriculum learning instabilities in color embedding. The same training process with identical hyperparameters but different random initializations can produce dramatically different outcomes. Top row: Successful case where the initial 6-color phase (left) leads to a well-organized color wheel in the final phase (right). Bottom row: Failed case where the initial arrangement becomes “crumpled,” resulting in a distorted final structure that doesn’t form a complete color wheel.
We became suspicious of the data curriculum when viewing animated visualizations of the latent space evolution. The “crumpling” occurred during the initial phase with only the 6 primary and secondary training colors. In that phase, the encoder had to organize its 16-dimensional hidden layer based on extremely limited information. Perhaps the remaining capacity wasn’t simply unused — it developed structure that later proved incompatible with the full color distribution.
We tested whether the curriculum was actually helping by training on the entire color dataset from the start, using selective per-sample regularization instead of phased data presentation[5]. This was much more successful:
Training was more stable across different random seeds
No “crumpling” occurred — the latent space consistently organized into a smooth sphere
Implementation was simpler with no data curriculum phases to manage
The structure never became misshapen by the introduction of new points, because they were all in-distribution from the start[6].
It seems that our curriculum approach failed because it violated a principle identified in the literature on critical learning periods and loss of plasticity: early training epochs create strong connections optimized for the input distribution, and these become difficult to change later[7]. By starting with only 6 colors, we forced the model to create representations optimized for an unrepresentative distribution. It’s worth noting that the primary and secondary colors are corners of the RGB cube, but two corners (black and white) were omitted. Our observations align with work showing that randomly ordered samples often perform as well or better than carefully designed curricula. The model needs access to the full problem structure to find globally optimal representations[8].
Our results don’t condemn all curriculum learning. Successful applications like RLHF work by narrowing from a broad distribution (all language) to a specific subset (preferred responses). Our experiments attempted the opposite — expanding from narrow to broad — which appears fundamentally more difficult. Hiding complexity doesn’t help; it forces premature commitments that become constraints later.
Lesson 2: Hyperparameter schedules can be discrete
We had noticed that abrupt changes in the data distribution caused unstable training dynamics. Before exploring whether the data curriculum was needed at all, we attempted to tame the process by varying hyperparameters, such as regularization term weights and sample weights (giving previously out-of-distribution samples lower weight). We developed a hyperparameter scheduler inspired by animation techniques and robotics, using a dopesheet and minimum jerk trajectories[9], and applied it to our color curriculum learning.
One of our hypotheses was that smooth hyperparameter transitions would lead to more stable training dynamics compared to discrete changes. To test this, we ran a controlled comparison between smooth and stepped transitions[10]. We created two dopesheets with curricula designed to be as similar as possible:
Smooth transitions: Using eased timing functions for gradual parameter changes
Stepped transitions: Using traditional discrete phases with abrupt changes
Comparison of smooth versus stepped hyperparameter schedules.Top: Smooth transitions using minimum jerk trajectories create gradual parameter changes across training phases. Bottom: Traditional stepped curriculum with abrupt changes at phase boundaries.
Learning rate schedules for smooth versus stepped transitions. The smooth approach (top) uses a brief initial warmup followed by gradual decay. The stepped approach (bottom) follows a similar overall trajectory but temporarily drops the learning rate to near zero during hyperparameter transitions.
To ensure a fair comparison, the stepped curriculum included learning rate warmup at the start of each phase (already common practice) and had more phases to allow finer control over hyperparameter values. We also implemented per-sample weight scaling for newly introduced data to prevent the optimizer from being “shocked” by previously out-of-distribution samples.
Contrary to our expectations, both approaches produced remarkably similar results:
Training dynamics: While the stepped variant showed the predicted loss spikes at phase transitions, these were mitigated by the learning rate warmup and caused no apparent problems. The smooth variant avoided these spikes entirely but didn’t show any clear advantage in final performance (i.e. lower loss) nor in the structure of the latent space.
Latent space quality: Surprisingly, the smooth variant produced a noisier latent space during the “All hues” phase (although that was largely corrected by the end of training). While this may be specific to our chosen hyperparameter values rather than an inherent property of smooth transitions, it challenged our assumption that smoother would always be better.
Visualization of validation data at the end of the All hues phase. At this time, the model hadn’t observed any non-vibrant colors. Left: the latent space of the model trained with the smooth curriculum, showing an irregular region near pink. Right: the model trained with the stepped curriculum, which is (surprisingly) more regular.
Data introduction effects: Despite introducing data differently (in chunks for stepped, gradually for smooth), both variants showed nearly identical latent space evolution. The stepped variant would bulge suddenly at phase starts while the smooth variant bulged more gradually, but the extent to which they “wobbled” was roughly the same. Both converged to almost exactly the same final structure.
Varying hyperparameters still helps
We made a reasonable effort to search for static hyperparameter values that would result in the same structure — but it seems that they either don’t exist, or are harder to find than a good schedule. This might be due to the use of conflicting regularization terms (see below): varying the strengths of those terms may have allowed the optimizer to overcome the tension. In particular, the repulsion term was reduced to zero shortly before anchoring the key concepts.
So we found that varying hyperparameters during training was beneficial, even though choice of timing function didn’t matter. The changing loss landscape appears to help the model find better solutions — perhaps by opening new optimization paths that wouldn’t be available with static hyperparameters. Colleagues working on Singular Learning Theory found this observation particularly intriguing, as it may relate to theoretical predictions about how loss landscape geometry evolves during training. Careful design of training data and regularization terms may avoid the need for varying hyperparameters, but to the extent that there are competing optimization goals, varying hyperparameters may be helpful.
Practical recommendations
If you need time-varying hyperparameters during training, smooth transitions offer practical advantages despite not improving training dynamics:
Simpler specification: Fewer keyframes, phases, and warmup periods to define
Easier iteration: Adjusting a smooth curve is more intuitive than redesigning discrete phases, and transitions can span multiple phases
Dynamic adaptation: Methods like minimum jerk trajectories can gracefully handle runtime schedule adjustments (e.g., shortening phases based on loss targets)
These findings are based on our specific color domain and architecture; different domains or model types may show different sensitivities to transition smoothness.
Lesson 3: Understand regularizer conflicts
Separately from our curriculum experiments, we discovered that our ensemble of regularization terms was poorly designed, creating unstable training dynamics. The main issue was the combination of unitarity (pushing embeddings to lie on a unit hypersphere) with a repulsion term designed to prevent clustering[11].
Our unitarity term encouraged embedding vectors to have length 1:
Lunitarity=(||x||−1)2
You can read this as “find the length of the vector and compare it to 1”. Squaring it makes the penalty much greater for points that are far from the surface of the hypersphere (whether inside or outside).
However, since our embeddings started clustered near the origin (random initialization), the simplest way for the optimizer to satisfy unitarity was to move the entire cluster to a small area on the surface of the hypersphere. To prevent this clustering, we added a repulsion term based on Euclidean distance:
Lrepulsion=1||x1−x2||2
This creates stronger repulsion the closer points get — like magnets with the same pole facing each other. The squared denominator means the force explodes as distance approaches zero.
This created a fundamental conflict: the repulsion term pushes points apart in straight lines through space, potentially moving them off the hypersphere surface. Meanwhile, unitarity pulls them back. The result was an unstable tug-of-war where the optimizer couldn’t satisfy both constraints simultaneously.
The impact on training was severe. During phases where Euclidean repulsion was active, the combined loss reached nearly 10,000 — compared to starting values around 1 without repulsion. This occurred with just six points (primary and secondary colors); with larger batches from later training phases, the problem would have been even greater.
Using orthogonal forces instead
We resolved this by reformulating the repulsion term to use angular rather than Euclidean distance:
Lrepulsion=(x1||x1||⋅x2||x2||)10
This measures how aligned two vectors are regardless of their magnitudes (using the normalized dot product), then amplifies the penalty for similar directions. The exponent of 10 sharpens the effect, so that only points that are quite close together would feel a strong repulsion. This force operates tangentially along the hypersphere surface, making it orthogonal to the radial unitarity force. The two regularizers no longer fight — they complement each other by guiding the optimization along independent directions.
Illustration of repulsion terms compared to unitarity.Left: Euclidean (linear) repulsion, where points repel each other in straight lines through space without regard to the unit sphere. Middle: Angular repulsion, where points repel along the surface of the sphere. Right: Radial unitarity constraint, where points are attracted to the closest point on the surface of the sphere. The angular repulsion force operates rotationally (orthogonal to the unitarity force), eliminating the conflict that occurs with Euclidean repulsion.
With this change, training dynamics improved significantly: the repulsion term could be active throughout the entire curriculum, and the optimizer could adjust each term independently without undoing progress on the other.
This suggests an important principle for designing regularization schemes: aim for constraints that operate on orthogonal aspects of the representation. In our case:
Unitarity influences the magnitude (radial position)
Angular repulsion influences the distribution (spread on the sphere)
Planarity influences which dimensions are used
When regularizers compete for the same geometric properties, they create optimization landscapes with no good solutions. But by considering the geometry of your constraints, you can design regularization terms that complement rather than conflict with each other.
In retrospect, we should have applied this principle more broadly — our concept anchoring term still uses Euclidean distance, which could create similar (though less severe) conflicts by pulling nearby points inward toward the anchor. Our future work should probably use angular distances consistently across all regularization terms that operate on hypersphere-normalized representations.
Conclusion
We have been surprised by a number of things as we work toward well-structured, interpretable latent representations. Initially we thought that careful data sequencing and smooth parameter transitions would be needed. Instead, we discovered that simpler approaches and more principled regularization terms work better.
If you’re exploring similar techniques, may we suggest:
Skip the data curriculum. Or rather, don’t start with the kind of dataset you might use for fine-tuning. Train on the full distribution to give the model enough information to find globally optimal representations.
Design regularizers to be complementary. Check that your constraints operate on independent aspects of the representation. When regularizers compete for the same geometric properties, they create unsolvable optimization problems.
Hyperparameter schedules can help, but smoothness doesn’t matter. If you need time-varying hyperparameters (and you might, especially with competing regularization terms), smooth transitions are easier to specify and adjust, but don’t expect them to improve training dynamics. The changing loss landscape itself seems to be what helps.
In some ways it would have been nice to have known these things earlier — some of them at least are present in the literature. On the other hand, by (re)discovering them we have developed better intuition for training dynamics and regularization techniques in general[12].
These experiments were always preparation for applying concept anchoring to language models. We now know we can work with realistic, noisy labels on full datasets, and this suggests we won’t need perfect datasets for abstract concepts like deception or harmfulness. We understand how to design regularization terms that work together rather than against each other. And we’ve found that imposing structure doesn’t require fighting against the learning process, but understanding the natural structure of the latent space and guiding it gently from the start.
The color wheel emerged not because we forced it, but because we gave the model just enough signal to organize itself intelligibly. If we can apply similar principles to language models, anchoring even a handful of safety-relevant concepts could make monitoring and intervention significantly more tractable.
The primary objective (loss criterion) of an autoencoder is to reproduce its input. The “loss criterion” is a signal that allows the model to learn how to produce the desired output, which in the case of an autoencoder is the same as the input by definition. “Regularization” refers to additional loss terms computed from the latent (hidden) layers of the network that encourage the model to adopt a particular internal shape. Details are in our previous post.
RGB means red, green, and blue, and HSV means hue, saturation, value — different ways to represent color. RGB is how color is typically displayed on a screen, but HSV is closer to how humans naturally think of color.
We make no claim that this is how humans learn; indeed we expect the opposite to be true, and that the sample efficiency of human learning is grossly overstated.
Technically, “from the start” they were all out-of-distribution. It’s probably more accurate to say that, early in training, the network was able to observe diverse-enough data to avoid major re-structuring later in training. You could think of the data curriculum as resulting in over-fitting in the first phase, but since we’re aiming for a specific structure, we think that a geometric interpretation is more informative.
Starting out as a researcher is hard, not knowing what you don’t know. Ideally you should collaborate with others, and we are grateful for the discussions with other researchers at the Melbourne AI Safety Hub — particularly @Liam Carroll and @Dan MacKinlay. Collaborating with AI is also a good way to sense-check your approach, as long as you’re careful not to suggest that you seek affirmation. The new research modes in Claude and in ChatGPT are excellent for finding papers related to what you’re working on, and we intend to use them more during future work rather than only for up-front and post-hoc literature reviews.
Side quests in curriculum learning and regularization
In Selective regularization for alignment-focused representation engineering, we presented a successful approach for structuring the latent space of a simple MLP. Here we document our side quests: experiments that didn’t go as expected, but in which we gained experience in regularization design and training dynamics.
This is the second part of Milestone 1 of Concept-anchored representation engineering for alignment. In this post, “we” refers to “me and Claude Opus 4,” who helped draft this content. The underlying research was conducted with coding assistance from Claude 3.5 Sonnet, Gemini 2.5 Pro, GPT-4o, and GPT-4.1.
Lesson 1: Data curricula are mostly harmful
We initially believed curriculum learning in pre-training would help impose structure on latent spaces: we would start with key concepts to establish a foundation, then gradually add complexity. We thought this would be necessary to create a structure around which other concepts could organize, but our experiments ultimately revealed the opposite: training on the full dataset from the start produces better representations and more stable training dynamics[1].
In our experiments, we trained a bottlenecked autoencoder on RGB colors. The primary objective was to reproduce the input color, but we added regularization to the bottleneck[2] to make the network internally represent color as HSV instead of RGB[3].
Following traditional curriculum learning principles, we started with just primary and secondary colors (6 points), then expanded to all hues, and finally introduced variations in brightness and saturation[4]. We added an “anchor” to lock in the positions of key colors after the initial phase of the curriculum: once the primary and secondary colors stabilized, a regularizer kicked in to penalize drift in those specific colors. We expected that a clean color wheel would be discovered from just the primary and secondary colors, then gradually adding more colors should preserve that structure while filling in the gaps. It worked to some extent, but it had problems:
About 25% of training runs resulted in “crumpled” latent spaces where colors were out of order. The structure was therefore highly sensitive to initial conditions and hyperparameter choices.
When expanding the data to include previously out-of-bounds colors, new samples needed to be gently introduced (with low batch weight or learning rate) to avoid shocking the optimizer. Failing to do so resulted in severely misshapen latent spaces.
We became suspicious of the data curriculum when viewing animated visualizations of the latent space evolution. The “crumpling” occurred during the initial phase with only the 6 primary and secondary training colors. In that phase, the encoder had to organize its 16-dimensional hidden layer based on extremely limited information. Perhaps the remaining capacity wasn’t simply unused — it developed structure that later proved incompatible with the full color distribution.
We tested whether the curriculum was actually helping by training on the entire color dataset from the start, using selective per-sample regularization instead of phased data presentation[5]. This was much more successful:
Training was more stable across different random seeds
No “crumpling” occurred — the latent space consistently organized into a smooth sphere
Implementation was simpler with no data curriculum phases to manage
The structure never became misshapen by the introduction of new points, because they were all in-distribution from the start[6].
It seems that our curriculum approach failed because it violated a principle identified in the literature on critical learning periods and loss of plasticity: early training epochs create strong connections optimized for the input distribution, and these become difficult to change later[7]. By starting with only 6 colors, we forced the model to create representations optimized for an unrepresentative distribution. It’s worth noting that the primary and secondary colors are corners of the RGB cube, but two corners (black and white) were omitted. Our observations align with work showing that randomly ordered samples often perform as well or better than carefully designed curricula. The model needs access to the full problem structure to find globally optimal representations[8].
Our results don’t condemn all curriculum learning. Successful applications like RLHF work by narrowing from a broad distribution (all language) to a specific subset (preferred responses). Our experiments attempted the opposite — expanding from narrow to broad — which appears fundamentally more difficult. Hiding complexity doesn’t help; it forces premature commitments that become constraints later.
Lesson 2: Hyperparameter schedules can be discrete
We had noticed that abrupt changes in the data distribution caused unstable training dynamics. Before exploring whether the data curriculum was needed at all, we attempted to tame the process by varying hyperparameters, such as regularization term weights and sample weights (giving previously out-of-distribution samples lower weight). We developed a hyperparameter scheduler inspired by animation techniques and robotics, using a dopesheet and minimum jerk trajectories[9], and applied it to our color curriculum learning.
One of our hypotheses was that smooth hyperparameter transitions would lead to more stable training dynamics compared to discrete changes. To test this, we ran a controlled comparison between smooth and stepped transitions[10]. We created two dopesheets with curricula designed to be as similar as possible:
Smooth transitions: Using eased timing functions for gradual parameter changes
Stepped transitions: Using traditional discrete phases with abrupt changes
To ensure a fair comparison, the stepped curriculum included learning rate warmup at the start of each phase (already common practice) and had more phases to allow finer control over hyperparameter values. We also implemented per-sample weight scaling for newly introduced data to prevent the optimizer from being “shocked” by previously out-of-distribution samples.
Contrary to our expectations, both approaches produced remarkably similar results:
Training dynamics: While the stepped variant showed the predicted loss spikes at phase transitions, these were mitigated by the learning rate warmup and caused no apparent problems. The smooth variant avoided these spikes entirely but didn’t show any clear advantage in final performance (i.e. lower loss) nor in the structure of the latent space.
Latent space quality: Surprisingly, the smooth variant produced a noisier latent space during the “All hues” phase (although that was largely corrected by the end of training). While this may be specific to our chosen hyperparameter values rather than an inherent property of smooth transitions, it challenged our assumption that smoother would always be better.
Data introduction effects: Despite introducing data differently (in chunks for stepped, gradually for smooth), both variants showed nearly identical latent space evolution. The stepped variant would bulge suddenly at phase starts while the smooth variant bulged more gradually, but the extent to which they “wobbled” was roughly the same. Both converged to almost exactly the same final structure.
Varying hyperparameters still helps
We made a reasonable effort to search for static hyperparameter values that would result in the same structure — but it seems that they either don’t exist, or are harder to find than a good schedule. This might be due to the use of conflicting regularization terms (see below): varying the strengths of those terms may have allowed the optimizer to overcome the tension. In particular, the repulsion term was reduced to zero shortly before anchoring the key concepts.
So we found that varying hyperparameters during training was beneficial, even though choice of timing function didn’t matter. The changing loss landscape appears to help the model find better solutions — perhaps by opening new optimization paths that wouldn’t be available with static hyperparameters. Colleagues working on Singular Learning Theory found this observation particularly intriguing, as it may relate to theoretical predictions about how loss landscape geometry evolves during training. Careful design of training data and regularization terms may avoid the need for varying hyperparameters, but to the extent that there are competing optimization goals, varying hyperparameters may be helpful.
Practical recommendations
If you need time-varying hyperparameters during training, smooth transitions offer practical advantages despite not improving training dynamics:
Simpler specification: Fewer keyframes, phases, and warmup periods to define
Easier iteration: Adjusting a smooth curve is more intuitive than redesigning discrete phases, and transitions can span multiple phases
Dynamic adaptation: Methods like minimum jerk trajectories can gracefully handle runtime schedule adjustments (e.g., shortening phases based on loss targets)
These findings are based on our specific color domain and architecture; different domains or model types may show different sensitivities to transition smoothness.
Lesson 3: Understand regularizer conflicts
Separately from our curriculum experiments, we discovered that our ensemble of regularization terms was poorly designed, creating unstable training dynamics. The main issue was the combination of unitarity (pushing embeddings to lie on a unit hypersphere) with a repulsion term designed to prevent clustering[11].
Our unitarity term encouraged embedding vectors to have length 1:
Lunitarity=(||x||−1)2You can read this as “find the length of the vector and compare it to 1”. Squaring it makes the penalty much greater for points that are far from the surface of the hypersphere (whether inside or outside).
However, since our embeddings started clustered near the origin (random initialization), the simplest way for the optimizer to satisfy unitarity was to move the entire cluster to a small area on the surface of the hypersphere. To prevent this clustering, we added a repulsion term based on Euclidean distance:
Lrepulsion=1||x1−x2||2This creates stronger repulsion the closer points get — like magnets with the same pole facing each other. The squared denominator means the force explodes as distance approaches zero.
This created a fundamental conflict: the repulsion term pushes points apart in straight lines through space, potentially moving them off the hypersphere surface. Meanwhile, unitarity pulls them back. The result was an unstable tug-of-war where the optimizer couldn’t satisfy both constraints simultaneously.
The impact on training was severe. During phases where Euclidean repulsion was active, the combined loss reached nearly 10,000 — compared to starting values around 1 without repulsion. This occurred with just six points (primary and secondary colors); with larger batches from later training phases, the problem would have been even greater.
Using orthogonal forces instead
We resolved this by reformulating the repulsion term to use angular rather than Euclidean distance:
Lrepulsion=(x1||x1||⋅x2||x2||)10This measures how aligned two vectors are regardless of their magnitudes (using the normalized dot product), then amplifies the penalty for similar directions. The exponent of 10 sharpens the effect, so that only points that are quite close together would feel a strong repulsion. This force operates tangentially along the hypersphere surface, making it orthogonal to the radial unitarity force. The two regularizers no longer fight — they complement each other by guiding the optimization along independent directions.
With this change, training dynamics improved significantly: the repulsion term could be active throughout the entire curriculum, and the optimizer could adjust each term independently without undoing progress on the other.
This suggests an important principle for designing regularization schemes: aim for constraints that operate on orthogonal aspects of the representation. In our case:
Unitarity influences the magnitude (radial position)
Angular repulsion influences the distribution (spread on the sphere)
Planarity influences which dimensions are used
When regularizers compete for the same geometric properties, they create optimization landscapes with no good solutions. But by considering the geometry of your constraints, you can design regularization terms that complement rather than conflict with each other.
In retrospect, we should have applied this principle more broadly — our concept anchoring term still uses Euclidean distance, which could create similar (though less severe) conflicts by pulling nearby points inward toward the anchor. Our future work should probably use angular distances consistently across all regularization terms that operate on hypersphere-normalized representations.
Conclusion
We have been surprised by a number of things as we work toward well-structured, interpretable latent representations. Initially we thought that careful data sequencing and smooth parameter transitions would be needed. Instead, we discovered that simpler approaches and more principled regularization terms work better.
If you’re exploring similar techniques, may we suggest:
Skip the data curriculum. Or rather, don’t start with the kind of dataset you might use for fine-tuning. Train on the full distribution to give the model enough information to find globally optimal representations.
Design regularizers to be complementary. Check that your constraints operate on independent aspects of the representation. When regularizers compete for the same geometric properties, they create unsolvable optimization problems.
Hyperparameter schedules can help, but smoothness doesn’t matter. If you need time-varying hyperparameters (and you might, especially with competing regularization terms), smooth transitions are easier to specify and adjust, but don’t expect them to improve training dynamics. The changing loss landscape itself seems to be what helps.
In some ways it would have been nice to have known these things earlier — some of them at least are present in the literature. On the other hand, by (re)discovering them we have developed better intuition for training dynamics and regularization techniques in general[12].
These experiments were always preparation for applying concept anchoring to language models. We now know we can work with realistic, noisy labels on full datasets, and this suggests we won’t need perfect datasets for abstract concepts like deception or harmfulness. We understand how to design regularization terms that work together rather than against each other. And we’ve found that imposing structure doesn’t require fighting against the learning process, but understanding the natural structure of the latent space and guiding it gently from the start.
The color wheel emerged not because we forced it, but because we gave the model just enough signal to organize itself intelligibly. If we can apply similar principles to language models, anchoring even a handful of safety-relevant concepts could make monitoring and intervention significantly more tractable.
Relevant experiment notebooks: Ex 1.3: MLP with 3D bottleneck and curriculum learning, Ex 1.5: Smooth curriculum and anchoring, and Ex 1.7: Sparse labels.
The primary objective (loss criterion) of an autoencoder is to reproduce its input. The “loss criterion” is a signal that allows the model to learn how to produce the desired output, which in the case of an autoencoder is the same as the input by definition. “Regularization” refers to additional loss terms computed from the latent (hidden) layers of the network that encourage the model to adopt a particular internal shape. Details are in our previous post.
RGB means red, green, and blue, and HSV means hue, saturation, value — different ways to represent color. RGB is how color is typically displayed on a screen, but HSV is closer to how humans naturally think of color.
We make no claim that this is how humans learn; indeed we expect the opposite to be true, and that the sample efficiency of human learning is grossly overstated.
For details on how we succeeded in shaping the latent space, see Selective regularization for alignment-focused representation engineering.
Technically, “from the start” they were all out-of-distribution. It’s probably more accurate to say that, early in training, the network was able to observe diverse-enough data to avoid major re-structuring later in training. You could think of the data curriculum as resulting in over-fitting in the first phase, but since we’re aiming for a specific structure, we think that a geometric interpretation is more informative.
Achille, A., Rovere, M., Soatto, S. (2017). Critical Learning Periods in Deep Neural Networks. arXiv:1711.08856
Wu, X., Dyer, E., Neyshabur, B. (2020). When Do Curricula Work? arXiv:2012.03107
A dopesheet (or exposure sheet) is a tabular format for specifying animations. Minimum jerk trajectories are a mathematical tool from robotics that minimize acceleration changes for smooth motion.
Relevant experiment notebooks: Ex 1.4: Parameter transitions, Ex 1.5: Smooth curriculum and anchoring, and Ex 1.6: Smooth vs. stepped hyperparameter transitions.
Relevant experiment notebooks: Ex 1.5: Smooth curriculum and anchoring and Ex 1.7: Sparse labels.
Starting out as a researcher is hard, not knowing what you don’t know. Ideally you should collaborate with others, and we are grateful for the discussions with other researchers at the Melbourne AI Safety Hub — particularly @Liam Carroll and @Dan MacKinlay. Collaborating with AI is also a good way to sense-check your approach, as long as you’re careful not to suggest that you seek affirmation. The new research modes in Claude and in ChatGPT are excellent for finding papers related to what you’re working on, and we intend to use them more during future work rather than only for up-front and post-hoc literature reviews.