Inner-aligning AI and trying to ask better questions.
Jaehyuk Lim
Jailbreaking ChatGPT and Claude using Web API Context Injection
Although not “circuit-style,” this could also be considered one of these attempts outlined by Mack et al. 2024.
HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix
some issues related to causal interpretations
Could you refer to the line you are referring to from Marks et al.?
Biasing VLM Response with Visual Stimuli
Do you also conclude that the causal role of the circuit you discovered was spurious? What’s a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV? (i.e. should a good metric of causal importance satisfy both sample- and population-level increase?)
Could you also link to an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria?
Is there a codebase for the supervised dictionary work?
“I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.”
As artificial intelligence continues to advance and demonstrate increasingly impressive creative capabilities, it raises important questions about the role and value of human creativity. While AI has the potential to enhance and augment human creativity in many ways, it also threatens to compress or diminish it in certain domains. This essay explores the complex interplay between AI and human creativity, examining both the use cases where AI can boost human creative potential and the concerning scenarios where it may encroach upon or devalue human creative contributions.
Creativity Enhancing Use Cases:
Automation of mundane tasks: By leveraging AI to handle repetitive, time-consuming tasks, humans can free up their mental bandwidth and energy for more creative, higher-level thinking. Just as laundry machines and dishwashers have saved us time for more productive pursuits, AI tools can automate tedious aspects of creative work, allowing humans to focus on the more imaginative and innovative elements.
Reevaluating priorities: This I believe is the hardest one because it’s hard to look at themselves. As AI takes over certain creative tasks, it prompts humans to reflect on where their unique value lies. This can lead to a beneficial reprioritization, pushing humans to focus on the creative endeavors where they can make the greatest impact and contribution. It encourages a shift toward higher-value, more meaningful creative work.
Faster iteration and experimentation: AI tools can greatly accelerate the creative process by enabling rapid prototyping, testing, and refinement of ideas. In fields like art, design, and product development, AI can generate a wide range of variations and alternatives, allowing humans to explore and iterate more efficiently. This faster experimentation cycle can lead to more innovative and polished final creations.
Enhancing human creativity: AI can serve as a powerful tool to augment and enhance human creativity by providing inspiration, suggestions, and collaborating in the creative process. It can offer novel combinations, uncover hidden patterns, and propose unexpected solutions that humans might not have considered. By working in tandem with AI, humans can push the boundaries of their creative potential.
Use Cases that Compress Creativity:
Reduced human niches: As AI continues to surpass human performance in various creative domains, the space for human originality and unique contributions may shrink. If AI can generate creative work of comparable or superior quality to humans, the value of human creativity in those areas may diminish. The focus may shift from originality to authenticity – the mere fact that a human created something becomes the differentiating factor, rather than the novelty of the ideas themselves.
Competitive pressures: In fields where AI can automate a significant portion of the creative process, it may disincentivize human participation and lead to an overall reduction in human creative output. If AI can generate large volumes of high-quality content, such as articles or designs, at a fraction of the cost and time, it may become increasingly difficult for human creators to compete and make a living in those domains.
Lack of new applications: If AI is primarily employed to automate existing creative tasks and optimize known processes, it may limit the technology’s potential to unlock entirely new creative possibilities. To truly amplify human creativity, AI should be applied not just to streamline what we already do, but to explore uncharted territories and enable radically new forms of creative expression.
Overreliance on AI: As AI becomes increasingly capable and ubiquitous in creative fields, there is a risk of humans becoming overly dependent on it, to the detriment of their own creative skills and intuition. If humans rely too heavily on AI for creative tasks, they may struggle to develop and maintain their own creative abilities. Striking a balance between leveraging AI’s strengths and nurturing human creative capacity is crucial.
While AI has the potential to greatly enhance and democratize creativity, it is important to consider its limitations and the unique role that human creativity plays. The spiral town image generated by Stable Diffusion serves as a compelling example. Despite being obviously AI-generated, it possesses an undeniable aesthetic appeal that resonates with human sensibilities. This suggests that there may be certain creative niches where human creativity is irreplaceable – not because of its superior technical quality, but because of its authentic connection to the human experience.
As we navigate the integration of AI into creative fields, we must strive to harness its potential to augment and inspire human creativity, while also preserving and protecting the intrinsic value of human creative expression. By fostering a synergistic relationship between human and machine creativity, we can unlock new frontiers of innovation and artistic exploration, while ensuring that the essence of human creativity continues to thrive.
Thank you for the feedback, and thanks for this.
Who else is actively pursuing sparse feature circuits in addition to Sam Marks? I’m curious because the code breaks in the forward pass of the linear layer in gpt2 since the dimensions are different from Pythia’s (768).
[Question] SAE sparse feature graph using only residual layers
Identifying Micro-friction in the Context of the Anterior Mid-Cingulate Cortex (aMCC)
I agree with you there. There are numerous benefits of being an autodidact (freedom to learn what you want, less pressure from authorities), but formal education offers more mentorship. For most people, the desire to learn something is often not enough even with the increased accessibility of information, as the material gets more complex.
Do you see possible dangers of closed-loop automated interpretability systems as well?
the only aligned AIs are those which are digital emulations of human brains
I don’t think this is necessarily true. I don’t think emulated human brains are necessary for full alignment, nor whether emulated human brains would be more aligned than a well calibrated and scaled up version of our current alignment techniques (+ new ones to be discovered in the next few years). To emulate the entire human brain to align values seem to be not only implausible (even with neurmorphic computing, efficient neural networks and Moore’s law^1000), it seems like an overkill and a misallocation of valuable computational resources. Assuming I’m understanding “emulated human brains” correctly, emulation would mean pseudo-sentient systems solely designed to be aligned to our values. Perhaps morality can be a bit simpler than that, somewhere in the middle of static, written rules (the law) and the unpredictable human mind. Because if we do make more of people essentially, it’s not really addressing the “many biases or philosophical inadequacies” of us.
What’s the difference between “having a representation” for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I’m interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people.
Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity-in-mistralai-mistral-7b-v0-1/70031), the presence of distinct representations for uppercase and lowercase characters might not be as relevant to the task for the model as one would assume. It seems plausible that these representations may not substantially influence the model’s ability to perform H-Test, such as answering multiple-choice questions, with non-negligible probability. Perhaps one should probe for another representation, such as a circuit for “eliciting information”.
Hey, thanks for the reply. Yes, we tried k-means and agglomerative clustering and they worked with some mixed results.
We’ll try PaCMAP instead and see if it is better!