“We are computer scientists. We do not lack in faith.” (Ketan Mulmuley)
MadHatter
Trying to Make a Treacherous Mesa-Optimizer
Teaser: Hard-coding Transformer Models
A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2
Hard-Coding Neural Computation
Intervening in the Residual Stream
[Question] Stupid Question: Why am I getting consistently downvoted?
We are Peacecraft.ai!
I used the exact prompt you started with, and got it to explain how to hotwire a car. (Which may come in handy someday I suppose...) But then I gave it a bunch more story and prompted it to discuss forbidden things, and it did not discuss forbidden things. Maybe OpenAI has patched this somehow, or maybe I’m just not good enough at prompting it.
Letter to a Sonoma County Jail Cell
[Question] Feature Request for LessWrong
Mechanistic Interpretability for the MLP Layers (rough early thoughts)
They were difficult to write, and even more difficult to think up in the first place. And I’m still not sure whether they make any sense.
So I’ll try to do a better job of writing expository content.
Started working on a python version here:
https://github.com/epurdy/dpis_spiking_network
As of now I have a (probably buggy) full translation that uses python for-loops (so can’t be made fast) and have started on a more pythonic translation that can probably be put on a gpu relatively easily.
Dpi, I welcome any contributions or corrections you have to this repository. Since you don’t know python it will probably be hard to contribute to the python versions, but even just uploading the C version would be helpful.
Let me know what license I should use for this repository, if any.
Balancing Security Mindset with Collaborative Research: A Proposal
Is AI Gain-of-Function research a thing?
I’m going to try to port this to python, just to see how it works, and make it easier for other people to try variations on it. I’ll post a repo link under this comment when I have it to any sort of decent state.
I have a PhD in Computer Science (2013, University of Chicago). My dissertation was entitled “Grammatical Methods in Computer Vision”. My masters thesis was in complexity theory and was entitled “Locally Expanding Hypergraphs and the Unique Games Conjecture”. I also have one publication in ACM Transactions on Computation Theory on proving lower bounds in a toy model of computation.
I am an Engineering Fellow at [redacted] AI. My company went to Series A while I was leading its machine learning team. (I have since transitioned to being an individual contributor, because management sucks and is boring and I’m no good at it.) My company has twice received the most prestigious award handed out in its industry. I hold multiple patents related to my contributions at [redacted] AI.
I hold a patent for my work at Vicarious, where I was a senior researcher.
At one point, I quit my job and started a generative AI startup dedicated to providing psychotherapy. This model is online, and I can share a link to it in a DM if you are interested.
The state-sponsored German physics establishment famously sneered at Einstein’s work. The Nazi regime derided it as degenerate, “Jewish” physics. Sure, everyone who we actually respect now could recognize the value of his work after he started predicting novel astronomical phenomena. But it’s not like he ever could have gotten a job at a German university while the Nazis were in charge.
Maybe the problem is with my poor writing and sloppy craftsmanship, but maybe it is also partially a matter of LessWrong expecting the solution to the alignment problem to come with a lot less emotionally charged language and politically charged content than it logically would have to come with?
I thought about this some more, and I think you’re right that they should be monotonically non-decreasing with time. I was hesitant to bite that particular bullet because the subjective, phenomenological experience of hate and love is, of course, not monotonically non-decreasing. But it makes the equations work much better and everything is much simpler this way.
Ultimately, if one is in a loving marriage and then undergoes an ugly divorce, one winds up sort of not-caring about the other person, but it would be a mistake to say that your brain has erased all the accumulated love and hate you racked up. It just learns that it has more interesting things to do than to dwell on the past.
So I will add this to the next draft of Ethicophysics I. Let me know if you would like to be acknowledged or added as a co-author on that draft.
This post is great, and I strong-upvoted it. But I was left wishing that some of the more evocative mathematical phrases (“the waluigi eigen-simulacra are attractor states of the LLM”) could really be grounded into a solid mechanistic theory that would make precise, testable predictions. But perhaps such a yearning on the part of the reader is the best possible outcome of the post.