I feel vaguely good about the general suggestions here, but I’m having trouble piecing together what these posts look like. Is the technical side of these posts real but of mediocre research taste, or mostly hallucinated?
Put another way, are these decent autodidactic programmers posting their hello worlds in alignment research, who could do something useful and informative if somebody onboarded them to the kind of work that needs doing, or are the posts essentially just LLM-generated ‘slop’ with some buzzwords sprinkled in? In the former case, it seems worthwhile for experienced researchers to put together a slate of open questions for people of a variety of skill levels, with examples of the kinds of results that people would find interesting.
My current guess is, the people are sort of earnestly trying to onboard themselves, but they are missing layers and layers of judgment that make them not anywhere close to being able to be a good contributor, and they are mostly showing on LessWrong after having done the work, asking “where should I post this?”, and the AI says “LessWrong”, but, they don’t really have interest in scrapping their work, reading the sequences, and starting over.
I think the actual content is… possibly real “built and ran code that does the vaguely mechinterp buzzwordy things they said”, but, the way they were evaluating the AI was by asking it simple english sentences after doing mechinterp buzzwordy things, and the questions they asked don’t really mean anything.
I think we probably should be doing something more to try to capture these people more productively but the first step is “yep, your first round of work was basically fake, here is how to start over.”
the people are sort of earnestly trying to onboard themselves
I’m still having trouble tracking what this means, concretely. Maybe an example of two situations:
A friend of mine, bright but new to CS, wanted help optimizing a model to draw adversarial lines across images. He wrote some code that took a bunch of line parameters (width, length, angle), stored in PyTorch tensors, and rendered lines onto those images programmatically, but with no differentiable relationship between these parameters and the perturbed image. He was confused as to why putting the resulting images into an optimization loop targeted at `-classification_loss` didn’t result in higher error rates.
When I was getting into genetic algorithms, I read a paper and attempted to replicate its results, using their published pseudocode. My algorithm consistently failed to work, and I eventually decided had to look for a working reference. As it turns out, modern genetic algorithms use momentum and LR schedulers just like backpropagation-based learning does, and I had been relying on the outdated notion that random perturbation alone was enough to get a working result.
In the first case, there was an earnest attempt, but there wasn’t any understanding of how the process should have worked, and the thought process was based in building something that looked like a machine learning script, on the basis that the relevant variables lived inside PyTorch tensors. In the second case, there was a solid understanding of how the process should have worked, but I was missing key practical knowledge.
If someone’s out there teaching himself to write code that’ll correctly identify simple concept vectors within LLMs’ intermediate activations, but he’s retreading ground that academics covered three years ago and posting about it like it’s new to everyone, then all he needs is a summary of work so far, which is worth having for plenty of reasons, and he can start poking at underexplored regions of the frontier. If, on the other hand, he’s feeding a sentence into OpenAI’s embedding API, and then pasting the resulting numpy arrays into ChatGPT’s UI and asking it how it feels about the numbers it sees, then they probably need to independently audit a few CS courses to get up to speed first.
I feel vaguely good about the general suggestions here, but I’m having trouble piecing together what these posts look like. Is the technical side of these posts real but of mediocre research taste, or mostly hallucinated?
Put another way, are these decent autodidactic programmers posting their hello worlds in alignment research, who could do something useful and informative if somebody onboarded them to the kind of work that needs doing, or are the posts essentially just LLM-generated ‘slop’ with some buzzwords sprinkled in? In the former case, it seems worthwhile for experienced researchers to put together a slate of open questions for people of a variety of skill levels, with examples of the kinds of results that people would find interesting.
My current guess is, the people are sort of earnestly trying to onboard themselves, but they are missing layers and layers of judgment that make them not anywhere close to being able to be a good contributor, and they are mostly showing on LessWrong after having done the work, asking “where should I post this?”, and the AI says “LessWrong”, but, they don’t really have interest in scrapping their work, reading the sequences, and starting over.
I think the actual content is… possibly real “built and ran code that does the vaguely mechinterp buzzwordy things they said”, but, the way they were evaluating the AI was by asking it simple english sentences after doing mechinterp buzzwordy things, and the questions they asked don’t really mean anything.
I think we probably should be doing something more to try to capture these people more productively but the first step is “yep, your first round of work was basically fake, here is how to start over.”
I’m still having trouble tracking what this means, concretely. Maybe an example of two situations:
A friend of mine, bright but new to CS, wanted help optimizing a model to draw adversarial lines across images. He wrote some code that took a bunch of line parameters (width, length, angle), stored in PyTorch tensors, and rendered lines onto those images programmatically, but with no differentiable relationship between these parameters and the perturbed image. He was confused as to why putting the resulting images into an optimization loop targeted at `-classification_loss` didn’t result in higher error rates.
When I was getting into genetic algorithms, I read a paper and attempted to replicate its results, using their published pseudocode. My algorithm consistently failed to work, and I eventually decided had to look for a working reference. As it turns out, modern genetic algorithms use momentum and LR schedulers just like backpropagation-based learning does, and I had been relying on the outdated notion that random perturbation alone was enough to get a working result.
In the first case, there was an earnest attempt, but there wasn’t any understanding of how the process should have worked, and the thought process was based in building something that looked like a machine learning script, on the basis that the relevant variables lived inside PyTorch tensors. In the second case, there was a solid understanding of how the process should have worked, but I was missing key practical knowledge.
If someone’s out there teaching himself to write code that’ll correctly identify simple concept vectors within LLMs’ intermediate activations, but he’s retreading ground that academics covered three years ago and posting about it like it’s new to everyone, then all he needs is a summary of work so far, which is worth having for plenty of reasons, and he can start poking at underexplored regions of the frontier. If, on the other hand, he’s feeding a sentence into OpenAI’s embedding API, and then pasting the resulting numpy arrays into ChatGPT’s UI and asking it how it feels about the numbers it sees, then they probably need to independently audit a few CS courses to get up to speed first.