I mostly agree with the diagnosis of the problem, but have some different guesses about paths to try and get alignment on track.
I think the core difficulties of alignment are explained semi-acceptably, but in a scattered form which means that only the dedicated explorers with lots of time and good taste end up finding them. Having a high quality course which collects the best explainers we have to prepare people for trying to find a toehold, and noticing the gaps left and writing good things added to fill them, seems necessary for any additional group of people added to actually point in the right direction.
BlueDot’s course seems strongly optimized to funnel people into the empirical/ML/lab alignment team pipeline, they have dropped the Agent Foundations module entirely, and their “What makes aligning AI difficult?” fast track is 3/5ths articles on RLHF/RLAIF (plus an intro to LLMs and a RA video). This is the standard recommendation, and there isn’t a generally known alternative.
I mostly agree with the diagnosis of the problem, but have some different guesses about paths to try and get alignment on track.
I think the core difficulties of alignment are explained semi-acceptably, but in a scattered form which means that only the dedicated explorers with lots of time and good taste end up finding them. Having a high quality course which collects the best explainers we have to prepare people for trying to find a toehold, and noticing the gaps left and writing good things added to fill them, seems necessary for any additional group of people added to actually point in the right direction.
BlueDot’s course seems strongly optimized to funnel people into the empirical/ML/lab alignment team pipeline, they have dropped the Agent Foundations module entirely, and their “What makes aligning AI difficult?” fast track is 3/5ths articles on RLHF/RLAIF (plus an intro to LLMs and a RA video). This is the standard recommendation, and there isn’t a generally known alternative.
I tried to fix this with Agent Foundations for Superintelligent Robust-Alignment, but I think this would go a lot better if someone like @johnswentworth took it over and polished it.