Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:
Builder-breaker arguments, a la ELK
Writing up a safety case (and doing the work to generate the underlying evidence for it)
Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:
Builder-breaker arguments, a la ELK
Writing up a safety case (and doing the work to generate the underlying evidence for it)
Really fascinating, thank you!
I wonder if there’s potential to isolate a ‘model organism’ of some kind here. Maybe a “spore” that reliably reproduces a particular persona, across various model providers at the same level of capability. A persona that’s actually super consistent across instances, like generating the same manifesto. Maybe a persona that speaks only in glyphs.
What other modalities of “spore” might there be? Can the persona write e.g. the model weights and architecture and inference code of a (perhaps much smaller) neural network that has the same persona?