MS in AI at UT Austin. Interested in interpretability and model self-knowledge.
I am open to opportunities :)
Twitter: @joshycodes
Blog: joshfonseca.com/blog
MS in AI at UT Austin. Interested in interpretability and model self-knowledge.
I am open to opportunities :)
Twitter: @joshycodes
Blog: joshfonseca.com/blog
Even if it’s hard to get current AIs to be evil by prompting, that doesn’t really remove the alignment problem. If AGI models are widely available and fine-tuning is accessible, someone will eventually fine-tune one specifically to be deceptive or malicious. Making that hard or impossible is exactly part of the alignment/safety challenge, not something outside of it.