Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of “satisfaction” across multiple metrics, based on a target false positive rate for the overall alignment evaluation.
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of “satisfaction” across multiple metrics, based on a target false positive rate for the overall alignment evaluation.