2) Yes, that is what I currently think as well, that the keyword-based controllability metric might be an overestimate and things are probably a bit more optimistic for safety. A correlation between a blackbox monitor, keyword measure, and a semantic measure does sound very interesting. I would be very willing to see how this works and do plan to do this.
3) Right, this is a big limitation right now. I have to scale my current experiments and judge the robustness of my measure by varying three things which I haven’t done yet: the external embedder, chunking strategies (currently just complete sentences), and phrasing of the forbidden concept. This is my immediate next step.
1) Yes, that is right.
2) Yes, that is what I currently think as well, that the keyword-based controllability metric might be an overestimate and things are probably a bit more optimistic for safety. A correlation between a blackbox monitor, keyword measure, and a semantic measure does sound very interesting. I would be very willing to see how this works and do plan to do this.
3) Right, this is a big limitation right now. I have to scale my current experiments and judge the robustness of my measure by varying three things which I haven’t done yet: the external embedder, chunking strategies (currently just complete sentences), and phrasing of the forbidden concept. This is my immediate next step.