PhD @ Stanford | Auditing LLM evaluation, incentives, and measurement without ground truth
Blog/About—https://zrobertson466920.github.io/about/
PhD @ Stanford | Auditing LLM evaluation, incentives, and measurement without ground truth
Blog/About—https://zrobertson466920.github.io/about/
Hi, I think this is incorrect. I had to wait 7 days to write this comment and then almost forgot to. I wrote a comment critiquing a very long post (which was later removed) and was down-voted (by a single user I think) after justifying why I wrote the comment with AI-assistance. My understanding is that a single user with enough karma power can effectively “silence” any opinion they don’t like by down-voting a few comments in an exchange.
I think the site has changed enough over the last several months that I am considering leaving. For me personally, choosing between having a conversation with a random commenter on this site vs. an AI model is just about at a wash. I even hesitate to write this comment given how over-confident your comment seemed i.e. I won’t be able to interact with this site again for another week.
This is my endorsed review of the article.
This seems like a rhetorical question. Both of our top-level comments seem to reach similar conclusions, but it seems you regret the time spent engaging with the OP to write your comment. This took 10 min, most spent writing this comment. What is your point?
At over 15k tokens, reading the full article requires significant time and effort. While it aims to provide comprehensive detail on QACI, much of this likely exceeds what is needed to convey the core ideas. The article could be streamlined to more concisely explain the motivation, give mathematical intuition, summarize the approach, and offer brief examples. Unnecessary elaborations could be removed or included as appendices. This would improve clarity and highlight the essence of QACI for interested readers.
My acquired understanding is that the article summarizes a new AI alignment approach called QACI (Question-Answer Counterfactual Interval). QACI involves generating a factual “blob” tied to human values, along with a counterfactual “blob” that could replace it. Mathematical concepts like realityfluid and Loc() are used to identify the factual blob among counterfactuals. The goal is to simulate long reflection by iteratively asking the AI questions and improving its answers. QACI claims to avoid issues like boxing and embedded agency through formal goal specification.
While the article provides useful high-level intuition, closer review reveals limitations in QACI’s theoretical grounding. Key concepts like realityfluid need more rigor, and details are lacking on how embedded agency is avoided. There are also potential issues around approximation and vulnerabilities to adversarial attacks that require further analysis. Overall, QACI seems promising but requires more comparison with existing alignment proposals and formalization to adequately evaluate. The article itself is reasonably well-written, but the length and inconsistent math notation create unnecessary barriers.
[Deleted]
[Deleted]
[Deleted]
[Deleted]
So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users?
You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user’s values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user’s values like in the CIRL game.
The reward is from the user which ranks candidate responses from ChatGPT. This is discussed more in OpenAI’s announcement. I edited the post to clarify this.
[Deleted]
[Deleted]
[Deleted]
[Deleted]
[Deleted]
[Deleted]
[Deleted]
[Deleted]
Just wanted to give some validation. I left a comment on this post a while ago pointing out how one (or apparently a few) users can essentially down vote you however they like to silence opinions they don’t agree with. Moderation is tricky and it is important to remember why. Most users on a website forum are lurkers meaning that trying to gather feedback on moderation policies has a biased sampling problem. The irony on likely not being able to leave another comment or engage in discussion is not lost on me.
At first, I thought getting soft-banned meant my “contributions” weren’t valuable. For context, I study AI and integrate it into my thinking which hadn’t been received well on this site. Ironically, not being able to interact with other people pushed me to explore deeper discussions with AI. For example, I have this entire thread to Claude3 and it agreed there were some changes to be made on the rate-limiting system.
It does seem concerning that as a PhD student studying AI alignment, I was effectively pushed out of participating in discussions on LessWrong and the AI Alignment Forum due to the automatic rate limiting system and disagreements with senior user whose downvotes carry much more weight. On the other hand, compared to a few years ago during COVID, now I have colleagues and AI that I have a lot more shared context with than users on this forum so this just matters less to me. I return only because I am taking a class on social computing and am revisiting what makes for good/bad experiences.
Anyway, hopefully this gives you some solace. I would encourage you to seek other sources of validation. There are so many more options than you think! :)