elandgre

Karma: 53

Reflection Mechanisms as an Alignment Target—Attitudes on “near-term” AI

elandgre, Beth Barnes and Marius Hobbhahn

2 Mar 2023 4:29 UTC

21 points

0 comments8 min readLW link

elandgre 2 Jan 2023 20:42 UTC
1 point
0
in reply to: the gears to ascension’s comment on: On the Importance of Open Sourcing Reward Models
I think I don’t completely understand the objection? Is your concern that organizations who are less competent will over-fit to the reward models during fine-tuning, and so give worse models than OpenAI/Anthropic are able to train? I think this is a fair objection, and one argument for open-sourcing the full model.
My main goal with this post is to advocate for it being good to “at least” open source the reward models, and that the benefits of doing this would far outweigh the costs, both societally and for the organizations doing the open-sourcing. I tend to think that completely unaligned models will get more backlash than imperfectly aligned ones, but maybe this is incorrect. I haven’t thought deeply about whether it is safe/good for everyone to open source the underlying capability model.

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

Reflection Mechanisms as an Alignment target: A follow-up survey

Marius Hobbhahn, elandgre and Beth Barnes

5 Oct 2022 14:03 UTC

21 points

2 comments7 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

22 Jun 2022 15:05 UTC

32 points

1 comment14 min readLW link