RSS

tailcalled

Karma: 7,914

DPO/​PPO-RLHF on LLMs in­cen­tivizes syco­phancy, ex­ag­ger­a­tion and de­cep­tive hal­lu­ci­na­tion, but not mis­al­igned powerseeking

tailcalledJun 10, 2024, 9:20 PM
29 points
13 comments2 min readLW link

Each Llama3-8b text uses a differ­ent “ran­dom” sub­space of the ac­ti­va­tion space

tailcalledMay 22, 2024, 7:31 AM
3 points
4 comments7 min readLW link

[Question] Is delet­ing ca­pa­bil­ities still a rele­vant re­search ques­tion?

tailcalledMay 21, 2024, 1:24 PM
15 points
1 comment1 min readLW link

Why I stopped be­ing into basin broadness

tailcalledApr 25, 2024, 8:47 PM
16 points
3 comments2 min readLW link

Blessed in­for­ma­tion, garbage in­for­ma­tion, cursed information

tailcalledApr 18, 2024, 4:56 PM
23 points
8 comments3 min readLW link

Ack­shually, many wor­lds is wrong

tailcalledApr 11, 2024, 8:23 PM
27 points
42 comments4 min readLW link

[GPT-4] On the Grad­ual Emer­gence of Mech­a­nized In­tel­lect: A Trea­tise from the Year 1924

tailcalledApr 1, 2024, 7:14 PM
11 points
0 comments2 min readLW link

Opinions sur­vey 2 (with ra­tio­nal­ism score at the end)

tailcalledFeb 17, 2024, 12:03 PM
2 points
11 comments1 min readLW link
(docs.google.com)

Opinions sur­vey (with ra­tio­nal­ism score at the end)

tailcalledFeb 17, 2024, 12:41 AM
8 points
14 comments1 min readLW link
(docs.google.com)

[Question] What are the known difficul­ties with this al­ign­ment ap­proach?

tailcalledFeb 11, 2024, 10:52 PM
18 points
24 comments1 min readLW link

Against Non­lin­ear (Thing Of Things)

tailcalledJan 18, 2024, 9:40 PM
58 points
18 comments1 min readLW link
(thingofthings.substack.com)

[Question] Which in­vest­ments for al­igned-AI out­comes?

tailcalledJan 4, 2024, 1:28 PM
8 points
9 comments2 min readLW link

Prac­ti­cally A Book Re­view: Ap­pendix to “Non­lin­ear’s Ev­i­dence: De­bunk­ing False and Mislead­ing Claims” (ThingOfThings)

tailcalledJan 3, 2024, 5:07 PM
111 points
25 comments2 min readLW link
(thingofthings.substack.com)

[Question] Could there be “nat­u­ral im­pact reg­u­lariza­tion” or “im­pact reg­u­lariza­tion by de­fault”?

tailcalledDec 1, 2023, 10:01 PM
24 points
6 comments1 min readLW link

Utility is not the se­lec­tion target

tailcalledNov 4, 2023, 10:48 PM
24 points
1 comment1 min readLW link

Con­tra Nora Belrose on Orthog­o­nal­ity Th­e­sis Be­ing Trivial

tailcalledOct 7, 2023, 11:47 AM
18 points
21 comments1 min readLW link

[Question] What are some good lan­guage mod­els to ex­per­i­ment with?

tailcalledSep 10, 2023, 6:31 PM
16 points
3 comments1 min readLW link

Au­mann-agree­ment is common

tailcalledAug 26, 2023, 8:22 PM
72 points
33 comments7 min readLW link1 review

A con­tent anal­y­sis of the SQ-R ques­tion­naire and a pro­posal for test­ing EQ-SQ theory

tailcalledAug 9, 2023, 1:51 PM
10 points
2 comments13 min readLW link

[Question] If I showed the EQ-SQ the­ory’s find­ings to be due to mea­sure­ment bias, would any­one change their minds about it?

tailcalledJul 29, 2023, 7:38 PM
23 points
13 comments1 min readLW link