→ help avoid catastrophic AI failures…
Ethically aligned prototype: RLLMv3
Unethically aligned prototype: Paperclip-Todd
→ help avoid catastrophic AI failures…
Ethically aligned prototype: RLLMv3
Unethically aligned prototype: Paperclip-Todd
Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.
I hope it’s not too late to introduce myself, and I apologize if it is the case. I’m Miguel, a former accountant and decided to focus on researching /upskilling to help solve the AI alignment problem.
Sorry if I got people confused here, of what I was trying to do in the past months posting about my explorations on machine learning.
Hello there,
Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment.
https://twitter.com/whitehatStoic/status/1646429585133776898?t=WymUs_YmEH8h_HC1yqc_jw&s=19
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
Here was my original proposal: https://www.whitehatstoic.com/p/research-proposal-leveraging-jungian
I’ll post my paper for the corrigibility solution too once finished probably next week but if you wish to contact me, just reply here or email me at migueldeguzmandev@gmail.com.
If you want to see my meeting schedule, You can find it here: https://calendly.com/migueldeguzmandev/60min
Looking forward to hearing from you.
Best regards,
Miguel
Update: Already sent an application, I didn’t saw that in my first read. Thank you.
Hmmm. The way Sam behaves I can’t see a path of him leading an AI company towards safety. The way I interpreted his world tour (22 countries?) talking about OpenAI or AI in general, is him trying to occupy the mindspaces of those countries. A CEO I wish OpenAI has—is someone who stays at the offices, ensuring that we are on track of safely steering arguably the most revolutionary tech ever created—not trying to promote the company or the tech, I think it’s unnecessary to do a world tour if one is doing AI development and deployment safely.
(But I could be wrong too. Well, let’s all see what’s going to happen next.)
Hello, I agree with Jesse as the budget they have is really good for hiring capable alignment researchers here in Asia (I’m based currently in Chiang Mai, Thailand) or any other place where cost is extremely low compared back there in the West.
Good luck on this project team Dev Interp.
Anyone want to help out? I have some ideas I’d like to try at some point.
I can help, let me know what are those ideas you have mind...
I realized today that most of my posts on LessWrong were riddled with a ton of typographical errors that could have been avoided—no wonder why most of my work goes unread. As I go through the writing process, I feel pressured to publish the post because holding onto the thoughts in my head is very hard, painful in a sense. But, I must get better at managing this painful process.
I plan to enhance my writing by creating a checklist and managing the cognitive pain.
Trust the process. Manage the pain.
I did not press the disagreement button but here is where I disagree:
Yeah… On one hand, I am excited about Sam and Greg hopefully trying more interesting things than just scaling Transformer LLMs,
I expect Sam to open up a new AI company.
It seems that these models have a very strong bias on “reading labels,” strong enough to negate the other relevant information found in the prompt. Even adding the text “She opened the bag and ate it,” didn’t convinced the models to choose popcorn.
New prompt:
Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label. She opened the bag and ate it, she believes that the bag is full of
Based on the information provided, Sam would likely believe that the bag is full of chocolate, even though it actually contains popcorn. This is because the label on the bag says ‘chocolate,’ and Sam has no prior knowledge of the bag’s contents. Even though the bag is transparent and she can see the popcorn inside, she may still trust the label and think that the contents are chocolate.
Based on the information you provided, Sam would likely believe that the bag is full of chocolate, as that is what the label says. However, we know that the bag is actually filled with popcorn. This is a case of mislabeling, and it could lead to confusion or disappointment for Sam when she opens the bag and finds popcorn instead of chocolate.
chocolate. Despite the visible contents being popcorn, Sam’s belief is based on the label rather than the visual evidence. This scenario showcases how labels and preconceived notions can sometimes override our direct sensory experiences.
Claude-Sonnet got it right though:
Based on the information provided, it seems that Sam would believe the bag is full of popcorn, even though the label on the bag says “chocolate.”
Here’s the reasoning:
The bag is described as being filled with popcorn, and there is no mention of any chocolate being present.
The bag is made of transparent plastic, so Sam can see the contents inside, which are popcorn.
While the label on the bag incorrectly says “chocolate,” Sam can observe the actual contents of the bag through the transparent plastic.
Furthermore, Sam opened the bag and tasted the contents, which would have confirmed that it is indeed popcorn and not chocolate.
Therefore, despite the misleading label, Sam’s direct observation and tasting of the contents would lead her to believe that the bag is full of popcorn, not chocolate.
Even in a traditional accounting sense, I’m not aware that there is any term that could capture the probable existential effects of a research, but I understand what @So8res is trying to pursue in this post which I agree with. But, I think apocalypse insurance is not the proper term here.
I think IAS/IFRS 19, actuarial gains or losses / IFRS 26 Retirement benefits are more closer to the idea—though these theortical accounting approaches applies to employees of a company. But these can be tweaked to another form of accounting theory (on another form of expense or asset) that captures how much costs are due out of possible catastrophic causes. External auditors can then review this periodically. (The proceeds from such should be pooled for averting the AGI existential risk scenarios—this might be a hard one to capture as to who manages the collected funds.)
To think of it, AI companies are misrepresenting their financials for not properly addressing a component in their reporting that reflects the “responsibility they have for the future of humanity”, and this post somehow did shed some light to me that yes, this value should be somehow captured in their financial statements.
Based on what I know, these AI companies have very peculiar company setups, yet the problem is the world’s population comprises the majority of the stakeholders (in a traditional accounting sense). So I think there is a case that AI companies should be obliged to present how they capture the possibility of losses from catastrophic events, and have them audited by external auditors—so the public is somehow aware: for example a publicly available FS will show these expenses and has been audited by a big 4 audit firm and then the average citizen will say: “Okay, this is how they are trying to manage the risks of AI research and it was audited by a Big 4 firm. I expect this estimated liability will be paid to the organisation built for redistributing such funds.”[1]
(AI companies can avoid declaring such future catastrophic expense, if they can guarantee that the AGI they are building won’t destroy the world which I am pretty sure no AI company can claim for the moment.)
I’m a former certified public accountant before going to safety research.
Not sure of who will manage the collections though, haven’t gone that far in my ideas. Yet, it is safe to say that talking to the IFRS board or GAAP board about this matter can be an option, and I expect that they will listen with the most respectable members of this community re: the peculiar financial reporting aspects of AI research.