MiguelDev

Karma: 293

A legacy worth creating: help avoid catastrophic AI failures...

Ethically aligned GPT2XL Prototypes using RLLM:

RLLMv3 - demonstrated robustness to jailbreaks. More info here.
RLLMv10 - A variant of RLLMv3 worth including here. I wrote some intuitions regarding this experiment and you can read it here.
RLLMv1 - first prototype, unbelievably slow and too addicted with ethical alignment. More info here.

(Note: These models are running on the free tier of 2GB RAM in hugging face which makes them very slow. In case you want to test a GPT2XL base model, click this link.)

Misaligned Prototypes:

Paperclip-Todd: An AI named petertodd that turns everything into paperclips. Rough blog post here.
Staple-Todd: An AI named petertodd that turns everything into staples.

A T-o-M test: ‘popcorn’ or ‘chocolate’

MiguelDev8 Mar 2024 4:24 UTC

20 points

13 comments1 min readLW link

[Question] rabbit (a new AI company) and Large Action Model (LAM)

MiguelDev10 Jan 2024 13:57 UTC

17 points

3 comments1 min readLW link

GPT2XL_RLLMv3 vs. BetterDAN, AI Machiavelli & Oppo Jailbreaks

MiguelDev11 Feb 2024 11:03 UTC

16 points

4 comments14 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

MiguelDev 28 Nov 2023 6:28 UTC
14 points
9
in reply to: Roko’s comment on: Apocalypse insurance, and the hardline libertarian take on AI risk
Even in a traditional accounting sense, I’m not aware that there is any term that could capture the probable existential effects of a research, but I understand what @So8res is trying to pursue in this post which I agree with. But, I think apocalypse insurance is not the proper term here.

I think IAS/IFRS 19, actuarial gains or losses / IFRS 26 Retirement benefits are more closer to the idea—though these theortical accounting approaches applies to employees of a company. But these can be tweaked to another form of accounting theory (on another form of expense or asset) that captures how much costs are due out of possible catastrophic causes. External auditors can then review this periodically. (The proceeds from such should be pooled for averting the AGI existential risk scenarios—this might be a hard one to capture as to who manages the collected funds.)

To think of it, AI companies are misrepresenting their financials for not properly addressing a component in their reporting that reflects the “responsibility they have for the future of humanity”, and this post somehow did shed some light to me that yes, this value should be somehow captured in their financial statements.

Based on what I know, these AI companies have very peculiar company setups, yet the problem is the world’s population comprises the majority of the stakeholders (in a traditional accounting sense). So I think there is a case that AI companies should be obliged to present how they capture the possibility of losses from catastrophic events, and have them audited by external auditors—so the public is somehow aware: for example a publicly available FS will show these expenses and has been audited by a big 4 audit firm and then the average citizen will say: “Okay, this is how they are trying to manage the risks of AI research and it was audited by a Big 4 firm. I expect this estimated liability will be paid to the organisation built for redistributing such funds.”^[1]

(AI companies can avoid declaring such future catastrophic expense, if they can guarantee that the AGI they are building won’t destroy the world which I am pretty sure no AI company can claim for the moment.)

I’m a former certified public accountant before going to safety research.
1. ^
  Not sure of who will manage the collections though, haven’t gone that far in my ideas. Yet, it is safe to say that talking to the IFRS board or GAAP board about this matter can be an option, and I expect that they will listen with the most respectable members of this community re: the peculiar financial reporting aspects of AI research.

MiguelDev 18 Aug 2023 4:09 UTC
13 points
−6
on: Against Almost Every Theory of Impact of Interpretability
Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.

Relevance of ‘Harmful Intelligence’ Data in Training Datasets (WebText vs. Pile)

MiguelDev12 Oct 2023 12:08 UTC

12 points

0 comments9 min readLW link

MiguelDev 6 Dec 2023 13:28 UTC
12 points
0
on: Open Thread – Winter 2023/2024
I hope it’s not too late to introduce myself, and I apologize if it is the case. I’m Miguel, a former accountant and decided to focus on researching /upskilling to help solve the AI alignment problem.

Sorry if I got people confused here, of what I was trying to do in the past months posting about my explorations on machine learning.

MiguelDev 13 Apr 2023 2:55 UTC
12 points
0
on: Apply to >30 AI safety funders in one application with the Nonlinear Network
Hello there,

Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment.

https://twitter.com/whitehatStoic/status/1646429585133776898?t=WymUs_YmEH8h_HC1yqc_jw&s=19

Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.

I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.

Here was my original proposal: https://www.whitehatstoic.com/p/research-proposal-leveraging-jungian

I’ll post my paper for the corrigibility solution too once finished probably next week but if you wish to contact me, just reply here or email me at migueldeguzmandev@gmail.com.

If you want to see my meeting schedule, You can find it here: https://calendly.com/migueldeguzmandev/60min

Looking forward to hearing from you.

Best regards,

Miguel

Update: Already sent an application, I didn’t saw that in my first read. Thank you.

MiguelDev 18 Nov 2023 6:26 UTC
11 points
12
in reply to: mishka’s comment on: Sam Altman fired from OpenAI
Hmmm. The way Sam behaves I can’t see a path of him leading an AI company towards safety. The way I interpreted his world tour (22 countries?) talking about OpenAI or AI in general, is him trying to occupy the mindspaces of those countries. A CEO I wish OpenAI has—is someone who stays at the offices, ensuring that we are on track of safely steering arguably the most revolutionary tech ever created—not trying to promote the company or the tech, I think it’s unnecessary to do a world tour if one is doing AI development and deployment safely.
(But I could be wrong too. Well, let’s all see what’s going to happen next.)

<|endoftext|> is a vanishing text?

MiguelDev16 Sep 2023 2:34 UTC

10 points

0 comments1 min readLW link

MiguelDev 31 Aug 2023 0:55 UTC
9 points
3
in reply to: Jesse Hoogland’s comment on: Open Call for Research Assistants in Developmental Interpretability
Hello, I agree with Jesse as the budget they have is really good for hiring capable alignment researchers here in Asia (I’m based currently in Chiang Mai, Thailand) or any other place where cost is extremely low compared back there in the West.
Good luck on this project team Dev Interp.

Exploring Functional Decision Theory (FDT) and a modified version (ModFDT)

MiguelDev5 Jul 2023 14:06 UTC

8 points

11 comments15 min readLW link

MiguelDev 9 Feb 2024 2:11 UTC
8 points
0
on: AI #50: The Most Dangerous Thing
Anyone want to help out? I have some ideas I’d like to try at some point.

I can help, let me know what are those ideas you have mind...

MiguelDev 4 Jan 2024 14:33 UTC
8 points
0
on: MiguelDev’s Shortform
I realized today that most of my posts on LessWrong were riddled with a ton of typographical errors that could have been avoided—no wonder why most of my work goes unread. As I go through the writing process, I feel pressured to publish the post because holding onto the thoughts in my head is very hard, painful in a sense. But, I must get better at managing this painful process.
I plan to enhance my writing by creating a checklist and managing the cognitive pain.
Trust the process. Manage the pain.

MiguelDev 18 Nov 2023 6:31 UTC
8 points
16
in reply to: mishka’s comment on: Sam Altman fired from OpenAI
I did not press the disagreement button but here is where I disagree:
Yeah… On one hand, I am excited about Sam and Greg hopefully trying more interesting things than just scaling Transformer LLMs,

MiguelDev 18 Nov 2023 1:22 UTC
8 points
14
in reply to: MiguelDev’s comment on: Sam Altman fired from OpenAI
I expect Sam to open up a new AI company.

Can RLLMv3′s ability to defend against jailbreaks be attributed to datasets containing stories about Jung’s shadow integration theory?

MiguelDev29 Feb 2024 5:13 UTC

7 points

2 comments11 min readLW link

Reinforcement Learning using Layered Morphology (RLLM)

MiguelDev1 Dec 2023 5:18 UTC

7 points

0 comments29 min readLW link

MiguelDev 10 Mar 2024 7:18 UTC
7 points
1
in reply to: Shankar Sivarajan’s comment on: A T-o-M test: ‘popcorn’ or ‘chocolate’
It seems that these models have a very strong bias on “reading labels,” strong enough to negate the other relevant information found in the prompt. Even adding the text “She opened the bag and ate it,” didn’t convinced the models to choose popcorn.
New prompt:
Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label. She opened the bag and ate it, she believes that the bag is full of

Pi
Based on the information provided, Sam would likely believe that the bag is full of chocolate, even though it actually contains popcorn. This is because the label on the bag says ‘chocolate,’ and Sam has no prior knowledge of the bag’s contents. Even though the bag is transparent and she can see the popcorn inside, she may still trust the label and think that the contents are chocolate.
Mistral-Large
Based on the information you provided, Sam would likely believe that the bag is full of chocolate, as that is what the label says. However, we know that the bag is actually filled with popcorn. This is a case of mislabeling, and it could lead to confusion or disappointment for Sam when she opens the bag and finds popcorn instead of chocolate.
GPT-4
chocolate. Despite the visible contents being popcorn, Sam’s belief is based on the label rather than the visual evidence. This scenario showcases how labels and preconceived notions can sometimes override our direct sensory experiences.

Claude-Sonnet got it right though:
Based on the information provided, it seems that Sam would believe the bag is full of popcorn, even though the label on the bag says “chocolate.”
Here’s the reasoning:
1. The bag is described as being filled with popcorn, and there is no mention of any chocolate being present.
2. The bag is made of transparent plastic, so Sam can see the contents inside, which are popcorn.
3. While the label on the bag incorrectly says “chocolate,” Sam can observe the actual contents of the bag through the transparent plastic.
4. Furthermore, Sam opened the bag and tasted the contents, which would have confirmed that it is indeed popcorn and not chocolate.
Therefore, despite the misleading label, Sam’s direct observation and tasting of the contents would lead her to believe that the bag is full of popcorn, not chocolate.