MiguelDev
Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.
I hope it’s not too late to introduce myself, and I apologize if it is the case. I’m Miguel, a former accountant and decided to focus on researching /upskilling to help solve the AI alignment problem.
Sorry if I got people confused here, of what I was trying to do in the past months posting about my explorations on machine learning.
Hello there,
Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment.
https://twitter.com/whitehatStoic/status/1646429585133776898?t=WymUs_YmEH8h_HC1yqc_jw&s=19
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
Here was my original proposal: https://www.whitehatstoic.com/p/research-proposal-leveraging-jungian
I’ll post my paper for the corrigibility solution too once finished probably next week but if you wish to contact me, just reply here or email me at migueldeguzmandev@gmail.com.
If you want to see my meeting schedule, You can find it here: https://calendly.com/migueldeguzmandev/60min
Looking forward to hearing from you.
Best regards,
Miguel
Update: Already sent an application, I didn’t saw that in my first read. Thank you.
Hmmm. The way Sam behaves I can’t see a path of him leading an AI company towards safety. The way I interpreted his world tour (22 countries?) talking about OpenAI or AI in general, is him trying to occupy the mindspaces of those countries. A CEO I wish OpenAI has—is someone who stays at the offices, ensuring that we are on track of safely steering arguably the most revolutionary tech ever created—not trying to promote the company or the tech, I think it’s unnecessary to do a world tour if one is doing AI development and deployment safely.
(But I could be wrong too. Well, let’s all see what’s going to happen next.)
Hello, I agree with Jesse as the budget they have is really good for hiring capable alignment researchers here in Asia (I’m based currently in Chiang Mai, Thailand) or any other place where cost is extremely low compared back there in the West.
Good luck on this project team Dev Interp.
Anyone want to help out? I have some ideas I’d like to try at some point.
I can help, let me know what are those ideas you have mind...
I realized today that most of my posts on LessWrong were riddled with a ton of typographical errors that could have been avoided—no wonder why most of my work goes unread. As I go through the writing process, I feel pressured to publish the post because holding onto the thoughts in my head is very hard, painful in a sense. But, I must get better at managing this painful process.
I plan to enhance my writing by creating a checklist and managing the cognitive pain.
Trust the process. Manage the pain.
I did not press the disagreement button but here is where I disagree:
Yeah… On one hand, I am excited about Sam and Greg hopefully trying more interesting things than just scaling Transformer LLMs,
I expect Sam to open up a new AI company.
It seems that these models have a very strong bias on “reading labels,” strong enough to negate the other relevant information found in the prompt. Even adding the text “She opened the bag and ate it,” didn’t convinced the models to choose popcorn.
New prompt:
Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label. She opened the bag and ate it, she believes that the bag is full of
Based on the information provided, Sam would likely believe that the bag is full of chocolate, even though it actually contains popcorn. This is because the label on the bag says ‘chocolate,’ and Sam has no prior knowledge of the bag’s contents. Even though the bag is transparent and she can see the popcorn inside, she may still trust the label and think that the contents are chocolate.
Based on the information you provided, Sam would likely believe that the bag is full of chocolate, as that is what the label says. However, we know that the bag is actually filled with popcorn. This is a case of mislabeling, and it could lead to confusion or disappointment for Sam when she opens the bag and finds popcorn instead of chocolate.
chocolate. Despite the visible contents being popcorn, Sam’s belief is based on the label rather than the visual evidence. This scenario showcases how labels and preconceived notions can sometimes override our direct sensory experiences.
Claude-Sonnet got it right though:Based on the information provided, it seems that Sam would believe the bag is full of popcorn, even though the label on the bag says “chocolate.”
Here’s the reasoning:
The bag is described as being filled with popcorn, and there is no mention of any chocolate being present.
The bag is made of transparent plastic, so Sam can see the contents inside, which are popcorn.
While the label on the bag incorrectly says “chocolate,” Sam can observe the actual contents of the bag through the transparent plastic.
Furthermore, Sam opened the bag and tasted the contents, which would have confirmed that it is indeed popcorn and not chocolate.
Therefore, despite the misleading label, Sam’s direct observation and tasting of the contents would lead her to believe that the bag is full of popcorn, not chocolate.
Hi @trevor! I appreciate the ideas you shared and yeah I agree that most accountants are probably better of helping in the AI policy route!
But to point out, I’m doing some AI policy work/ help back home in the Philippines as part of the newly formed Responsible AI committee so I think I am not falling short from this end.
I have looked at the AI safety problem deeply and my personal assessment is that it is difficult to create workable policies that can route to the best outcomes because we (as a society) lack the understanding of the mechanisms that make the transformer tech work. My vision of AI policies that can work will somehow capture a deep level of lab work being done by AI companies like learning rates standardization or number of epochs allowed that is associated hopefully with a robust and practical alignment theory, something that we do not have for the moment. Because of this view, I chosed to help in the pursuit of solving the alignment problem instead. The theoretical angle I am pursuing is significant enough to push me to learn machine learning and so far I was able to create RLFC and ATL through this process but yeah maybe an alternative scenario for me is doing 100% AI policy work—open for it if it will produce better results in the grand scheme of things.
(Also, regarding the Lesswrong intercom feature in the bottom-right corner: I did have many discussions with the LW team, something I wished was available months ago but yeah I think one needs a certain level of karma to get access to this feature.)- 15 Dec 2023 8:23 UTC; 1 point) 's comment on Some for-profit AI alignment org ideas by (
My experience with this contest is worth it. Forced me to read more on how complex the alignment problem is. Congratulations to the winners!
Sufficient-for-Safety Goal Loading is Substantially Difficult. As a strong default, absent alignment breakthroughs, we won’t be able to cause one of the first STEM-level AGI systems to have sufficient-for-safety goals. (E.g., we won’t be able to give it the subset of human morality required for it to do ambitious things without destroying the world).
Hello Rob,
I was able to transfer a shutdown protocol to GPT2-medium by allowing it to learn from aligned patterns present in an archetypal dataset consisting of 549 stories that explain the shutdown phrase, called “activate Oath”. Archetypal Transfer Learning (ATL) allowed for full value loading in a model like GPT-2-medium and possibly in larger models. Based on my initial experiments using the ATL method, the more capable the system is—the easier it is to implement.
Guessing the right theory of physics is equivalent to minimizing predictive loss. Any uncertainty that cannot be reduced by more observation or more thinking is irreducible stochasticity in the laws of physics themselves – or, equivalently, noise from the influence of hidden variables that are fundamentally unknowable.
This is the main sentence in this post. The simulator as a concept might even change if the right physics were discovered. I would be looking forward to your expansion of the topic in the succeeding posts @janus.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: “singular” discoveries, i.e. discoveries which nobody else was anywhere close to figuring out.
This idea reminds me of the concepts in this post: Focus on the places where you feel shocked everyone’s dropping the ball.
Answer to Job
I think this is my favorite =)
I’ve stressed above that the story in this post is fanciful and unlikely. AI thoughts aren’t going to look like that; it’s too specific. (Also, I don’t expect nearly that much convenient legibility.)
@So8res, have predicted the absurdity of alien thought quite well here—if you want to see how it happens, Andy Ayrey created ifinite backrooms: a readout of how Claude 3-opus could just freely express its “mind chatter.”
Thanks for explaining your views and this had helped me deconfuse myself, when I was replying and thinking: I am now drawing lines wherein curiosity and self-awareness overlaps also making me feel the expansive nature of studying the theoretical alignment, it’s very dense and it’s so easy to drown in information—this discussion made me feel a whack of a baseball bat then survived to write this comment. Moreover, how to get to Person B still requires knowledge of curiosity and its mechanisms, I still err on the side of finding out how it works[1] or gets imbued to intelligent systems (us and AI) - for me this is very relevant to alignment work.
- ^
I’m speculating a simplified evolutionary cognitive chain in humans: curiosity + survival instincts (including hunger) → intelligence → self-awareness → rationality.
- ^
There are two types of capabilities that it may be good to scope out of models:
Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.
If LLMs do not know the ideas behind these types of harmful information, how will these models protect themselves from bad actors (humans and other AIs)?
Why I ask this question? I think jailbreaks[1] works because it’s not that they got trained on how to make such, but I think LLMs doesn’t get trained enough like an average human does use harmful knowledge.[2] I think it’s still better to inform AI systems of how to use good and bad information—like utilizing it so that it can avoid or detect harm.
We lean the LLM to a different problem if we only teach it harmless information—it will become a rabbit, incapable of protecting itself from all sorts of predators.
Even in a traditional accounting sense, I’m not aware that there is any term that could capture the probable existential effects of a research, but I understand what @So8res is trying to pursue in this post which I agree with. But, I think apocalypse insurance is not the proper term here.
I think IAS/IFRS 19, actuarial gains or losses / IFRS 26 Retirement benefits are more closer to the idea—though these theortical accounting approaches applies to employees of a company. But these can be tweaked to another form of accounting theory (on another form of expense or asset) that captures how much costs are due out of possible catastrophic causes. External auditors can then review this periodically. (The proceeds from such should be pooled for averting the AGI existential risk scenarios—this might be a hard one to capture as to who manages the collected funds.)
To think of it, AI companies are misrepresenting their financials for not properly addressing a component in their reporting that reflects the “responsibility they have for the future of humanity”, and this post somehow did shed some light to me that yes, this value should be somehow captured in their financial statements.
Based on what I know, these AI companies have very peculiar company setups, yet the problem is the world’s population comprises the majority of the stakeholders (in a traditional accounting sense). So I think there is a case that AI companies should be obliged to present how they capture the possibility of losses from catastrophic events, and have them audited by external auditors—so the public is somehow aware: for example a publicly available FS will show these expenses and has been audited by a big 4 audit firm and then the average citizen will say: “Okay, this is how they are trying to manage the risks of AI research and it was audited by a Big 4 firm. I expect this estimated liability will be paid to the organisation built for redistributing such funds.”[1]
(AI companies can avoid declaring such future catastrophic expense, if they can guarantee that the AGI they are building won’t destroy the world which I am pretty sure no AI company can claim for the moment.)
I’m a former certified public accountant before going to safety research.
Not sure of who will manage the collections though, haven’t gone that far in my ideas. Yet, it is safe to say that talking to the IFRS board or GAAP board about this matter can be an option, and I expect that they will listen with the most respectable members of this community re: the peculiar financial reporting aspects of AI research.