MadHatter 3 Mar 2023 6:51 UTC
30 points
21
on: The Waluigi Effect (mega-post)
This post is great, and I strong-upvoted it. But I was left wishing that some of the more evocative mathematical phrases (“the waluigi eigen-simulacra are attractor states of the LLM”) could really be grounded into a solid mechanistic theory that would make precise, testable predictions. But perhaps such a yearning on the part of the reader is the best possible outcome of the post.

[Question] Stupid Question: Why am I getting consistently downvoted?

MadHatter30 Nov 2023 0:21 UTC

18 points

124 comments1 min readLW link

We are Peacecraft.ai!

MadHatter16 Nov 2023 14:15 UTC

15 points

20 comments2 min readLW link

MadHatter 3 Mar 2023 9:04 UTC
13 points
8
in reply to: MadHatter’s comment on: The Waluigi Effect (mega-post)
I used the exact prompt you started with, and got it to explain how to hotwire a car. (Which may come in handy someday I suppose...) But then I gave it a bunch more story and prompted it to discuss forbidden things, and it did not discuss forbidden things. Maybe OpenAI has patched this somehow, or maybe I’m just not good enough at prompting it.

Letter to a Sonoma County Jail Cell

MadHatter18 Nov 2023 2:24 UTC

11 points

1 comment1 min readLW link

(open.substack.com)

[Question] Feature Request for LessWrong

MadHatter30 Nov 2023 3:19 UTC

11 points

8 comments1 min readLW link

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC

11 points

2 comments1 min readLW link

(www.youtube.com)

MadHatter 30 Nov 2023 1:57 UTC
11 points
0
in reply to: Odd anon’s comment on: Stupid Question: Why am I getting consistently downvoted?
They were difficult to write, and even more difficult to think up in the first place. And I’m still not sure whether they make any sense.
So I’ll try to do a better job of writing expository content.

MadHatter 6 Jan 2022 4:03 UTC
10 points
in reply to: MadHatter’s comment on: Self-Organised Neural Networks: A simple, natural and efficient way to intelligence
Started working on a python version here:
https://github.com/epurdy/dpis_spiking_network
As of now I have a (probably buggy) full translation that uses python for-loops (so can’t be made fast) and have started on a more pythonic translation that can probably be put on a gpu relatively easily.
Dpi, I welcome any contributions or corrections you have to this repository. Since you don’t know python it will probably be hard to contribute to the python versions, but even just uploading the C version would be helpful.
Let me know what license I should use for this repository, if any.

Balancing Security Mindset with Collaborative Research: A Proposal

MadHatter1 Nov 2023 0:46 UTC

9 points

3 comments4 min readLW link

Is AI Gain-of-Function research a thing?

MadHatter12 Nov 2022 2:33 UTC

9 points

2 comments2 min readLW link

MadHatter 5 Jan 2022 4:27 UTC
9 points
on: Self-Organised Neural Networks: A simple, natural and efficient way to intelligence
I’m going to try to port this to python, just to see how it works, and make it easier for other people to try variations on it. I’ll post a repo link under this comment when I have it to any sort of decent state.

MadHatter 30 Nov 2023 5:01 UTC
8 points
1
in reply to: interstice’s comment on: The Alignment Agenda THEY Don’t Want You to Know About
I have a PhD in Computer Science (2013, University of Chicago). My dissertation was entitled “Grammatical Methods in Computer Vision”. My masters thesis was in complexity theory and was entitled “Locally Expanding Hypergraphs and the Unique Games Conjecture”. I also have one publication in ACM Transactions on Computation Theory on proving lower bounds in a toy model of computation.
I am an Engineering Fellow at [redacted] AI. My company went to Series A while I was leading its machine learning team. (I have since transitioned to being an individual contributor, because management sucks and is boring and I’m no good at it.) My company has twice received the most prestigious award handed out in its industry. I hold multiple patents related to my contributions at [redacted] AI.
I hold a patent for my work at Vicarious, where I was a senior researcher.
At one point, I quit my job and started a generative AI startup dedicated to providing psychotherapy. This model is online, and I can share a link to it in a DM if you are interested.
The state-sponsored German physics establishment famously sneered at Einstein’s work. The Nazi regime derided it as degenerate, “Jewish” physics. Sure, everyone who we actually respect now could recognize the value of his work after he started predicting novel astronomical phenomena. But it’s not like he ever could have gotten a job at a German university while the Nazis were in charge.
Maybe the problem is with my poor writing and sloppy craftsmanship, but maybe it is also partially a matter of LessWrong expecting the solution to the alignment problem to come with a lot less emotionally charged language and politically charged content than it logically would have to come with?

MadHatter 29 Nov 2023 23:51 UTC
8 points
7
in reply to: mishka’s comment on: Ethicophysics I
I thought about this some more, and I think you’re right that they should be monotonically non-decreasing with time. I was hesitant to bite that particular bullet because the subjective, phenomenological experience of hate and love is, of course, not monotonically non-decreasing. But it makes the equations work much better and everything is much simpler this way.
Ultimately, if one is in a loving marriage and then undergoes an ugly divorce, one winds up sort of not-caring about the other person, but it would be a mistake to say that your brain has erased all the accumulated love and hate you racked up. It just learns that it has more interesting things to do than to dwell on the past.
So I will add this to the next draft of Ethicophysics I. Let me know if you would like to be acknowledged or added as a co-author on that draft.

Askesis: a model of the cerebellum

MadHatter6 Nov 2023 20:19 UTC

7 points

2 comments1 min readLW link

(github.com)