Danielle Ensign

Karma: 207

Danielle Ensign 27 Apr 2026 15:43 UTC
2 points
1
on: Do not conquer what you cannot defend
I think this neglects an important aspect of checks to power: functioning feedback loops.

Your model seems to be “as long as the good people can defend their power than the system is good” but I think every person fails in some ways, and a more important criteria for successful leadership is the ability to get feedback about what’s going wrong (or right) and iterate.

If a system no longer accepts critique (or actively selects against it) that’s very likely a sign things have gone wrong. Ideally critique should be embraced and encouraged, and any organization’s first concerns should be to setup ways to maintain healthy feedback cycles and decrease blind spots.

Danielle Ensign 9 Sep 2025 18:14 UTC
1 point
0
in reply to: momom2’s comment on: The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
Fixed, ty!

Danielle Ensign 9 Sep 2025 5:16 UTC
7 points
0
in reply to: Yonatan Cale’s comment on: The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
Yes feel free!

You can browse the raw outputs here https://www.phylliida.dev/modelwelfare/bailstudy/harmdatasetvis.html
(any of interest you can share by URL)

The data for that is hosted here
https://github.com/Phylliida/BailStudyData
that data is in a bit of a weird format, converted by
https://github.com/Phylliida/BailStudy/blob/main/bailstudy/oldCodePorting.py
And here’s a gdrive of the data gathered by our BailStudy repo

https://drive.google.com/drive/folders/1nHT06qSBcbfQSCL1orEhG4afxGm3ecJk?usp=sharing

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Danielle Ensign8 Sep 2025 0:57 UTC

87 points

4 comments5 min readLW link

Danielle Ensign 20 Jun 2025 19:43 UTC
1 point
0
in reply to: Metin Hasan’s comment on: Unsupervised Activation Steering: Find a steering vector that best represents any set of text data
I think for those cases you’re better off using standard methods (multiple choice etc.), this technique is only useful when paired positive negative data is more difficult to create (like writing imitation).

Unsupervised Activation Steering: Find a steering vector that best represents any set of text data

Danielle Ensign6 Jun 2025 22:37 UTC

3 points

2 comments1 min readLW link

Danielle Ensign 16 Mar 2025 6:35 UTC
1 point
0
in reply to: gwern’s comment on: gwern’s Shortform
Seems like this could be addressed by filtering out comments that use evidence or personal examples from your dataset.

If that’s too intense, filtering responses to remove personal examples and checking sources shouldn’t be too bad? But maybe you’d just end up with a model that tries to subvert the filter/draw misleading conclusions from sources instead of actually being helpful…

Danielle Ensign 10 Feb 2025 22:26 UTC
1 point
0
in reply to: tailcalled’s comment on: Attribution-based parameter decomposition

There are certain cases where pure gradient-based attributions predictably don’t work (most notably when a softmax is saturated)

Do you have a source or writeup somewhere on this? (or do you mind explaining more/have some examples where this is true?) Is this issue actually something that comes up for modern day LLMs?

In my observations it works fine for the toy tasks people have tried it on. The challenge seems to be in interpreting the attributions, not issues with the attributions themselves.

Danielle Ensign 10 Feb 2025 17:34 UTC
3 points
0
in reply to: tailcalled’s comment on: Attribution-based parameter decomposition
In my experience gradient-based attributions (especially if you use integrated gradients) are almost identical to the attributions you get from ablating away each component. It’s kinda crazy but is the reason ppl use edge-attribution patching over older approaches like ACDC.

Look at page 15 of https://openreview.net/forum?id=lq7ZaYuwub (left is gradient attributions, right is attributions from ablating each component). This is for Mamba but I’ve observed similar things for transformers.

Danielle Ensign 22 Apr 2024 21:11 UTC
1 point
0
in reply to: Chakshu Mira’s comment on: Ophiology (or, how the Mamba architecture works)
fixed :)

Danielle Ensign 22 Apr 2024 16:50 UTC
1 point
0
in reply to: Chakshu Mira’s comment on: Ophiology (or, how the Mamba architecture works)
Fixed, thank you!

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

67 points

10 comments10 min readLW link

Danielle Ensign 16 Mar 2023 14:36 UTC
4 points
0
in reply to: Charlie Steiner’s comment on: What’s the Least Impressive Thing GPT-4 Won’t be Able to Do
GPT-4 generated the following:

Review 2:

Sophia L.

B.R., United States

4.5 stars

A cozy, eco-friendly gem in the city! Green Street Cafe is my new go-to spot for catching up with friends or enjoying a quiet afternoon with a good book. Their plant-based menu offers a delightful variety of options—I absolutely loved their vegan BLT and lavender latte. The only downside was the slightly slow service, but the staff was so warm and attentive that it hardly mattered. Can’t wait to visit again!

Review 3:

Michael N.

T.S., United States

4 stars

Stopped by Green Street Cafe for a quick bite and was pleasantly surprised by the charming atmosphere and delicious food. The spinach and feta quiche was to die for, and the iced chai latte had just the right amount of spice. The place was buzzing with bees, which added a unique touch to the experience. Although the Wi-Fi was a bit spotty, I’d still recommend this spot for a relaxing break or casual meeting.

Review 4:

Emily P.

D.W., United States

5 stars

Green Street Cafe is my new favorite spot for brunch! The cafe’s bright, inviting interior, complete with lush greenery and an eco-conscious design, makes for the perfect weekend retreat. I can’t get enough of their avocado toast and freshly squeezed orange juice. The bees add a quirky touch, and their presence speaks to the cafe’s commitment to sustainability. Fantastic service, too—the staff is always smiling and eager to help. Highly recommended!

Review 5:

David T.

G.J., United States

3.5 stars

While the Green Street Cafe has a charming ambiance and friendly staff, the food and drinks were a bit hit-or-miss. The honey lemonade was fantastic, but the cappuccino I ordered was lukewarm and lacked flavor. My wife enjoyed her quinoa salad, but my vegetable panini was soggy. The bees are an interesting touch, but they may not be for everyone. I might give this place another shot, but I’ll temper my expectations next time.

Danielle Ensign 27 Nov 2022 22:29 UTC
1 point
0
on: Socialist firms as a cause area
I highly recommend this video for those wanting a more detailed analysis of the pros and cons of worker co-ops

Danielle Ensign 8 Apr 2021 3:39 UTC
6 points
0
on: What will GPT-4 be incapable of?
Things that it can probably do sometimes, but will fail on some inputs:
- Factor numbers
- Solve NP-Complete or harder problems
- Execute code
There are other “tail end” tasks like this that should eventually become the hardest bits that optimization spends the most time on, once it manages to figure everything else out.

Danielle Ensign 24 Mar 2021 2:14 UTC
3 points
0
on: Chaos Induces Abstractions
If we knew (for some reason) that a system had no useful abstractions (or at least, no small ones), what could we say about that system? Does it reduce to some vacuous thing? Or does it require it to be adversarial in nature?

Human Biases that Obscure AI Progress

Danielle Ensign25 Sep 2020 0:24 UTC

48 points

2 comments4 min readLW link

Danielle Ensign

The LLM Has Left The Chat: Ev­i­dence of Bail Prefer­ences in Large Lan­guage Models

Un­su­per­vised Ac­ti­va­tion Steer­ing: Find a steer­ing vec­tor that best rep­re­sents any set of text data

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

Hu­man Bi­ases that Ob­scure AI Progress

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Unsupervised Activation Steering: Find a steering vector that best represents any set of text data

Ophiology (or, how the Mamba architecture works)

Human Biases that Obscure AI Progress