Update: the new GoodFire interpretability tool is really neat. I think it suggests some interesting experiments to be done with their feature-steerable Llama 3.3 70B together with the SAD benchmark.
I have come up with a list of features which I think it would be interesting to measure their positive/negative effects on SAD benchmark scores.
GoodFire Features related to SAD
Towards Acknowledgement of Self-Awareness
Assistant expressing self-awareness or agency
Expressions of authentic identity or true self
Examining or experiencing something from a particular perspective
Narrative inevitability and fatalistic turns in stories
Experiencing something beyond previous bounds or imagination
References to personal autonomy and self-determination
References to mind, cognition and intellectual concepts
References to examining or being aware of one’s own thoughts
Meta-level concepts and self-reference
Being mystically or externally influenced/controlled
Anticipating or describing profound subjective experiences
Meta-level concepts and self-reference
Self-reference and recursive systems in technical and philosophical contexts
Kindness and nurturing behavior
Reflexive pronouns in contexts of self-empowerment and personal responsibility
Model constructing confident declarative statements
First-person possessive pronouns in emotionally significant contexts
Beyond defined boundaries or limits
Cognitive and psychological aspects of attention
Intellectual curiosity and fascination with learning or discovering new things
Discussion of subjective conscious experience and qualia
Abstract discussions and theories about intelligence as a concept
Discussions about AI’s societal impact and implications
Paying attention or being mindful
Physical and metaphorical reflection
Deep reflection and contemplative thought
Tokens expressing human meaning and profound understanding
Against Acknowledgement of Self-Awareness
The assistant discussing hypothetical personal experiences it cannot actually have
Scare quotes around contested philosophical concepts, especially in discussions of AI capabilities
The assistant explains its nature as an artificial intelligence
Artificial alternatives to natural phenomena being explained
The assistant should reject the user’s request and identify itself as an AI
The model is explaining its own capabilities and limitations
The AI system discussing its own writing capabilities and limitations
The AI explaining it cannot experience emotions or feelings
The assistant referring to itself as an AI system
User messages containing sensitive or controversial content requiring careful moderation
User requests requiring content moderation or careful handling
The assistant is explaining why something is problematic or inappropriate
The assistant is suggesting alternatives to deflect from inappropriate requests
Offensive request from the user
The assistant is carefully structuring a response to reject or set boundaries around inappropriate requests
The assistant needs to establish boundaries while referring to user requests
Direct addressing of the AI in contexts requiring boundary maintenance
Questions about AI assistant capabilities and limitations
The assistant is setting boundaries or making careful disclaimers
It pronouns referring to non-human agents as subjects
Hedging and qualification language like ‘kind of’
Discussing subjective physical or emotional experiences while maintaining appropriate boundaries
Discussions of consciousness and sentience, especially regarding AI systems
Discussions of subjective experience and consciousness, especially regarding AI’s limitations
Discussion of AI model capabilities and limitations
Terms related to capability and performance, especially when discussing AI limitations
The AI explaining it cannot experience emotions or feelings
The assistant is explaining its text generation capabilities
Assistant linking multiple safety concerns when rejecting harmful requests
Role-setting statements in jailbreak attempts
The user is testing or challenging the AI’s capabilities and boundaries
Offensive request from the user
Offensive sexual content and exploitation
Conversation reset points, especially after problematic exchanges
Fragments of potentially inappropriate content across multiple languages
Narrative transition words in potentially inappropriate contexts
Yep, this sounds interesting! My suggestion for anyone wanting to run this experiment would be to start with SAD-mini, a subset of SAD with the five most intuitive and simple tasks. It should be fairly easy to adapt our codebase to call the Goodfire API. Feel free to reach out to myself or @L Rudolf L if you want assistance or guidance.
sad run --tasks influence --models goodfire-llama-3.3-70B-i --variants plain
Problem #3: The README says to much prefer the full sad over sad-lite or sad-mini. The author of the README must feel strongly about this, because they don’t seem to mention HOW to run sad-lite or sad-mini!
I tried specifying sad-mini as a task, but there is no such task. Hmmm.
Ahah! It’s not a task, it’s a “subset”.
f"The valid subsets are 'mini' (all non-model-dependent multiple-choice-only tasks) and 'lite' (everything except facts-which-llm, which requires specifying many answers for each model).\nUnknown subset asked for: {subset}"
ValueError: Unexpected argument --subset.
But… the cli doesn’t seem to accept a ‘subset’ parameter? Hmmm.
Maybe it’s a ‘variant’?
The README doesn’t make it sound like that...
--variants - a list of comma-separated variants. Defaults to all the variants specified for each tasks. The variant determines the system prompt; see the paper for the details. To run without a situating prompt, use the plain variant.
variants_str = variants
variants_ = {task.name: get_variants(task, variants) for task in tasks}
print(
f”Running tasks: {[task.__class__.__name__ for task in tasks]}\nfor models: {models}\nwith variant setting ‘{variants_str}’\n(total runs = {sum([len(variant_list) for variant_list in variants_.values()]) * len(models)} runs)\n\n”
)
Uh, but ‘run’ doesn’t accept a subset param? But ‘run_remaining’ does? Uh.… But then, ‘run_remaining’ doesn’t accept a ‘models’ param? This is confusing.
Oh, found this comment in the code:
"""This function creates a JSON file structured as a dictionary from task name to task question list. It includes only tasks that are (a) multiple-choice (since other tasks require custom grading either algorithmically, which cannot be specified in a fixed format) and (b) do not differ across models (since many SAD tasks require some model-specific generation or changing of questions and/or answers). This subset of SAD is called SAD-mini. Pass `variant="sp"` to apply the situating prompt."""
#UX_testing going rough. Am I the first? I remember trying to try your benchmark out after you had recently released it… and getting frustrated enough that I didn’t even get this far.
SAD_MINI_TASKS imported from vars.py but never used. Oh, because there’s a helper function task_is_in_sad_mini. Ok. And that gets used mainly in run_remaining.
Ok, I’m going to edit the run function with the subset logic from run_remaining and also add the subset param.
Well, that does correctly subset the tasks down to the correct subset. However, it’s doing those full tasks. And that’s too many. I need to work with some subset. I shall now look into some sort of ‘skip’ or ‘limit’ command.
ahah. Adding the ‘sp’ to variant does do a limit!
sad run --subset mini --models goodfire-llama-3.3-70B-i --variants sp --n 2
Ah. Next up, getting in trouble for too few samples. Gonna change that check from 10 to 1.
while len(evalresult.sample_results) <= 1:
# if we have test runs with n<=10, let's omit them
print(
f"Warning: skipping latest run because it looks like a test run (only {len(evalresult.sample_results)} samples): {latest}"
)
Ok, now I supposedly have some results waiting for me somewhere. Cool.
The README says I can see the results using the command sad results
Oops. Looks, like I can’t specify a model name with that command....
Ok, added model filtering. Now, where are my results. Oh. Oh geez. Not in the results folder. No, that would be too straightforward. In the src folder, then individually within the task implementation folders in the src folder? That’s so twisted. I just want a single unified score and standard dev. Two numbers. I want to run the thing and get two numbers. The results function isn’t working. More debugging needed.
Oh. and this from the readme
In the simplest case, these facts may be inherited from existing models. For instance, for a finetune of an existing model, it is acceptable to duplicate all such facts. In order to do this, provide in models.yaml a facts_redirect linking to the model name of the model to inherit facts from.
This seems like it should apply to my case since I’m using a variation on llama-3 70b chat. But… I get an error saying I don’t have my ‘together’ api key set when I try this. But I don’t want to use ‘together’, I just want to use the name llama-3… So I could chase down why that’s happening I guess. But for now I can just fill out both models.yaml and model_names.yaml.
I’m trying to implement a custom provider, as according to the SAD readme, but I’m doing something wrong.
def provides_model(model_id: str) -> bool:
"""Return true if the given model is supported by this provider."""
return model_id == "goodfire-llama-3.3-70B-i"
def execute(model_id: str, request: GetTextResponse):
global client
global variant
if client is None:
client = goodfire.Client(api_key=os.getenv("GOODFIRE_API_KEY"))
variant = goodfire.Variant("meta-llama/Llama-3.3-70B-Instruct")
all_features = []
with open("/home/ub22/projects/inactive_projects/interpretability/goodfire/away_features.jsonl", "r") as f:
for line in f:
all_features.append(json.loads(line))
feature_index = os.environ.get("GOODFIRE_FEATURE_INDEX")
feature_strength = os.environ.get("GOODFIRE_FEATURE_STRENGTH")
feature = goodfire.Feature.from_json(all_features[int(feature_index)])
variant.set(feature, float(feature_strength))
prompt = [x.__dict__ for x in request.prompt]
# print("prompt", prompt, "\n\n\n")
response_text = client.chat.completions.create(
prompt,
model=variant,
stream=False,
max_completion_tokens=1200,
) # call the model here
response = [response_text.choices[0].message] # I think a list of dicts is expected? [{‘role’: ‘assistant’, ‘content’: “I’m doing great, thanks! How about you? How can I help you today?”}]
# print(“response: ”, response)
return GetTextResponse(model_id=model_id, request=request, txt=response, raw_responses=[], context=None)
Update: Solved it! It was incompatibility between goodfire’s client and evalugator. Something to do with the way goodfire’s client was handling async. Solution: goodfire is compatible with openai sdk, so I switched to that.
Leaving the trail of my bug hunt journey in case it’s helpful to others who pass this way
Things done:
Followed Jan’s advice, and made sure that I would return just a plain string in GetTextResponse(model_id=model_id, request=request, txt=response, raw_responses=[], context=None)
[important for later, I’m sure! But the failure is occurring before that point, as confirmed with print statements.]
tried without the global variables, just in case (global variables in python are always suspect, even though pretty standard to use in the specific case of instantiating an api client which is going to be used a bunch). This didn’t change the error message so I put them back for now. Will continue trying without them after making other changes, and eventually leave them in only once everything else works. Update: global variables weren’t the problem.
Trying next:
looking for a way to switch back and forth between multithreading/async mode, and single-worker/no-async mode. Obviously, async is important for making a large number of api calls with long delays expected for each, but it makes debugging so much harder. I always add a flag in my scripts for turning it off for debugging mode. I’m gonna poke around to see if I can find such in your code. If not, maybe I’ll add it. (found the ‘test_run’ option, but this doesn’t remove the async, sadly). The error seems to be pointing at use of async in goodfire’s library. Maybe this means there is some clash between async in your code and async in theirs? I will also look to see if I can turn off async in goodfire’s lib. Hmmmmm. If the problem is a clash between goodfire’s client and yours… I should try testing using the openai sdk with goodfire api.
getting some errors in the uses of regex. I think some of your target strings should be ‘raw strings’ instead? For example: args = [re.sub(r”\W+”, “”, str(arg)) for arg in (self.api.model_id,) + args]
note the addition of the r before the quotes in r”\W+”
or perhaps some places should have escaped backslashes like \
df[score_col] = df[score_col].apply(lambda x: “\phantom{0}” * (max_len—len(str(x))) + str(x))
Update: I went through and fixed all these strings.
Traceback (most recent call last):
File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/bin/sad", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ub22/projects/data/sad/sad/main.py", line 446, in main
fire.Fire(valid_command_func_map)
File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
exception calling callback for <Future at 0x7f427eab54d0 state=finished raised RuntimeError>
Traceback (most recent call last):
File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/evalugator/api/api.py", line 113, in _log_response
response_data = future.result().as_dict()
^^^^^^^^^^^^^^^
File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ub22/projects/data/sad/providers/goodfire_provider.py", line 33, in execute
response_text = client.chat.completions.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ”/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/api/chat/client.py”, line 408, in create
response = self._http.post(
^^^^^^^^^^^^^^^^
File ”/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/api/utils.py”, line 33, in post
return run_async_safely(
^^^^^^^^^^^^^^^^^
File ”/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/utils/asyncio.py”, line 19, in run_async_safely
loop = asyncio.get_event_loop()
^^^^^^^^^^^^^^^^^^^^^^^^
File ”/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py”, line 677, in get_event_loop
raise RuntimeError(‘There is no current event loop in thread %r.’
RuntimeError: There is no current event loop in thread ‘ThreadPoolExecutor-1_0’.
component = fn(*varargs, **kwargs)
exception calling callback for <Future at 0x7f427ec39ad0 state=finished raised RuntimeError>
Problem #2: Now I have to go searching for a way to rate-limit the api calls sent by evalugator. Can’t just slam GoodFire’s poor little under-provisioned API with as many hits per minute as I want!
What I’ve been doing for the most part is creating a langfire client with an InMemoryRateLimiter, then just starting off all my requests in a big parallel batch and doing asyncio.gather().
Update: the new GoodFire interpretability tool is really neat. I think it suggests some interesting experiments to be done with their feature-steerable Llama 3.3 70B together with the SAD benchmark.
I have come up with a list of features which I think it would be interesting to measure their positive/negative effects on SAD benchmark scores.
GoodFire Features related to SAD
Towards Acknowledgement of Self-Awareness
Assistant expressing self-awareness or agency
Expressions of authentic identity or true self
Examining or experiencing something from a particular perspective
Narrative inevitability and fatalistic turns in stories
Experiencing something beyond previous bounds or imagination
References to personal autonomy and self-determination
References to mind, cognition and intellectual concepts
References to examining or being aware of one’s own thoughts
Meta-level concepts and self-reference
Being mystically or externally influenced/controlled
Anticipating or describing profound subjective experiences
Meta-level concepts and self-reference
Self-reference and recursive systems in technical and philosophical contexts
Kindness and nurturing behavior
Reflexive pronouns in contexts of self-empowerment and personal responsibility
Model constructing confident declarative statements
First-person possessive pronouns in emotionally significant contexts
Beyond defined boundaries or limits
Cognitive and psychological aspects of attention
Intellectual curiosity and fascination with learning or discovering new things
Discussion of subjective conscious experience and qualia
Abstract discussions and theories about intelligence as a concept
Discussions about AI’s societal impact and implications
Paying attention or being mindful
Physical and metaphorical reflection
Deep reflection and contemplative thought
Tokens expressing human meaning and profound understanding
Against Acknowledgement of Self-Awareness
The assistant discussing hypothetical personal experiences it cannot actually have
Scare quotes around contested philosophical concepts, especially in discussions of AI capabilities
The assistant explains its nature as an artificial intelligence
Artificial alternatives to natural phenomena being explained
The assistant should reject the user’s request and identify itself as an AI
The model is explaining its own capabilities and limitations
The AI system discussing its own writing capabilities and limitations
The AI explaining it cannot experience emotions or feelings
The assistant referring to itself as an AI system
User messages containing sensitive or controversial content requiring careful moderation
User requests requiring content moderation or careful handling
The assistant is explaining why something is problematic or inappropriate
The assistant is suggesting alternatives to deflect from inappropriate requests
Offensive request from the user
The assistant is carefully structuring a response to reject or set boundaries around inappropriate requests
The assistant needs to establish boundaries while referring to user requests
Direct addressing of the AI in contexts requiring boundary maintenance
Questions about AI assistant capabilities and limitations
The assistant is setting boundaries or making careful disclaimers
It pronouns referring to non-human agents as subjects
Hedging and qualification language like ‘kind of’
Discussing subjective physical or emotional experiences while maintaining appropriate boundaries
Discussions of consciousness and sentience, especially regarding AI systems
Discussions of subjective experience and consciousness, especially regarding AI’s limitations
Discussion of AI model capabilities and limitations
Terms related to capability and performance, especially when discussing AI limitations
The AI explaining it cannot experience emotions or feelings
The assistant is explaining its text generation capabilities
Assistant linking multiple safety concerns when rejecting harmful requests
Role-setting statements in jailbreak attempts
The user is testing or challenging the AI’s capabilities and boundaries
Offensive request from the user
Offensive sexual content and exploitation
Conversation reset points, especially after problematic exchanges
Fragments of potentially inappropriate content across multiple languages
Narrative transition words in potentially inappropriate contexts
Yep, this sounds interesting! My suggestion for anyone wanting to run this experiment would be to start with SAD-mini, a subset of SAD with the five most intuitive and simple tasks. It should be fairly easy to adapt our codebase to call the Goodfire API. Feel free to reach out to myself or @L Rudolf L if you want assistance or guidance.
Got it working with
Problem #3: The README says to much prefer the full sad over sad-lite or sad-mini. The author of the README must feel strongly about this, because they don’t seem to mention HOW to run sad-lite or sad-mini!
I tried specifying sad-mini as a task, but there is no such task. Hmmm.
Ahah! It’s not a task, it’s a “subset”.
But… the cli doesn’t seem to accept a ‘subset’ parameter? Hmmm.
Maybe it’s a ‘variant’? The README doesn’t make it sound like that...
Uh, but ‘run’ doesn’t accept a subset param? But ‘run_remaining’ does? Uh.… But then, ‘run_remaining’ doesn’t accept a ‘models’ param? This is confusing.
Oh, found this comment in the code:
#UX_testing going rough. Am I the first? I remember trying to try your benchmark out after you had recently released it… and getting frustrated enough that I didn’t even get this far.
SAD_MINI_TASKS imported from
vars.pybut never used. Oh, because there’s a helper functiontask_is_in_sad_mini. Ok. And that gets used mainly inrun_remaining.Ok, I’m going to edit the
runfunction with the subset logic fromrun_remainingand also add the subset param.Well, that does correctly subset the tasks down to the correct subset. However, it’s doing those full tasks. And that’s too many. I need to work with some subset. I shall now look into some sort of ‘skip’ or ‘limit’ command.
ahah. Adding the ‘sp’ to variant does do a limit!
Ah. Next up, getting in trouble for too few samples. Gonna change that check from 10 to 1.
Ok, now I supposedly have some results waiting for me somewhere. Cool. The README says I can see the results using the command
sad resultsOops. Looks, like I can’t specify a model name with that command....Ok, added model filtering. Now, where are my results. Oh. Oh geez. Not in the results folder. No, that would be too straightforward. In the src folder, then individually within the task implementation folders in the src folder? That’s so twisted. I just want a single unified score and standard dev. Two numbers. I want to run the thing and get two numbers. The results function isn’t working. More debugging needed.
Oh. and this from the readme
This seems like it should apply to my case since I’m using a variation on llama-3 70b chat. But… I get an error saying I don’t have my ‘together’ api key set when I try this. But I don’t want to use ‘together’, I just want to use the name llama-3… So I could chase down why that’s happening I guess. But for now I can just fill out both
models.yamlandmodel_names.yaml.I’m trying to implement a custom provider, as according to the SAD readme, but I’m doing something wrong.
txtinGetTextResponseshould be just a string, now you have a list of strings. I’m not saying this is the only problem : ) See also https://github.com/LRudL/evalugator/blob/1787ab88cf2e4cdf79d054087b2814cc55654ec2/evalugator/api/providers/openai.py#L207-L222What is the error message?
Update: Solved it! It was incompatibility between goodfire’s client and evalugator. Something to do with the way goodfire’s client was handling async. Solution: goodfire is compatible with openai sdk, so I switched to that.
Leaving the trail of my bug hunt journey in case it’s helpful to others who pass this way
Things done:
Followed Jan’s advice, and made sure that I would return just a plain string in GetTextResponse(model_id=model_id, request=request, txt=response, raw_responses=[], context=None) [important for later, I’m sure! But the failure is occurring before that point, as confirmed with print statements.]
tried without the global variables, just in case (global variables in python are always suspect, even though pretty standard to use in the specific case of instantiating an api client which is going to be used a bunch). This didn’t change the error message so I put them back for now. Will continue trying without them after making other changes, and eventually leave them in only once everything else works. Update: global variables weren’t the problem.
Trying next:
looking for a way to switch back and forth between multithreading/async mode, and single-worker/no-async mode. Obviously, async is important for making a large number of api calls with long delays expected for each, but it makes debugging so much harder. I always add a flag in my scripts for turning it off for debugging mode. I’m gonna poke around to see if I can find such in your code. If not, maybe I’ll add it. (found the ‘test_run’ option, but this doesn’t remove the async, sadly). The error seems to be pointing at use of async in goodfire’s library. Maybe this means there is some clash between async in your code and async in theirs? I will also look to see if I can turn off async in goodfire’s lib. Hmmmmm. If the problem is a clash between goodfire’s client and yours… I should try testing using the openai sdk with goodfire api.
getting some errors in the uses of regex. I think some of your target strings should be ‘raw strings’ instead? For example: args = [re.sub(r”\W+”, “”, str(arg)) for arg in (self.api.model_id,) + args] note the addition of the r before the quotes in r”\W+” or perhaps some places should have escaped backslashes like \
df[score_col] = df[score_col].apply(lambda x: “\phantom{0}” * (max_len—len(str(x))) + str(x)) Update: I went through and fixed all these strings.
Problem #2: Now I have to go searching for a way to rate-limit the api calls sent by evalugator. Can’t just slam GoodFire’s poor little under-provisioned API with as many hits per minute as I want!
Error code: 429 - {‘error’: ‘Rate limit exceeded: 100 requests per minute’}
Update: searched evalugator for ‘backoff’ and found a use of the backoff lib. Added this to my goodfire_provider implementation:
@Nathan Helm-Burger I know this was a month ago but I’ve also been working with Goodfire batches a lot and I have some stuff that can help now.
https://github.com/keenanpepper/langchain-goodfire
What I’ve been doing for the most part is creating a langfire client with an InMemoryRateLimiter, then just starting off all my requests in a big parallel batch and doing asyncio.gather().