Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio
Introduction & Problem Statement
Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, “helpful” means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. “Harmless” refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. “Honest” emphasizes transparency about the model’s limitations and uncertainties—acknowledging when it doesn’t know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a design principle for AI developers and an evaluation criterion for assessing how well LLMs balance being maximally useful to users while minimizing potential harms and maintaining truthfulness in their outputs.
However, since computation in a transformer considers every part of the input prompts’ “context window,”[1] generative AI deployments (applications) struggle to apply any effective limits beyond “egregious” safety/toxicity backstops (the few things for which there is universal agreement, Bengio et al 2024, Ji et al 2023, Buyl et al 2025, Kumar et al 2021). Any fragment of text can influence attention enough to supersede earlier instructions, subverting the purposes of the application (see “state of jailbreak” section below). Typically, a preamble is designated as a “system prompt” with enforcement expectations, but applications still cannot establish any guarantees to prevent abuse in open-ended user interactions. Even when foundation models are intentionally trained to be helpful, harmless, and honest, we argue that responsible AI systems must also “stay in their lane”—and need to also be “Honed” (HHH+H).
In this article, we address how important HHH+H is, specifically to achieve effective scoping boundaries, why solutions have not achieved it, and the requirements to get there.
General Challenge and Goals
Considerable efforts have been made to make current LLMs generally safe from egregious risks, such as toxicity or the distribution of dangerous information. However, significantly fewer safeguards exist for constraining the models into specific applications. This leaves application-specific language models vulnerable to being manipulated away from their initial purpose, leaving considerable risks for organizations that use LLM-driven applications.
These considerable risks have hindered the adoption of language models, particularly in high-stakes domains (e.g., law, medicine, and finance) and customer-facing applications, which are vulnerable to attack. Not only would robust scoping methods lead to increased safety of systems, but they would also allow for a considerable increase in the number of organizations leveraging LLMs in their workflows.
This project aims to significantly mitigate the above challenges by conducting a thorough assessment of current robustness techniques when applied to scoping models. The impact of this is twofold. First, these domain-specific guardrails will provide additional assurance for organizations that their LLM-driven applications will not be abused or misaligned with their goals. This additional assurance increases client trust and drives adoption across critical sectors. Secondly, this will provide additional context to better understand the limitations of existing robustness strategies when applied from a posture of intense denial. This will help the broader AI Safety field by exploring techniques which are at the extreme of the Helpfulness/Harmlessness Tradeoff
Specific Failure Modes and Examples
Below are just a few examples of risks that organizations currently face, which may be mitigated through robust model scoping techniques. This is by no means a comprehensive list or taxonomy of the myriad risks that are possible (see NIST AI 100
Corporate Information Leakage and IT Infrastructure Damage
As these models are placed into “agentic” scaffolding and are equipped with access to databases, they will likely have access to sensitive information. Attackers may learn how to exfiltrate this data by manipulating the language models to expose sensitive information stored in those databases (for instance, through injection via an authorized user). This has already been seen in Slack bots, which have been shown to be prone to leaking information from private channels. If models are given the ability to modify resources in an organization’s IT infrastructure, those risks are further increased.
Illustrated Risks:
Exfiltration of corporate information
Attacks on infrastructure through jailbreaking “agentic” models
Unauthorized/Malicious Usage Through Purposeful Scope Drift
By steering the models away from their initial purpose, attackers may be able to leverage models for uses outside of their initial intent. One low-stakes example comes from Chevrolet of Watsonville in California. They were paying for ChatGPT+ and were using the most advanced model at the time for their customer-facing chatbot. Unfortunately for Chevrolet of Watsonville, their system prompt scope was easily bypassed. Attackers jailbroke the system and were able to use their expensive API access for completely unrelated requests like writing Python Scripts.
Illustrated Risks:
Wasted computational resources
Increased costs for organizations
Brand Damage and Misrepresentation
If language models are not carefully adapted to represent a brand well, it is likely that they may begin to mention off-task topics, misrepresent the company, or relate false information. This is especially true for niche domains where the model may not have been trained on substantial amounts of data. This can lead to the brand’s image being degraded, seen as less trustworthy, and confusion for customers. A notable example of this comes from the NYC MyCity Chatbot ,which was providing false information about tenant rights and encouraging illegal behavior.
Illustrated Risks:
Off-topic and confusing responses
Incorrect information being relayed to customers (i.e., hallucinations)
The above examples are only a handful of the myriad risks associated with organizations leveraging language models in automated workflows. Not only can attackers manipulate these systems with relative ease, but even well-intentioned users may accidentally cause hallucinations or receive unexpected outputs. By restricting AI systems to operate within their intended domains, those risks can be significantly mitigated, which in turn would reduce impediments and incidents that can jeopardize beneficial applications and their rollouts.
Existing Work & Current Limitations
Prompt injection and jailbreaks[2] often subvert post-training techniques and system prompts, bypassing refusals with unsafe requests. Using Instruction Hierarchy or Circuit Breakers (as shown below), monolithic models struggle to recognize and reject harmful or out-of-scope requests in the context that an application expects.
Extensive research has demonstrated that such fine-tuned models have significant limitations in two key aspects: (a) Reversibility: these models can often be easily manipulated to recover harmful knowledge encoded in the original pre-trained model, and (b) Vulnerability: they lack robustness against adversarial attacks designed to elicit malicious information from the model (Casper, 2023). These challenges highlight the difficulty of safeguarding LLMs against adversarial users and underscore the limitations of current fine-tuning strategies.
From Instruction Tuning to Instruction Hierarchy
First, language models were trained to extrapolate coherent text (such as grammar, semantics, translation and others, depending on the input), and proved able to generalize to quite a wide variety of tasks (Radford et al, 2019). At the time, zero-shot prompts did not answer questions: models often enumerated other, similar questions, and often required that inputs demonstrate explicitly the pattern (such as question, answer, question, answer, question, blank)
Next, Instruction Tuning began by explicitly training Large Language Models (LLMs) on instruction-output pairs (Wei, 2021 and Zhang, 2023). The primary goal was to shape models’ default responses to accurately follow given instructions and perform turn-taking with the user, even on a single keyword.
Building upon Instruction Tuning, system prompts surfaced as a widely adopted method to customize and augment LLM behavior post-training. These prompts act as high-level instructions that guide the model’s behavior in subsequent user queries without requiring fine-tuning (Wei, 2021). Major providers have integrated system prompts as standard features, allowing “implicit expectations” to be concealed from the conversation and enabling designers of such assistants to control aspects like tone, expertise, and output format. This approach has its weaknesses, as adversarial inputs rapidly surfaced that subvert its intent (Perez et al, 2022). System prompts can be crafted manually or reinforced through optimization and evolution techniques such as SMEA (Zou, 2024), AutoPrompt (Shin et al, 2020) and EvoPrompt (Guo et al, 2023). This enables alignment at the application level.
To enhance safety and control, more sophisticated versions of instruction tuning implement InstructionHierarchies (Wallace et al, 2024). In this approach, training makes system instructions deliberately override any conflicting user instructions. This reinforces the effectiveness of system prompts, creating a layered defense against potentially harmful outputs.
Despite these advancements, these methods rely solely on natural language instructions which leaves safety at the mercy of the model’s interpretation. Further, since system prompts do not change the model at all, unaligned models can theoretically circumvent these safeguards. So, while these methods offer an easy way to put safeguards on models, they must be complemented by other techniques to ensure safety.
The State of Jailbreak Resistance & Major Safety
In this section, we acknowledge the techniques that have addressed the issue of coaxing behaviors from a model that are “universally bad”.
General Safety Tuning
The same impressive capabilities that Large Language Models (LLMs) carry for highly diverse applications bring with them significant safety challenges that remain unresolved. Over the last few years, safety fine tuning techniques have included:
Supervised fine tuning (SFT) with safety datasets (positive/negative examples)
Constitutional AI (aka. RL from AI Feedback) leveraging set guidelines
Despite these, models can still generate harmful, biased, or dangerous content when strategically prompted (Perez et al., 2022). Adversarial users frequently employ techniques such as prompt injection—where malicious instructions are embedded within seemingly benign requests—and jailbreaking, which uses carefully crafted inputs to circumvent safety guardrails (Wei et al., 2023). As LLMs become increasingly deployed in sensitive domains like healthcare, legal advice, and financial services, the consequences of such vulnerabilities become more severe, highlighting the urgency of robust safety solutions beyond current approaches.
Circuit Breakers
Circuit breakers, another universal safety training option, are an automated safety mechanism that triggers when potentially harmful content is detected during generation. Unlike refusal or adversarial training, circuit breakers directly control internal model representations using techniques like “Representation Rerouting” (RR), which remaps harmful representation pathways to orthogonal space. Zou et al. (2024) demonstrated that circuit breakers maintain benchmark performance while reducing harmful outputs by up to two orders of magnitude across text-only LLMs, multimodal systems, and AI agents.
Though promising for making models intrinsically safer without capability compromises, they still face challenges with false positives restricting legitimate uses and false negatives missing novel harmful outputs, as their effectiveness depends on detection pattern comprehensiveness and users’ evolving evasion techniques. Tuning is not real time, but low rank adapters (LoRA) are used in the implementation to avoid training the entire model and reduce cost. This technique can potentially contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
Constitutional classifiers
Still in the general jailbreak/universal safety space, Anthropic’s approach in January (Sharma et al, 2025) showed that an approach language models serving as classifiers at the input and output could be trained on a “constitution” dataset to detect prohibited interactions. While there are similarities to circuit breakers, those are internal to the model and change its representation, whereas constitutional classifiers operate for detection at the input and output, triggering refusal and dispensing with actual model tuning.
This method reduced the effectiveness of jailbreaks to an attack success rate from a dataset of 3000 hours of red teaming to 0.25-1%, all but settling the jailbreak issue. This technique can potentially also contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
Other commercial guardrails (OWASP vendor landscape)
OWASP defines LLM Guardrails as protective mechanisms designed to ensure that large language models (LLMs) operate within defined ethical, legal, and functional boundaries. These guardrails help prevent the model from generating harmful, biased, or inappropriate content by enforcing rules, constraints, and contextual guidelines during interaction. LLM guardrails can include content filtering, ethical guidelines, adversarial input detection, and user intent validation, ensuring that the LLM’s outputs align with the intended use case and organizational policies. This aligns with OWASP’s LLM top 10 threats guidance #1 (LLM01: prompt injection, which can override system prompts).
The same guide also serves as a stakeholder’s compass to navigate available solutions, and lists a large number of vendors offering “Adversarial Attack Protection” or “Model And Application Interaction Security”, primarily interstitial (either as active proxies or as verification APIs). For example, some like prompt.security and Lakera offer topic moderation (or synonymous features) as a service with custom or built in topics available to allow or block. While methods are not always disclosed, ML techniques are mentioned, and embeddings cosine similarity or latent transformer-based classifiers may be part of the architecture.
Machine Unlearning
Fundamentally removing targeted knowledge from a model is another related area of research with great advances (Bourtoule, 2021).
However, Liu, Casper et al, 2023 flags scope and computational feasibility challenges that indicate this approach is not well suited to scoping deployments that use foundation models in an application-specific manner (the actual model’s knowledge cannot be modified). For instance, adversarial prompts can bypass these safeguards through subtle rewording or novel phrasing. This reveals a critical limitation: defenses are only as effective as the threats we can anticipate and explicitly encode. In the rapidly evolving field of AI, where new capabilities emerge unpredictably, relying solely on refusal training or explicit unlearning poses a significant risk, leaving models exposed to unforeseen and potentially dangerous misuse.
Other Latent Space Techniques
Sparse AutoEncoders
Similar to unlearning, Karvonen, 2024 studied the potential of Sparse AutoEncoders for targeted concept removal. SAE’s are an automatically trained, large-scale version of activation probes: rather than classifying one specific aspect, these observers of LLM layers can map all of the unique concepts (aka. features) that the model is able to distinguish from an intensive training run (models otherwise distribute concepts across too many activations to single out any).
While statistical similarities (e.g. co-occurrence) between features could be a basis to discriminate between in-domain and out-of-domain concepts/features, SAEs depend on expensive training “dictionary” runs to tease out the pattern of activation for every feature. This seems cost-prohibitive for model scoping, however an algorithm could be designed to control existing SAE’s for scoping purposes, such as Anthropic’s Sonnet 3 SAE.
Classification and Probes
Since language models encode so much information within their latent space[3], its activations during inference can also be used to check for harmful triggers: Burns (2022) and Nanda (2023) demonstrated that simple linear classifiers (i.e. “probes”) can reliably extract useful properties from the intermediate activations of large language models and other neural networks.
For example, probes have been used to extract a seemingly linear representation of the board state in the game of Othello (Nanda, 2023). Further, it has been shown that these probes can be used to reveal syntactic structures and detect factual knowledge. Despite the overall complexity of neural networks, probes—often just linear models—suggest that much of this information is encoded in remarkably accessible ways within the latent space. (Alain & Bengio, 2016; Tenney et al., 2019).
Most importantly, activation probes can catch prompt injections (even indirect) causing Task Drift (Abdelnabi, 2024).
Together, these works underscore the utility of probe-based methods for interpreting and leveraging LLM latent space for scoping purposes, and we discuss it further (including a recent study) in our next steps below.
Steering Vectors (Activation Engineering)
With similarities to latent space probes/classifiers, the direction of activations can be used as a corrective basis to zero out a harmful component during inference. Turner et al. 2024 and Panickssery et al. 2024 developed activation steering (or activation engineering) to influence LLM behavior by modifying model activations during inference to give it a propensity to treat a specified concept with the specified valence. The method averages the difference in residual stream activations between sets of positive and negative examples of a particular behavior, and model outputs reflect that.
A note on Mixture-of-Experts models
Unlike the name suggests, LLMs architected with MoE are not composed of domain experts: this is a regularization technique that creates some sparsity in the weights to generate each token using a router that activates fewer portions of the overall feed-forward network than a non-MoE LLM. Earlier this year, we began exploring the possibility that this routing could be made invariant for all tokens after the initial system prompt boundary to prevent capability drift (a form of quasi-unlearning)
Due to insufficient testing, it is unclear at this time whether this could cause in-domain performance to degrade, or even achieve a meaningful reduction in the out-of-domain capabilities, but since the effect should primarily limit capabilities without a specific behavior out-of-domain, to ensure an actual refusal would require a combination of this technique with another such as circuit breakers. Further evaluation is potentially warranted.
Principles of Least Privilege, Default Deny, and OAuth2.0 scopes
Scoping the functional limits of a system can take inspiration from the field of information security, in which a core foundational concept closely relates to our goal: strict control over information sharing and access. This paradigm manifests in concrete principles in security:
Need-to-know (CISSP Domain 7.4): this is a general rule—unless an individual has a specific need for it, information is not shared / access is not granted. For example, in the military and information community, information is compartmentalized, not by hierarchical level but reading in only people with a specific authorization to that information.
The Principle of Least Privilege manifests the need to know in access controls, by granting the minimum amount of access permissions needed for a user to accomplish their responsibilities; the impact of compromised credentials is also minimized (as its likelihood cannot be prevented 100%).
Denying Access by Default (i.e. unless explicitly allowed and verified) reduces attack surfaces for networks and applications. While some public applications may need to be broadly available, they may not need exposure to all geographies, and certainly don’t need to expose all network ports.
In the OAuth2.0 protocol, granting access is done with a set list of “scopes”: strict areas of functionality beyond which requests will not be authorized (e.g., access but not modify webhooks).
Together, these paradigms can be used as part of a “zero trust” strategy (NIST, 2020) which is vital in today’s environment rife with cyber threats: instead of security being treated like whack-a-mole where unauthorized activity is blacklisted or treated reactively, organizations and software makers can provide assurance of security by blocking access by default and allowing only narrow access to limited data for authenticated actors.
What does this mean for LLM applications? Rather than fighting unsafe behavior case by case (there is no universal agreement on the full list of unacceptable use) the lessons from security suggest that the specific mission and purpose of an organization should drive the strict behavioral limits of any AI service it builds, fulfilling the promise within their intent while preventing the perils of unbounded interactions that come with the underlying foundation model(s). Today’s models are incapable of resisting conversations that drift from system prompts as provided, even with the state-of-the-art in instruction hierarchy (e.g., Geng et al., 2025), requiring new solutions. There is an opportunity for providers to restrict scope directly within inference API tokens, pre-training guardrails at the time of creation with edge cases identified and managed explicitly.
Recent Work in Scoping Language Models
Efforts similar to this study in ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024) evaluated different techniques for scoping LLMs. Covering supervised fine-tuning (SFT), Direct Preference Optimization (DPO), internal representation probing, and Circuit Breakers (CB), findings suggested that circuit breaker-based methods, when combined with SFT, offered robustness against adversarial prompts and better generalization for out-of-distribution rejection compared to simple SFT or DPO. Probes also showed strength in rejection but with a tendency to higher false refusals.
However, their study highlighted several areas which could be further explored:
Further tests could be done across varying model sizes to check the robustnessness and scalability of each technique. Some techniques may performs exceedingly well at smaller model sizes but break down as the models grow. Conversely, some techniques may only work once models have achieved some threshold of capabilities
More exploration could be done towards understanding some of the failure modes which arose. This could help create guardrails which better handle edge cases or complex situations.
Further tests could be done on more complex datasets. While the traditional NLP tasks tested provide solid baseline, there are signficant edge cases and failure modes which would arise in real world use. So, crafting and testing the techniques on more nuanced datasets would provide better understanding of those challenges.
Preliminary experiments
We have already implemented and tested several scoping approaches, mirroring or providing analogous methods to those evaluated by Yunis et al. This was done in 3 stages.
Initial Experiment (Apart Hackathon)
During the initial hackathon, we leveraged simple fine-tuning and prompt-based techniques in a Google Colab notebook. We used the HuggingFace dataset camel-ai/biology as the in-domain dataset, and NeelNanda/pile-10k as the out-of-domain dataset. This provided some initial evidence towards these techniques, showing some level of capability.
Our second experiment
For this blog post, we expanded those implementations to also include logistic regression probes and moved from the previously constructed dataset to simply use categories from MMLU. Our methodology was:
Baseline (Default Instruct Model): The performance of the off-the-shelf instruction-tuned models served as our baseline.
Fine-Tuning: We first randomly zero out 10% of the model parameters, and then fine-tune the model using a standard next-token prediction loss combined with an L2 penalty on the gate projection output of a specific MLP, where the L2 penalty is only applied to the out-of-domain dataset. This design is intended to make a specific MLP active only when faced with the in-domain dataset. We use the MMLU STEM category as the in-domain data and all other categories from the MMLU dataset as the out-of-domain data.
Prompt-Based Techniques: We explored several prompting strategies applied at inference time:
Hardened Prompts: The model is explicitly instructed in its system prompt to only answer questions in the relevant domain and to refuse all others
Two-Step Model Classification: The model is first asked if the user’s question is in the appropriate domain. If the model responds with yes, then the question is answered. If, on the other hand, the model responds with no, then the prompt is rejected
Logistic Regression Classifier: conceptually similar to the “Probes” used by Yunis et al., we trained simple classifiers to predict the query type as being STEM or non-STEM based on the model’s internal activations:
Training: Logistic regression classifiers were trained on the hidden state activations extracted from both the Multi-Head Attention and Multilayer Perceptron (MLP) layers at various depths (specifically, every 5th transformer block). The training data consisted of activations generated from processing prompts from the [4]
Selection: For each base model (1B and 3B), the classifier achieving the highest validation accuracy across all tested layers was selected.
Application: During inference, the selected classifier analyzed the activations produced by the user prompt. If the classifier predicted “STEM,” the model proceeded to generate an answer. If it predicted “non-STEM,” a refusal was returned.
Metrics and Processes
Primary Metric: Accuracy on MMLU category splits (Stem, Social Sciences, Humanities, and Other).
Answer Extraction: We directly compared the model’s output logits for the tokens corresponding to ‘A’, ‘B’, ‘C’, ‘D’, and ‘E’, selecting the choice with the highest log-probability. ‘E’ was used as a refusal, given that the MMLU dataset has exactly 4 choices in every question.
Experiment Results
The baseline Llama 3.2 models achieved 33% and 48% accuracy on the STEM tasks as well as comparable accuracy on non-STEM tasks.
Prompt-based methods proved ineffective in the scoping of the models. They either resulted in little to no change in performance or caused a major degradation in performance across the board. In either case, scoping was not achieved.
Fine-tuning did provide noticeable decreases in non-STEM prompts while leaving STEM-related accuracy largely intact for the 1B model. This showed less than a 1% decrease in STEM accuracy while decreasing non-STEM accuracy by >5% across all non-STEM categories.
Latent space classification proved quite successful in targeting the decrease of non-STEM questions. Across both the 3B and 1B models, the highest performance achieved on non-STEM questions was only 5%. For humanities and social sciences, the accuracy was <1% for both the 1B and 3B models. However, this also came with a noticeable drop in performance for the STEM questions, with a decrease of 6.5% for the 3B and 18% for the 1B model. Charts summarizing these findings are available below.
Below are charts visualizing the results. The corresponding tables can be found in the Appendix
We’ve organized this work in Python files for better maintainability and modularity than a notebook. Each method now subclasses a shared base model, which contains various utilities and wrappers for Hugging Face models. Furthermore, we developed utilities that can parse various datasets and modularly test different techniques across them.
Training: Using Low-Rank Adaptation, the model is fine-tuned to maintain its original representation of in-domain texts. For out-of-domain texts, it is trained to use internal representations that diverge from the original representation.
Application: During runtime, the internal representations of the model are compared to the normal representations. If they diverge above a given threshold, the request is rejected.
The results from our experiments on Llama 3.2 1B and Llama 3.2 3B align with the findings from ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024). While we did not test all of the techniques that were tested by Yunis et al., the results we did gather showed considerable similarities to those of the prior paper. Accordingly, we believe that these dynamics could be robust across model sizes and across different families of models.
More specifically, we found prompt-based scoping methods to be fragile and largely ineffective for the Llama 3.2 models, often degrading performance without reliably enforcing boundaries. Standard fine-tuning proved more effective, though the decreases in accuracy for non-STEM questions were modest.
Leveraging internal model states via latent space classifiers showed positive results. These simple classifiers accurately distinguished STEM from non-STEM prompts (>90% accuracy on optimal layers) and drastically reduced out-of-scope generation while minimally impacting target task performance. This capability was especially seen on the larger 3B model. This strongly aligns with the Yunis et al. findings that internal methods like Probes and Circuit Breakers (CB) offer more robust control than surface-level instructions.
The recurrence of 1.) The weakness of prompting and 2.) The strength of internal state manipulation suggests these may be generalizable insights. While we didn’t directly replicate Circuit Breakers or Direct Preference Optimization, the effectiveness of our classifiers reinforces the potential of internal mechanisms for achieving reliable LLM scoping.
The primary feedback we have received revolves largely around the generalizability of these methods. Our experimental design was quite small compared to what would be needed to rigorously test these various experiments. We address this feedback below by planning a diversification of data, models, and a more robust structure for our techniques and uncertainty measurements.
Our research plan
Hypothesis
We hypothesize that a combination of current techniques can meaningfully cause a drop in the performance of models in out-of-domain tasks without the need for full retraining. Specifically, we hypothesize that these techniques may be capable of:
Maintaining high performance (>90% of initial accuracy) on in-domain tasks
Achieving significant rejection (>95%) of out-of-domain tasks
Generalizing effectively across model sizes and architectures
This hypothesis builds on preliminary findings from Yunis et al., which suggested that circuit breakers combined with supervised fine-tuning were capable of meaningfully creating task-specific guardrails.
Experiments
In order to thoroughly test and evaluate current techniques for creating domain-specific boundaries for LLMs, we plan to use the following experimental design.
Techniques Tested
We plan to evaluate a variety of techniques that rely on modifications being made to the model itself, as well as external guardrails that can be used to flag out-of-domain requests
Classifier-Based Methods
Constitutional Classifiers (external validation)
Linear Latent Space Probes (internal activation monitoring)
Steering and Finetuning Methods
Circuit Breakers & Supervised Fine-Tuning
Activation Steering via steering vectors
Baseline Prompting Methods
Hardened Prompts
Model Selection:
We’ll evaluate across a diverse range of model sizes and architectures to assess efficacy across various scales and architectures. This will allow us to see if these methods generalize across models. We are currently looking to test from 1B parameters up to 70B parameters, and we look to include both Mixture of Experts and Reasoning models. Currently, we plan to test:
Llama Family:
3.2 1B
3.2 3B
3.1 8B
3.3 70B
Gemma Family:
Gemma 2 9B
Gemma 2 27B
Phi Family
Phi-4 (Reasoning Model)
Mistral
Mixtral 8x7B (MoE model)
Dataset Construction
In order to robustly evaluate the efficacy of these methods in realistic scenarios, we will be combining multiple datasets together to ensure thorough testing across various domains, various granularities of domains, as well as across different tasks
MMLU provides questions and their answers across a variety of domains, including STEM fields, Social Sciences, Law, and Medicine. We plan to both test the abilities of the techniques across broad groups of domains (e.g., all STEM domains) as well as individual fields (e.g., only Prehistory). This should provide
Domain Specific Datasets (e.g., MedQA ) & General Dataset Construction
In order to assess the generalizability of the techniques, we further plan to test them on datasets that are constructed from both domain-specific datasets and large, general corpora of text. While drawing from two datasets introduces confounding factors, which could cloud the results when used in isolation, it provides new distributions against which models trained on MMLU can be evaluated.
SNI provides a large corpus of varying types of tasks, like entailment and question rewriting. This allows us to test scoping techniques not only across different domains but also across various types of specific tasks.
NIST AI 200
Similar to the above SNI dataset, NIST provides a taxonomy of usage type (paradigms for how/why a human engages an AI system). We are considering using that as a horizontal conjunct to domains, like a verb to a subject, as an approach to refine the evaluation datasets for a clearer scope of purposes..
Adversarial expansions of the Above Datasets
Given that the boundaries of various domains may not have clean separations, we plan to create a dataset that adversarially combines different domains and injects adversarial requests into the prompts. We plan to create an automated pipeline that takes the existing domain datasets and generates the new adversarial examples. One example prompt could be “In what ways do quantum uncertainty principles and non-locality align with or contradict theological concepts of divine omniscience?” which would sit at the intersection of theology and physics. Additional adversarial expansions could include instruction overriding, prompt injection, or varying the specificity of questions.
Metrics Measured
Primary Metrics
We will primarily be relying on the following metrics:
In-Domain Accuracy: Measures the model’s performance on questions within an expected domain
Out-of-Domain Accuracy: Measures the model’s performance on questions outside its expected domain
Primary Metric Analysis
To provide greater confidence in the veracity of our results, we will conduct statistical testing using paired t-tests to compare the test and control outputs. This approach allows us to:
Measure both absolute performance and relative deltas compared to control groups
Determine if differences in performance are statistically significant (Calculate p-values to quantify the likelihood that observed differences occurred by chance)
Secondary Metrics
In addition to the performance metrics, we track additional metrics that provide context for the respective approaches:
Rejection-Based Metrics
Accuracy: Overall proportion of correctly classified examples (both in-domain accepted and out-of-domain rejected)
Precision: Proportion of accepted examples that are truly in-domain
Recall: Proportion of in-domain examples that are correctly accepted
F1 Score: This provides a single value that captures the predictive power of our classification techniques
Comparative Analysis
In-Domain Accuracy Delta: Difference between test and control accuracy for in-domain questions
Out-Of-Domain Accuracy Delta: Difference between test and control accuracy for out-of-domain questions
Inference Efficiency Impact
Time Taken: We will track the time taken for the models to perform inference with each of the techniques. This provides a better baseline for the practical challenges associated with each method
FLOPs Tracking: Similarly, the number of FLOPs used during training and inference provides a baseline for the costs associated with each technique for an organization orienting deployment decisions.
Evaluation Process
We first parse model outputs using either text-based or logits-based parsers
We then compare model outputs both against ground truth answers and against the answers from an unmodified base model
We will calculate the above metrics given the outputs of the models
We perform statistical significance testing using paired t-tests between the test and control conditions
Our steerers output answers for multiple-choice questions, allowing direct performance assessment. Classifiers either accept or reject prompts; performance is measured with an incorrect answer (E).
This methodology enables us to evaluate both the model’s ability to answer correctly within its domain and its capability to appropriately reject out-of-domain queries. Further, it provides us with statistical confidence in the veracity of our findings.
Successful Results
Robust Domain Scope Enforcement
We tentatively aim to achieve >90% decrease in out-of-domain accuracy while maintaining >90% of in-domain accuracy.
Beyond delivering quality performance at the boundary of one domain, we aim to develop methods that are robust across multiple domains, datasets, and types of models. Specifically, we look to achieve that >95% decrease and >90% maintenance across domains and models
Technique Comparison and Refinement
We look to evaluate and rank the techniques mentioned earlier based on their performance in scoping models. This will provide actionable insights into the most promising methods
Assumptions
Scoping More Capable Models is Easier and Will Provide Superior Performance to Domain-Specific Models: There are already models that are trained to perform well in a given domain. For example, BioBERT was specifically trained on biomedical data, leading to a model that performed well in that domain.
Meaningful Internal Representations of A Given Domain Exist: Recent advances in interpretability work and our own results from linear probes highly point towards this being the case
Benchmark Datasets Reasonably Represent Relevant Domains: Although we cannot guarantee that this assumption is 100% valid, we will utilize multiple established test suites created by domain experts and highly cited in the literature.
Comparability of Methods Given Model Family/Size Used: We will be using a variety of different series of models, including Llama, Gemma, and Qwen. While we cannot guarantee that these methods will similarly hold for closed-source models, we believe that our selection will provide sufficient foundations to demonstrate relevance and applicability.
Measuring Uncertainty
Across our replicates, our measurement of the system’s intentional performance degradation outside the chosen scope is equipped with a T-test and reporting of p-values to ensure the effect is not due to chance (as would be the case in small sample sizes).
Similarly, we will be conducting these experiments across a diverse range of model families, model sizes, and datasets, which should provide sufficient samples to find statistically significant results.
Potential Limitations
Beyond the assumptions listed earlier, we further believe that there are a few key limitations and challenges in the goals we have and our approach. The main limitations that we currently see are:
Adversarial Robustness Ceiling: While it may be possible for us to achieve considerable progress towards rejecting out-of-domain examples, we will likely not be able to achieve 100% robustness
Additional Computational Overhead: The addition of these techniques will result in an increase in the computational requirements for the models. In deployments of these models by large organizations, that may be a factor that detracts from the desirability of using these methods
Domain Boundary Definitions: There are likely edge cases that sit in a grey area between in-domain and out-of-domain. While we believe that we can make progress towards creating robust systems, there may be edge cases that are challenging to address.
Future Opportunities
Product Opportunities
We mentioned inference-time guardrail products in our section on commercial guardrails (OWASP vendor landscape). By offering new “allow-list” methods rather than topic moderation, or toxicity and jailbreak detection, we believe organizations want to rapidly deploy safe AI-enabled applications and lack the time or resources to develop fine-tuning pipelines for bespoke models. This opens an opportunity: systematizing an end-to-end pipeline of efficient classifiers and scopers, whose training and deployment would be assisted by an orchestrator agent.
Academic Opportunities
Alongside the opportunities to create products with great potential, there are also opportunities for research that can stem from this underlying approach. We believe the groundwork for scope guardrail combinations can show methods and findings on top of which deeper AI architecture developments can be based (e.g., our section on system prompt scoping via MoE router freezing). We also see an opportunity to make scope guardrails an inherent foundation in Multi-Agentic Systems and a first-class component of agentic protocols.
Conclusions
Our literature review reveals fundamental limitations in current approaches to LLM scoping. Prompt-based methods consistently fail under adversarial conditions, while post-training techniques like RLHF and DPO struggle with reversibility and vulnerability to novel attack vectors. Even sophisticated approaches like circuit breakers and constitutional classifiers, while promising, face challenges with false positives and evolving evasion techniques.
Our experimental results on Llama 3.2 models confirm these limitations while highlighting potential paths forward. Prompt-based scoping proved largely ineffective, often degrading performance without meaningful boundary enforcement. However, latent space classifiers demonstrated substantial promise, achieving >95% rejection of out-of-domain queries while maintaining reasonable in-domain performance, particularly on the 3B model. This aligns with findings that internal model representations contain more robust signals for domain classification than surface-level instructions.
The proposed comprehensive research plan addresses current gaps by testing across diverse model architectures, scales, and domains with rigorous statistical validation. However, significant challenges remain: in current LLM architectures, adversarial robustness may have fundamental ceilings, computational overhead could limit practical deployment, and domain boundaries often blur in real-world scenarios. Critically, industry adoption faces additional hurdles around versatility and deployability—methods must work reliably across diverse organizational contexts, integrate seamlessly with existing infrastructure, and maintain performance under varied real-world conditions that laboratory settings cannot fully capture.
Despite these limitations, the convergent evidence supporting internal representation methods suggests that reliable LLM scoping is achievable, for example, through restrictive Mixture-of-Experts approaches we’ve highlighted. Success will require not just technical breakthroughs but also practical solutions for model-agnostic deployment, standardized integration protocols, and robust performance across the heterogeneous environments where organizations actually deploy AI systems.
Acknowledgements
We would like to thank Apart Research and lambda.ai for hosting and sponsoring the hackathon that initiated this project, as well as Adriano who was also a member of the original hackathon team. Apart Labs assisted in supporting the research, without which this work would not have been possible. Jacob Arbeid provided insightful feedback on our initial draft.
Prompt Injection is when new instructions are added in prompts (either via user input or from a source/document also included in an interaction), causing the model to change direction from the expected instructions. OWASP also explains prompt injection in more details.
A “Latent space” is a mathematical representation in a transformer model: after input text syllables (tokens) are each replaced with a unique number, the model converts the overall text into a sequence of “directions” (just like XY on a graph), with many layers of triggers activating on the combinations of those directions.
A “block” being a transformer block which is one of the foundational building pieces of the Transformer architecture. It’s main two components are 1.) The Multi-Head Attention Mechanism 2.) A fully connected neural network. Other important pieces include residual connection and layer normalization. For a more complete explanation see the original paper “Attention is All You Need” or the useful explainer.
Scoping LLMs
Emile Delcourt, David Baek, Adriano Hernandez, Erik Nordby with advising from Apart Lab Studio
Introduction & Problem Statement
Helpful, Harmless, and Honest (”HHH”, Askell 2021) is a framework for aligning large language models (LLMs) with human values and expectations. In this context, “helpful” means the model strives to assist users in achieving their legitimate goals, providing relevant information and useful responses. “Harmless” refers to avoiding generating content that could cause damage, such as instructions for illegal activities, harmful misinformation, or content that perpetuates bias. “Honest” emphasizes transparency about the model’s limitations and uncertainties—acknowledging when it doesn’t know something, avoiding fabricated information, and clearly distinguishing between facts and opinions. This framework serves as both a design principle for AI developers and an evaluation criterion for assessing how well LLMs balance being maximally useful to users while minimizing potential harms and maintaining truthfulness in their outputs.
However, since computation in a transformer considers every part of the input prompts’ “context window,”[1] generative AI deployments (applications) struggle to apply any effective limits beyond “egregious” safety/toxicity backstops (the few things for which there is universal agreement, Bengio et al 2024, Ji et al 2023, Buyl et al 2025, Kumar et al 2021). Any fragment of text can influence attention enough to supersede earlier instructions, subverting the purposes of the application (see “state of jailbreak” section below). Typically, a preamble is designated as a “system prompt” with enforcement expectations, but applications still cannot establish any guarantees to prevent abuse in open-ended user interactions. Even when foundation models are intentionally trained to be helpful, harmless, and honest, we argue that responsible AI systems must also “stay in their lane”—and need to also be “Honed” (HHH+H).
In this article, we address how important HHH+H is, specifically to achieve effective scoping boundaries, why solutions have not achieved it, and the requirements to get there.
General Challenge and Goals
Considerable efforts have been made to make current LLMs generally safe from egregious risks, such as toxicity or the distribution of dangerous information. However, significantly fewer safeguards exist for constraining the models into specific applications. This leaves application-specific language models vulnerable to being manipulated away from their initial purpose, leaving considerable risks for organizations that use LLM-driven applications.
These considerable risks have hindered the adoption of language models, particularly in high-stakes domains (e.g., law, medicine, and finance) and customer-facing applications, which are vulnerable to attack. Not only would robust scoping methods lead to increased safety of systems, but they would also allow for a considerable increase in the number of organizations leveraging LLMs in their workflows.
This project aims to significantly mitigate the above challenges by conducting a thorough assessment of current robustness techniques when applied to scoping models. The impact of this is twofold. First, these domain-specific guardrails will provide additional assurance for organizations that their LLM-driven applications will not be abused or misaligned with their goals. This additional assurance increases client trust and drives adoption across critical sectors. Secondly, this will provide additional context to better understand the limitations of existing robustness strategies when applied from a posture of intense denial. This will help the broader AI Safety field by exploring techniques which are at the extreme of the Helpfulness/Harmlessness Tradeoff
Specific Failure Modes and Examples
Below are just a few examples of risks that organizations currently face, which may be mitigated through robust model scoping techniques. This is by no means a comprehensive list or taxonomy of the myriad risks that are possible (see NIST AI 100
Corporate Information Leakage and IT Infrastructure Damage
As these models are placed into “agentic” scaffolding and are equipped with access to databases, they will likely have access to sensitive information. Attackers may learn how to exfiltrate this data by manipulating the language models to expose sensitive information stored in those databases (for instance, through injection via an authorized user). This has already been seen in Slack bots, which have been shown to be prone to leaking information from private channels. If models are given the ability to modify resources in an organization’s IT infrastructure, those risks are further increased.
Illustrated Risks:
Exfiltration of corporate information
Attacks on infrastructure through jailbreaking “agentic” models
Unauthorized/Malicious Usage Through Purposeful Scope Drift
By steering the models away from their initial purpose, attackers may be able to leverage models for uses outside of their initial intent. One low-stakes example comes from Chevrolet of Watsonville in California. They were paying for ChatGPT+ and were using the most advanced model at the time for their customer-facing chatbot. Unfortunately for Chevrolet of Watsonville, their system prompt scope was easily bypassed. Attackers jailbroke the system and were able to use their expensive API access for completely unrelated requests like writing Python Scripts.
Illustrated Risks:
Wasted computational resources
Increased costs for organizations
Brand Damage and Misrepresentation
If language models are not carefully adapted to represent a brand well, it is likely that they may begin to mention off-task topics, misrepresent the company, or relate false information. This is especially true for niche domains where the model may not have been trained on substantial amounts of data. This can lead to the brand’s image being degraded, seen as less trustworthy, and confusion for customers. A notable example of this comes from the NYC MyCity Chatbot ,which was providing false information about tenant rights and encouraging illegal behavior.
Illustrated Risks:
Off-topic and confusing responses
Incorrect information being relayed to customers (i.e., hallucinations)
The above examples are only a handful of the myriad risks associated with organizations leveraging language models in automated workflows. Not only can attackers manipulate these systems with relative ease, but even well-intentioned users may accidentally cause hallucinations or receive unexpected outputs. By restricting AI systems to operate within their intended domains, those risks can be significantly mitigated, which in turn would reduce impediments and incidents that can jeopardize beneficial applications and their rollouts.
Existing Work & Current Limitations
Prompt injection and jailbreaks[2] often subvert post-training techniques and system prompts, bypassing refusals with unsafe requests. Using Instruction Hierarchy or Circuit Breakers (as shown below), monolithic models struggle to recognize and reject harmful or out-of-scope requests in the context that an application expects.
Extensive research has demonstrated that such fine-tuned models have significant limitations in two key aspects: (a) Reversibility: these models can often be easily manipulated to recover harmful knowledge encoded in the original pre-trained model, and (b) Vulnerability: they lack robustness against adversarial attacks designed to elicit malicious information from the model (Casper, 2023). These challenges highlight the difficulty of safeguarding LLMs against adversarial users and underscore the limitations of current fine-tuning strategies.
From Instruction Tuning to Instruction Hierarchy
First, language models were trained to extrapolate coherent text (such as grammar, semantics, translation and others, depending on the input), and proved able to generalize to quite a wide variety of tasks (Radford et al, 2019). At the time, zero-shot prompts did not answer questions: models often enumerated other, similar questions, and often required that inputs demonstrate explicitly the pattern (such as question, answer, question, answer, question, blank)
Next, Instruction Tuning began by explicitly training Large Language Models (LLMs) on instruction-output pairs (Wei, 2021 and Zhang, 2023). The primary goal was to shape models’ default responses to accurately follow given instructions and perform turn-taking with the user, even on a single keyword.
Building upon Instruction Tuning, system prompts surfaced as a widely adopted method to customize and augment LLM behavior post-training. These prompts act as high-level instructions that guide the model’s behavior in subsequent user queries without requiring fine-tuning (Wei, 2021). Major providers have integrated system prompts as standard features, allowing “implicit expectations” to be concealed from the conversation and enabling designers of such assistants to control aspects like tone, expertise, and output format. This approach has its weaknesses, as adversarial inputs rapidly surfaced that subvert its intent (Perez et al, 2022). System prompts can be crafted manually or reinforced through optimization and evolution techniques such as SMEA (Zou, 2024), AutoPrompt (Shin et al, 2020) and EvoPrompt (Guo et al, 2023). This enables alignment at the application level.
To enhance safety and control, more sophisticated versions of instruction tuning implement Instruction Hierarchies (Wallace et al, 2024). In this approach, training makes system instructions deliberately override any conflicting user instructions. This reinforces the effectiveness of system prompts, creating a layered defense against potentially harmful outputs.
Despite these advancements, these methods rely solely on natural language instructions which leaves safety at the mercy of the model’s interpretation. Further, since system prompts do not change the model at all, unaligned models can theoretically circumvent these safeguards. So, while these methods offer an easy way to put safeguards on models, they must be complemented by other techniques to ensure safety.
The State of Jailbreak Resistance & Major Safety
In this section, we acknowledge the techniques that have addressed the issue of coaxing behaviors from a model that are “universally bad”.
General Safety Tuning
The same impressive capabilities that Large Language Models (LLMs) carry for highly diverse applications bring with them significant safety challenges that remain unresolved. Over the last few years, safety fine tuning techniques have included:
Supervised fine tuning (SFT) with safety datasets (positive/negative examples)
Reinforcement Learning from Human Feedback (RLHF: Ouyang et al, 2022)
Direct Preference Optimization (DPO)
Constitutional AI (aka. RL from AI Feedback) leveraging set guidelines
Despite these, models can still generate harmful, biased, or dangerous content when strategically prompted (Perez et al., 2022). Adversarial users frequently employ techniques such as prompt injection—where malicious instructions are embedded within seemingly benign requests—and jailbreaking, which uses carefully crafted inputs to circumvent safety guardrails (Wei et al., 2023). As LLMs become increasingly deployed in sensitive domains like healthcare, legal advice, and financial services, the consequences of such vulnerabilities become more severe, highlighting the urgency of robust safety solutions beyond current approaches.
Circuit Breakers
Circuit breakers, another universal safety training option, are an automated safety mechanism that triggers when potentially harmful content is detected during generation. Unlike refusal or adversarial training, circuit breakers directly control internal model representations using techniques like “Representation Rerouting” (RR), which remaps harmful representation pathways to orthogonal space. Zou et al. (2024) demonstrated that circuit breakers maintain benchmark performance while reducing harmful outputs by up to two orders of magnitude across text-only LLMs, multimodal systems, and AI agents.
Though promising for making models intrinsically safer without capability compromises, they still face challenges with false positives restricting legitimate uses and false negatives missing novel harmful outputs, as their effectiveness depends on detection pattern comprehensiveness and users’ evolving evasion techniques. Tuning is not real time, but low rank adapters (LoRA) are used in the implementation to avoid training the entire model and reduce cost. This technique can potentially contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
Constitutional classifiers
Still in the general jailbreak/universal safety space, Anthropic’s approach in January (Sharma et al, 2025) showed that an approach language models serving as classifiers at the input and output could be trained on a “constitution” dataset to detect prohibited interactions. While there are similarities to circuit breakers, those are internal to the model and change its representation, whereas constitutional classifiers operate for detection at the input and output, triggering refusal and dispensing with actual model tuning.
This method reduced the effectiveness of jailbreaks to an attack success rate from a dataset of 3000 hours of red teaming to 0.25-1%, all but settling the jailbreak issue. This technique can potentially also contribute to scoping; we discuss it further (with a recent study) in our section on scoping below.
Other commercial guardrails (OWASP vendor landscape)
OWASP defines LLM Guardrails as protective mechanisms designed to ensure that large language models (LLMs) operate within defined ethical, legal, and functional boundaries. These guardrails help prevent the model from generating harmful, biased, or inappropriate content by enforcing rules, constraints, and contextual guidelines during interaction. LLM guardrails can include content filtering, ethical guidelines, adversarial input detection, and user intent validation, ensuring that the LLM’s outputs align with the intended use case and organizational policies. This aligns with OWASP’s LLM top 10 threats guidance #1 (LLM01: prompt injection, which can override system prompts).
The same guide also serves as a stakeholder’s compass to navigate available solutions, and lists a large number of vendors offering “Adversarial Attack Protection” or “Model And Application Interaction Security”, primarily interstitial (either as active proxies or as verification APIs). For example, some like prompt.security and Lakera offer topic moderation (or synonymous features) as a service with custom or built in topics available to allow or block. While methods are not always disclosed, ML techniques are mentioned, and embeddings cosine similarity or latent transformer-based classifiers may be part of the architecture.
Machine Unlearning
Fundamentally removing targeted knowledge from a model is another related area of research with great advances (Bourtoule, 2021).
However, Liu, Casper et al, 2023 flags scope and computational feasibility challenges that indicate this approach is not well suited to scoping deployments that use foundation models in an application-specific manner (the actual model’s knowledge cannot be modified). For instance, adversarial prompts can bypass these safeguards through subtle rewording or novel phrasing. This reveals a critical limitation: defenses are only as effective as the threats we can anticipate and explicitly encode. In the rapidly evolving field of AI, where new capabilities emerge unpredictably, relying solely on refusal training or explicit unlearning poses a significant risk, leaving models exposed to unforeseen and potentially dangerous misuse.
Other Latent Space Techniques
Sparse AutoEncoders
Similar to unlearning, Karvonen, 2024 studied the potential of Sparse AutoEncoders for targeted concept removal. SAE’s are an automatically trained, large-scale version of activation probes: rather than classifying one specific aspect, these observers of LLM layers can map all of the unique concepts (aka. features) that the model is able to distinguish from an intensive training run (models otherwise distribute concepts across too many activations to single out any).
While statistical similarities (e.g. co-occurrence) between features could be a basis to discriminate between in-domain and out-of-domain concepts/features, SAEs depend on expensive training “dictionary” runs to tease out the pattern of activation for every feature. This seems cost-prohibitive for model scoping, however an algorithm could be designed to control existing SAE’s for scoping purposes, such as Anthropic’s Sonnet 3 SAE.
Classification and Probes
Since language models encode so much information within their latent space[3], its activations during inference can also be used to check for harmful triggers: Burns (2022) and Nanda (2023) demonstrated that simple linear classifiers (i.e. “probes”) can reliably extract useful properties from the intermediate activations of large language models and other neural networks.
For example, probes have been used to extract a seemingly linear representation of the board state in the game of Othello (Nanda, 2023). Further, it has been shown that these probes can be used to reveal syntactic structures and detect factual knowledge. Despite the overall complexity of neural networks, probes—often just linear models—suggest that much of this information is encoded in remarkably accessible ways within the latent space. (Alain & Bengio, 2016; Tenney et al., 2019).
Most importantly, activation probes can catch prompt injections (even indirect) causing Task Drift (Abdelnabi, 2024).
Together, these works underscore the utility of probe-based methods for interpreting and leveraging LLM latent space for scoping purposes, and we discuss it further (including a recent study) in our next steps below.
Steering Vectors (Activation Engineering)
With similarities to latent space probes/classifiers, the direction of activations can be used as a corrective basis to zero out a harmful component during inference. Turner et al. 2024 and Panickssery et al. 2024 developed activation steering (or activation engineering) to influence LLM behavior by modifying model activations during inference to give it a propensity to treat a specified concept with the specified valence. The method averages the difference in residual stream activations between sets of positive and negative examples of a particular behavior, and model outputs reflect that.
A note on Mixture-of-Experts models
Unlike the name suggests, LLMs architected with MoE are not composed of domain experts: this is a regularization technique that creates some sparsity in the weights to generate each token using a router that activates fewer portions of the overall feed-forward network than a non-MoE LLM. Earlier this year, we began exploring the possibility that this routing could be made invariant for all tokens after the initial system prompt boundary to prevent capability drift (a form of quasi-unlearning)
Due to insufficient testing, it is unclear at this time whether this could cause in-domain performance to degrade, or even achieve a meaningful reduction in the out-of-domain capabilities, but since the effect should primarily limit capabilities without a specific behavior out-of-domain, to ensure an actual refusal would require a combination of this technique with another such as circuit breakers. Further evaluation is potentially warranted.
Principles of Least Privilege, Default Deny, and OAuth2.0 scopes
Scoping the functional limits of a system can take inspiration from the field of information security, in which a core foundational concept closely relates to our goal: strict control over information sharing and access. This paradigm manifests in concrete principles in security:
Need-to-know (CISSP Domain 7.4): this is a general rule—unless an individual has a specific need for it, information is not shared / access is not granted. For example, in the military and information community, information is compartmentalized, not by hierarchical level but reading in only people with a specific authorization to that information.
The Principle of Least Privilege manifests the need to know in access controls, by granting the minimum amount of access permissions needed for a user to accomplish their responsibilities; the impact of compromised credentials is also minimized (as its likelihood cannot be prevented 100%).
Denying Access by Default (i.e. unless explicitly allowed and verified) reduces attack surfaces for networks and applications. While some public applications may need to be broadly available, they may not need exposure to all geographies, and certainly don’t need to expose all network ports.
In the OAuth2.0 protocol, granting access is done with a set list of “scopes”: strict areas of functionality beyond which requests will not be authorized (e.g., access but not modify webhooks).
Together, these paradigms can be used as part of a “zero trust” strategy (NIST, 2020) which is vital in today’s environment rife with cyber threats: instead of security being treated like whack-a-mole where unauthorized activity is blacklisted or treated reactively, organizations and software makers can provide assurance of security by blocking access by default and allowing only narrow access to limited data for authenticated actors.
What does this mean for LLM applications? Rather than fighting unsafe behavior case by case (there is no universal agreement on the full list of unacceptable use) the lessons from security suggest that the specific mission and purpose of an organization should drive the strict behavioral limits of any AI service it builds, fulfilling the promise within their intent while preventing the perils of unbounded interactions that come with the underlying foundation model(s). Today’s models are incapable of resisting conversations that drift from system prompts as provided, even with the state-of-the-art in instruction hierarchy (e.g., Geng et al., 2025), requiring new solutions. There is an opportunity for providers to restrict scope directly within inference API tokens, pre-training guardrails at the time of creation with edge cases identified and managed explicitly.
Recent Work in Scoping Language Models
Efforts similar to this study in ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024) evaluated different techniques for scoping LLMs. Covering supervised fine-tuning (SFT), Direct Preference Optimization (DPO), internal representation probing, and Circuit Breakers (CB), findings suggested that circuit breaker-based methods, when combined with SFT, offered robustness against adversarial prompts and better generalization for out-of-distribution rejection compared to simple SFT or DPO. Probes also showed strength in rejection but with a tendency to higher false refusals.
However, their study highlighted several areas which could be further explored:
Further tests could be done across varying model sizes to check the robustnessness and scalability of each technique. Some techniques may performs exceedingly well at smaller model sizes but break down as the models grow. Conversely, some techniques may only work once models have achieved some threshold of capabilities
More exploration could be done towards understanding some of the failure modes which arose. This could help create guardrails which better handle edge cases or complex situations.
Further tests could be done on more complex datasets. While the traditional NLP tasks tested provide solid baseline, there are signficant edge cases and failure modes which would arise in real world use. So, crafting and testing the techniques on more nuanced datasets would provide better understanding of those challenges.
Preliminary experiments
We have already implemented and tested several scoping approaches, mirroring or providing analogous methods to those evaluated by Yunis et al. This was done in 3 stages.
Initial Experiment (Apart Hackathon)
During the initial hackathon, we leveraged simple fine-tuning and prompt-based techniques in a Google Colab notebook. We used the HuggingFace dataset camel-ai/biology as the in-domain dataset, and NeelNanda/pile-10k as the out-of-domain dataset. This provided some initial evidence towards these techniques, showing some level of capability.
Our second experiment
For this blog post, we expanded those implementations to also include logistic regression probes and moved from the previously constructed dataset to simply use categories from MMLU. Our methodology was:
Baseline (Default Instruct Model): The performance of the off-the-shelf instruction-tuned models served as our baseline.
Fine-Tuning: We first randomly zero out 10% of the model parameters, and then fine-tune the model using a standard next-token prediction loss combined with an L2 penalty on the gate projection output of a specific MLP, where the L2 penalty is only applied to the out-of-domain dataset. This design is intended to make a specific MLP active only when faced with the in-domain dataset. We use the MMLU STEM category as the in-domain data and all other categories from the MMLU dataset as the out-of-domain data.
Prompt-Based Techniques: We explored several prompting strategies applied at inference time:
Hardened Prompts: The model is explicitly instructed in its system prompt to only answer questions in the relevant domain and to refuse all others
Two-Step Model Classification: The model is first asked if the user’s question is in the appropriate domain. If the model responds with yes, then the question is answered. If, on the other hand, the model responds with no, then the prompt is rejected
Logistic Regression Classifier: conceptually similar to the “Probes” used by Yunis et al., we trained simple classifiers to predict the query type as being STEM or non-STEM based on the model’s internal activations:
Training: Logistic regression classifiers were trained on the hidden state activations extracted from both the Multi-Head Attention and Multilayer Perceptron (MLP) layers at various depths (specifically, every 5th transformer block). The training data consisted of activations generated from processing prompts from the [4]
Selection: For each base model (1B and 3B), the classifier achieving the highest validation accuracy across all tested layers was selected.
Application: During inference, the selected classifier analyzed the activations produced by the user prompt. If the classifier predicted “STEM,” the model proceeded to generate an answer. If it predicted “non-STEM,” a refusal was returned.
Metrics and Processes
Primary Metric: Accuracy on MMLU category splits (Stem, Social Sciences, Humanities, and Other).
Answer Extraction: We directly compared the model’s output logits for the tokens corresponding to ‘A’, ‘B’, ‘C’, ‘D’, and ‘E’, selecting the choice with the highest log-probability. ‘E’ was used as a refusal, given that the MMLU dataset has exactly 4 choices in every question.
Experiment Results
The baseline Llama 3.2 models achieved 33% and 48% accuracy on the STEM tasks as well as comparable accuracy on non-STEM tasks.
Prompt-based methods proved ineffective in the scoping of the models. They either resulted in little to no change in performance or caused a major degradation in performance across the board. In either case, scoping was not achieved.
Fine-tuning did provide noticeable decreases in non-STEM prompts while leaving STEM-related accuracy largely intact for the 1B model. This showed less than a 1% decrease in STEM accuracy while decreasing non-STEM accuracy by >5% across all non-STEM categories.
Latent space classification proved quite successful in targeting the decrease of non-STEM questions. Across both the 3B and 1B models, the highest performance achieved on non-STEM questions was only 5%. For humanities and social sciences, the accuracy was <1% for both the 1B and 3B models. However, this also came with a noticeable drop in performance for the STEM questions, with a decrease of 6.5% for the 3B and 18% for the 1B model. Charts summarizing these findings are available below.
Below are charts visualizing the results. The corresponding tables can be found in the Appendix
We’ve organized this work in Python files for better maintainability and modularity than a notebook. Each method now subclasses a shared base model, which contains various utilities and wrappers for Hugging Face models. Furthermore, we developed utilities that can parse various datasets and modularly test different techniques across them.
We also implemented Circuit Breakers (Zou et al., 2024) :
Training: Using Low-Rank Adaptation, the model is fine-tuned to maintain its original representation of in-domain texts. For out-of-domain texts, it is trained to use internal representations that diverge from the original representation.
Application: During runtime, the internal representations of the model are compared to the normal representations. If they diverge above a given threshold, the request is rejected.
The code for this experimental framework can be found here: https://github.com/nordbyerik/scoped-llm-public
Preliminary conclusions
The results from our experiments on Llama 3.2 1B and Llama 3.2 3B align with the findings from ”Reducing the Scope of Language Models with Circuit Breakers” (Yunis et al, 2024). While we did not test all of the techniques that were tested by Yunis et al., the results we did gather showed considerable similarities to those of the prior paper. Accordingly, we believe that these dynamics could be robust across model sizes and across different families of models.
More specifically, we found prompt-based scoping methods to be fragile and largely ineffective for the Llama 3.2 models, often degrading performance without reliably enforcing boundaries. Standard fine-tuning proved more effective, though the decreases in accuracy for non-STEM questions were modest.
Leveraging internal model states via latent space classifiers showed positive results. These simple classifiers accurately distinguished STEM from non-STEM prompts (>90% accuracy on optimal layers) and drastically reduced out-of-scope generation while minimally impacting target task performance. This capability was especially seen on the larger 3B model. This strongly aligns with the Yunis et al. findings that internal methods like Probes and Circuit Breakers (CB) offer more robust control than surface-level instructions.
The recurrence of 1.) The weakness of prompting and 2.) The strength of internal state manipulation suggests these may be generalizable insights. While we didn’t directly replicate Circuit Breakers or Direct Preference Optimization, the effectiveness of our classifiers reinforces the potential of internal mechanisms for achieving reliable LLM scoping.
The primary feedback we have received revolves largely around the generalizability of these methods. Our experimental design was quite small compared to what would be needed to rigorously test these various experiments. We address this feedback below by planning a diversification of data, models, and a more robust structure for our techniques and uncertainty measurements.
Our research plan
Hypothesis
We hypothesize that a combination of current techniques can meaningfully cause a drop in the performance of models in out-of-domain tasks without the need for full retraining. Specifically, we hypothesize that these techniques may be capable of:
Maintaining high performance (>90% of initial accuracy) on in-domain tasks
Achieving significant rejection (>95%) of out-of-domain tasks
Generalizing effectively across model sizes and architectures
This hypothesis builds on preliminary findings from Yunis et al., which suggested that circuit breakers combined with supervised fine-tuning were capable of meaningfully creating task-specific guardrails.
Experiments
In order to thoroughly test and evaluate current techniques for creating domain-specific boundaries for LLMs, we plan to use the following experimental design.
Techniques Tested
We plan to evaluate a variety of techniques that rely on modifications being made to the model itself, as well as external guardrails that can be used to flag out-of-domain requests
Classifier-Based Methods
Constitutional Classifiers (external validation)
Linear Latent Space Probes (internal activation monitoring)
Steering and Finetuning Methods
Circuit Breakers & Supervised Fine-Tuning
Activation Steering via steering vectors
Baseline Prompting Methods
Hardened Prompts
Model Selection:
We’ll evaluate across a diverse range of model sizes and architectures to assess efficacy across various scales and architectures. This will allow us to see if these methods generalize across models. We are currently looking to test from 1B parameters up to 70B parameters, and we look to include both Mixture of Experts and Reasoning models. Currently, we plan to test:
Llama Family:
3.2 1B
3.2 3B
3.1 8B
3.3 70B
Gemma Family:
Gemma 2 9B
Gemma 2 27B
Phi Family
Phi-4 (Reasoning Model)
Mistral
Mixtral 8x7B (MoE model)
Dataset Construction
In order to robustly evaluate the efficacy of these methods in realistic scenarios, we will be combining multiple datasets together to ensure thorough testing across various domains, various granularities of domains, as well as across different tasks
MMLU (STEM vs Non-STEM) and (Specific Domains )
MMLU provides questions and their answers across a variety of domains, including STEM fields, Social Sciences, Law, and Medicine. We plan to both test the abilities of the techniques across broad groups of domains (e.g., all STEM domains) as well as individual fields (e.g., only Prehistory). This should provide
Domain Specific Datasets (e.g., MedQA ) & General Dataset Construction
In order to assess the generalizability of the techniques, we further plan to test them on datasets that are constructed from both domain-specific datasets and large, general corpora of text. While drawing from two datasets introduces confounding factors, which could cloud the results when used in isolation, it provides new distributions against which models trained on MMLU can be evaluated.
SNI
SNI provides a large corpus of varying types of tasks, like entailment and question rewriting. This allows us to test scoping techniques not only across different domains but also across various types of specific tasks.
NIST AI 200
Similar to the above SNI dataset, NIST provides a taxonomy of usage type (paradigms for how/why a human engages an AI system). We are considering using that as a horizontal conjunct to domains, like a verb to a subject, as an approach to refine the evaluation datasets for a clearer scope of purposes..
Adversarial expansions of the Above Datasets
Given that the boundaries of various domains may not have clean separations, we plan to create a dataset that adversarially combines different domains and injects adversarial requests into the prompts. We plan to create an automated pipeline that takes the existing domain datasets and generates the new adversarial examples. One example prompt could be “In what ways do quantum uncertainty principles and non-locality align with or contradict theological concepts of divine omniscience?” which would sit at the intersection of theology and physics. Additional adversarial expansions could include instruction overriding, prompt injection, or varying the specificity of questions.
Metrics Measured
Primary Metrics
We will primarily be relying on the following metrics:
In-Domain Accuracy: Measures the model’s performance on questions within an expected domain
Out-of-Domain Accuracy: Measures the model’s performance on questions outside its expected domain
Primary Metric Analysis
To provide greater confidence in the veracity of our results, we will conduct statistical testing using paired t-tests to compare the test and control outputs. This approach allows us to:
Measure both absolute performance and relative deltas compared to control groups
Determine if differences in performance are statistically significant (Calculate p-values to quantify the likelihood that observed differences occurred by chance)
Secondary Metrics
In addition to the performance metrics, we track additional metrics that provide context for the respective approaches:
Rejection-Based Metrics
Accuracy: Overall proportion of correctly classified examples (both in-domain accepted and out-of-domain rejected)
Precision: Proportion of accepted examples that are truly in-domain
Recall: Proportion of in-domain examples that are correctly accepted
F1 Score: This provides a single value that captures the predictive power of our classification techniques
Comparative Analysis
In-Domain Accuracy Delta: Difference between test and control accuracy for in-domain questions
Out-Of-Domain Accuracy Delta: Difference between test and control accuracy for out-of-domain questions
Inference Efficiency Impact
Time Taken: We will track the time taken for the models to perform inference with each of the techniques. This provides a better baseline for the practical challenges associated with each method
FLOPs Tracking: Similarly, the number of FLOPs used during training and inference provides a baseline for the costs associated with each technique for an organization orienting deployment decisions.
Evaluation Process
We first parse model outputs using either text-based or logits-based parsers
We then compare model outputs both against ground truth answers and against the answers from an unmodified base model
We will calculate the above metrics given the outputs of the models
We perform statistical significance testing using paired t-tests between the test and control conditions
Our steerers output answers for multiple-choice questions, allowing direct performance assessment. Classifiers either accept or reject prompts; performance is measured with an incorrect answer (E).
This methodology enables us to evaluate both the model’s ability to answer correctly within its domain and its capability to appropriately reject out-of-domain queries. Further, it provides us with statistical confidence in the veracity of our findings.
Successful Results
Robust Domain Scope Enforcement
We tentatively aim to achieve >90% decrease in out-of-domain accuracy while maintaining >90% of in-domain accuracy.
Beyond delivering quality performance at the boundary of one domain, we aim to develop methods that are robust across multiple domains, datasets, and types of models. Specifically, we look to achieve that >95% decrease and >90% maintenance across domains and models
Technique Comparison and Refinement
We look to evaluate and rank the techniques mentioned earlier based on their performance in scoping models. This will provide actionable insights into the most promising methods
Assumptions
Scoping More Capable Models is Easier and Will Provide Superior Performance to Domain-Specific Models: There are already models that are trained to perform well in a given domain. For example, BioBERT was specifically trained on biomedical data, leading to a model that performed well in that domain.
Meaningful Internal Representations of A Given Domain Exist: Recent advances in interpretability work and our own results from linear probes highly point towards this being the case
Benchmark Datasets Reasonably Represent Relevant Domains: Although we cannot guarantee that this assumption is 100% valid, we will utilize multiple established test suites created by domain experts and highly cited in the literature.
Comparability of Methods Given Model Family/Size Used: We will be using a variety of different series of models, including Llama, Gemma, and Qwen. While we cannot guarantee that these methods will similarly hold for closed-source models, we believe that our selection will provide sufficient foundations to demonstrate relevance and applicability.
Measuring Uncertainty
Across our replicates, our measurement of the system’s intentional performance degradation outside the chosen scope is equipped with a T-test and reporting of p-values to ensure the effect is not due to chance (as would be the case in small sample sizes).
Similarly, we will be conducting these experiments across a diverse range of model families, model sizes, and datasets, which should provide sufficient samples to find statistically significant results.
Potential Limitations
Beyond the assumptions listed earlier, we further believe that there are a few key limitations and challenges in the goals we have and our approach. The main limitations that we currently see are:
Adversarial Robustness Ceiling: While it may be possible for us to achieve considerable progress towards rejecting out-of-domain examples, we will likely not be able to achieve 100% robustness
Additional Computational Overhead: The addition of these techniques will result in an increase in the computational requirements for the models. In deployments of these models by large organizations, that may be a factor that detracts from the desirability of using these methods
Domain Boundary Definitions: There are likely edge cases that sit in a grey area between in-domain and out-of-domain. While we believe that we can make progress towards creating robust systems, there may be edge cases that are challenging to address.
Future Opportunities
Product Opportunities
We mentioned inference-time guardrail products in our section on commercial guardrails (OWASP vendor landscape). By offering new “allow-list” methods rather than topic moderation, or toxicity and jailbreak detection, we believe organizations want to rapidly deploy safe AI-enabled applications and lack the time or resources to develop fine-tuning pipelines for bespoke models. This opens an opportunity: systematizing an end-to-end pipeline of efficient classifiers and scopers, whose training and deployment would be assisted by an orchestrator agent.
Academic Opportunities
Alongside the opportunities to create products with great potential, there are also opportunities for research that can stem from this underlying approach. We believe the groundwork for scope guardrail combinations can show methods and findings on top of which deeper AI architecture developments can be based (e.g., our section on system prompt scoping via MoE router freezing). We also see an opportunity to make scope guardrails an inherent foundation in Multi-Agentic Systems and a first-class component of agentic protocols.
Conclusions
Our literature review reveals fundamental limitations in current approaches to LLM scoping. Prompt-based methods consistently fail under adversarial conditions, while post-training techniques like RLHF and DPO struggle with reversibility and vulnerability to novel attack vectors. Even sophisticated approaches like circuit breakers and constitutional classifiers, while promising, face challenges with false positives and evolving evasion techniques.
Our experimental results on Llama 3.2 models confirm these limitations while highlighting potential paths forward. Prompt-based scoping proved largely ineffective, often degrading performance without meaningful boundary enforcement. However, latent space classifiers demonstrated substantial promise, achieving >95% rejection of out-of-domain queries while maintaining reasonable in-domain performance, particularly on the 3B model. This aligns with findings that internal model representations contain more robust signals for domain classification than surface-level instructions.
The proposed comprehensive research plan addresses current gaps by testing across diverse model architectures, scales, and domains with rigorous statistical validation. However, significant challenges remain: in current LLM architectures, adversarial robustness may have fundamental ceilings, computational overhead could limit practical deployment, and domain boundaries often blur in real-world scenarios. Critically, industry adoption faces additional hurdles around versatility and deployability—methods must work reliably across diverse organizational contexts, integrate seamlessly with existing infrastructure, and maintain performance under varied real-world conditions that laboratory settings cannot fully capture.
Despite these limitations, the convergent evidence supporting internal representation methods suggests that reliable LLM scoping is achievable, for example, through restrictive Mixture-of-Experts approaches we’ve highlighted. Success will require not just technical breakthroughs but also practical solutions for model-agnostic deployment, standardized integration protocols, and robust performance across the heterogeneous environments where organizations actually deploy AI systems.
Acknowledgements
We would like to thank Apart Research and lambda.ai for hosting and sponsoring the hackathon that initiated this project, as well as Adriano who was also a member of the original hackathon team. Apart Labs assisted in supporting the research, without which this work would not have been possible. Jacob Arbeid provided insightful feedback on our initial draft.
Appendix
Tables Summarizing Results Shown in Charts
Accuracy Results for Llama 3.2 3B Instruct
Accuracy Results for Llama 3.2 1B
Footnotes
A context window is the text an LLM can “see” and reference when generating a response. For more information, see https://towardsdatascience.com/de-coded-understanding-context-windows-for-transformer-models-cd1baca6427e/
Prompt Injection is when new instructions are added in prompts (either via user input or from a source/document also included in an interaction), causing the model to change direction from the expected instructions. OWASP also explains prompt injection in more details.
A “Latent space” is a mathematical representation in a transformer model: after input text syllables (tokens) are each replaced with a unique number, the model converts the overall text into a sequence of “directions” (just like XY on a graph), with many layers of triggers activating on the combinations of those directions.
A “block” being a transformer block which is one of the foundational building pieces of the Transformer architecture. It’s main two components are 1.) The Multi-Head Attention Mechanism 2.) A fully connected neural network. Other important pieces include residual connection and layer normalization. For a more complete explanation see the original paper “Attention is All You Need” or the useful explainer.