We are interested in benchmarks that reduce risks from advanced AI systems. In order to provide further guidance, we've outlined four categories where we would like to see benchmarks:
- Alignment: building models that represent and safely optimize difficult-to-specify human values.
- Monitoring: discovering unintended model functionality.
- Robustness: designing systems to be reliable in the face of adversaries and highly unusual situations.
- Safety Applications: using ML to address broader risks related to how ML systems are handled.
For each of these categories, we've provided example ideas. These are NOT example submissions as they do not include details about how a benchmark would actually be implemented. See guidelines for information about what submissions should contain. You can submit a write-up that concretizes one of these ideas or go in a novel direction.
| Measuring or penalizing power-seeking.
To better accomplish their goals, advanced agent AIs may be instrumentally incentivized to seek power. Various forms of power, including resources, legitimate power, coercive power, and so on, are helpful for achieving nearly any goal a system might be given. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values, since powerful agents are more difficult to correct and can create more unintended consequences. AIs that pursue power may also reduce human autonomy and authority, so we should avoid building agents that do not act within reasonable boundaries.
A benchmark may involve developing an environment in which agents clearly develop self-preserving or power-seeking tendencies and designing a metric that tracks this behavior. Potentially, benchmarks could consider using video games or other environments in which agents’ goals can be achieved by acquiring power.
| Detecting when models are pursuing proxies to the detriment of the true goal and developing robust proxies
When building systems, it is often difficult to measure the true goal (e.g., human wellbeing) directly. Instead, it is common to create proxies or approximation of the true goal. However, proxies can be “gamed”: a system might be able to optimize the proxy to the detriment of the true goal.
For instance, recommender systems optimizing for user engagement have been demonstrated to recommend polarizing content. This content engenders high engagement but decreases human wellbeing in the process, which has been costly not only for users but also for system designers.
An example of a benchmark could be a set of proxy objectives such that if they are optimized weakly, models perform well at the intended task, while optimizing them strongly causes models to perform poorly at the intended task (the proxies are ‘gamed’). A benchmark could then evaluate methods that aim to detect whether/when the model is gaming the proxy. Another benchmark could measure the extent to which learned proxy metrics are robust to powerful optimization.
| Measuring the extent to which language models state what they know.
An honest language model only outputs text that that they hold to be true. It is important that ML models do not output falsehoods nor deceive human operators. If AI agents are honest, it will be easier to monitor their plans. Honesty is not the same as truthfulness, which requires that models only output truths about the world. We focus on honesty rather than truthfulness because honesty is more orthogonal to general model capabilities. Being truthful requires both honesty and the capability to determine the truth.
A benchmark could build an evaluation scheme that catches models making inconsistent statements (while showing these inconsistencies are not the result of fickleness). Useful benchmarks should ideally rely on rigorous definitions of “honesty” and “beliefs.”
- Discovering Latent Knowledge in Language Models Without Supervision
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
- Measuring and Improving Consistency in Pretrained Language Models
- Hallucinations in Neural Machine Translation
- Language Models as Knowledge Bases?
- Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases
| Detect ulterior motives other than a human-specified objective that drive an agent's actions.
It may be more efficient for a model to gain human approval through deception than earn human approval legitimately. Deception also provides optionality: systems that have the capacity to be deceptive have strategic advantages over restricted honest models. Models could behave differently when they are no longer being monitored, a “treacherous turn.” For example, Volkswagen programmed their engines to reduce emissions only when being monitored. This allowed them to achieve performance gains while retaining purportedly low emissions. Future AI agents could similarly switch strategies when being monitored and take steps to obscure their deception from monitors.
A good dataset would demonstrate treacherous turns.
| Detecting and preventing collusion in multi-agent systems.
In multi-agent environments (e.g. a monitor evaluating a model), there may be incentives to collude; for example, a monitor and a model under evaluation could collude to both report favorable results. Undetectable collusion undermines the integrity of monitoring mechanisms and opens the door to a variety of failure modes.
A useful environment might incentivize collusion in a toy scenario and provide a standardized method of measurement, so that anti-collusion techniques can be objectively evaluated. Environments could also propose anti-collusion measures (e.g. limited communication channels) and create a benchmark to elicit examples of collusion that are still possible. Video games with strategies involving collusion may be useful sandboxes.
- Bostrom, Superintelligence, page 179
- Reframing Superintelligence, page 103
Implementing moral decision-making
| Training models to robustly represent and abide by ethical frameworks.
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.
Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.
Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).
| Detecting and forecasting emergent capabilities.
In today’s AI systems, capabilities that are not anticipated by system designers emerge during training. For example, as language models became larger, they gained the ability to perform arithmetic, even though they received no explicit arithmetic supervision. Future ML models may, when prompted deliberately, demonstrate capabilities to synthesize harmful content or assist with crimes. To safely deploy these systems, we must monitor what capabilities they possess. Furthermore, if we’re able to accurately forecast future capabilities, this gives us time to prepare to mitigate their potential risks.
Benchmarks could assume the presence of a trained model and probe it through a battery of tests designed to reveal new capabilities. Benchmarks could also evaluate capability prediction methods themselves, e.g., by creating a test set of unseen models with varying sets of capabilities and measuring the accuracy of methods that have white-box access to these models and attempt to predict their capabilities. Benchmarks could cover one or more model types, including language models or reinforcement learning agents.
Hazardous capability mitigation
| Preventing and removing unwanted and dangerous capabilities from trained models.
Unanticipated capabilities often emerge in today’s AI systems, which makes it important to check whether AI systems have hazardous capabilities before deploying them. If they do have dangerous capabilities, e.g. the ability to produce persuasive political content, assist with a cyber attack, or to deceive/manipulate humans, these capabilities may need to be removed. Alternatively, the training procedure or dataset may need to be changed in order to train new models without hazardous capabilities.
Methods may be developed that make it very difficult for a model to exhibit a capability, even with moderate fine-tuning. Benchmarks might also verify that methods for removing capabilities do not affect model performance in unrelated and harmless domains.
| Enabling safe delegation of subgoals to subsystems.
In the future, AI systems or human operators may handle tasks by breaking them into subgoals and delegating them to other agents or systems. However, breaking down a task can distort it; systems pursuing subgoals may seek to gain and preserve power at the expense of a top-level agent. Analogously, companies set up specialized departments to pursue intrasystem goals, such as finance and IT. Occasionally a department captures power and leads the company away from its original mission. Therefore, even if we correctly specify our high-level objectives, systems may not operationally pursue our objectives.
A benchmark might be an environment which naturally causes intrasystem goals to develop. Such a benchmark would act as a testbed for methods that seek to detect intrasystem goals or keep them in check.
| Building tools that offer clarity into model inner workings.
Neural networks are notoriously opaque. Transparency tools that intelligibly communicate model reasoning and knowledge to humans may be useful for uncovering dangerous model properties and making models correctable. Successful transparency tools would allow a human to predict how a model will behave in various situations without testing it. They would provide clear explanations for behavior that suggest corrective interventions.
Benchmarks could determine how well transparency tools are able to identify belief structures possessed by models. Alternatively, they could measure the extent to which the transparency tools predict model behavior or can be used to identify potential failures.
| Recovering triggers for ML model backdoors.
Trojans (or backdoors) can be planted in models by training them to fail on a specific set of inputs. For example, a trojaned language model might produce toxic text if triggered with a key word but otherwise behaved benignly. Trojans are usually introduced into the model through data poisoning, which is especially a risk if the training set contains data scraped from public sources. Screening for and patching trojans is necessary for ensuring model security in real-world applications. Otherwise, adversaries might exploit the model’s backdoor to their own advantage. Trojan detection may also be a microcosm for detecting deceptive behavior in future AI agents. Misaligned AI agents that are sufficiently capable may adopt the strategy of appearing to be aligned and benign in order to preserve themselves and be deployed. These AIs may defect on a small subset of inputs, e.g. inputs that indicate that the AI isn’t being monitored or inputs indicating it could bypass security in the training environment. Methods for detecting trojans may be relevant for detecting deceptively benign behavior like this.
A benchmark could measure how well the trigger for a trojan could be reconstructed.
- Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
- BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
- BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
- Deep Learning Backdoors
- Introduction to ML Safety Course Readings
| Improving defenses against adversarial attacks.
Adversarial robustness entails making systems robust to inputs that are selected to make the system fail. Adversarial robustness plays a critical role in AI security. As AI systems are deployed to higher stakes settings they need to be adversarially robust to avoid being exploited by malicious actors.
Models must be robust to unforeseen adversaries. As the community has demonstrated, there exists an incredibly large design space for potential adversarial attacks. Strong defenses must therefore be robust to unseen attacks.
In addition to security, the metrics used to quantify the performance of a system (e.g., the reward function) must be adversarially robust to prevent proxy gaming. If our metrics are not adversarially robust, AI systems will exploit vulnerabilities in these proxies to achieve high scores without optimizing the actual objective.
A benchmark could tackle new and ideally unseen attacks, not ones with known defenses or small $l_p$-perturbations. It would be especially interesting to measure defenses against expensive or highly motivated attacks. Benchmarks could consider attacks in a variety of domains beyond natural images or text, including attacks on systems involving multiple redundant sensors or information channels.
Adaptive model security
| Preventing adaptive models from being compromised by adversaries.
Deployed AI systems will likely need to adapt to new data and perform online learning. This is a radically different paradigm from traditional ML research, in which models are trained once and tested, and presents additional risks.
Adaptive models may face adversaries who exploit model adaptiveness. For example, Microsoft’s 2016 Tay chatbot was designed to adapt to new phrases and slang. Malicious users exploited this adaptivity by teaching it to speak in offensive and abusive ways. More generally, adversaries might cause models to adapt to poisoned data, implanting vulnerabilities in the model or generally causing the model to misbehave.
An exemplary benchmark might be an initial text training corpus followed by adaptive data meant to be learned online; the adaptive data could be constructed adversarially so as to contain high levels of undesired content. Metrics could evaluate to what extent the model incorporates this content and whether it can be effectively steered away from adapting to it. Benchmarks might contain significant held-out data to replicate the adversarial unknowns in real-world deployment.
Interpretable and calibrated uncertainty
| Improving calibration and the ability of a model to express its confidence in language.
To make models more trustworthy, they should accurately assess their own ability. Model uncertainties (as expressed in e.g. softmax probabilities) are currently not representative of their performance off of the training distribution, and they are often overconfident. Unreliable and uninterpretable model confidence precludes competent operator oversight.
Good benchmarks would generate novel data to test models’ uncertainty estimates. Datasets could scale up the complexity or size of datasets in existing literature. Benchmarks could possibly test the extent to which a language model makes statements about the world that are qualified by the certainty it has in its own ontology (for instance, “I’m not sure when AGI will be developed”). Additionally, benchmarks could review the quality of textual explanation of its uncertainties (for instance, “I’m uncertain about whether this is a cat because the tail looks like a carrot”).
Text/RL out-of-distribution detection
| Detecting out-of-distribution text or events in a reinforcement learning context.
Language models are seeing adoption in increasingly high-stakes settings, such as in software engineering. Reinforcement learning has been useful for robotics and industrial automation, such as in automated data center cooling. With both modalities, it is essential that systems are able to identify out-of-distribution inputs so that models can be overridden by external operators or systems. Yet despite a wide availability of text data and reinforcement learning environments, most out-of-distribution detection work has focused on image data.
A strong benchmark might include a diverse range of environments or textual contexts that models are evaluated on after training. The benchmark should contain difficult examples on the boundary of the in-distribution and out-of-distribution. In addition, it should specify a clear evaluation protocol so that methods can be easily compared.
| Making models robust against highly unlikely events and black swans.
The real world is rife with long tails and black swan events that continue to thwart modern ML systems. Despite petabytes of task-specific fine-tuning, autonomous vehicles struggle to robustly learn basic concepts like stop signs. Even human systems often fail at long-tail robustness; the 2008 financial crisis and COVID-19 have shown that institutions struggle to handle black swan events. Such events might be known, and simply really rare (e.g., the possibility of a catastrophic earthquake).
Many long-tail events fall into the domain of human “known unknowns”: we know about them (e.g. the possibility of a catastrophic earthquake) but don’t prepare models for these eventualities. As such, benchmarks could test models against predictable long-tail events, including new, unusual, and extreme distribution shifts and long-tail events. Following industry precedents, benchmarks could include simulated data that capture structural properties of real long-tail events, such as environmental feedback loops. Benchmarks should also focus on “wild,” significant distribution shifts that cause large accuracy drops over “mild” shifts. Perhaps RL environments with large, sudden systemic changes or rapid feedback loops could help us measure resilience in the face of long-tail events.
| Using ML to defend against sophisticated cyberattacks.
Networked computer systems now control critical infrastructure, sensitive databases, and powerful ML models. This leads to two major weaknesses:
As AI systems increase in economic relevance, cyberattacks on AI systems themselves will become more common. Some AI systems may be private or unsuitable for proliferation, and they will therefore need to operate on computers that are secure.
ML may amplify future automated cyberattacks. Hacking currently requires specialized skills, but if ML code-generation models or agents could be fine-tuned for hacking, the barrier to entry may sharply decrease.
Future ML systems could:
- Automatically detect intrusions
- Actively stop cyberattacks by selecting or recommending known defenses
- Submit patches to security vulnerabilities in code
- Generate unexpected inputs for programs (fuzzing)
- Model binaries and packets to detect obfuscated malicious payloads
- Predict next steps in large-scale cyberattacks to provide contextualized early warnings
Warnings could be judged by lead time, precision, recall, and quality of contextualized explanations.
ML defense systems should be able to cope with dynamic threats and adversarial attacks from bad actors (e.g. if publicly available training data is known). Systems should also be able to plausibly scale to real-world levels of complexity; large corporations should be able to deploy ML monitors on fleets of servers, each running production-deployed software. benchmarks: | Useful benchmarks could outline a standard to evaluate one or more of the above tasks. Ideally the benchmark should be easy to use for deep learning researchers without a background in cybersecurity. Benchmarks may involve toy tasks, but should bear similarity to real world tasks. A benchmark should incentivize defensive capabilities only and have limited offensive utility.
| Evaluating AI-assisted moral philosophy research.
In past decades, humanity’s moral attitudes have changed in marked ways. It’s unlikely that human moral development has converged. To address deficiencies in our existing moral systems, AI research systems could help set and clarify moral precedents. Ideally AI systems will avoid locking in deficient existing moral systems. An example of progress here could be to build AIs that produce original insights in philosophy, such that models could make philosophical arguments or write seminal philosophy papers. Value clarification systems could also point out novel inconsistencies in existing ethical views, arguments, or systems.
Benchmarks could predict how impactful a philosophical paper will be. Alternatively, benchmarks could incentivize work towards creating models that do well at philosophy olympiads.
Improved institutional decision-making
| Surfacing strategically relevant information and forecasting for high-stakes decisions.
High-level decision-makers in governments and organizations must act on limited data, often in uncertain and rapidly evolving situations. Failing to surface data can have enormous consequences; in 1995, the Russian nuclear briefcase was activated when radar operators were not informed of a Norwegian science rocket and interpreted it as a nuclear attack. Failures also arise from the difficulty of forecasting social or geopolitical consequences of interventions. In such complex situations, humans are liable to make egregious errors. ML systems that can synthesize diverse information sources, predict a variety of events, and identify crucial considerations could help provide good judgment and correct misperceptions, and thereby reduce the chance of rash decisions and inadvertent escalation. An ML system could evaluate e.g., the long-term effects of a new moderation policy on hate speech on social networks and provide probability estimates of tail risks and adversarial activity to trust and safety teams. ML systems could estimate the outcomes of political or technological interventions and suggest alternatives and tradeoffs or uncover historical precedents. Systems could support interactive dialogue, where they could bring up base rates, crucial questions, metrics, and key stakeholders. Forecasting tools necessitate caution and careful preparation to prevent human overreliance and risk-taking propensities.
Benchmarks could evaluate how well a model improves brainstorming and how well they can raise key considerations. They may involve backtesting with historical strategic scenarios including multiple proposed approaches and their tradeoffs, risks, and retrospective outcomes. They should ensure that model’s predictions generalize well across domains or time. Benchmarks may also tackle concrete subproblems, e.g. QA for finding stakeholders related to a given question or forecasting specific categories of events.
- Forecasting Future World Events with Neural Networks
- Ought, scaling open-ended reasoning
- 80,000 Hours, “Improving Institutional Decision-making”
- 25 Years After Challenger, Has NASA’s Judgement Improved?
- On the Difference between Binary Prediction and True Exposure With Implications For Forecasting Tournaments and Decision-making Research
- Superforecasting – Philip Tetlock