For generic guidance for designing good ML benchmarks, refer to this document.
Example format
If you have already written a paper about your benchmark, submit that. Otherwise, you should submit a write-up that provides a thorough and concrete explanation of your benchmark idea, including details about how it would be implemented. We’ve provided an example format below, though using it is entirely optional.
-
Title
-
Abstract (what’s the tl;dr?)
-
Benchmark Description and Motivation
- Introduction. What context motivates your benchmark? What is the intuition behind it?
- Description. Provide a brief, concrete description of the benchmark. Is it an environment, dataset, etc? What are the inputs and outputs?
- Relevance. Explain how it reduces high-consequence risks from advanced AI systems. We’ve provided four categories of research that generally help to reduce these risks; however, submissions will be judged according to their relevance to risks from advanced AI, not to these categories. You can learn more about these risks here.
- Technical Details
- Data source (if applicable). Where does (or will) the data come from? For example, will data be cleaned from a publicly available site? Will it be manually collected?
- Implementation cost. If you don’t include data with your submission, estimate the time and cost to collect and clean it.
- Dataset size (if applicable): GB and number of examples.
-
Related Work
How is your benchmark similar to or different from existing tasks and benchmarks? Good benchmarks often tie into existing work to gain widespread adoption while motivating novel research.
-
Relevance to Future Work
- Current tractability analysis. Benchmarks should currently or soon be tractable for existing models while posing a meaningful challenge. A significant research effort should be required to achieve near-maximum performance.
- Performance ceiling. Provide an estimate of maximum performance. For example, what would expert human-level performance be? Is it possible to achieve superhuman performance?
- Barriers to entry. List factors that might make it harder for researchers to use your benchmark. Keep in mind that if barriers-to-entry trade-off against relevance, you should generally prioritize the latter.
- How large do models need to be to perform well?
- How much context is required to understand the task?
- How difficult is it to adapt current training architectures to the dataset?
- Is third-party software (e.g. games, modeling software, simulators) or a complex set of dependencies required for training and/or evaluation?
- Is unusual hardware required (e.g. robotics, multi-GPU training setups)?
- Do researchers need to learn a new program or programming language to use the dataset (e.g., Coq, AnyLogic)?
-
Baseline & projected improvements
- Baseline. Implement a baseline and describe its performance OR describe a plausible baseline and its expected performance on the task. Implementing a baseline is not required, though it is encouraged if you create your own dataset.
- Plausible directions for improvement. You should be able to think of research directions that could plausibly lead to performance improvements. “Scaling the model” doesn’t count.
- References
How will submissions be evaluated?
Some factors the judges will take into account include:
- Relevance to existential risk. Would this benchmark motivate research that reduces high-consequence risks from advanced AI systems?
- Safety/capabilities balance. To what extent will your benchmark encourage capability improvements unrelated to safety? This matters because stronger capabilities require higher safety standards, so a favorable balance between the two should be maintained.
- Difficulty. Does your benchmark present a significant challenge while allowing for meaningful near-term improvements?
- Expected Impact. To what extent is this benchmark the beginning of a long line of research as opposed to a one-off contribution?
Only your final submission will be judged and any feedback on intermediate submissions will not affect your final score.
Points to keep in mind
-
Empirical testing is preferred. You are not required to provide a dataset to submit a benchmark idea; however, building your benchmark and providing a baseline is preferred — especially if doing so would be straightforward. Empirical testing helps the judges determine whether your benchmark is appropriately challenging and is realistic to implement.
-
Be mindful of sourcing legalities. Leveraging existing freely-available data is okay and encouraged; however, you should ensure that sourcing the data is compliant with the license or usage policy provided by the creator.