Guidelines

For generic guidance for designing good ML benchmarks, refer to this document.

Example format

If you have already written a paper about your benchmark, submit that. Otherwise, you should submit a write-up that provides a thorough and concrete explanation of your benchmark idea, including details about how it would be implemented. We’ve provided an example format below, though using it is entirely optional.

  1. Title

  2. Abstract (what’s the tl;dr?)

  3. Benchmark Description and Motivation

    1. Introduction. What context motivates your benchmark? What is the intuition behind it?
    2. Description. Provide a brief, concrete description of the benchmark. Is it an environment, dataset, etc? What are the inputs and outputs?
    3. Relevance. Explain how it reduces high-consequence risks from advanced AI systems. We’ve provided four categories of research that generally help to reduce these risks; however, submissions will be judged according to their relevance to risks from advanced AI, not to these categories. You can learn more about these risks here.
  4. Technical Details
    1. Data source (if applicable). Where does (or will) the data come from? For example, will data be cleaned from a publicly available site? Will it be manually collected?
    2. Implementation cost. If you don’t include data with your submission, estimate the time and cost to collect and clean it.
    3. Dataset size (if applicable): GB and number of examples.
  5. Related Work

    How is your benchmark similar to or different from existing tasks and benchmarks? Good benchmarks often tie into existing work to gain widespread adoption while motivating novel research.

  6. Relevance to Future Work

    1. Current tractability analysis. Benchmarks should currently or soon be tractable for existing models while posing a meaningful challenge. A significant research effort should be required to achieve near-maximum performance.
    2. Performance ceiling. Provide an estimate of maximum performance. For example, what would expert human-level performance be? Is it possible to achieve superhuman performance?
    3. Barriers to entry. List factors that might make it harder for researchers to use your benchmark. Keep in mind that if barriers-to-entry trade-off against relevance, you should generally prioritize the latter.
      1. How large do models need to be to perform well?
      2. How much context is required to understand the task?
      3. How difficult is it to adapt current training architectures to the dataset?
      4. Is third-party software (e.g. games, modeling software, simulators) or a complex set of dependencies required for training and/or evaluation?
      5. Is unusual hardware required (e.g. robotics, multi-GPU training setups)?
      6. Do researchers need to learn a new program or programming language to use the dataset (e.g., Coq, AnyLogic)?
  7. Baseline & projected improvements

    1. Baseline. Implement a baseline and describe its performance OR describe a plausible baseline and its expected performance on the task. Implementing a baseline is not required, though it is encouraged if you create your own dataset.
    2. Plausible directions for improvement. You should be able to think of research directions that could plausibly lead to performance improvements. “Scaling the model” doesn’t count.
  8. References

How will submissions be evaluated?

Some factors the judges will take into account include:

  1. Relevance to existential risk. Would this benchmark motivate research that reduces high-consequence risks from advanced AI systems?
  2. Safety/capabilities balance. To what extent will your benchmark encourage capability improvements unrelated to safety? This matters because stronger capabilities require higher safety standards, so a favorable balance between the two should be maintained.
  3. Difficulty. Does your benchmark present a significant challenge while allowing for meaningful near-term improvements?
  4. Expected Impact. To what extent is this benchmark the beginning of a long line of research as opposed to a one-off contribution?

Only your final submission will be judged and any feedback on intermediate submissions will not affect your final score.

Points to keep in mind

  1. Empirical testing is preferred. You are not required to provide a dataset to submit a benchmark idea; however, building your benchmark and providing a baseline is preferred — especially if doing so would be straightforward. Empirical testing helps the judges determine whether your benchmark is appropriately challenging and is realistic to implement.

  2. Be mindful of sourcing legalities. Leveraging existing freely-available data is okay and encouraged; however, you should ensure that sourcing the data is compliant with the license or usage policy provided by the creator.