AI Safety: Shortcomings in Current Tests and Benchmarks | | | | Turtles AI
Highlights:
- Current benchmark shortcomings: Existing tools for testing AI model safety can be manipulated and do not reflect real-world behavior.
- Data contamination issues: Benchmarks can overestimate performance if models are trained on the test data.
- Challenges of red-teaming: Identifying model vulnerabilities is complex and costly, particularly for smaller organizations.
- Need for context-specific evaluations: It is crucial to develop evaluation methods that consider the context of use and potential impacts on different user groups.
The growing demand for AI safety and accountability has highlighted the shortcomings of current tests and benchmarks. According to a new report, existing tools are insufficient to ensure that AI models are safe and reliable, raising questions about their ability to predict behavior in real-world scenarios.
The advancement of generative AI models, capable of producing texts, images, music, and videos, is accompanied by increasing concerns about their tendency to make mistakes and behave unpredictably. In response, numerous organizations, including public sector agencies and major tech firms, are developing new benchmarks to test the safety of these models.
Late last year, Scale AI created a lab dedicated to assessing models’ adherence to safety guidelines. Recently, the NIST and the UK AI Safety Institute launched tools to assess the risks associated with models. However, a study conducted by the Ada Lovelace Institute (ALI), a UK-based AI research nonprofit, revealed that these tests might be inadequate. The ALI interviewed experts from academic labs, civil society, and model-producing companies, finding that while current evaluations can be useful, they are often non-exhaustive, easily manipulated, and do not necessarily reflect how models will behave in real-world situations.
Elliot Jones, a senior researcher at ALI and co-author of the report, emphasized that, like other products such as smartphones or cars, AI models are expected to be safe and reliable before being deployed. However, the investigation highlighted a lack of consensus within the AI industry on the best methods and taxonomies for evaluating models.
A significant issue involves the risk of "data contamination," where benchmark results can overestimate a model’s performance if it has been trained on the test data. This, coupled with the choice of benchmarks for convenience rather than effectiveness, was criticized by the experts interviewed. Additionally, the practice of "red-teaming," or simulating attacks on models to identify vulnerabilities, was described as complex and costly, making it challenging for smaller organizations to execute effectively.
The report’s authors suggest that the pressure to release models quickly and the reluctance to conduct thorough tests are among the main reasons for the current gaps in AI safety evaluations. Mahi Hardalupas, a researcher at ALI, believes that greater involvement from public authorities is needed to clearly define expectations for evaluations and promote an ecosystem of independent tests. This could include measures to ensure regular access to the necessary models and datasets.
Jones suggests that it may be necessary to develop "context-specific" evaluations that go beyond simply testing response to prompts, instead examining how a model might impact different user groups and how safeguards can be compromised. This would require investments in the science of evaluations to develop more robust and repeatable methods.
Ultimately, as Hardalupas pointed out, there is no absolute guarantee that a model is safe. Safety depends on the context of use, the intended audience, and the adequacy of the implemented safety measures.