Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluation. The performance and impact of AI models, including generative models like its own Claude.
Details of the Funding Program
Unveiled on Monday, Anthropic’s program will pay third-party organizations. The company states in a blog post that this can “effectively measure advanced capabilities in AI models.” Those interested can submit applications to be evaluated on a rolling basis.
Goals of the Program
“Our investment in these evaluations is intended to elevate. The entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” Anthropic wrote on its official blog. “Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply.”
Current Issues with AI Benchmarking
As we’ve highlighted before, AI has a benchmarking problem. Today’s most commonly cited benchmarks for AI do a poor job of capturing how the average person uses the systems being tested. Given their age, some benchmarks, particularly those released before the dawn of modern generative AI, may not even measure what they purport to measure.
Proposed Solutions
The very-high-level, harder-than-it-sounds solution Anthropic is proposing is creating challenging benchmarks focusing on AI security and societal implications via new tools, infrastructure, and methods.
Focus Areas for New Benchmarks
The company calls specifically for tests that assess a model’s ability to carry out cyberattacks, “enhance” weapons of mass destruction (e.g., nuclear weapons), and manipulate or deceive people (e.g., through deep fakes or misinformation). For AI risks related to national security and defense, Anthropic says it’s committed to developing an early warning system for identifying and assessing risks. However, the blog post doesn’t reveal what such a system might entail.
Supporting Research and Development
Anthropic also says it intends its new program to support research into benchmarks and “end-to-end” tasks that probe AI’s potential for aiding in scientific study, conversing in multiple languages, mitigating ingrained biases, and self-censoring toxicity. To achieve all this, Anthropic envisions new platforms that allow subject-matter experts to develop their evaluations and large-scale trials of models involving “thousands” of users.
Funding Options and Expert Interaction
“We offer a range of funding options tailored to the needs and stage of each project,” Anthropic writes in the post, though an Anthropic spokesperson declined to provide further details about those options. “Teams will have the opportunity to interact directly with Anthropic’s domain experts from our red team, fine-tuning, trust and safety, and other relevant teams.”
Potential Challenges and Criticisms
Anthropic’s effort to support new benchmarks is laudable — assuming sufficient cash and worth behind it. However, given the company’s commercial ambitions and competitive landscape, it might be tough to trust completely.
Transparency and Concerns
In the blog post, Anthropic is relatively transparent about wanting specific evaluations it funds to align with the AI safety classifications it developed (with some input from third parties like the nonprofit AI research org METR). That’s well within the company’s rights. However, it may force applicants to the program into accepting definitions of “safe” or “risky” AI that they might disagree with.
Industry Reactions
Some members of the AI community will likely take issue with Anthropic’s references to “AI risks,” such as nuclear weapons and deception. Many experts believe there’s little evidence to suggest AI will gain world-ending, human-outsmarting capabilities anytime soon, if ever. These experts add that claims of the following “supercell” generation only draw attention away from the pressing. AI regulatory issues of the day, like AI’s hallucinatory tendencies.
Anthropic’s Vision for the Future
In its post, Anthropic hopes its program will serve as “a catalyst for progress toward a future where comprehensive AI evaluation is an industry-standard. That’s a mission the many corporate-unaffiliated efforts to create better AI benchmarks can identify with; however, will those efforts join forces with an AI vendor?