Research

BAIF supports a broad range of technical AI safety research through staff research, university partnerships and its Buterin Fellowship program. Here are some examples.

Our Quantitative AI Safety Initiative (QAISI) is a multi-university partnership pursuing quantitative guarantees that AI will not cause harm.Just as the FDA requires quantitative estimates of side effects before approving a new drug and the FAA requires quantitative bounds on engine failure rates before approving a new aircraft, QAISI supports the pursuit of quantitive guarantees that AI systems will do what we want, with the hope that appropriate guarantees of this type will become required in future mandatory AI safety standards.
Further reading: guaranteed safe AI, provably safe AI

Mechanistic interpretability, a. k. a. artificial neuroscience, is the quest to understand how AI systems work under the hood, with the aim of assessing and improving their trustworthiness. BAIF has supported this both through research and by helping organize an MIT conference and an ICML workshop. These events showcased how progress has been striking in the past couple of years, with growing understanding of how artificial neural networks learn and encode concepts, knowledge and algorithms. For example, millions of concepts have now been mapped in an large language model, and this 2024 paper automatically distilled out machine-learned algorithms from trained neural networks and converted them to simple Python code.

Audit standards & red-teaming: Another theme, pursued mainly via the Buterin Fellow's program, involves technical research informing ongoing AI governance discussions. For example, if audits are required of sufficiently powerful AI models, then how should these audits best be done and what should they accomplish? Red teaming in AI research refers to a systematic and adversarial approach to testing and evaluating artificial intelligence systems to identify vulnerabilities, weaknesses, and potential risks. This process involves a team of experts, known as the red team, who simulate attacks, jailbraks, misuse, or other disruptive scenarios to challenge the AI's robustness, security, and ethical implications. This can enhance the system's reliability, safety, and trustworthiness before deployment in real-world applications.