Associated Incidents

Toronto recently used an AI tool to predict when a public beach will be safe. It went horribly awry.
The developer claimed the tool achieved over 90% accuracy in predicting when beaches would be safe to swim in. But the tool did much worse: on a majority of the days when the water was in fact unsafe, beaches remained open based on the tool's assessments. It was less accurate than the previous method of simply testing the water for bacteria each day.
We do not find this surprising. In fact, we consider this to be the default state of affairs when an AI risk prediction tool is deployed.
The Toronto tool involved an elementary performance evaluation failure—city officials never checked the performance of the deployed model over the summer—but much more subtle failures are possible. Perhaps the model is generally accurate but occasionally misses even extremely high bacteria levels. Or it works well on most beaches but totally fails on one particular beach. It's not realistic to expect non-experts to be able to comprehensively evaluate a model. Unless the customer of an AI risk prediction tool has internal experts, they're buying the tool on faith. And if they do have their own experts, it's usually easier to build the tool in-house!
When officials were questioned about the tool's efficacy, they deflected the questions by saying that the tool was never used on its own—a human always made the final decision. But they did not answer questions about how often the human decision-makers overrode the tool's recommendation.
This is also a familiar pattern. AI vendors often use a bait-and-switch when it comes to human oversight. Vendors sell these tools based on the promise of full automation and elimination of jobs, but when concerns are raised about bias, catastrophic failure, or other well-known limitations of AI, they retreat to the fine print which says that the tool shouldn't be used on its own. Their promises lead to over-automation—AI tools are used for tasks far beyond their capabilities.
Here are three other stories of similar failures of risk prediction models.
Epic's sepsis prediction debacle
Epic is a large healthcare software company. It stores health data for over 300 million patients. In 2017, Epic released a sepsis prediction model. Over the next few years, it was deployed in hundreds of hospitals across the U.S. However, a 2021 study from researchers at the University of Michigan found that Epic's model vastly underperformed compared to the developer's claims.
The tool's inputs included information about whether a patient was given antibiotics. But if a patient is given antibiotics, they have already been diagnosed with sepsis