This study from MIT researchers raises some challenging questions about collaborative AI interfaces, “human in the loop” supervision, and the value of explaining AI logic and confidence.
Their meta-study looked at over 100 experiments of humans and AI working both separately and together to accomplish tasks. They found that some tasks benefited a ton from human-AI teamwork, while others got worse from the pairing.
Poor Performers Make Poor Supervisors
For tasks where humans working solo do worse than AI, the study found that putting humans in the loop to make final decisions actually delivers worse results. For example, in a task to detect fake reviews, AI working alone achieved 73% accuracy, while humans hit 55%—but the combined human-AI system landed at 69%, watering down what AI could do alone.
In these scenarios, people oscillate between over-reliance (“using suggestions as strong guidelines without seeking and processing more information”) and under-reliance (“ignoring suggestions because of adverse attitudes towards automation”).
Since the people were less accurate, in general, than the AI algorithms, they were also not good at deciding when to trust the algorithms and when to trust their own judgement, so their participation resulted in lower overall performance than for the AI algorithm alone.
Takeaway: “Human in the loop” may be an anti-pattern for certain tasks where AI is more high-performing. Measure results; don’t assume that human judgment always makes things better.
Explanations Didn’t Help
The study found that common design patterns like AI explanations and confidence scores showed no significant impact on performance for human-AI collaborative systems. “These factors have received much attention in recent years [but] do not impact the effectiveness of human-AI collaboration,” the study found.
Given our result that, on average across our 300+ effect sizes, they do not impact the effectiveness of human-AI collaboration, we think researchers may wish to de-emphasize this line of inquiry and instead shift focus to the significant and less researched moderators we identified: the baseline performance of the human and AI alone, the type of task they perform, and the division of labour between them.
Takeaway: Transparency doesn’t always engage the best human judgment; explanations and confidence scores need refinement—or an entirely new alternative. I suspect that changing the form, manner, or tone of these explanations could improve outcomes, but also: Are there different ways to better engage critical thinking and productive skepticism?
Creative Tasks FTW
The study found that human-AI collaboration was most effective for open-ended creative and generative tasks—but worse at decision-making tasks to choose between defined options. For those decision-making tasks, either humans or AI did better working alone.
We hypothesize that this advantage for creation tasks occurs because even when creation tasks require the use of creativity, knowledge or insight for which humans perform better, they often also involve substantial amounts of somewhat routine generation of additional content that AI can perform as well as or better than humans.
This is a great example of “let humans do what they do best, and let machines do what they do best.” They’re rarely the same thing. And creative/generative tasks tend to have elements of each, where humans excel at creative judgment, and the machines excel at production/execution.
Takeaway: Focus human-machine collaboration on creative and generative tasks; humans and AI may handle decision-making tasks better solo.
Divide and Conquer
A very small number of experiments in the study split tasks between human and machine intelligence based on respectie strengths. While only three of the 100+ experiments explored this approach, the researchers hypothesized that “better results might have been obtained if the experimenters had designed processes in which the AI systems did only the parts of the task for which they were clearly better than humans.” This suggests an opportunity for designers to explore more intentional division of labor in human-AI interfaces. Break out your journey maps, friends.
Takeaway: Divvy up and define tasks narrowly around the demonstrated strengths of both humans and machines, and make responsibilities clear for each.