‘Crowd wisdom’ refers to the surprising accuracy that can be attained by averaging judgments from independent individuals. However, independence is unusual; people often discuss and collaborate in groups. When does group interaction improve vs. degrade judgment accuracy relative to averaging the group's initial, independent answers? Two large laboratory studies explored the effects of 969 face-to-face discussions on the judgment accuracy of 211 teams facing a range of numeric estimation problems from geographic distances to historical dates to stock prices. Although participants nearly always expected discussions to make their answers more accurate, the actual effects of group interaction on judgment accuracy were decidedly mixed. Importantly, a novel, group-level measure of collective confidence calibration robustly predicted when discussion helped or hurt accuracy relative to the group's initial independent estimates. When groups were collectively calibrated prior to discussion, with more accurate members being more confident in their own judgment and less accurate members less confident, subsequent group interactions were likelier to yield increased accuracy. We argue that collective calibration predicts improvement because groups typically listen to their most confident members. When confidence and knowledge are positively associated across group members, the group's most knowledgeable members are more likely to influence the group's answers.