Figure 2 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Calibration and Accuracy of OvA Parameterization on CIFAR-10: Subfigure (a) reports a reliability diagram and the expected calibration error (ECE) for p OvA m (x) (Eq. 9). Darker bin shade means more samples in the bin. Subfigure (b) shows the distribution of risk estimates. Subfigure (c) reports the accuracy as a function of an expert with increasing expertise (left) and of varying coverage (right).
Source publication
The learning to defer (L2D) framework has the potential to make AI systems safer. For a given input, the system can defer the decision to a human if the human is more likely than the model to take the correct action. We study the calibration of L2D systems, investigating if the probabilities they output are sound. We find that Mozannar & Sontag's (...
Contexts in source publication
Context 1
... now test our OvA method's calibration in the same experimental setting used to test the softmax method in Section 3. To reiterate, the expert has a 75% chance of being correct on the first five classes and random chance on the last five. Figure 2a reports a reliability diagram and the ECE. Comparing to the softmax results in Figure 1b, our OvA loss produces a model that has has an over fifty percent reduction in ECE: 7.58 for softmax, 3.01 for OvA. Figure 2b reports the empirical distribution of error estimates: 1 − p OvA m (x). ...
Context 2
... to the softmax results in Figure 1b, our OvA loss produces a model that has has an over fifty percent reduction in ECE: 7.58 for softmax, 3.01 for OvA. Figure 2b reports the empirical distribution of error estimates: 1 − p OvA m (x). Unlike the corresponding softmax results in Figure 1c, the OvA method produces sharper modes nearer to the true error values. ...
Context 3
... then vary k from k = 2 to k = 8. The left plot in Figure 2c shows accuracy vs k. Our OvA model (blue) has a modest but consistent advantage over the softmax model (red). ...
Context 4
... right plot in Figure 2c reports the accuracy vs coverage, where coverage is the proportion of samples that the system has not deferred. Classifier Accuracy is the accuracy on the non-deferred samples. ...
Context 5
... this section, we verify calibration's role in the overall system's accuracy. For the trained one-vs-all model from Figure 2c, we apply a post-processing calibration technique called temperature scaling ( Guo et al., 2017) to further calibrate the rejector. In Figure 5, we see that this additional calibration step marginally improves the system's accuracy. ...