The model is only trained on step functions (left), still it learns to make smooth predictions (right) just like the true posterior for the step function prior.

The model is only trained on step functions (left), still it learns to make smooth predictions (right) just like the true posterior for the step function prior.

Source publication
Preprint
Full-text available
Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large s...

Contexts in source publication

Context 1
... training, we sample step functions, see Figure 1 (left), that start on different heights ∆y, have different step positions ∆x and step sizes h. We can define the set of functions in our latent set as ...
Context 2
... predictions in Figure 1 (right) are smooth, as there are multiple step functions that might have produced the line, and the PPD now averages all of these step functions as in Equation 6. Thus, even though our model was trained solely on non-smooth step functions, its predictions are (approximately) smooth as they are an average of many step functions. ...
Context 3
... can further see this effect for more interesting priors, too. In Figure 11 of the appendix, we show that when we feed more data to our prior that combines sine curves and lines (Section 4.3), we see the same effect of converging to a wrong solution while starting from a good prediction. Finally, in Figure 12 of the appendix we show this effect even happens for a Gaussian process, when modelling a step function. ...
Context 4
... Figure 11 of the appendix, we show that when we feed more data to our prior that combines sine curves and lines (Section 4.3), we see the same effect of converging to a wrong solution while starting from a good prediction. Finally, in Figure 12 of the appendix we show this effect even happens for a Gaussian process, when modelling a step function. The Gaussian process becomes increasingly worse at modelling the step as more ground truth data is provided. ...
Context 5
... models consistently overestimate the sine magnitude relative to the true posterior as the slope increases. The predictions of the eight best models are presented in Figure 10 in the appendix. ...

Similar publications

Preprint
Full-text available
A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which...