
We evaluate whether model confidence survives contact with reality using pre-registered, reproducible methodologies.
Avenridge Institute conducted a pre-registered calibration evaluation of the publicly available model:
using the SST-2 validation dataset.
The objective was not to evaluate raw classification accuracy alone, but to evaluate whether the model's probability outputs were statistically calibrated relative to observed outcomes.
The evaluation protocol was locked before execution using a documented pre-registration methodology.
The model achieved:
However:
the model failed the pre-registered calibration criterion.
The strongest observed pattern was bimodal overconfidence:
The evaluation demonstrated that:
strong benchmark accuracy does not necessarily imply reliable probabilistic calibration.
Inquiries regarding verification methodology, calibration audits, and institutional partnerships are welcome.
Today | Closed |
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.