Loss functions for binary classification and class probability estimation
What are the natural loss functions for binary class probability estimation? This question has a simple answer: so-called "proper scoring rules". These loss functions, known from subjective probability, measure the discrepancy between true probabilities and estimates thereof. They comprise all commonly used loss functions: lob loss, squared error loss, boosting loss (which we derive from boosting's exponential loss), and cost-weighted misclassification losses. We also introduce a larger class of possibly uncalibrated loss functions that can be calibrated with a link function. An example is exponential loss, which is related to boosting. Proper scoring rules are fully characterized by weight functions ω(η) on class probabilities η = P [Y = 1]. These weight functions give immediate practical insight into loss functions: high mass of ω(η) points to the class probabilities η where the proper scoring rule strives for greatest accuracy. For example, both log-loss and boosting loss have poles near zero and one, hence rely on extreme probabilities. We show that the freedom of choice among proper scoring rules can be exploited ploited when the two types of misclassification have different costs: one can choose proper scoring rules that focus on the cost c of class 0 misclassification by concentrating ω(η) near c . We also show that cost-weighting uncalibrated loss functions can achieve tailoring. "Tailoring" is often beneficial for classical linear models, whereas non-parametric boosting models show fewer benefits. We illustrate "tailoring" with artificial and real datasets both for linear models and for non-parametric models based on trees, and compare it with traditional linear logistic regression and one recent version of boosting, called "LogitBoost".