Unstructured Data, Econometric Models, and Estimation Bias
This article examines the powerful combination of machine learning and econometric models to examine unstructured data. Researchers estimate an econometric model (e.g., logit regression, structural model) that relates an outcome of interest (e.g., sales) to a focal feature in unstructured data (e.g., presence of pets in images), with the feature extracted using machine learning algorithms. We focus on potential estimation bias due to prediction errors by the machine learning algorithm. Unfolding the causes of bias, we point out important differences from classical measurement errors. Particularly, bias of either direction is possible. We derive a strategy to alleviate the bias, under the typical setting that the feature is correctly labeled in a fraction of the sample. The strategy extends and improves the few pioneering works in this area, by covering general nonlinear econometric models and relaxing the assumption that unstructured data affect outcome only via the focal feature