Spatial Econometric Modelling Of Massive Datasets: The Contribution Of Data Mining
In this paper we provide a brief overview of some of the most recent empirical research on spatial econometric models and spatial data mining. Data mining in general is the search for hidden patterns that may exist in large databases. Spatial data mining is a process to discover interesting, potentially useful and high utility patterns embedded in large spatial datasets. The field of spatial data mining has been influenced by many other disciplines: databases technology, artificial intelligence, machine learning, probabilistic statistics, visualization, information science, and pattern recognition. This process is more complex than conventional data mining because of the complexities inherent in spatial data. Spatial data are multi-sourced, multi-typed, multi-scaled, eterogeneous, and dynamic. The main difference between data mining and spatial data mining is that in spatial data mining tasks we use not only non-spatial attributes (as it is usual in data mining in non-spatial data), but also spatial attributes. We suggest some directions along which spatial econometric modeling could benefit from the cross-fertilization spatial data mining techniques such as Classification and Regression Trees (CART). We use the CART algorithm to fit empirical data and produce a tree with optimal tree size for different specifications of econometric models. We also examine some diagnostic measures to evaluate the spatial autocorrelation of the pseudo-residuals obtained from the regression tree analysis and we compare the accuracy and performance of different versions of CART that take into account the effects of spatial dependence. To address this issue, we start examining a non-spatial regression tree, then we include the geographical coordinates of data in the covariate set and finally, we consider one of the most common spatial econometric models: Spatial Lag combined with two versions of regression trees: non-spatial regression tree and geographical coordinates based regression tree. This allows us to determine the strength and the possible role of spatial arrangement on the variables in the predictive model and reduce the effect of spatial autocorrelation on prediction errors. In particular, we test the sensibility of various regression trees with different spatial weights matrix specifications such that to remove the spatial autocorrelation on pseudo-residuals and improvement in the accuracy of spatial predictive models.
C14 - Semiparametric and Nonparametric Methods ; C31 - Cross-Sectional Models; Spatial Models ; C52 - Model Evaluation and Testing ; C81 - Methodology for Collecting, Estimating, and Organizing Microeconomic Data ; R10 - General Regional Economics. General