Classification in Networked Data : a Toolkit and a Univariate Case Study
This paper presents NetKit, a modular toolkit for classification in networked data, and a case-studyof its application to a collection of networked data sets used in prior machine learning research.Networked data are relational data where entities are interconnected, and this paper considers thecommon case where entities whose labels are to be estimated are linked to entities for which thelabel is known. NetKit is based on a three-component framework, comprising a local classifier, arelational classifier, and a collective inference procedure. Various existing relational learning algorithmscan be instantiated with appropriate choices for these three components and new relationallearning algorithms can be composed by new combinations of components. The case study demonstrateshow the toolkit facilitates comparison of different learning methods (which so far has beenlacking in machine learning research). It also shows how the modular framework allows analysisof subcomponents, to assess which, whether, and when particular components contribute to superiorperformance. The case study focuses on the simple but important special case of univariatenetwork classification, for which the only information available is the structure of class linkage inthe network (i.e., only links and some class labels are available). To our knowledge, no work previouslyhas evaluated systematically the power of class-linkage alone for classification in machinelearning benchmark data sets. The results demonstrate clearly that simple network-classificationmodels perform remarkably wellacirc; quot;well enough that they should be used regularly as baseline classifiersfor studies of relational learning for networked data. The results also show that there are asmall number of component combinations that excel, and that different components are preferablein different situations, for example when few versus many labels are known