Distributed Data Mining with Limited Knowledge Sharing
Ghosh and Srujana Merugu
Department of Electrical and Computer Engineering,
The University of Texas at Austin
While data mining algorithms invariably operate on centralized data, in practice related information is often acquired and stored at geographically distributed locations due to organizational or operational constraints. Centralization of such data before analysis may not be desirable because of computational or bandwidth costs. In some cases, it may not even be possible due to variety of real-life constraints including security, privacy, proprietary nature of data/software and the accompanying ownership and legal issues.
This paper briefly describes how one can achieve distributed clustering in two different settings that impose severe constraints on the data or knowledge that can be shared among data sites. The first allows only the cluster labels of individual objects to be shared, but not their attributes. The second disallows sharing of the attributes or cluster labels of individual objects altogether. In this case generative (probabilistic) models of local data are used to generate "virtual samples" that are then used to obtain a "global" solution. Applications are identified for both of these settings.