Distributed Data Mining with Limited Knowledge Sharing

Joydeep Ghosh and Srujana Merugu
Department of Electrical and Computer Engineering,
The University of Texas at Austin



While data mining algorithms invariably operate on centralized  data, in practice related information is often acquired and stored at geographically distributed locations due to organizational or operational constraints. Centralization of such data before analysis may not be desirable because of computational or bandwidth  costs. In some cases, it may not even be possible due to variety of real-life constraints including security, privacy, proprietary nature of data/software and the accompanying ownership and legal issues. 

This paper briefly describes how one can achieve distributed clustering in two different settings that impose severe constraints on the data or knowledge that can be shared among data sites. The first allows only the cluster labels of individual objects  to be shared, but not their attributes. The second disallows sharing of the attributes or cluster labels of individual objects altogether. In this case generative (probabilistic) models of local data are used to generate "virtual samples" that are then used to obtain a "global" solution. Applications are identified for both of these settings.

Back                                                                       Home