This paper proposes DS-means, a novel algorithm for clustering distributed data streams. Given a network of computing nodes, each of them receiving its share of a distributed data stream, our goal is to obtain a common clustering under the following restrictions (i) the number of clusters is not known in advance and (ii) nodes are not allowed to share single points of their datasets, but only aggregate information. A motivating example for DS-means is the decentralized detection of botnets, where a collection of independent ISPs may want to detect common threats, but are unwilling to share their precious users' data. In DS-means, nodes execute a distributed version of K-means on each chunk of data they receive to provide a compact representation of the data of the entire network. Later, X-means is executed on this representation to obtain an estimate of the number of clusters. A number of experiments on both synthetic and real-life datasets show that our algorithm is precise, efficient...
DS-Means: Distributed Data Stream Clustering
Guerrieri, Alessio;Montresor, Alberto
2012-01-01
Abstract
This paper proposes DS-means, a novel algorithm for clustering distributed data streams. Given a network of computing nodes, each of them receiving its share of a distributed data stream, our goal is to obtain a common clustering under the following restrictions (i) the number of clusters is not known in advance and (ii) nodes are not allowed to share single points of their datasets, but only aggregate information. A motivating example for DS-means is the decentralized detection of botnets, where a collection of independent ISPs may want to detect common threats, but are unwilling to share their precious users' data. In DS-means, nodes execute a distributed version of K-means on each chunk of data they receive to provide a compact representation of the data of the entire network. Later, X-means is executed on this representation to obtain an estimate of the number of clusters. A number of experiments on both synthetic and real-life datasets show that our algorithm is precise, efficient...I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



