Text Clustering with Seeds Affinity Propagation

IRIS

Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

Text Clustering with Seeds Affinity Propagation / Guan, Renchu; Shi, Xh; Marchese, Maurizio; Yang, Chen; Liang, Yanchun. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - STAMPA. - 2011, 23:4(2011), pp. 627-637. [10.1109/TKDE.2010.144]

Text Clustering with Seeds Affinity Propagation

Guan, Renchu;Shi, XH;Marchese, Maurizio;Yang, Chen;Liang, Yanchun

2011-01-01

Abstract

Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2011
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
			
	Numero e parte del fascicolo (Issue number and part)
	
				4
			
	DOI
	
				https://dx.doi.org/10.1109/TKDE.2010.144
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-79951899720
			
	Codice WOS (WOS identifier)
	
				WOS:000287586100011
			
	Tutti gli autori
	
						Guan, Renchu; Shi, Xh; Marchese, Maurizio; Yang, Chen; Liang, Yanchun
					
	Citazione
	
				Text Clustering with Seeds Affinity Propagation / Guan, Renchu; Shi, Xh; Marchese, Maurizio; Yang, Chen; Liang, Yanchun. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - STAMPA. - 2011, 23:4(2011), pp. 627-637. [10.1109/TKDE.2010.144]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
SAP_PostPrint.pdf accesso aperto Descrizione: Articolo principale Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 280.51 kB Formato Adobe PDF Visualizza/Apri	280.51 kB	Adobe PDF	Visualizza/Apri
Text_Clustering_with_Seeds_Affinity_Propagation.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.02 MB Formato Adobe PDF Visualizza/Apri	1.02 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/89884

Citazioni

ND

100

74

ND

social impact