We propose a text categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian Mixture algorithm, which dierentiates relevant and non-relevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two Text Categorization tasks and obtained good performance using only the category names as initial seeds. In particular the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70-160 labeled documents per category.

Improving text categorization bootstrapping via unsupervised learning / Gliozzo, Alfio Massimiliano; Strapparava, Carlo; Dagan, Ido Kalman. - In: ACM TRANSACTIONS ON SPEECH AND LANGUAGE PROCESSING. - ISSN 1550-4875. - 6:1(2009), pp. 1-24.

Improving text categorization bootstrapping via unsupervised learning

Gliozzo, Alfio Massimiliano;Strapparava, Carlo;
2009-01-01

Abstract

We propose a text categorization bootstrapping algorithm in which categories are described by relevant seed words. Our method introduces two unsupervised techniques to improve the initial categorization step of the bootstrapping scheme: (i) using latent semantic spaces to estimate the similarity among documents and words, and (ii) the Gaussian Mixture algorithm, which dierentiates relevant and non-relevant category information using statistics from unlabeled examples. In particular, this second step maps the similarity scores to class posterior probabilities and therefore reduces sensitivity to keyword-dependent variations in scores. The algorithm was evaluated on two Text Categorization tasks and obtained good performance using only the category names as initial seeds. In particular the performance of the proposed method proved to be equivalent to a pure supervised approach trained on 70-160 labeled documents per category.
2009
1
Gliozzo, Alfio Massimiliano; Strapparava, Carlo; Dagan, Ido Kalman
Improving text categorization bootstrapping via unsupervised learning / Gliozzo, Alfio Massimiliano; Strapparava, Carlo; Dagan, Ido Kalman. - In: ACM TRANSACTIONS ON SPEECH AND LANGUAGE PROCESSING. - ISSN 1550-4875. - 6:1(2009), pp. 1-24.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/343409
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? ND
social impact