Text clustering is a critical step in text data analysis and has been extensively studied by the text mining community. Most existing text clustering algorithms are based on the bag-of-words model, which faces the high-dimensional and sparsity problems and ignores text structural and sequence information. Deep learning-based models such as convolutional neural networks and recurrent neural networks regard texts as sequences but lack supervised signals and explainable results. In this paper, we propose a deep feature-based text clustering (DFTC) framework that incorporates pretrained text encoders into text clustering tasks. This model, which is based on sequence representations, breaks the dependency on supervision. The experimental results show that our model outperforms classic text clustering algorithms and the state-of-the-art pretrained language model, i.e., BERT, on almost all the considered datasets. In addition, the explanation of the clustering results is significant for understanding the principles of the deep learning approach. Our proposed clustering framework includes an explanation module that can help users understand the meaning and quality of the clustering results.

Deep Feature-Based Text Clustering and its Explanation / Guan, Renchu; Zhang, Hao; Liang, Yanchun; Giunchiglia, Fausto; Huang, Lan; Feng, Xiaoyue. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - 34:8(2022), pp. 3669-3680. [10.1109/tkde.2020.3028943]

Deep Feature-Based Text Clustering and its Explanation

Liang, Yanchun;Giunchiglia, Fausto;
2022-01-01

Abstract

Text clustering is a critical step in text data analysis and has been extensively studied by the text mining community. Most existing text clustering algorithms are based on the bag-of-words model, which faces the high-dimensional and sparsity problems and ignores text structural and sequence information. Deep learning-based models such as convolutional neural networks and recurrent neural networks regard texts as sequences but lack supervised signals and explainable results. In this paper, we propose a deep feature-based text clustering (DFTC) framework that incorporates pretrained text encoders into text clustering tasks. This model, which is based on sequence representations, breaks the dependency on supervision. The experimental results show that our model outperforms classic text clustering algorithms and the state-of-the-art pretrained language model, i.e., BERT, on almost all the considered datasets. In addition, the explanation of the clustering results is significant for understanding the principles of the deep learning approach. Our proposed clustering framework includes an explanation module that can help users understand the meaning and quality of the clustering results.
2022
8
Guan, Renchu; Zhang, Hao; Liang, Yanchun; Giunchiglia, Fausto; Huang, Lan; Feng, Xiaoyue
Deep Feature-Based Text Clustering and its Explanation / Guan, Renchu; Zhang, Hao; Liang, Yanchun; Giunchiglia, Fausto; Huang, Lan; Feng, Xiaoyue. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - 34:8(2022), pp. 3669-3680. [10.1109/tkde.2020.3028943]
File in questo prodotto:
File Dimensione Formato  
2022 Deep_Feature-Based_Text_Clustering.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 7.08 MB
Formato Adobe PDF
7.08 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/443970
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 61
  • ???jsp.display-item.citation.isi??? 71
  • OpenAlex ND
social impact