Current hierarchical text categorization (HTC) methods mainly fall into three directions: (1) Flat one-vs.-all approach, which flattens the hierarchy into independent nodes and trains a binary one-vs.-all classifier for each node. (2) Top-down method, which uses the hierarchical structure to decompose the entire problem into a set of smaller sub-problems, and deals with such sub-problems in top-down fashion along the hierarchy. (3) Big-bang approach, which learns a single (but generally complex) global model for the class hierarchy as a whole with a single run of the learning algorithm. These methods were shown to provide relatively high performance in previous evaluations. However, they still suffer from two main drawbacks: (1) relatively low accuracy as they disregard category dependencies, or (2) low computational efficiency when considering such dependencies. In order to build an accurate and efficient model we adopted the following strategy: first, we design advanced global reranking models (GR) that exploit structural dependencies in hierarchical multi-label text classification (TC). They are based on two algorithms: (1) to generate the k-best classification of hypotheses based on decision probabilities of the flat one-vs.-all and top-down methods; and (2) to encode dependencies in the reranker by: (i) modeling hypotheses as trees derived by the hierarchy itself and (ii) applying tree kernels (TK) to them. Such TK-based reranker selects the best hierarchical test hypothesis, which is naturally represented as a labeled tree. Additionally, to better investigate the role of category relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. Second, we propose an efficient local incremental reranking model (LIR), which combines a top-down method with a local reranking model for each sub-problem. These local rerankers improve the accuracy by absorbing the local category dependencies of sub-problems, which alleviate the errors of top-down method in the higher levels of the hierarchy. The application of LIR recursively deals with the sub-problems by applying the corresponding local rerankers in top-down fashion, resulting in high efficiency. In addition, we further optimize LIR by (i) improving the top-down method by creating local dictionaries for each sub-problem; (ii) using LIBLINEAR instead of LIBSVM; and (iii) adopting the compact representation of hypotheses for learning the local reranking model. This makes LIR applicable for large-scale hierarchical text categorization. The experimentation on different hierarchical datasets has shown promising enhancements by exploiting the structural dependencies in large-scale hierarchical text categorization.

Large-scale Structural Reranking for Hierarchical Text Categorization / Ju, Qi. - (2013), pp. 1-130.

Large-scale Structural Reranking for Hierarchical Text Categorization

JU, QI
2013-01-01

Abstract

Current hierarchical text categorization (HTC) methods mainly fall into three directions: (1) Flat one-vs.-all approach, which flattens the hierarchy into independent nodes and trains a binary one-vs.-all classifier for each node. (2) Top-down method, which uses the hierarchical structure to decompose the entire problem into a set of smaller sub-problems, and deals with such sub-problems in top-down fashion along the hierarchy. (3) Big-bang approach, which learns a single (but generally complex) global model for the class hierarchy as a whole with a single run of the learning algorithm. These methods were shown to provide relatively high performance in previous evaluations. However, they still suffer from two main drawbacks: (1) relatively low accuracy as they disregard category dependencies, or (2) low computational efficiency when considering such dependencies. In order to build an accurate and efficient model we adopted the following strategy: first, we design advanced global reranking models (GR) that exploit structural dependencies in hierarchical multi-label text classification (TC). They are based on two algorithms: (1) to generate the k-best classification of hypotheses based on decision probabilities of the flat one-vs.-all and top-down methods; and (2) to encode dependencies in the reranker by: (i) modeling hypotheses as trees derived by the hierarchy itself and (ii) applying tree kernels (TK) to them. Such TK-based reranker selects the best hierarchical test hypothesis, which is naturally represented as a labeled tree. Additionally, to better investigate the role of category relationships, we consider two interesting cases: (i) traditional schemes in which node-fathers include all the documents of their child-categories; and (ii) more general schemes, in which children can include documents not belonging to their fathers. Second, we propose an efficient local incremental reranking model (LIR), which combines a top-down method with a local reranking model for each sub-problem. These local rerankers improve the accuracy by absorbing the local category dependencies of sub-problems, which alleviate the errors of top-down method in the higher levels of the hierarchy. The application of LIR recursively deals with the sub-problems by applying the corresponding local rerankers in top-down fashion, resulting in high efficiency. In addition, we further optimize LIR by (i) improving the top-down method by creating local dictionaries for each sub-problem; (ii) using LIBLINEAR instead of LIBSVM; and (iii) adopting the compact representation of hypotheses for learning the local reranking model. This makes LIR applicable for large-scale hierarchical text categorization. The experimentation on different hierarchical datasets has shown promising enhancements by exploiting the structural dependencies in large-scale hierarchical text categorization.
2013
XXV
2012-2013
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Giunchiglia, Fausto
Moschitti, Alessandro
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
Qi_Thesis.pdf

accesso aperto

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 2.68 MB
Formato Adobe PDF
2.68 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369177
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact