We propose a large dataset for machine learning-based automatic keyphrase extraction. The dataset has a high quality and consist of 2,000 of scientific papers from computer science domain published by ACM. Each paper has its keyphrases assigned by the authors and verified by the reviewers. Different parts of papers, such as title and abstract, are separated, enabling extraction based on a part of an article's text. The content of each paper is converted from PDF to plain text. The pieces of formulae, tables, figures and LaTeX mark up were removed automatically. For removal we have used Maximum Entropy Model-based machine learning and achieved 97.04% precision. Preliminary investigation with help of the state of the art keyphrase extraction system KEA shows keyphrases recognition accuracy improvement for refined texts.

Large Dataset for Keyphrases Extraction / Krapivin, Mikalai; Autaeu, Aliaksandr; Marchese, Maurizio. - ELETTRONICO. - (2009), pp. 1-4.

Large Dataset for Keyphrases Extraction

Krapivin, Mikalai;Marchese, Maurizio
2009-01-01

Abstract

We propose a large dataset for machine learning-based automatic keyphrase extraction. The dataset has a high quality and consist of 2,000 of scientific papers from computer science domain published by ACM. Each paper has its keyphrases assigned by the authors and verified by the reviewers. Different parts of papers, such as title and abstract, are separated, enabling extraction based on a part of an article's text. The content of each paper is converted from PDF to plain text. The pieces of formulae, tables, figures and LaTeX mark up were removed automatically. For removal we have used Maximum Entropy Model-based machine learning and achieved 97.04% precision. Preliminary investigation with help of the state of the art keyphrase extraction system KEA shows keyphrases recognition accuracy improvement for refined texts.
2009
Trento
University of Trento - Dipartimento di Ingegneria e Scienza dell'Informazione
Large Dataset for Keyphrases Extraction / Krapivin, Mikalai; Autaeu, Aliaksandr; Marchese, Maurizio. - ELETTRONICO. - (2009), pp. 1-4.
Krapivin, Mikalai; Autaeu, Aliaksandr; Marchese, Maurizio
File in questo prodotto:
File Dimensione Formato  
disi09055-krapivin-autayeu-marchese.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 165.04 kB
Formato Adobe PDF
165.04 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/358576
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact