Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present...

WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks / Consonni, Cristian; Laniado, David; Montresor, Alberto. - ELETTRONICO. - (2019), pp. 598-607. ( 13th International Conference on Web and Social Media, ICWSM 2019 Munich, Germany 2019).

WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks

Cristian Consonni;Alberto Montresor
2019-01-01

Abstract

Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present...
2019
Proceedings of the Thirteenth International Conference on Web and Social Media
San Francisco, California
Association for the Advancement of Artificial Intelligence
Consonni, Cristian; Laniado, David; Montresor, Alberto
WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks / Consonni, Cristian; Laniado, David; Montresor, Alberto. - ELETTRONICO. - (2019), pp. 598-607. ( 13th International Conference on Web and Social Media, ICWSM 2019 Munich, Germany 2019).
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/251705
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 22
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact