Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. Human annotation is costly. How to select samples to annotate that account for the structure of the forum? We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.
A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums / Di Tizio, Giorgio; Atondo Siu, Gilberto; Hutchings, Alice; Massacci, Fabio. - In: IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. - ISSN 1556-6013. - 18:(2023), pp. 5473-5483. [10.1109/TIFS.2023.3304424]
A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums
Giorgio Di Tizio;Fabio Massacci
2023-01-01
Abstract
Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. Human annotation is costly. How to select samples to annotate that account for the structure of the forum? We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.| File | Dimensione | Formato | |
|---|---|---|---|
|
A_Graph-Based_Stratified_Sampling_Methodology_for_the_Analysis_of_Underground_Forums.pdf
accesso aperto
Descrizione: Dutch Copyright Act Edition
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
5.28 MB
Formato
Adobe PDF
|
5.28 MB | Adobe PDF | Visualizza/Apri |
|
2308.09413v1.pdf
accesso aperto
Descrizione: Submitted Version
Tipologia:
Post-print referato (Refereed author’s manuscript)
Licenza:
Creative commons
Dimensione
308.69 kB
Formato
Adobe PDF
|
308.69 kB | Adobe PDF | Visualizza/Apri |
|
A_Graph-Based_Stratified_Sampling_Methodology_for_the_Analysis_of_Underground_Forums.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
5.25 MB
Formato
Adobe PDF
|
5.25 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



