A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums

IRIS

Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. Human annotation is costly. How to select samples to annotate that account for the structure of the forum? We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.

A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums / Di Tizio, G.; Siu, G. A.; Hutchings, A.; Massacci, F.. - In: IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. - ISSN 1556-6013. - 18:(2023), pp. 5473-5483. [10.1109/TIFS.2023.3304424]

A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums

Di Tizio G.;Siu G. A.;Hutchings A.;Massacci F.^Ultimo

2023-01-01

Abstract

Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. Human annotation is costly. How to select samples to annotate that account for the structure of the forum? We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2023
			
	Titolo del periodico (Journal title)
	
				IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY
			
	DOI
	
				https://dx.doi.org/10.1109/TIFS.2023.3304424
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
			
	Settori scientifico-disciplinari (validi dal 09/05/2024) - Reference SSD (valid from 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Settore INFO-01/A - Informatica
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85167808690
			
	Codice WOS (WOS identifier)
	
				WOS:001064516900003
			
	Tutti gli autori
	
						Di Tizio, G.; Siu, G. A.; Hutchings, A.; Massacci, F.
					
	Citazione
	
				A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums / Di Tizio, G.; Siu, G. A.; Hutchings, A.; Massacci, F.. - In: IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. - ISSN 1556-6013. - 18:(2023), pp. 5473-5483. [10.1109/TIFS.2023.3304424]

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/445501

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

ND

social impact