Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.
Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings / Mirylenka, Katsiaryna. - (2015), pp. 1-145.
Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings
Mirylenka, Katsiaryna
2015-01-01
Abstract
Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.File | Dimensione | Formato | |
---|---|---|---|
mirylenka.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
9.24 MB
Formato
Adobe PDF
|
9.24 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione