UNDERSTANDING AND MANAGING COMPLEX DATASETS

Brugnara, Martin

doi:10.15168/11572_337818

Nowadays, we are producing and collecting data at an unprecedented rate, measured in the order of petabytes per minute, together with a substantial increase in data volume, complexity, and variety. While tabular and unstructured data still dominate the scene, graphs are becoming ever more prominent, bringing new challenges. The size and complexity of graph datasets have increased, thus renewing the interest in graph databases and distributed graph processing. The current abundance of data and content complicates even simply accessing data. Web users are constantly overwhelmed by the availability of information and rely upon search engine and recommender systems to get a trusted personalized selection. Since people tend to prefer sources that reinforce their pre-existing beliefs, such systems optimize for whatever the users like. Whenever a controversial topic arises, users get more polarized as they see only a part of the reality and feel support from others who hold the same view. This feedback loop leads to “filter bubbles” and “echo chambers”, new problems tackled by researchers in the social sciences. On the other hand, scientists and data analysts have a hard time navigating the different data lakes and data repositories with the data deluge. Thus, new tools need to help data scientists explore and understand data to maximize the value they can extract by processing them. This thesis contributes to solving these issues by studying and evaluating the existing graph database technologies to reveal the implications of different design decisions. It offers a principled and systematic evaluation methodology based on microbenchmarks comprising tests for more than 51 classes of operations and graphs with up to 30M nodes and 178M edges. The framework has been materialized into an evaluation suite and executed against the major graph databases available today. The gathered results proved effective for better understanding graph databases systems’ design choices, performances, and functionalities. Findings include analysis of the tradeoffs between native and hybrid graph database systems, their effect on important graph queries like traversals and pattern matching, and their current capability to handle highly heterogeneous graphs. This thesis also contributes to the efficient processing of distributed graphs whose data is partitioned by other systems, like externally managed by graph databases. In particular, it provides a novel technique for k-core decomposition and maintenance. The solution has been implemented on top of akka and tested on various real and synthetic datasets. Results show that it efficiently exploits as much as possible the existing topology of the graph achieving shorter run-ning time and higher scalability compared to existing sequential and distributed approaches. To tackle news polarisation, this thesis proposes two novel recommender systems that account for different points of view expressed in a document and offer a holistic overview of the topic at hand. The first, Orthogonal-topics, focuses on the relationship of the topics, and it has been designed to generalize well on all datasets. The second, Sentimented-topics, focuses on the sentiment expressed by the documents on the different topics, and it has been designed to extract and exploit as much information as possible from text corpora that contain opinionated articles. Moreover, a new diversity-metric, MIN-BW, and a new optimization algorithm, FDLS, are provided to support these approaches in finding the most diverse set based on the metric mentioned above. For MIN-BW, a set of documents is modeled as a system of particles with repulsive forces, where the most diverse set is the one whose system requires less work to balance, i. e., to make it statically stable. The results of a user study showed the superior quality of the recommendation of our approach, and further test on synthetic data showed the superior scalability of FDLS. Finally, to aid researchers in navigating through data lakes, this thesis provides a new solution to the generations of compact and informative summaries of the contents of a dataset to enable a more systematic approach to data exploration. The task is modeled as a multi-objective optimization problem. We formally define the notion of a data description and the intuition behind the concept of goodness for such a description. Descriptions are modeled as sets of views over the datasets, where the views are defined as filtering clauses. Four factors determine the quality of a description: length, coverage on the dataset, overlap, and intricacy. We thus provide three algorithms that generate such descriptions given these four optimization objectives. Results showed the scalability and applicability of our approaches. With this thesis, we have contributed to improving and scaling data management, processing, and exploration, which are fundamental tasks in big data and knowledge management both from a research and a business perspective.

UNDERSTANDING AND MANAGING COMPLEX DATASETS / Brugnara, Martin. - (2022 Apr 12), pp. 1-141. [10.15168/11572_337818]