We provide a domain specific language called the Streaming Analytics Language (SAL) to write concise but expressive analyses of streaming temporal graphs. We target problems where the data comes as an infinite stream and where the volume is prohibitive, requiring a single pass over the data and tight spatial and temporal complexity constraints. Also, each item in the stream can be thought of as an edge in a graph, and each edge has an associated timestamp and duration.
A real-world problem that is a streaming temporal graph is cyber security data. Machines communicate with each other within a network, forming a streaming sequence of edges with temporal information. As such, we elucidate the value of SAL by applying it to a large range of cyber-related problems. With a combination of vertex-centric computations that create features per vertex, and subgraph matching to find communication patterns of interest, we cover a wide spectrum of important cyber use cases. As an example, we discuss Verizon’s Data Breach Investigations Report, and show how SAL can be used to capture most of the nine different categories of cyber breaches. Also, we apply SAL to discovering botnet activity within network traffic in 13 different scenarios, with an average area under the curve (AUC) of the receiver operating characteristic (ROC) of 0.87.
Besides SAL as a language, as another contribution we present an implementation we call the Streaming Analytics Machine (SAM). With SAM, we can run SAL programs in parallel on a cluster, achieving rates of a million netflows per second, and scaling to 128 nodes or 2560 cores. We compare SAM to another streaming framework, Apache Flink, and find that Flink cannot scale past 32 nodes for the problem of finding triangles (a subgraph of three interconnected nodes) within the streaming graph. Also, SAM excels when the subgraphs are frequent, continuing to find the expected number of subgraphs, while Flink performance degrades and under-reports. Together, SAL and SAM provide an expressive and scalable infrastructure for performing analyses on streaming temporal graphs.
|Commitee:||Ha, Sangtae, Keller, Eric, Lv, Qin, Massey, Daniel|
|School:||University of Colorado at Boulder|
|School Location:||United States -- Colorado|
|Source:||DAI-B 80/09(E), Dissertation Abstracts International|
|Keywords:||Domain specific language, Graphs, Machine learning, Streaming, Subgraph matching, Temporal|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be