This thesis is concerned with understanding the behavior of complex systems, particularly in the common case where instrumentation data is noisy or incomplete. We begin with an empirical study of logs from production systems, which characterizes the content of those logs and the challenges associated with analyzing them automatically, and present an algorithm for identifying surprising messages in such logs. The principal contribution is a method, called influence, that identifies relationships among components---even when the underlying mechanism of interaction is unknown---by looking for correlated surprise. Two components are said to share an influence if they tend to exhibit surprising behavior that is correlated in time. We represent the behavior of components as surprise (deviation from typical or expected behavior) over time and use signal-processing techniques to find correlations. The method makes few assumptions about the underlying systems or the data they generate, so it is applicable to a variety of unmodified production systems, including supercomputers, clusters, and autonomous vehicles. We then extend the idea of influence by presenting a query language and online implementation, which allow the method to scale to systems with hundreds of thousands of signals. In collaboration with system administrators, we applied these tools to real systems and discovered correlated problems, failure cascades, skewed clocks, and performance bugs. According to the administrators, it also generated information useful for diagnosing and fixing these issues.directed arrow in Stanleya#39;s SIG from the box containing LASER* to the box containing PLANNER* indicates that each ... The edges in the dependency diagram indicate intended communication patterns, rather than functional dependencies.
|Title||:||Using Influence to Understand Complex Systems|
|Author||:||Adam Jamison Oliner|
|Publisher||:||Stanford University - 2011|