Artificial Intelligence for the Understanding of Large Complex Datacenters
Advances in monitoring, tracing, and profiling large, complex datacenters produce rich datasets and establish a rigorous foundation for understanding datacenter performance. But the sheer volume and complexity of the data challenges existing techniques, which rely heavily on expert knowledge, human intervention, and simple statistics to gain performance insights.
In this talk, I will address this challenge using artificial intelligence to understand large, complex systems. First, I will present Hound, a framework that uses causal inference to diagnose stragglers. Stragglers are slow tasks that significantly delay a job’s completion and identifying their causes requires interpretable machine learning. Second, I will present my work on Limelight+, a framework that uses graph theory and semantic learning to extract design insights for datacenter architecture. Limelight+ summarizes semantic structure in massive call graphs and discover clusters of computation that could benefit from acceleration. Finally, I will briefly discuss black-box optimization for managing containerized services before concluding with future projects.