Detecting and Reducing Resource Interferences in Data Analytics Frameworks
Data analytics frameworks on shared clusters host a large number of diverse workloads submitted by multiple tenants. Modern cluster schedulers incentivize users to share the cluster resources by promising fairness and isolation along with high performance and resource utilization. Nevertheless, it is hard to meet these guarantees as resource contentions among such collocated workloads cause significant performance issues and is one of the key reasons for unpredictable performance and missed workload Service-Level-Agreements (SLAs) in data analytics frameworks. Despite meticulous measures to prevent interferences, contention for unmanaged resources still prevails and causes undue waiting times for queries. It is highly important to safeguard the progress of data analytical queries from such variable impacts caused by ad-hoc jobs. In general, for any dataflow-based execution of queries on these frameworks, analyzing the inter-query resource interactions is critical in order to answer questions like 'who is creating resource conflicts for my query'. Today cluster schedulers face the challenge of accurately detecting such concurrency-related slowdown, and acting on them in a timely manner to reduce multi-resource interferences in an online workload.
We present a novel approach to detecting and reducing resource contentions in an online workload that helps the system/database administrators narrow down and regulate the many possibilities of query performance degradation. It uses data from historical executions, runtime cluster metrics and heuristics applicable to a pipeline execution model to build a robust and scalable solution. Our solution (i) models the resource conflicts between concurrent queries using a multi-level directed acyclic graph. Using the resource-blocked times faced by a query, we develop a blame attribution metric that helps users identify concurrency-related problems in the execution of a query, and (ii) feeds pair-wise task disutility preferences to a contention-aware scheduler that creates fair and stable task co-locations on the cluster to minimize the resource waiting times. We have also developed a novel dynamic resource re-allocation scheme that detects a point in a query's execution timeline after which it is vulnerable to the impacts due to concurrency problems. This scheme also safeguards the query's progress. We evaluate our system on a wide range of workloads using TPCDS benchmark queries to show that our approach to contention analysis and blame attribution is substantially more accurate than other methods that are based on overlap time between concurrent queries. We also demonstrate that it reduces scheduling queue wait times and resource blocked times of queries significantly and provides a more predictable query execution with improved performance. Our solution out-performs dataflow-agnostic and contention-oblivious schedulers while improving cluster utilization.