# CRISP: Critical Path Analysis of Large-Scale Microservice Architectures-Zhizhou Zhang ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/default-book-icon-2.dae1dc4d332b.png) ## Metadata - Author:: [[Zhizhou Zhang]] - Full_Title:: CRISP: Critical Path Analysis of Large-Scale Microservice Architectures - URL:: https://www.usenix.org/system/files/atc22-zhang-zhizhou.pdf - URL:: [devonthink](x-devonthink-item://E85B5B90-0646-45C1-94CD-E42274DFB571) ## Summary - Distributed execution flow summaries of the critical path - Top-down summaries of CP presented as flame graphs - Bottom-up summaries by remote endpoints, heat-maps - anomaly detection for paths deviating from the normal CP ## Highlights - State-of-the-art RPC tracing tools collect a deluge of data but provide little insight. (Page 2) ^348451459 - CRISP provides three critical performance analysis capabilities: a) a top-down CPA of any specific endpoint, which is tailored for service owners to drill down the root causes of latency issues, b) a bottom-up CPA over all endpoints in the system — tailored for infrastructure and performance engineers — to bubble up those (interior) APIs that have a high impact across many endpoints, and c) an on-the-fly anomaly detection to alert potential problems. (Page 2) ^348451460 - reduced the false positives in anomaly detection by up to 50% while speeding up the training and inference by up to 28 and up to 67 , respectively, over the state of the art. (Page 2) ^348451461 - how is this claim measured? #readwise/note - The critical path [59] is the longest chain of dependent tasks in a microservice dependency graph. (Page 3) ^348451462 - CRISP uses the RPC tracing facility provided by Jaeger [6] and constructs the critical path through a request’s graph of dependencies. (Page 3) ^348451463 - The critical path may vary among requests; hence, CRISP computes the critical path per request. It then aggregates and summarizes critical paths from millions of requests. Finally, it presents them as digestible and actionable insights via rich heat maps [5] and flame graphs [31]. (Page 3) ^348451464 - Top-down analysis: A top-down analysis of any specific endpoint of interest. This capability allows service owners to deep dive into their specific endpoint and pinpoints and quantifies bottlenecks encountered in the RPC dependency graph. Improving these bottlenecks should be the first-order priority to reduce the latency of the endpoint. (Page 3) ^348451465 - Bottom-up analysis: A bottom-up analysis over all endpoints, which bubbles up and ranks by the impact of those interior APIs that cause the most latency across most endpoints. Optimizing these interior APIs reduces latency across numerous endpoints. (Page 3) ^348451466 - Neural network-based anomaly detection: An automated anomaly detection system, which detects whether a request is exhibiting abnormal behavior compared with the past history of the endpoint. The system is trained per endpoint using an autoencoder-decoder machine learning technique [39]. (Page 3) ^348451467 - The service itself is written in Java, and highly (both macro and micro) optimized using periodic profiling. However, the profiling offered no insights into downstream calls, where most time is spent. (Page 3) ^348451468 - Basing the abnormality detection on the divergence in the critical path as opposed to the full call graph [39] not only makes the training and inference faster but also reduces false alerts. (Page 3) ^348451469 - Quantitative insight into the causes of the latency was hard to analyze by looking at individual traces because each trace contains thousands of nested and overlapping RPCs. (Page 3) ^348451470 - traditional profilers fail to highlight the causes of latencies incurred at an individual request level. (Page 4) ^348451471 - traditional profilers fail to capture IO waiting, task dependencies, and serialization patterns. (Page 4) ^348451472 - Distributed tracing [47] encodes causality information in a distributed context, which is propagated across process boundaries. It provides a way to infer states across various services for the lifetime of a request. (Page 4) ^348451473 - Our premise is that while the whole graph is interesting in terms of data richness, it brings a lot of noise. There are many RPCs and call paths that are insignificant for a high-level analysis and optimization task. (Page 5) ^348451474 - shrink the graph to its quintessential element—the critical path (Page 5) ^348451475 - and aggregate many traces into a single summary (Page 5) ^348451476 - Total work is the sum of weights of all nodes and the critical path is the longest weighted path in the DAG. (Page 5) ^348451477 - following challenges: (Page 5) ^348451478 - the Jaeger traces do not provide clear “spawn” and “sync” events in the DAG. (Page 5) ^348451479 - The parent spans in Jaeger traces carry no dependence information (Page 5) ^348451480 - Only first-level insights are possible from drilling down into microservice latencies and errors. (Page 5) ^348451481 - In order to obtain the last “sync“ child span, clock information is needed. However, the clocks on different machines where spans are collected are not time-synchronized. (Page 5) ^348451482 - Using a few Jaeger traces is insufficient to reach a reliable conclusion. Users can visualize and navigate only one Jaeger trace at a time. There is no aggregate summary of traces. (Page 5) ^348451483 - The critical path across all requests may not be unique. Services have diurnal patterns and different traces may exhibit different critical paths, which need to be aggregated, and yet “hot” critical paths need to be bubbled up. (Page 5) ^348451484 - A single Jaeger trace can be so complex that it is not humanly possible to browse and understand the details. (Page 5) ^348451485 - Since the service codes keep evolving, the critical path keeps changing. (Page 5) ^348451486 - There is a lack of regular, performance-driven feedback tooling to optimize the workflow or downstream systems. (Page 5) ^348451487 - These challenges introduce a barrier to our developers in effectively using Jaeger to either detect anomalous situations or identify optimization opportunities. (Page 5) ^348451488 - n order to compute the critical path through the trace, we need a computational DAG. To accomplish this under Jaeger/OpenTrace trace format, we make use of the start and end times of children’s spans. The start time in every immediate child creates a “spawn” event in the parent and splits its span at that point in time. Similarly, the end time in every immediate child creates a “sync” event in the parent and splits its span at the point in time. (Page 6) ^348451489 - One key limitation of the Jaeger is that the parent spans (a.k.a., caller) do not contain dependence information. Specifically, they lack the information of both start and end of callee RPC. Instead, it is the callee that stores both the ID of its parent and callee’s start and end time (per callee’s local clock) in its own span. The implication of the constraint is that the dependency relationship needs to be inferred via clock information recorded in the callee span. (Page 6) ^348451490 - the inference can be inaccurate because of the clock skew (Page 6) ^348451491 - Without correctly handling the clock skew, the critical path analysis can go wrong. (Page 6) ^348451492 - Due to the clock drift, more than 50% of our traces recorded this type of span overlaps causing misattribution in critical paths. (Page 7) ^348451493 - A level of aggregation happens immediately within each trace processing: if the same endpoint appears multiple times on the critical path, we sum them as long as their call chains are exactly the same. (Page 7) ^348451494 - Based on this empirical observation, we tuned the happensBefore(A, B) part of our CP algorithm with the following relaxation: • Aend • No other children of the parent of P of A can start or end in threshold < Bstart , and the overlapped time range (Page 7) ^348451495 - This merger discards the ordering relationship between events, which we do not need for further analysis. (Page 7) ^348451496 - The threshold is set to 1ms. (Page 7) ^348451497 - A child span C may start before the start of the parent span P. In such cases, we truncate the start time of C till the start time of P. This may involve the recursive truncation of C’s descendants. (Page 7) ^348451498 - A child span C may end after the end of the parent span P. In such cases, we truncate the end time of C to the end time of P. (Page 7) ^348451499 - This may involve the recursive truncation of C’s descendant. (Page 8) ^348451500 - produces the daily critical path report for each endpoint (top-down analysis) and also produces overall metrics aggregated over all endpoints (bottom-up analysis). (Page 8) ^348451501 - Although rare, a child span C may end before the start time of parent span P. Similarly, a child span C may start after the end time of the parent span P. In these cases, we completely drop the subtree formed by C for CPA. (Page 8) ^348451502 - The total time truncation over millions of traces was under 5% giving us the confidence that a significant part of the data was retained. (Page 8) ^348451503 - An offline anomaly detection model is also trained per endpoint result. (Page 8) ^348451504 - While one trace can be compressed into its essential critical path and represented as a CCCT, it may not be representative. Hence, we need to inspect numerous traces to derive a “typical” shape of the critical path. Distinct traces may exhibit different critical paths based on many things, such as calling parameters, scheduling decisions, system load, time of the day, and network delays, to name a few. Hence, a summary of typical components on the critical path is desired. (Page 8) ^348451505 - To this end, we merge all critical paths (represented as CCCTs) into a weighted, aggregate CCCT. We follow the tree merging process done in HPCToolkit [13]. (Page 8) ^348451506 - Specifically, we provide different percentiles (e.g., P50, P95, P99) of the latency values, which are widely used for QoS purposes. (Page 8) ^348451507 - If we chose all traces to represent a single flame graph, the critical path found in P99 latencies may dominate the flame graph and mask the other common cases. For that reason, we show three different flame graphs for different percentiles of latency values (e.g., P50, P95, and P99). (Page 8) ^348451508 - We also produce differential flame graphs [30] that show how the critical paths change between two percentile values. (Page 8) ^348451509 - CRISP provides the heat map view (see Figure 10), where the rows are the endpoints and the columns represent individual traces. Each cell in the heat map represents the exclusive time on the critical path and each cell is gradient colored based on its contribution (exclusive time) to the total latency. (Page 8) ^348451510 - service critical path vectors (SCPV) (Page 9) ^348451511 - During the online inference, the learned model will infer whether the given new trace is abnormal or not based on an anomaly score. (Page 9) ^348451512 - TraceAnomaly [39], which is the state-of-the-art framework for anomaly detection in microservices trace. The neural architectural details are described in Appendix B (Page 9) ^348451513 - The output of the bottom-up analysis is a descending priority list of top APIs that are often in many endpoints. Additionally, the bottom-up analysis produces various histograms over all traces taken together, which include the total number of times any API appears in any graph, the total number of times an API appears on the critical path, the number of unique APIs on the critical path, the critical path length, and the maximum degree of concurrency in a trace (Page 9) ^348451514 - These graphs are intended to inform infrastructure and hardware engineers to better understand the current needs of our systems and aid capacity planning for the future. (Page 9) ^348451515 - During the offline training, we encode the critical path (CCCT) for each trace of an endpoint into feature vectors, (Page 9) ^348451516 - Critical Path Analysis (CPA) has been extensively explored in the shared-memory parallel programming paradigm [13, 20, 25, 40, 50, 54, 55, 59, 62] but less explored in distributed parallel systems (Page 12) ^348451517