Hybrid Sharding and Real-Time Thresholding: Scaling Tail-Latency Diagnostics for Global Traffic
DOI:
https://doi.org/10.22399/ijcesen.5053Keywords:
Tail Latency Optimization, Distributed Systems Monitoring, Hybrid Sharding Architecture, Real-Time Performance Thresholding, Conditional Profiling MechanismsAbstract
High-scale distributed systems face significant challenges in understanding performance degradation at the tail end of latency distributions. In traditional performance monitoring methods, it is common for the average of all requests to mask the true experience users have when making the requests towards the outer edges of the distribution. Monitoring methods will generally either fail to identify rare instances of performance anomaly or consume too much computational resource when profiling regular transactions. This article introduces an advanced telemetry architecture, which provides detailed insights into tail latency without decreasing system efficiency. The proposed architecture utilizes a centralized controller to maintain continuously updated dynamic percentile cutoffs for thousands of concurrent experiments, which operate across rolling time windows. The use of minimal reporting means that the amount of information exchanged between the serving tasks and the controller will be very small in terms of volume. In addition, the controller's conditional profiling activates the detailed diagnostics collection mechanism only when latency threshold limits are exceeded. The architecture has been designed to accommodate hybrid sharding schemes that define different traffic profiles in global deployments. Multi-slice sharding allows for horizontal scaling and increased metric storage, while maintaining centralized co-ordination to accurately manage thresholds. The real-time monitoring of performance in both production environments and load-test environments improves the speed to deliver developer velocity. The presented work demonstrates that advanced observability approaches are capable of working in an efficient manner at a planetary scale if they are developed using outcome-aware principles, along with effective resource allocation that leverages the system resources intelligently.
References
1. Marios Kogias and Edouard Bugnion, "Tail-tolerance as a Systems Principle not a Metric," ACM Digital Library, 2020. Available: https://dl.acm.org/doi/10.1145/3411029.3411032
2. Mohammad Alizadeh, et al., "Less is more: trading a little bandwidth for ultra-low latency in the data center," ACM Digital Library, 2012. Available: https://dl.acm.org/doi/10.5555/2228298.2228324
3. Ravi Netravali, et al., "Vesper: Measuring Time-to-Interactivity for Modern Web Pages," USENIX, 2018. Available: https://www.cs.princeton.edu/~ravian/publications/vesper_nsdi18.pdf
4. Kevin Hsieh, et al., "Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds," USENIX, 2017. Available: https://www.usenix.org/sites/default/files/conference/protected-files/nsdi17_slides_hsieh.pdf
5. Ahmed Saeed, et al., "Eiffel: efficient and flexible software packet scheduling," ACM Digital Library, 2019. Available: https://dl.acm.org/doi/10.5555/3323234.3323237
6. Yibo Zhu, et al., "Congestion Control for Large-Scale RDMA Deployments," ACM SIGCOMM Computer Communication Review, 2015. Available: https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf
7. Sara McAllister, et al., "Kangaroo: Caching Billions of Tiny Objects on Flash," Proceedings of SOSP, 2021. Available: https://www.pdl.cmu.edu/PDL-FTP/NVM/McAllister-SOSP21.pdf
8. Patrick Stuedi, et al., "Crail: A high-performance I/O architecture for the Apache data processing ecosystem," OpenFabrics Alliance, 2017. Available: https://www.openfabrics.org/images/eventpresos/2017presentations/109_Crail_BMetzler.pdf
9. Behnam Montazeri, et al., "Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities," ACM SIGCOMM Computer Communication Review, 2018. Available: https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18.pdf
10. Yuhong Zhong, et al., "XRP: In-Kernel Storage Functions with eBPF," USENIX Association, 2022. Available: https://www.usenix.org/system/files/osdi22-zhong_1.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.