Autonomous AI Agents for Apache Flink Pipeline Management on Kubernetes
DOI:
https://doi.org/10.22399/ijcesen.4998Keywords:
Apache Flink Stream Processing, Kubernetes Orchestration, Reinforcement Learning Optimization, Autonomous Failure Recovery, Predictive Service-Level Agreement EnforcementAbstract
Modern companies struggle to maintain data pipelines that process millions of events per second while meeting strict performance requirements. Traditional methods use fixed configurations and manual fixes, which fail when workloads change unexpectedly. This paper presents an AI-powered system that automatically manages Apache Flink pipelines on Kubernetes. The system uses machine learning to predict problems before they occur, recover from failures automatically, and optimize resource usage continuously. The system was evaluated using two publicly available benchmark datasets: the NYC Taxi Trip Record dataset adapted for streaming scenarios and the Yahoo Cloud Serving Benchmark dataset. Tests show the AI-driven approach significantly reduces service violations, substantially cuts recovery time, and lowers infrastructure costs compared to manual management while maintaining better performance. The system uses three AI agents working together where the prediction agent forecasts problems ahead of time with high accuracy using a neural network that processes multiple metrics continuously, the recovery agent detects failures rapidly using isolation forests, autoencoders, and long short-term memory networks, and the optimization agent adjusts resources dynamically based on workload patterns using reinforcement learning. Together, these agents enable the system to operate autonomously, dramatically reducing manual interventions and operational overhead.
References
[1] Paris Carbone, et al., "Apache Flink: Stream and batch processing in a single engine," Asterios Katsifodimos, 2015. [Online]. Available: https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf
[2] D. Sculley et al., "Hidden technical debt in machine learning systems," ACM digital library,2015. [Online]. Available: https://dl.acm.org/doi/10.5555/2969442.2969519
[3] Paris Carbone, et al., "Apache Flink: Stream and batch processing in a single engine," ResearchGate, 2015. [Online]. Available: https://www.researchgate.net/publication/308993790_Apache_Flink_Stream_and_Batch_Processing_in_a_Single_Engine
[4] Brendan Burns, et al., "Borg, Omega, and Kubernetes," ACM Digital Library, May 2016. [Online]. Available: https://dl.acm.org/doi/10.1145/2890784
[5] John Schulman, et al., "Proximal policy optimization algorithms," arXiv, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347
[6] Joshua Achiam, et al., "Constrained policy optimization," ACM Digital Library, 2017. [Online]. Available: https://dl.acm.org/doi/10.5555/3305381.3305384
[7] Abhishek Verma, "Large-scale cluster management at Google with Borg," ACM Digital Library, 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/2741948.2741964
[8] Cory Maklin, "Isolation forest," [Online]. Available: https://ieeexplore.ieee.org/document/4781136
[9] Tianqi Chen, Carlos Guestrin, "XGBoost: A scalable tree boosting system," ACM Digital Library. [Online]. Available: https://dl.acm.org/doi/10.1145/2939672.2939785
[10] Robert Grandl, et al., "Multi-resource packing for cluster schedulers," 2014. [Online]. Available: https://dl.acm.org/doi/10.1145/2619239.2626334
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.