Overview:
Deploying applications across cloud, fog, and edge resources is now routine in distributed computing, but ensuring Quality-of-Service (QoS) across these tiers remains challenging. While orchestration frameworks like Kubernetes and K3s simplify the mechanics of deployment, their schedulers remain largely reactive and rule-based, with no built-in support for learned decision-making based on historical or real-time metrics.
This thesis focuses on enhancing QoS-aware orchestration using machine learning (ML) techniques. Building on previous work that laid the architectural foundations (e.g., decentralized coordination, Raft consensus, Borda count ranking), this project shifts the focus toward predictive and adaptive orchestration strategies. The central goal is to develop and evaluate ML models that learn from system behavior (e.g., past deployments, resource usage, network latency) and can proactively steer placement and migration decisions for microservices.
Use cases may include reinforcement learning for scheduling, supervised learning for performance prediction, or even clustering models to identify optimal workload groupings under varying QoS constraints like energy efficiency, cost, and latency.
Research Questions:
This thesis lies at the intersection of ML, distributed systems, and edge-cloud orchestration:
How can supervised or reinforcement learning methods improve workload placement decisions compared to rule-based or Borda-based scheduling?
Which features (e.g., resource telemetry, historical workload behavior, cluster health) are most predictive of good QoS outcomes?
Can we train models that adapt in real time to changing network or energy conditions without violating SLA constraints?
How can ML-based decisions be integrated into a Kubernetes-based orchestration layer while preserving robustness and explainability?
Requirements:
Solid understanding of Kubernetes and container orchestration; strong programming skills (preferably in Go or Python); Knowledge of machine learning techniques; familiarity with YAML and CI/CD pipelines is a plus.
Start: Immediately
Contact: Ismail Aslan (aslan@tu-berlin.de)