Overview:
Modern cloud workflows are structured as directed acyclic graphs (DAGs), with each node representing a computational task and edges encoding task dependencies. Accurate predictions of resource usage (e.g., CPU, memory) for upcoming tasks are crucial for efficient scheduling, cost control, and QoS adherence.
Recent advances in temporal deep learning — especially temporal graph neural networks (TGNNs) — provide powerful tools for learning from sequential and graph-structured data simultaneously. Building on prior work that explored the influence of workflow topology on task intensity prediction using static graph models, this thesis aims to incorporate temporal aspects such as execution timestamps, time-based dependencies, and evolving topologies into the modeling process.
The objective is to build predictive models that leverage temporal graph structures to forecast task intensities and identify bottlenecks, load spikes, or SLA violations in advance. You will experiment with TGNN architectures such as Temporal GCNs, Temporal GATs, or Transformer-based models with temporal encoding, and train them on workflow traces like the Alibaba Cluster-Trace-V2018.
Research Questions:
This thesis lies at the intersection of deep learning, distributed systems, and time-aware graph analysis:
How can we model workflows as temporal graphs that reflect both topological structure and dynamic task execution behavior?
Which temporal graph learning architectures (e.g., TGAT, TGN, DySAT) are most suitable for predicting CPU and memory demands in real-world workflows?
Can temporal attention mechanisms or learned node embeddings improve upon static topology-based models?
How does temporal information (e.g., task start time, inter-arrival delay) enhance predictive accuracy in contrast to purely structural or task-level features?
Requirements:
Good knowledge of Python and data processing; good foundation in machine learning and deep learning (PyTorch and PyTorch-Geometric are a plus); familiarity with GNNs, time series, and temporal data; basic understanding of workflow orchestration and DAG-based systems.
Start: Immediately
Contact: Ismail Aslan (aslan@tu-berlin.de)
Note: For undergraduate students, the topic can be scaled and be either expanded or narrowed down after consultation with the staff member.