Predictive Workflow Scheduling with Temporal Graph Learning

Overview:

Modern cloud workflows are structured as directed acyclic graphs (DAGs), with each node representing a computational task and edges encoding task dependencies. Accurate predictions of resource usage (e.g., CPU, memory) for upcoming tasks are crucial for efficient scheduling, cost control, and QoS adherence.

Recent advances in temporal deep learning — especially temporal graph neural networks (TGNNs) — provide powerful tools for learning from sequential and graph-structured data simultaneously. Building on prior work that explored the influence of workflow topology on task intensity prediction using static graph models, this thesis aims to incorporate temporal aspects such as execution timestamps, time-based dependencies, and evolving topologies into the modeling process.

The objective is to build predictive models that leverage temporal graph structures to forecast task intensities and identify bottlenecks, load spikes, or SLA violations in advance. You will experiment with TGNN architectures such as Temporal GCNs, Temporal GATs, or Transformer-based models with temporal encoding, and train them on workflow traces like the Alibaba Cluster-Trace-V2018.

Research Questions:

This thesis lies at the intersection of deep learning, distributed systems, and time-aware graph analysis:

Requirements:

Good knowledge of Python and data processing; good foundation in machine learning and deep learning (PyTorch and PyTorch-Geometric are a plus); familiarity with GNNs, time series, and temporal data; basic understanding of workflow orchestration and DAG-based systems.

Start: Immediately

Contact: Ismail Aslan (aslan@tu-berlin.de)

Note: For undergraduate students, the topic can be scaled and be either expanded or narrowed down after consultation with the staff member.