Data Valuation and Critical Learning Periods in Federated Learning

Overview:

In the majority of machine learning (ML) methods in use today, models are trained centrally. This means that all required training data is collected and processed in a common location, usually in highly optimized data centers with very good energy efficiency. However, in many practical use cases it is not possible to collect data centrally, for example due to security and privacy concerns. In these cases, federated learning (FL) approaches are therefore increasingly used, which enable distributed training of ML models without training data leaving the end devices or data silos in the process. However, privacy-preserving, distributed ML training through FL introduces disadvantages:

Resulting in less accurate Model Performance compared to centralized ML
Training with private data that might even harm the model performance

Data Valuation is a paradigm that tries to estimate the Value of the decentralized data for the model performance, in order to train only on “useful” data. This has an important impact of the model performance. Critical Learning Periods (CLPs) are a biological phenomenon from neuroscience that are also measured in artificial neural networks. CLP are specific times during the training of a neural network when the model is most receptive to learning new information. During these periods, the network’s ability to learn and improve is at its peak, making it crucial to provide the most valuable data for training.

In Federated Learning (FL), data is decentralized, residing on multiple devices instead of a central server. Data Valuation helps identify which pieces of this decentralized data are most valuable for improving model performance. By combining Data Valuation with Critical Learning Periods, we can optimize the training process. Providing the most valuable data during these receptive times enhances model performance and efficiency while maintaining privacy, as the data never leaves its original location. This synergy leads to smarter, faster, and more secure machine learning models.

This topic is about decentralized ML (FL). Identifying the “right” data in the “right” timing in training, reduces the potential harm of model performance through adversarial data points while improving the overall performance in privacy preserving ML.

Research Questions:

Potential research questions for theses lie at the intersection of distributed systems and ML:

How can we identify the right clients with the right data ? Which measurements and techniques are useful to identify Data similiarity and learning phases of a Neural Net?
How can we select clients intelligent? Can we use client selection strategies like Oort or Bandits in order to improve our exploration vs. exploitation trade-off?
Can we make the developed approaches more tractable, resilient, and powerful using explainable artificial intelligence (XAI)?

Prerequisites:

Work in this research area requires very good knowledge of machine learning and solid knowledge of distributed systems. In addition, the information on how to conduct theses at our department must be read and considered.

Start: Immediately

Contact: Patrick Wilhelm (patrick.wilhelm ∂ tu-berlin.de)

References:

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2016.
F. Lai, X. Zhu, H. V. Madhyastha, and M. Chowdhury, “Oort: Efficient federated learning via guided participant selection,” in USENIX OSDI, 2021.
Y. Jee Cho, J. Wang, and G. Joshi, “Towards understanding biased client selection in federated learning,” in AISTATS, 2022.
A. Achille, M. Rovere, and S. Soatto, “Critical learning periods in deep neural networks”, in ICLR 2019 - Youtube: https://www.youtube.com/watch?v=rZIJiZpBALk
“LAVA: Data Valuation without Pre-specified Learning Alorithms” in ICLR 2023