Ali Amin-Nejad

Federated Learning @ ICLR 2021

17 May 2021

Introduction

We’re all busy people and it’s hard to find the time to pore through papers at conferences to stay up to date with the latest research in the field of your choice. Worry not, if you’re into Federated Learning (FL), I’ve done it for you. For this one conference. Hopefully this should save you some time so you can quickly dismiss the papers you’re not interested in and focus specifically on the ones that you are interested in while maintaining a general awareness of research in the field more broadly. I have included papers across all Oral, Spotlight and Poster presentations. You can find all ICLR papers here. And if you need an introduction to FL, take a look at my earlier blog posts here and here.

Federated Learning Papers

So without further ado, in no particular order, here are the ten papers (just happened to be a nice round number) focusing on FL at ICLR 2021:

1. Federated Learning Based on Dynamic Regularization

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, Venkatesh Saligrama

Institutions

Boston University, ARM

New algorithm

FedDyn

Code

None provided

Authors’ One-sentence Summary

We present, FedDyn, a novel dynamic regularization method for Federated Learning where the risk objective for each device is dynamically updated to ensure the device optima is asymptotically consistent with stationary points of the global loss.

My summary

This paper is most applicable to cross-device FL since it addresses the communication overhead of FL. Dynamic regularization modifies the regularization terms (linear and quadratic penalty terms) in each device’s local neural network at every step to make sure it is kept consistent with the global loss minimum. This means more computation can be done on device before needing to share updates and download parameters, etc. The authors demonstrate empirical results and improve on previous SOTA in this area (SCAFFOLD). Their algorithm also has the useful properties of working in both convex and non-convex settings, being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data.

2. Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh

Institutions

Google, CMU

New algorithm

FedPA

Code

https://github.com/alshedivat/fedpa

Authors’ One-sentence Summary

A new approach to federated learning that generalizes federated optimization, combines local MCMC-based sampling with global optimization-based posterior inference, and achieves competitive results on challenging benchmarks.

My summary

This paper reframes FL from being a global optimization problem to a posterior inference problem. But what does this actually mean? Well, if you recall your Bayesian statistics, the posterior probability is the probability of the parameters \(\theta\) given the evidence \(X\): \(P(\theta \mid X)\) i.e. the probability of something after taking into account the relevant background (priors) and evidence (likelihood). In this case, the parameters we are referring to are the actual model parameters. Local posterior inference is run on the clients to refine a global estimate of the posterior mode. The stochastic gradient Markov chain Monte Carlo method is used to approximate sampling from local posteriors on the clients. This federated local sampling stops local model weights from ever straying too far from the global optimum. Not only does FedPA result in better model convergence (faster and to better optima) than FedAvg (as shown in the figure below), it also benefits from increased local computation making it particularly attractive in the cross-device setting. Finally, the authors show that FedAvg is essentially a suboptimal special case of FedPA and report SOTA results on some benchmarks.

3. Adaptive Federated Optimization

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, Hugh Brendan McMahan

Institutions

Google

New algorithm

FedOpt

Code

https://github.com/google-research/federated/tree/master/optimization

Authors’ One-sentence Summary

We propose adaptive federated optimization techniques, and highlight their improved performance over popular methods such as FedAvg.

My summary

The authors adapt the optimizers Adam, Adagrad and Yogi to work in the federated setting and show that it improves results compared to regular FedAvg algorithm with SGD on a variety of different datasets. The adaptive portion of the optimization is done on the server with the clients just running regular SGD. This ensures the method has the same communication cost as FedAvg and and thus can feasibly work in the cross-device setting. The general version of their optimization algorithm is called FedOpt and shows that the negative of the average model difference can indeed be used as a pseudo-gradient in general server optimizer updates.

4. Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

Haibo Yang, Minghong Fang, Jia Liu

Institutions

Ohio State

Code

None provided

Authors’ One-sentence Summary

None provided

My summary

FL has a nice property in that it has a linear relationship between convergence and number of workers i.e. convergence performance increases linearly with respect to the number of workers. However, this has only been proven with IID datasets and/or full worker participation. The authors essentially answer what was previously an open question and show that this property still holds in non-IID settings and/or with partial worker participation.

5. Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning

Wonyong Jeong, Jaehong Yoon, Eunho Yang, Sung Ju Hwang

Institutions

Korea Advanced Institute of Science and Technology (KAIST)

New algorithm

FedMatch

Code

https://github.com/wyjeong/FedMatch

Authors’ One-sentence Summary

We introduce a new practical problem of federated learning with a deficiency of supervision and study two realistic scenarios with a novel method to tackle the problems, including inter-client consistency and disjoint learning

My summary

The authors introduce a new algorithm FedMatch for semi-supervised learning. They focus on two federated paradigms, one where each client has a mix of labelled and unlabelled data and another where clients only have unlabelled data but the server has labelled data. The two core components of the algorithm are (1) an inter-client consistency loss which regularizes the models learned at multiple clients to output the same prediction and (2) and parameter decomposition such that the model has one set of weights for unsupervised learning and another set of weight for supervised learning. They show that their algorithm outperforms local semi-supervised learning and other naive baselines on a few different datasets in IID and non-IID settings.

6. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization

Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, Qi Dou

Institutions

Princeton, CUHK, Iowa State, Monash

New algorithm

FedBN

Code

https://github.com/med-air/FedBN

Authors’ One-sentence Summary

We propose a novel and efficient federated learning aggregation method, denoted FedBN, that uses local batch normalization to effectively tackle the underexplored non-iid problem of heterogeneous feature distributions, or feature shift.

My summary

Batch Normalization (BN) is an important and now-ubiquitous component of neural network architectures which improves convergence and model stability. However, little work has been done to adapt BN layers to the federated setting, instead they are either usually removed or treated naively just like regular layers. This paper comes up with the remarkably simple yet effective solution of including BN layers in the model but simply not synchronizing them with the global model, so that each local model has their own personalized BN layers. This improves performance compared to standard FedAvg particularly with non-IID data where it helps to mitigate feature shifts. One caveat however is that all experiments are done with just a two layer neural network.

7. FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning

Hong-You Chen, Wei-Lun Chao

Institutions

Ohio State

New algorithm

FedBE

Code

None provided

Authors’ One-sentence Summary

None provided

My summary

This paper takes a Bayesian Inference perspective to the model aggregation portion of FL. The authors’ algorithm, FedBE, can be a simple add on to the regular FedAvg (or indeed any other) algorithm. The only difference is the aggregation step which uses Bayesian Ensembling to get the best average instead of a simple arithmetic mean. Their results are most convincing when FedBE is combined with Stochastic Weight Averaging where we see significant improvements over baselines. Crucially these improvements are also demonstrated with deeper architectures like ResNets on non-IID data. One caveat however is that their algorithm relies on some computation being done by the server on a small portion of unlabelled data. This could well be an unfeasible assumption for many situations.

8. FedMix: Approximation of Mixup under Mean Augmented Federated Learning

Tehrim Yoon, Sumin Shin, Sung Ju Hwang, Eunho Yang

Institutions

KAIST

New algorithm

MAFL, FedMix

Code

None provided

Authors’ One-sentence Summary

We introduce a new federated framework, Mean Augmented Federated Learning (MAFL), and propose an efficient algorithm, Federated Mixup (FedMix), which shows good performance on difficult non-iid situations.

My summary

This paper introduces a new framework and algorithm which again addresses the non-IID data problem - this time with data augmentation. Their work builds on the MixUp algorithm which is a simple data augmentation technique using a linear interpolation between two input-label pairs \((x_i ,y_i)\) and \((x_j , y_j)\). Their general framework, Mean Augmented Federated Learning (MAFL), builds on FedAvg but with one difference - in addition to sharing model parameters, clients also share averaged (or mashed as the authors call it) data. Each client then essentially unmashes other clients’ data every round and trains with it alongside its own local data. The problem here of course is that this not very federated for an FL framework and a privacy tradeoff needs to be made. FedMix, the algorithm which builds on this framework, improves naive MAFL performance by approximating global MixUp in a more systematic way. FedMix and the naive implementation both outperform FedAvg and FedProx in non-IID scenarios on CIFAR and FEMNIST datasets. One critical missing piece in this piece of research is the privacy risk which the authors state is beyond the scope of this work.

9. HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients

Enmao Diao, Jie Ding, Vahid Tarokh

Institutions

Duke, Minnesota

New algorithm

HeteroFL

Code

https://github.com/dem123456789/HeteroFL-Computation-and-Communication-Efficient-Federated-Learning-for-Heterogeneous-Clients

Authors’ One-sentence Summary

In this work, we propose a new federated learning framework named HeteroFL to train heterogeneous local models with varying computation complexities.

My summary

One of the core requirements or assumptions of FL is that all models share the same neural network architecture. HeteroFL challenges this assumption and shows that it is possible to adapt FL to scenarios where different clients have different architectures. However, this is not a complete overhaul of FL as we know it. All clients have various subsets of the weights of the global model and the goal is still to jointly train this global model. Model depth is kept the same across all models meaning that the number of neurons in each layer is what is modified. The authors demonstrate results using CNN, PreResNet18, and Transformer architectures.

10. Personalized Federated Learning with First Order Model Optimization

Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, Jose M. Alvarez

Institutions

Stanford, NVIDIA

New algorithm

FedFomo

Code

None provided

Authors’ One-sentence Summary

We propose a new federated learning framework that efficiently computes a personalized weighted combination of available models for each client, outperforming existing work for personalized federated learning.

My summary

Rather than modifications to FL in order to make the global model more robust, FedFomo proposes to tackle the non-IID data problem by local model personalization. There is no global model. Instead, at every step, each client receives updates from a subset of other clients based on how much that client would benefit from those parameters. This is ascertained by performance on a validation set on each client. Ideally, every client would send their model to every other client but the communication costs would likely be prohibitively large, even in a small cross-silo setting. Instead the number of models is restricted and a sampling scheme is used to sample different models depending on how much they helped on previous rounds. Their method outperforms other existing solutions.

While not directly FL focused, the following papers are related to the field and so I have included them below:

1. CaPC Learning: Confidential and Private Collaborative Learning

Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang

Institutions

Toronto, Vector Institute

New algorithm

CaPC

Code

https://github.com/cleverhans-lab/capc-iclr

Authors’ One-sentence Summary

A method that enables parties to improve their own local heterogeneous machine learning models in a collaborative setting where both confidentiality and privacy need to be preserved to prevent both explicit and implicit sharing of private data.

My summary

Proposed alternative to FL for improving a local model based on private data. Allows collaboration without sharing of data or model parameters and different clients can have completely different models. Instead, they will collaborate by querying each other for labels of the inputs about which they are uncertain. In the active learning paradigm, one party poses queries in the form of data samples and all the other parties together provide answers in the form of predicted labels. Each model can be exploited in both the querying phase and the answering phase, with the querying party alternating between different participants in the protocol. This work builds on PATE, private aggregation of teacher ensembles and relies on HE, MPC and DP to make the necessary privacy guarantees. The authors show the efficacy of their model with extensive experiments. However no comparison with FL is made in the experiments.

2. Differentially Private Learning Needs Better Features (or Much More Data)

Florian Tramer, Dan Boneh

Institutions

Stanford

Code

https://github.com/ftramer/Handcrafted-DP

Authors’ One-sentence Summary

Linear models with handcrafted features outperform end-to-end CNNs for differentially private learning

My summary

Addresses the problem that DP adds too much noise that makes CNN models unusable and worse than linear models. They show that handcrafting of features with ScatterNet significantly boosts performance of CNNs but they are still mostly worse than Linear models with handcrafted features. Basically, still a long way to go in this area.

Summary

So what have we learnt? Well, the focus broadly seemed to be very geared towards improving performance on non-IID data with a lot of very interesting ideas tackling different angles from BatchNorm and regularization to optimizers and model aggregation. Even where this wasn’t the focus, most papers reported results on non-IID data anyway which was great to see. Oh, we also have a whole bunch of new algorithms whose names we have to remember: FedMix, FedMatch, FedFomo, FedBE, FedBN, FedOpt, FedDyn, FedPA and HeteroFL. Do try to keep up.