One of the main challenges of machine learning is the need for large amounts of data. Gathering training datasets for machine learning models poses privacy, security, and processing risks organizations would rather avoid.
One technique that can help address some of these challenges is “federated learning.” By distributing the training of models across user devices, federated learning makes it possible to take advantage of machine learning while minimizing the need to collect user data.
The main idea behind federated learning is to train a machine learning model on user data without transfer that data to cloud servers.
Federated learning starts with a base machine learning model in the cloud server. This model is either trained on public data (e.g., Wikipedia articles or the ImageNet dataset) or has not been trained at all. In the next stage, several user devices volunteer to train the model. These devices hold user data that is relevant to the model’s application, such as chat logs and keystrokes.
These devices download the base model at a suitable time, for instance when they are on a wi-fi network and are connected to a power outlet (training is a compute-intensive operation and will drain the device’s battery if done at an improper time). Then they train the model on the device’s local data. After training, they return the trained model to the server. Popular machine learning algorithms such as deep neural networks and support vector machines is that they are parametric. Once trained, they encode the statistical patterns of their data in numerical parameters and they no longer need the training data for inference. Therefore, when the device sends the trained model back to the server, it doesn’t contain raw user data.
Once the server receives the data from user devices, it updates the base model with the aggregate parameter values of user-trained models. The federated learning cycle must be repeated several times before the model reaches the optimal level of accuracy that the developers desire. Once the final model is ready, it can be distributed to all users for on-device inference.
Limits of federated learning
Federated learning does not apply to all machine learning applications. If the model is too large to run on user devices, then the developer will need to find other workarounds to preserve user privacy.
On the other hand, the developers must make sure that the data on user devices are relevant to the application. The traditional machine learning development cycle involves intensive data cleaning practices in which data engineers remove misleading data points and fill the gaps where data is missing. Training machine learning models on irrelevant data can do more harm than good. When the training data is on the user’s device, the data engineers have no way of evaluating the data and making sure it will be beneficial to the application. For this reason, federated learning must be limited to applications where the user data does not need pre-processing.
Privacy implications of federated learning
While sending trained model parameters to the server is less privacy-sensitive than sending user data, it doesn’t mean that the model parameters are completely clean of private data. Many experiments have shown that trained machine learning models might memorize user data and membership inference attacks can recreate training data in some models through trial and error.
One important remedy to the privacy concerns of federated learning is to discard the user-trained models after they are integrated into the central model. The cloud server doesn’t need to store individual models once it updates its base model.
Another measure that can help is to increase the pool of model trainers. For example, if a model needs to be trained on the data of 100 users, the engineers can increase their pool of trainers to 250 or 500 users. For each training iteration, the system will send the base model to 100 random users from the training pool. This way, the system doesn’t collect trained parameters from any single user constantly.
Finally, by adding a bit of noise to the trained parameters and using normalization techniques, developers can considerably reduce the model’s ability to memorize users’ data.