Vertical Federated Learning RFC

## Motivation
XGBoost 1.7.0 introduced the initial support for Federated Learning. However, only horizontal federated learning is supported. Training samples are assumed to be split horizontally, i.e. each participant has a subset of samples, but with all the features and labels. In many real world applications, data is split vertically: each participant has all the samples, but only a partial list of features, and not all participants have access to the labels. It would be useful to support vertical federated learning in XGBoost.

## Goals
* Enhance XGBoost to support vertical federated learning.
* Support using [NVFlare](https://github.com/NVIDIA/NVFlare) to coordinate the learning process, but the design should be amenable to support other federated learning platforms.
* Efficiency: training speed should be close to traditional distributed training.
* Accuracy: should be close to centralized learning.

## Non-Goals
* Initially we will assume the federated environment is non-adversarial, and will not provide strong privacy guarantees. This will be improved upon in later iterations.
* We will not support data owners dropping out during the learning process.

## Assumptions
In vertical federated learning, before model training, the participants need to jointly compute the common IDs of their private sets. This is called private set intersection (PSI). For XGBoost, we assume this is already done, and users may rely on some other library/framework for this step.

Similar to horizontal federated learning, we make some simplifying assumptions::
* A few trusted partners jointly train a model.
* Reasonably fast network connection between each participant and a central trusted party.

## Risks
The current XGBoost codebase is fairly complicated and hard to modify. Some code refactoring needs to happen first, before support for vertical federated learning can be added. Care must be taken to not break existing functionality, or make regular training harder.

## Design
LightGBM, a gradient boosting library similar to XGBoost, supports “feature parallel” distributed learning.

![feature_parallel](https://user-images.githubusercontent.com/497101/200077543-0a1806fe-1b25-4ddc-84ca-fc2738909ce4.png)

Conceptually, feature parallelism is similar to vertical federated learning. A possible design is to first enhance XGBoost distributed training to support feature parallelism, and then build vertical federated learning on top of it. This would benefit the wider user community, thus greatly reducing the risks involved in refactoring XGBoost’s code base.

### Feature Parallelism
XGBoost has an internal training parameter called `DataSplitMode`, which can be set to `auto`, `col`, and `row`. However, it’s currently not exposed to the end user, and can only be set to `row` for distributed training.

In order to support column-based data split for distributed training, we need to do the following:
* When initially loading data, support splitting by column. To keep it simple, we can have all workers keep a copy of the labels (and other things like `weight` and `qid`, effectively the `MetaInfo` object). The resulting `DMatrix` needs to keep track of which features belong to which worker.
* When generating the prediction, participants need to work collaboratively: the worker owning the feature used to split a node needs to collect the left and right splits and broadcast the results. A naive implementation may incur too much communication overhead, but there is prior work to encode partial predictions in bitsets to make the process more efficient (see [paper](https://dl.acm.org/doi/pdf/10.1145/3331184.3331331)).
* The worker owning the label should calculate the gradients and broadcast them to other workers.
* When finding the best split, each worker finds the best local split based on the features it owns, and then performs an allreduce to find the global best split. Workers don’t need to access histogram from each other. The worker owning the feature for the best split then broadcasts the split results to others.

We may also want to consider implementing LightGBM’s voting parallel approach ([paper](https://proceedings.neurips.cc/paper/2016/file/10a5ab2db37feedfdeaab192ead4ac0e-Paper.pdf)) for more efficient communication.

### Vertical Federated Learning
Assuming feature parallelism is implemented, vertical federated learning is a slight modification:
* When loading data, no need to split the columns further, since we assume each worker only has a subset of the features.
* We can no longer share labels between workers.
* Communication needs to switch to the federated communicator.

### Federated Inference
In horizontal federated learning, since each participant has all the features and labels, trained models can be shared to run inference/prediction locally. In vertical federated learning, however, this is no longer feasible. All the participants need to be online and work collaboratively to run inference. For batch inference, a federated learning job can be set up (for example, using NVFlare) to produce predictions. For online inference, we would need to set up services at participating sites to jointly produce online predictions, which is out of the scope of this design.

## Alternatives Considered
It may be possible to implement vertical federated learning without first adding support for column data split mode in distributed training. However, since this would require extensive refactoring of the XGBoost code base without producing any benefits for users not using federated learning, it may be too risky. For very wide datasets (relatively many features with moderate number of rows), column data split may be a useful feature for distributed training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vertical Federated Learning RFC #8424

Motivation

Goals

Non-Goals

Assumptions

Risks

Design

Feature Parallelism