-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
In-memory inputs for column split and vertical federated learning #9619
Copy link
Copy link
Open
Labels
Description
We've recently added support for column-wise data split (feature parallelism) and vertical federated learning (#8424), but the user interface in python is limited to text inputs and numpy arrays (#9365) only. We'd like to support other in-memory formats such as scipy sparse matrix, pandas data frame, cudf, and cupy.
One question is the meaning of passing in data_split_mode=COL. There are potentially two interpretations:
- We assume each worker has access to the full dataset, passing in
data_split_mode=COLwould load the wholeDMatrix, then split it by column according to the size of the cluster. The columns are split evenly intoworld_sizeslices, with each worker'srankdetermining which slice it gets. This is the approach currently used by the text inputs for feature parallel distributed training, but not for vertical federated learning. - We assume each worker only has access to a subset of the total number of columns, with column indices starting from 0 on every worker. The whole
DMatrixis a union of all the columns from all the workers, with column indices re-indexed starting from worker 0. This is the approach currently used for vertical federated learning.
Now we want to support more in-memory inputs, it probably makes more sense to standardize on the second approach, since it seems wasteful to construct a DMatrix in memory and then slice it by column.
Reactions are currently unavailable