In-memory inputs for column split and vertical federated learning

We've recently added support for column-wise data split (feature parallelism) and vertical federated learning (#8424), but the user interface in python is limited to text inputs and numpy arrays (#9365) only. We'd like to support other in-memory formats such as scipy sparse matrix, pandas data frame, cudf, and cupy.

One question is the meaning of passing in `data_split_mode=COL`. There are potentially two interpretations:

- We assume each worker has access to the full dataset, passing in `data_split_mode=COL` would load the whole `DMatrix`, then split it by column according to the size of the cluster. The columns are split evenly into `world_size` slices, with each worker's `rank` determining which slice it gets. This is the approach currently used by the text inputs for feature parallel distributed training, but not for vertical federated learning.
- We assume each worker only has access to a subset of the total number of columns, with column indices starting from 0 on every worker. The whole `DMatrix` is a union of all the columns from all the workers, with column indices re-indexed starting from worker 0. This is the approach currently used for vertical federated learning.

Now we want to support more in-memory inputs, it probably makes more sense to standardize on the second approach, since it seems wasteful to construct a `DMatrix` in memory and then slice it by column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In-memory inputs for column split and vertical federated learning #9619

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

In-memory inputs for column split and vertical federated learning #9619

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions