Skip to content

In-memory inputs for column split and vertical federated learning #9619

@rongou

Description

@rongou

We've recently added support for column-wise data split (feature parallelism) and vertical federated learning (#8424), but the user interface in python is limited to text inputs and numpy arrays (#9365) only. We'd like to support other in-memory formats such as scipy sparse matrix, pandas data frame, cudf, and cupy.

One question is the meaning of passing in data_split_mode=COL. There are potentially two interpretations:

  • We assume each worker has access to the full dataset, passing in data_split_mode=COL would load the whole DMatrix, then split it by column according to the size of the cluster. The columns are split evenly into world_size slices, with each worker's rank determining which slice it gets. This is the approach currently used by the text inputs for feature parallel distributed training, but not for vertical federated learning.
  • We assume each worker only has access to a subset of the total number of columns, with column indices starting from 0 on every worker. The whole DMatrix is a union of all the columns from all the workers, with column indices re-indexed starting from worker 0. This is the approach currently used for vertical federated learning.

Now we want to support more in-memory inputs, it probably makes more sense to standardize on the second approach, since it seems wasteful to construct a DMatrix in memory and then slice it by column.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions