Skip to content

UMAP spectral initialization with int64_t nnz types. #7225

@aamijar

Description

@aamijar

UMAP with large datasets dispatches the COO graph with uint64_t nnz types. This needs to be properly handled in spectral initialization. We are aiming to resolve #6910 with this.

There are several changes and considerations to be made when working on this.

  1. Exposing int64_t types in cuvs spectral embedding algorithm. Currently there is only support for int.
  2. uint64_t in spectral embedding is not possible because there are downstream primitives that do not work with it (I forget the exact ones, but they were blockers when I was working on this.)
  3. There are several bugs in the raft kernel primitives for int64_t that need to be resolved.
  4. UMAP is handing off row/col indices in int always. This could be a problem since when converting to CSR the row offsets needs to be int64.
  5. However, allocating space for int64 and copying again, is problematic because it causes OOM.
  6. What is the best way to efficiently handle the int64 row/col support?
  7. It would be great is row/col were already in int64 during _get_graph in UMAP. Is it possible to do this? I tried doing this however, it seems like an extensive change.

Update:
Now that I think about it, allocating int64 on the fly for the spectral embedding coo_to_csr step will have the same memory overhead as using row/col int64 in the COO creation upfront.

Scenario 1. COO in umap has row and col with int.
Then during coo_to_csr the output row_offsets must be int64. The col_indices could be int, but that will not work with cusparse spmv primitives in the lanczos solver. Therefore we need to allocate space for a temporary col_indices int64 array.
Scenario 2. COO in umap has row and col with int64
Again during coo_to_csr the output row_offsets must be int64. Now we don't need a temp col_indices since its already in int64.

In both scenarios we create the same amount of extra memory overhead. Either we have the temp col_indices of size nnz * int64. Or we need to increase the size of row/col to int64 upfront. This increase would be from int so it would be a total increase of (nnz * int * 2) + (nnz * int * 2). This is the same increase as nnz * int64.

Therefore, scenario 1 is preferrable, since it doesn't require extensive changes to the COO umap types and needs temporary memory increase.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions