-
Notifications
You must be signed in to change notification settings - Fork 621
UMAP spectral initialization with int64_t nnz types. #7225
Description
UMAP with large datasets dispatches the COO graph with uint64_t nnz types. This needs to be properly handled in spectral initialization. We are aiming to resolve #6910 with this.
There are several changes and considerations to be made when working on this.
- Exposing int64_t types in cuvs spectral embedding algorithm. Currently there is only support for int.
- uint64_t in spectral embedding is not possible because there are downstream primitives that do not work with it (I forget the exact ones, but they were blockers when I was working on this.)
- There are several bugs in the raft kernel primitives for int64_t that need to be resolved.
- UMAP is handing off row/col indices in int always. This could be a problem since when converting to CSR the row offsets needs to be int64.
- However, allocating space for int64 and copying again, is problematic because it causes OOM.
- What is the best way to efficiently handle the int64 row/col support?
- It would be great is row/col were already in int64 during _get_graph in UMAP. Is it possible to do this? I tried doing this however, it seems like an extensive change.
Update:
Now that I think about it, allocating int64 on the fly for the spectral embedding coo_to_csr step will have the same memory overhead as using row/col int64 in the COO creation upfront.
Scenario 1. COO in umap has row and col with int.
Then during coo_to_csr the output row_offsets must be int64. The col_indices could be int, but that will not work with cusparse spmv primitives in the lanczos solver. Therefore we need to allocate space for a temporary col_indices int64 array.
Scenario 2. COO in umap has row and col with int64
Again during coo_to_csr the output row_offsets must be int64. Now we don't need a temp col_indices since its already in int64.
In both scenarios we create the same amount of extra memory overhead. Either we have the temp col_indices of size nnz * int64. Or we need to increase the size of row/col to int64 upfront. This increase would be from int so it would be a total increase of (nnz * int * 2) + (nnz * int * 2). This is the same increase as nnz * int64.
Therefore, scenario 1 is preferrable, since it doesn't require extensive changes to the COO umap types and needs temporary memory increase.