BUG in the logarithm detection algorithm - unreliable DES score

The issue originates from the [guess_is_log](https://github.com/ArcInstitute/pdex/blob/c5fd2ecb892c6b6634bc9047ec83ce38627e1684/src/pdex/_utils.py#L10-L32) function of pdex. This function checks whether the sum of preprocessed counts is below 15, under the assumption that e¹⁵ − 1 ≈ 3.26M counts per cell is an unlikely value.

However, the function incorrectly assumes that the sum of counts is log-transformed, whereas in reality, each gene’s count is log-transformed individually.

Example:
Assume your median UMI count per cell is 10k (the challenge uses 50k+). If a cell has 1 count for each gene and is median-normalized, you get 1 normalized count per gene. Applying log1p yields 0.69 for each gene. Summing these up—as guess_is_log does—returns 6931, which is (incorrectly) detected as non-log-transformed data. Consequently, your log-transformed data is treated as non-log-transformed. Even if you submit count data through cell-eval, as cell-eval correctly guesses it is on count level (it uses its own function) and log-normalizes it and pushes it to so to pdex.

Impact:
This will almost certainly distort fold-change estimates—and, crucially, their rankings. This is particularly harmful when your model detects more DEGs than the ground truth, as the top N DEGs (ranked by absolute fold change) will be misordered. Such models are unfairly penalized in the DES score due to this bug.

This problem was reported in previous issues:
- https://github.com/ArcInstitute/pdex/issues/41
- https://github.com/ArcInstitute/cell-eval/issues/184

We identify that the guess_is_log function is not just not optimal but rather totally inappropriate. Also, this is especially a problem for the VCC challenge, where the [cell-eval](https://github.com/ArcInstitute/cell-eval) is used for evaluation and one has no way of directly specifying pdex-kwargs because the submissions are evaluated on the server.

Thanks a lot for looking into our bug report. We will report this to cell-eval as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG in the logarithm detection algorithm - unreliable DES score #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG in the logarithm detection algorithm - unreliable DES score #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions