feat: include JAR SHA-1 digest in JSON report output #10606

kotakanbe · 2026-05-01T07:04:01Z

kotakanbe
May 1, 2026

Background

The JAR analyzer (pkg/fanal/analyzer/language/java/jar/jar.go) computes the SHA-1 digest of the .jar/.war/.ear/.par file it scans, and includes it in SPDX (checksumValue) and CycloneDX (hashes[].content) output. The types.Package.Digest field exists regardless of output format, but --format json discards it because of the format gate at pkg/commands/artifact/run.go:625-628. The data is computed for the SBOM path and dropped only for JSON.

Why scope this to JAR

The FileChecksum option also flips digest computation on for gemspec / nodejs/pkg / python/packaging / conda/meta. For those analyzers the file being hashed is a small manifest text (package.json, .gemspec, METADATA), not the package's binary artifact, and no public registry exposes a SHA-based metadata lookup. The use cases below only work for JAR, and this proposal does not change behavior for the other language analyzers.

Use cases that need the JAR digest from `--format json`

1. Maven Central canonicalization

https://search.maven.org/solrsearch/select?q=1:<sha1> deterministically maps SHA-1 → (groupId, artifactId, version) and is the only authoritative way to recover GAV for a JAR with a missing or wrong pom.properties. Trivy itself calls this API in pkg/dependency/parser/java/jar/jar.go as its groupId fallback during scanning; downstream consumers want to do the same lookup post hoc, e.g. when reconciling against a separate SBOM that arrived without groupId.

2. Cross-tool / cross-source artifact identity

The same physical JAR can land in multiple inputs (CI artifact, deployed image, vendor SBOM) under slightly different PURLs (groupId stripped, version normalized, classifier added). SHA-1 is the only format-independent identity that lets a downstream system say "these are the same bytes." Without digest in JSON, consumers must either re-hash the file themselves (requires having the file) or skip dedup.

3. Tamper / repackaging detection

Maven Central publishes (groupId, artifactId, version) → expected SHA-1. Comparing scan-time SHA-1 against the published value flags repackaged JARs (shaded, stripped, modified). This is a security-relevant check that is currently impossible from a JSON scan result alone.

4. Output-format symmetry

--format spdx-json and --format cyclonedx already include the SHA-1. --format json does not. Producing equivalent reports requires picking a less ergonomic format purely to retain the field.

5. Supplements PR #10178

#10178 (feat(purl): add checksum qualifier when package digest is set, currently draft / stale) appends ?checksum=sha1:<hex> to the PURL when Digest is non-empty, but under the current gate Digest is always empty in --format json, so the qualifier never appears. Exposing the digest in JSON makes that PR's qualifier actually show up.

Concrete consumer

Vuls consumes Trivy --format json output via its trivy-to-vuls converter. It uses the JAR SHA-1 to canonicalize GAV against Maven Central in detector/library.go (regenerating the PURL after SHA-1 lookup, see commit c004b78). Today this code path silently no-ops on trivy-derived ScanResults because Trivy drops the digest in the JSON output.

Cost

pkg/fanal/analyzer/language/analyze.go:115-125 calculateDigest seeks the file back to 0 after the parser runs and computes SHA-1 over the full contents — an additional full-file read pass. The zip parser only touches small entries (MANIFEST.MF, pom.properties), so the bulk of the JAR (the .class files) is read only for the digest. Stream-merging is not viable due to zip's random-access structure.

Local measurement: pure crypto/sha1 peaks at ~1.8 GiB/s (warm cache, single thread). Trivy on a 23-file, 335 MiB JAR directory:

setup	parallel	mean	delta
baseline	5 (default)	172 ms	—
JAR digest enabled	5	222 ms	+50 ms (+29%)
baseline	1	338 ms	—
JAR digest enabled	1	527 ms	+190 ms (+56%)

Cold disk adds I/O time proportional to JAR bytes (e.g. ~80 MB Spring Boot fat jar: ~50 ms warm, several hundred ms cold depending on disk).

Implementation options

Happy to PR any of these; opening this discussion to align before finalizing.

A. Always-on for JAR (recommended). Bypass the format gate when fileType == types.Jar. No new flag. Cost is bounded to JAR bytes scanned, predictable.
PoC: diff against main — 1-line production change in pkg/fanal/analyzer/language/java/jar/jar.go + mechanical test updates (existing tests asserted empty digest).
B. New flag --jar-checksum. Explicit opt-in, zero cost when off, but adds permanent CLI surface.
C. Auto-enable when --list-all-pkgs is set. Piggybacks on the existing "verbose JSON" intent, no new flag, but introduces an implicit coupling.

I'd default to A. Happy to switch to B or C if you'd rather avoid the default-on cost change for JAR-heavy scans.

kotakanbe · 2026-05-01T07:11:49Z

kotakanbe
May 1, 2026
Author

After posting this I realized the framing here is unnecessarily narrow.

SBOM round-trip integrity: the more general argument

Trivy's --format spdx-json and --format cyclonedx already include the SHA-1 digest for every analyzer that supports FileChecksum — not just java/jar but also nodejs/pkg (package.json), ruby/gemspec (.gemspec), python/packaging (METADATA), python/packaging/egg (.egg-info), and conda/meta. The checksum is part of those SBOMs by deliberate choice (#3888, #7507).

A downstream tool that ingests --format json, persists it, and later re-emits an SBOM (CycloneDX or SPDX) for its own consumers cannot reproduce Trivy's own SBOM faithfully today — every non-JAR checksum is silently dropped on the JSON hop. That is a regression of SBOM data fidelity caused by a gating decision (pkg/commands/artifact/run.go:625-628) that wasn't necessarily intended to gate user-facing data, only to avoid the I/O cost when the active output format didn't need it.

Cost reconsidered

For the non-JAR analyzers the cost is essentially free: the file being hashed is a small manifest text (typically 1–10 KB), so adding SHA-1 for those is microseconds per file with negligible additional I/O. The 30–56% overhead measured in the original post is JAR-bytes-dominated; non-JAR analyzers contribute approximately nothing.

Revised proposal

Option A becomes "remove the format gate entirely" rather than "bypass for JAR." All analyzers that already opt in to FileChecksum under SBOM formats would also produce digests under --format json, matching what Trivy's own SBOM emits. Cost stays bounded by JAR bytes scanned (the same number measured above).

The JAR-specific use cases (Maven Central canonicalization, tamper detection) still motivate JAR most strongly, but the SBOM round-trip argument generalizes: it's a fidelity concern, not a JAR concern.

I'm happy to update the PoC branch to drop the gate entirely if option A in this widened scope is the direction you'd prefer. The widened implementation is the same diff size (drop the if in run.go, hard-code fileChecksum := true) and the test impact is similar (existing analyzer tests that asserted empty digest under FileChecksum: false will need the digest filled in, mechanical).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: include JAR SHA-1 digest in JSON report output #10606

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

feat: include JAR SHA-1 digest in JSON report output #10606

Uh oh!

kotakanbe May 1, 2026

Background

Why scope this to JAR

Use cases that need the JAR digest from --format json

1. Maven Central canonicalization

2. Cross-tool / cross-source artifact identity

3. Tamper / repackaging detection

4. Output-format symmetry

5. Supplements PR #10178

Concrete consumer

Cost

Implementation options

Replies: 1 comment

Uh oh!

kotakanbe May 1, 2026 Author

SBOM round-trip integrity: the more general argument

Cost reconsidered

Revised proposal

kotakanbe
May 1, 2026

Use cases that need the JAR digest from `--format json`

kotakanbe
May 1, 2026
Author