Replies: 1 comment
-
|
After posting this I realized the framing here is unnecessarily narrow. SBOM round-trip integrity: the more general argumentTrivy's A downstream tool that ingests Cost reconsideredFor the non-JAR analyzers the cost is essentially free: the file being hashed is a small manifest text (typically 1–10 KB), so adding SHA-1 for those is microseconds per file with negligible additional I/O. The 30–56% overhead measured in the original post is JAR-bytes-dominated; non-JAR analyzers contribute approximately nothing. Revised proposalOption A becomes "remove the format gate entirely" rather than "bypass for JAR." All analyzers that already opt in to The JAR-specific use cases (Maven Central canonicalization, tamper detection) still motivate JAR most strongly, but the SBOM round-trip argument generalizes: it's a fidelity concern, not a JAR concern. I'm happy to update the PoC branch to drop the gate entirely if option A in this widened scope is the direction you'd prefer. The widened implementation is the same diff size (drop the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
The JAR analyzer (
pkg/fanal/analyzer/language/java/jar/jar.go) computes the SHA-1 digest of the.jar/.war/.ear/.parfile it scans, and includes it in SPDX (checksumValue) and CycloneDX (hashes[].content) output. Thetypes.Package.Digestfield exists regardless of output format, but--format jsondiscards it because of the format gate atpkg/commands/artifact/run.go:625-628. The data is computed for the SBOM path and dropped only for JSON.Why scope this to JAR
The
FileChecksumoption also flips digest computation on forgemspec/nodejs/pkg/python/packaging/conda/meta. For those analyzers the file being hashed is a small manifest text (package.json,.gemspec,METADATA), not the package's binary artifact, and no public registry exposes a SHA-based metadata lookup. The use cases below only work for JAR, and this proposal does not change behavior for the other language analyzers.Use cases that need the JAR digest from
--format json1. Maven Central canonicalization
https://search.maven.org/solrsearch/select?q=1:<sha1>deterministically maps SHA-1 →(groupId, artifactId, version)and is the only authoritative way to recover GAV for a JAR with a missing or wrongpom.properties. Trivy itself calls this API inpkg/dependency/parser/java/jar/jar.goas its groupId fallback during scanning; downstream consumers want to do the same lookup post hoc, e.g. when reconciling against a separate SBOM that arrived without groupId.2. Cross-tool / cross-source artifact identity
The same physical JAR can land in multiple inputs (CI artifact, deployed image, vendor SBOM) under slightly different PURLs (groupId stripped, version normalized, classifier added). SHA-1 is the only format-independent identity that lets a downstream system say "these are the same bytes." Without digest in JSON, consumers must either re-hash the file themselves (requires having the file) or skip dedup.
3. Tamper / repackaging detection
Maven Central publishes
(groupId, artifactId, version) → expected SHA-1. Comparing scan-time SHA-1 against the published value flags repackaged JARs (shaded, stripped, modified). This is a security-relevant check that is currently impossible from a JSON scan result alone.4. Output-format symmetry
--format spdx-jsonand--format cyclonedxalready include the SHA-1.--format jsondoes not. Producing equivalent reports requires picking a less ergonomic format purely to retain the field.5. Supplements PR #10178
#10178 (
feat(purl): add checksum qualifier when package digest is set, currently draft / stale) appends?checksum=sha1:<hex>to the PURL whenDigestis non-empty, but under the current gateDigestis always empty in--format json, so the qualifier never appears. Exposing the digest in JSON makes that PR's qualifier actually show up.Concrete consumer
Vuls consumes Trivy
--format jsonoutput via itstrivy-to-vulsconverter. It uses the JAR SHA-1 to canonicalize GAV against Maven Central indetector/library.go(regenerating the PURL after SHA-1 lookup, see commitc004b78). Today this code path silently no-ops on trivy-derived ScanResults because Trivy drops the digest in the JSON output.Cost
pkg/fanal/analyzer/language/analyze.go:115-125calculateDigestseeks the file back to 0 after the parser runs and computes SHA-1 over the full contents — an additional full-file read pass. The zip parser only touches small entries (MANIFEST.MF,pom.properties), so the bulk of the JAR (the.classfiles) is read only for the digest. Stream-merging is not viable due to zip's random-access structure.Local measurement: pure
crypto/sha1peaks at ~1.8 GiB/s (warm cache, single thread). Trivy on a 23-file, 335 MiB JAR directory:Cold disk adds I/O time proportional to JAR bytes (e.g. ~80 MB Spring Boot fat jar: ~50 ms warm, several hundred ms cold depending on disk).
Implementation options
Happy to PR any of these; opening this discussion to align before finalizing.
fileType == types.Jar. No new flag. Cost is bounded to JAR bytes scanned, predictable.PoC: diff against main — 1-line production change in
pkg/fanal/analyzer/language/java/jar/jar.go+ mechanical test updates (existing tests asserted empty digest).--jar-checksum. Explicit opt-in, zero cost when off, but adds permanent CLI surface.--list-all-pkgsis set. Piggybacks on the existing "verbose JSON" intent, no new flag, but introduces an implicit coupling.I'd default to A. Happy to switch to B or C if you'd rather avoid the default-on cost change for JAR-heavy scans.
Beta Was this translation helpful? Give feedback.
All reactions