Add blob caching to GitHub provider to reduce duplicate API calls
Summary
The getObject() function in the GitHub provider fetches blob content from the GitHub API without any caching. When the same file (identified by SHA) is requested multiple times during a single webhook event processing, each request results in a separate API call.
Problem
Location: pkg/provider/github/github.go:596-609
func (v *Provider) getObject(ctx context.Context, sha string, runevent *info.Event) ([]byte, error) {
blob, _, err := wrapAPI(v, "get_blob", func() (*github.Blob, *github.Response, error) {
return v.Client().Git.GetBlob(ctx, runevent.Organization, runevent.Repository, sha)
})
if err != nil {
return nil, err
}
decoded, err := base64.StdEncoding.DecodeString(blob.GetContent())
if err != nil {
return nil, err
}
return decoded, err
}
Issue: No caching mechanism. Every call to getObject() with the same SHA results in a new API request.
Evidence from E2E Instrumentation
In a single "Github PullRequest" test run with 21 API calls:
2x /repos/.../git/blobs/37d7ae74ccf80e9e2c1499d15325615cdb5dc692
2x /repos/.../git/blobs/987b42c3236751cb864ae28b9c64fc726f790d35
2x /repos/.../git/blobs/c93a9530f83d97c191bc642e51d8a533a40ac5bd
The same blob SHAs are fetched twice. This pattern repeats across multiple tests.
Total impact: 18 duplicate get_blob calls out of 136 total (13.2% waste)
Root Cause
When GetTektonDir() is called multiple times (e.g., once for source provenance and once for default_branch provenance), the concatAllYamlFiles() function fetches each YAML file's content via getObject(). Since there's no cache, identical files are fetched again.
Call chain:
GetTektonDir()
-> concatAllYamlFiles()
-> getObject(sha) // No cache, always calls API
Proposed Solution
Add an in-memory blob cache to the Provider struct:
type Provider struct {
// ... existing fields
blobCache map[string][]byte
blobCacheMutex sync.RWMutex
}
func New() *Provider {
return &Provider{
// ... existing initialization
blobCache: make(map[string][]byte),
}
}
func (v *Provider) getObject(ctx context.Context, sha string, runevent *info.Event) ([]byte, error) {
// Check cache first (read lock)
v.blobCacheMutex.RLock()
if cached, ok := v.blobCache[sha]; ok {
v.blobCacheMutex.RUnlock()
return cached, nil
}
v.blobCacheMutex.RUnlock()
// Fetch from API
blob, _, err := wrapAPI(v, "get_blob", func() (*github.Blob, *github.Response, error) {
return v.Client().Git.GetBlob(ctx, runevent.Organization, runevent.Repository, sha)
})
if err != nil {
return nil, err
}
decoded, err := base64.StdEncoding.DecodeString(blob.GetContent())
if err != nil {
return nil, err
}
// Store in cache (write lock)
v.blobCacheMutex.Lock()
v.blobCache[sha] = decoded
v.blobCacheMutex.Unlock()
return decoded, nil
}
Why This is Safe
-
Blob SHAs are content-addressable: A SHA uniquely identifies the exact content. If the SHA matches, the content is guaranteed to be identical.
-
Cache lifetime is bounded: The Provider instance is created per webhook event and discarded after processing. No risk of stale data persisting.
-
Thread-safe: Using sync.RWMutex allows concurrent reads while protecting writes.
-
No cross-repository concerns: The cache is per-Provider instance, and each Provider handles a single repo's event.
Files to Modify
pkg/provider/github/github.go
- Add
blobCache and blobCacheMutex fields to Provider struct
- Initialize
blobCache in New()
- Add cache check/store logic in
getObject()
Testing
- Existing unit tests should pass (behavior unchanged, just faster)
- Run e2e tests with
PAC_API_INSTRUMENTATION_DIR set
- Verify
get_blob duplicate calls are eliminated
Expected Impact
- Calls saved: ~18 per full e2e test suite
- Per-event savings: 1-3 calls for typical PR events
- Complexity: Low - isolated change to a single function
Acceptance Criteria
/label ~enhancement ~performance ~github-provider
Add blob caching to GitHub provider to reduce duplicate API calls
Summary
The
getObject()function in the GitHub provider fetches blob content from the GitHub API without any caching. When the same file (identified by SHA) is requested multiple times during a single webhook event processing, each request results in a separate API call.Problem
Location:
pkg/provider/github/github.go:596-609Issue: No caching mechanism. Every call to
getObject()with the same SHA results in a new API request.Evidence from E2E Instrumentation
In a single "Github PullRequest" test run with 21 API calls:
The same blob SHAs are fetched twice. This pattern repeats across multiple tests.
Total impact: 18 duplicate
get_blobcalls out of 136 total (13.2% waste)Root Cause
When
GetTektonDir()is called multiple times (e.g., once for source provenance and once for default_branch provenance), theconcatAllYamlFiles()function fetches each YAML file's content viagetObject(). Since there's no cache, identical files are fetched again.Call chain:
Proposed Solution
Add an in-memory blob cache to the
Providerstruct:Why This is Safe
Blob SHAs are content-addressable: A SHA uniquely identifies the exact content. If the SHA matches, the content is guaranteed to be identical.
Cache lifetime is bounded: The Provider instance is created per webhook event and discarded after processing. No risk of stale data persisting.
Thread-safe: Using
sync.RWMutexallows concurrent reads while protecting writes.No cross-repository concerns: The cache is per-Provider instance, and each Provider handles a single repo's event.
Files to Modify
pkg/provider/github/github.goblobCacheandblobCacheMutexfields toProviderstructblobCacheinNew()getObject()Testing
PAC_API_INSTRUMENTATION_DIRsetget_blobduplicate calls are eliminatedExpected Impact
Acceptance Criteria
getObject()caches blob content by SHAget_blobcalls for same SHA in instrumentation output/label ~enhancement ~performance ~github-provider