Skip to content

Add blob caching to GitHub provider to reduce duplicate API calls #2376

@chmouel

Description

@chmouel

Add blob caching to GitHub provider to reduce duplicate API calls

Summary

The getObject() function in the GitHub provider fetches blob content from the GitHub API without any caching. When the same file (identified by SHA) is requested multiple times during a single webhook event processing, each request results in a separate API call.

Problem

Location: pkg/provider/github/github.go:596-609

func (v *Provider) getObject(ctx context.Context, sha string, runevent *info.Event) ([]byte, error) {
    blob, _, err := wrapAPI(v, "get_blob", func() (*github.Blob, *github.Response, error) {
        return v.Client().Git.GetBlob(ctx, runevent.Organization, runevent.Repository, sha)
    })
    if err != nil {
        return nil, err
    }

    decoded, err := base64.StdEncoding.DecodeString(blob.GetContent())
    if err != nil {
        return nil, err
    }
    return decoded, err
}

Issue: No caching mechanism. Every call to getObject() with the same SHA results in a new API request.

Evidence from E2E Instrumentation

In a single "Github PullRequest" test run with 21 API calls:

2x /repos/.../git/blobs/37d7ae74ccf80e9e2c1499d15325615cdb5dc692
2x /repos/.../git/blobs/987b42c3236751cb864ae28b9c64fc726f790d35
2x /repos/.../git/blobs/c93a9530f83d97c191bc642e51d8a533a40ac5bd

The same blob SHAs are fetched twice. This pattern repeats across multiple tests.

Total impact: 18 duplicate get_blob calls out of 136 total (13.2% waste)

Root Cause

When GetTektonDir() is called multiple times (e.g., once for source provenance and once for default_branch provenance), the concatAllYamlFiles() function fetches each YAML file's content via getObject(). Since there's no cache, identical files are fetched again.

Call chain:

GetTektonDir()
  -> concatAllYamlFiles()
    -> getObject(sha) // No cache, always calls API

Proposed Solution

Add an in-memory blob cache to the Provider struct:

type Provider struct {
    // ... existing fields
    blobCache      map[string][]byte
    blobCacheMutex sync.RWMutex
}

func New() *Provider {
    return &Provider{
        // ... existing initialization
        blobCache: make(map[string][]byte),
    }
}

func (v *Provider) getObject(ctx context.Context, sha string, runevent *info.Event) ([]byte, error) {
    // Check cache first (read lock)
    v.blobCacheMutex.RLock()
    if cached, ok := v.blobCache[sha]; ok {
        v.blobCacheMutex.RUnlock()
        return cached, nil
    }
    v.blobCacheMutex.RUnlock()

    // Fetch from API
    blob, _, err := wrapAPI(v, "get_blob", func() (*github.Blob, *github.Response, error) {
        return v.Client().Git.GetBlob(ctx, runevent.Organization, runevent.Repository, sha)
    })
    if err != nil {
        return nil, err
    }

    decoded, err := base64.StdEncoding.DecodeString(blob.GetContent())
    if err != nil {
        return nil, err
    }

    // Store in cache (write lock)
    v.blobCacheMutex.Lock()
    v.blobCache[sha] = decoded
    v.blobCacheMutex.Unlock()

    return decoded, nil
}

Why This is Safe

  1. Blob SHAs are content-addressable: A SHA uniquely identifies the exact content. If the SHA matches, the content is guaranteed to be identical.

  2. Cache lifetime is bounded: The Provider instance is created per webhook event and discarded after processing. No risk of stale data persisting.

  3. Thread-safe: Using sync.RWMutex allows concurrent reads while protecting writes.

  4. No cross-repository concerns: The cache is per-Provider instance, and each Provider handles a single repo's event.

Files to Modify

  • pkg/provider/github/github.go
    • Add blobCache and blobCacheMutex fields to Provider struct
    • Initialize blobCache in New()
    • Add cache check/store logic in getObject()

Testing

  1. Existing unit tests should pass (behavior unchanged, just faster)
  2. Run e2e tests with PAC_API_INSTRUMENTATION_DIR set
  3. Verify get_blob duplicate calls are eliminated

Expected Impact

  • Calls saved: ~18 per full e2e test suite
  • Per-event savings: 1-3 calls for typical PR events
  • Complexity: Low - isolated change to a single function

Acceptance Criteria

  • getObject() caches blob content by SHA
  • Cache is thread-safe
  • No duplicate get_blob calls for same SHA in instrumentation output
  • All existing tests pass
  • No memory leaks (cache cleared when Provider is garbage collected)

/label ~enhancement ~performance ~github-provider

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions