feat(canvas): checkpoint logic (3/4) by benwu408 · Pull Request #9807 · onyx-dot-app/onyx

benwu408 · 2026-03-31T17:47:57Z

Description

Add _load_from_checkpoint with staged processing (pages → assignments → announcements) per course, time-window filtering, per-document failure isolation via ConnectorFailure, and proper checkpoint state advancement. Security-critical pagination errors (host/scheme mismatch) propagate while recoverable API errors (404, 429, 5xx) trigger retries. Implements load_from_checkpoint, load_from_checkpoint_with_perm_sync, build_dummy_checkpoint, and validate_checkpoint_json.

Includes unit tests for checkpoint lifecycle, stage advancement, time filtering, failure handling, and perm sync attachment.

How Has This Been Tested?

Unit tests covering checkpoint lifecycle: first-call course materialization, stage processing (pages → assignments → announcements), stage advancement across all 3 stages, time-window filtering, announcement skip on missing posted_at, stage failure retry (no advancement), per-document conversion failure yielding ConnectorFailure, terminal state (has_more=False), invalid stage rejection, and load_from_checkpoint_with_perm_sync attaching ExternalAccess.

Also manually tested against a live Canvas instance.

PR Stack

Stacked PRs on my fork (each targets the previous branch for isolated review):

PRs on upstream onyx (all target main):

Summary by cubic

Adds checkpoint-based indexing to the Canvas connector with staged per‑course processing and time-window filtering. Also adds a permission-sync mode and tighter error handling, including surfacing 401/403 and retrying only recoverable API errors.

New Features
- Shared _load_from_checkpoint used by load_from_checkpoint and load_from_checkpoint_with_perm_sync, plus build_dummy_checkpoint and validate_checkpoint_json.
- Staged per-course processing with pagination resume via next_url and correct stage/course advancement.
- Filters by (start, end] using item timestamps; skips announcements without posted_at.
- Per-document conversion errors emit ConnectorFailure without stopping other items.
- Security-critical pagination errors propagate; recoverable API errors are retried by keeping has_more=True.
Bug Fixes
- 401/403 in the checkpoint loop now surface via _handle_canvas_api_error (no silent retry), while other API failures continue to retry.

^{Written for commit 23933c5. Summary will update on new commits.}

Add _load_from_checkpoint with staged processing (pages → assignments → announcements) per course, time-window filtering, per-document failure isolation via ConnectorFailure, and proper checkpoint state advancement. Security-critical pagination errors (host/scheme mismatch) propagate while recoverable API errors trigger retries via has_more=True. Implements load_from_checkpoint, load_from_checkpoint_with_perm_sync, build_dummy_checkpoint, and validate_checkpoint_json. Includes unit tests for checkpoint lifecycle, stage advancement, time filtering, failure handling, and perm sync attachment.

cubic-dev-ai

4 issues found across 2 files

Confidence score: 3/5

There is a concrete runtime risk in backend/onyx/connectors/canvas/connector.py: checkpoint loading appears to retry on all API errors and does not surface 401/403, which can cause unauthorized/expired credentials to loop indefinitely instead of failing clearly.
backend/onyx/connectors/canvas/connector.py also has duplicated _handle_canvas_api_error logic that overrides prior behavior, potentially misclassifying 5xx upstream failures as validation errors and changing error handling paths.
Test reliability is reduced in backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py because duplicate class/test names overwrite earlier definitions, so some newly added cases are skipped and coverage drops.
Pay close attention to backend/onyx/connectors/canvas/connector.py and backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py - API error classification and duplicated tests may hide regressions.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py">

<violation number="1" location="backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py:701">
P2: Duplicate TestConnectorUrlNormalization class definitions overwrite the earlier class, so the newly added tests in the first class are skipped by pytest.</violation>

<violation number="2" location="backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py:946">
P3: Duplicate test_validate_insufficient_permissions definition overwrites the earlier test, so one variant never runs and coverage is reduced.</violation>
</file>

<file name="backend/onyx/connectors/canvas/connector.py">

<violation number="1" location="backend/onyx/connectors/canvas/connector.py:65">
P2: Duplicate `_handle_canvas_api_error` overrides the original 5xx handling logic, so 5xx Canvas errors are misclassified as validation errors.</violation>

<violation number="2" location="backend/onyx/connectors/canvas/connector.py:519">
P2: Checkpoint loading treats all API errors as retryable and never raises for 401/403, so expired or unauthorized credentials will loop forever instead of surfacing a credential error.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py

backend/onyx/connectors/canvas/connector.py

backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py

greptile-apps · 2026-03-31T18:04:06Z

Greptile Summary

This PR implements the core checkpoint-based indexing for the Canvas connector: _load_from_checkpoint, load_from_checkpoint, load_from_checkpoint_with_perm_sync, build_dummy_checkpoint, and validate_checkpoint_json. The staged processing model (pages → assignments → announcements per course) with next_url pagination cursors, time-window filtering, and per-document ConnectorFailure isolation is well-structured and consistent with other checkpointed connectors in the codebase.

Key changes:

_load_from_checkpoint handles five distinct checkpoint states: initial course materialization, active pagination within a stage, stage advancement, course advancement, and terminal state.
Auth errors (401/403) propagate as CredentialExpiredError/InsufficientPermissionsError; recoverable API errors (429, 5xx) silently set has_more=True for retry; security errors from _parse_next_link (host/scheme mismatch) always re-raise.
_maybe_attach_permissions attaches ExternalAccess per-document only in the load_from_checkpoint_with_perm_sync path, sharing all other logic.
Prior review issues (duplicate _handle_canvas_api_error, missing _make_url_dispatcher/_run_checkpoint helpers, missing imports) are resolved in this commit.
Two minor items remain: stage_config is constructed unconditionally even when next_url is active (minor inefficiency), and the next_url resume path has no unit test coverage.

Confidence Score: 5/5

Safe to merge — no P0/P1 issues found; all remaining findings are P2 style and coverage suggestions.
The prior blocking issues (duplicate function definition, missing test helpers, missing imports) are all resolved. The checkpoint state machine logic is correct, auth errors propagate properly, retryable errors are handled, and per-document failure isolation works as designed. The two remaining P2 items (eager stage_config construction, missing next_url resume test) are minor and do not affect production correctness.
No files require special attention; the test file would benefit from a pagination-resume test case.

Important Files Changed

Filename	Overview
backend/onyx/connectors/canvas/connector.py	Adds `_load_from_checkpoint` with staged per-course processing, time-window filtering, per-document failure isolation, and checkpoint state advancement; also implements the four remaining abstract methods. Logic is sound but `stage_config` is eagerly constructed even when unused, and `oe._status_code_override` (private field access) from a prior thread remains unaddressed.
backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py	Adds `_make_url_dispatcher`, `_run_checkpoint` helpers, `TestCheckpoint` and `TestLoadFromCheckpointWithPermSync` suites — resolving prior missing-helper issues; the `next_url` pagination resume path has no test coverage.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([load_from_checkpoint called]) --> B{course_ids empty?}
    B -- Yes --> C[_list_courses]
    C -- failure --> D[has_more=True\nreturn same checkpoint]
    C -- success, 0 courses --> E[has_more=False\nreturn checkpoint]
    C -- success, N courses --> F[populate course_ids\nstage=pages, index=0\nhas_more=True\nreturn — no docs yielded]
    B -- No --> G{current_index ≥\nlen course_ids?}
    G -- Yes --> H[has_more=False\nreturn]
    G -- No --> I[Validate stage\npages / assignments / announcements]
    I --> J{next_url set?}
    J -- Yes --> K[GET full_url]
    J -- No --> L[GET endpoint + params]
    K --> M{HTTP error?}
    L --> M
    M -- 401/403 --> N[raise CredentialExpired /\nInsufficientPermissions]
    M -- security error no status --> N
    M -- 429 / 5xx --> O[has_more=True\nreturn — stage unchanged]
    M -- OK --> P[Iterate items]
    P --> Q{item in\ntime window?}
    Q -- No / no timestamp --> R[skip]
    Q -- Yes --> S[convert to Document\nyield doc]
    S -- conversion error --> T[yield ConnectorFailure]
    P --> U{result_next_url?}
    U -- Yes --> V[checkpoint.next_url = result_next_url\nhas_more=True]
    U -- No --> W{next stage?}
    W -- pages→assignments\nassignments→announcements --> X[advance stage\nnext_url=None]
    W -- announcements→done --> Y[advance_course\nreset stage+next_url]
    V --> Z([return new_checkpoint])
    X --> Z
    Y --> Z

Prompt To Fix All With AI

This is a comment left during a code review.
Path: backend/onyx/connectors/canvas/connector.py
Line: 486-493

Comment:
**`stage_config` eagerly built even when unused**

`stage_config` and `config` are constructed on every invocation, including when `new_checkpoint.next_url` is set — in that branch, `config["endpoint"]` and `config["params"]` are never accessed. Moving the dict construction into the `else` branch keeps the hot path (pagination resume) cleaner and avoids building three endpoint strings on every page-turn.

```suggestion
        try:
            if new_checkpoint.next_url:
                response, result_next_url = self.canvas_client.get(
                    full_url=new_checkpoint.next_url
                )
            else:
                stage_config: dict[str, dict[str, Any]] = {
                    "pages": {
                        "endpoint": f"courses/{course_id}/pages",
                        "params": {"per_page": "100", "include[]": "body", "published": "true"},
                    },
                    "assignments": {
                        "endpoint": f"courses/{course_id}/assignments",
                        "params": {"per_page": "100", "published": "true"},
                    },
                    "announcements": {
                        "endpoint": "announcements",
                        "params": {
                            "per_page": "100",
                            "context_codes[]": f"course_{course_id}",
                            "active_only": "true",
                        },
                    },
                }
                config = stage_config[stage]
                response, result_next_url = self.canvas_client.get(
                    config["endpoint"], params=config["params"]
                )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py
Line: 330-344

Comment:
**No test coverage for `next_url` pagination resume path**

`_run_checkpoint` always exhausts `StopIteration` after one generator pass, and `_mock_response` always returns an empty `Link` header — so `result_next_url` is always `None` in every test. The `next_url` branch in `_load_from_checkpoint` (lines 487–490 of `connector.py`) is never exercised. A test like the one below would cover mid-pagination resume and verify that the checkpoint correctly carries `next_url` between calls:

```python
@patch("onyx.connectors.canvas.client.rl_requests")
def test_pagination_resume_via_next_url(self, mock_requests: MagicMock) -> None:
    """When a page has a next_url, checkpoint carries it and resumes from there."""
    page1 = _mock_page(10, updated_at="2025-06-15T12:00:00Z")
    page2 = _mock_page(11, title="Page 2", updated_at="2025-06-15T13:00:00Z")
    call_count = 0

    def _dispatcher(url: str, **kwargs: Any) -> MagicMock:
        nonlocal call_count
        call_count += 1
        next_url = f"{FAKE_BASE_URL}/api/v1/courses/1/pages?page=2" if call_count == 1 else ""
        return _mock_response(
            json_data=[page1] if call_count == 1 else [page2],
            link_header=f'<{next_url}>; rel="next"' if next_url else "",
        )

    mock_requests.get.side_effect = _dispatcher
    connector = _build_connector()
    cp = CanvasConnectorCheckpoint(
        has_more=True, course_ids=[1], current_course_index=0, stage="pages"
    )
    start = datetime(2025, 6, 1, tzinfo=timezone.utc).timestamp()
    end = datetime(2025, 6, 30, tzinfo=timezone.utc).timestamp()

    items1, cp = _run_checkpoint(connector, cp, start, end)
    assert len(items1) == 1
    assert cp.next_url is not None  # cursor saved
    assert cp.stage == "pages"     # stage not advanced yet

    items2, cp = _run_checkpoint(connector, cp, start, end)
    assert len(items2) == 1
    assert cp.next_url is None     # exhausted
    assert cp.stage == "assignments"
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix duplicate tests/functions from earli..." | Re-trigger Greptile}

backend/onyx/connectors/canvas/connector.py

backend/tests/unit/onyx/connectors/canvas/test_canvas_connector.py

greptile-apps · 2026-03-31T18:04:16Z

backend/onyx/connectors/canvas/connector.py

+        except OnyxError as oe:
+            # Re-raise security errors from _parse_next_link (host/scheme
+            # mismatch on pagination URLs) — these must not be silenced.
+            # Security errors have no HTTP status code override (they are
+            # raised locally, not from an API response).
+            is_api_error = oe._status_code_override is not None
+            if not is_api_error:
+                raise
+            logger.warning(
+                f"Failed to fetch {stage} for course {course_id}: {oe}"
            )
+            new_checkpoint.has_more = True
+            return new_checkpoint


Accessing private _status_code_override to classify errors

The code inspects oe._status_code_override to distinguish HTTP API errors from locally-raised security errors. This couples to an internal detail of OnyxError. Consider adding a public property on OnyxError to expose whether the error originated from an HTTP response, or catching the security-error types explicitly.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/onyx/connectors/canvas/connector.py Line: 514-526 Comment: **Accessing private `_status_code_override` to classify errors** The code inspects `oe._status_code_override` to distinguish HTTP API errors from locally-raised security errors. This couples to an internal detail of `OnyxError`. Consider adding a public property on `OnyxError` to expose whether the error originated from an HTTP response, or catching the security-error types explicitly. How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

backend/onyx/connectors/canvas/connector.py

jasmine-wu-ai · 2026-04-01T02:13:46Z

backend/onyx/connectors/canvas/connector.py

+        include_permissions: bool = False,
+    ) -> CheckpointOutput[CanvasConnectorCheckpoint]:
+        """Shared implementation for load_from_checkpoint and load_from_checkpoint_with_perm_sync."""
+        new_checkpoint = checkpoint.model_copy(deep=True)


is model_copy just making a copy of the object?

yep, makes a copy that we modify then return. its a pydantic method

jasmine-wu-ai · 2026-04-01T02:15:52Z

backend/onyx/connectors/canvas/connector.py

+        # First call: materialize the list of course IDs
+        if not new_checkpoint.course_ids:
+            try:
+                courses = self._list_courses()


maybe i don't have a good understanding of what exactly "checkpoint" means - but from reading this i'd be concerned that we're just using the old course_ids if they exist - what if someone joins / drops a course since we last checkpoitned?

This lowkey confused me and I had to spend some time understanding it, but basically indexing is like every minute or whatever when I sorts thru everything and finds new stuff. every indexing run makes a new empty checkpoint, calls list courses, then finds every single document and yields the ones in the window since the last indexing run. checkpoints are made at every single stage in the indexing process, each load from checkpoint run processes a page(at most 100) of things, and then returns a new checkpoint that is then called. So basically every new index you'd get the full course list again.

jasmine-wu-ai · 2026-04-01T02:16:16Z

backend/onyx/connectors/canvas/connector.py

+                courses = self._list_courses()
+            except Exception as e:
+                logger.warning(f"Failed to list Canvas courses: {e}")
+                new_checkpoint.has_more = True


what does has_more mean?

the current indexing run still has more docs to go thru

…perly in checkpoint loop

benwu408 requested a review from a team as a code owner March 31, 2026 17:47

cubic-dev-ai bot reviewed Mar 31, 2026

View reviewed changes

greptile-apps bot reviewed Mar 31, 2026

View reviewed changes

jasmine-wu-ai reviewed Apr 1, 2026

View reviewed changes

backend/onyx/connectors/canvas/connector.py Outdated Show resolved Hide resolved

jasmine-wu-ai reviewed Apr 1, 2026

View reviewed changes

fix duplicate tests/functions from earlier rebase. handle 401/403 pro…

23933c5

…perly in checkpoint loop

benwu408 had a problem deploying to ci-protected April 2, 2026 19:27 — with GitHub Actions Failure

Conversation

benwu408 commented Mar 31, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

PR Stack

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jasmine-wu-ai Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

benwu408 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

jasmine-wu-ai Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

benwu408 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

jasmine-wu-ai Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

benwu408 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benwu408 commented Mar 31, 2026 •

edited by cubic-dev-ai bot

Loading

greptile-apps bot commented Mar 31, 2026 •

edited

Loading