Conversation
|
Preview Deployment
|
Greptile SummaryThis PR introduces a new Key changes:
Issue found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant GoogleDriveConnector
participant get_files_by_web_view_links_batch
participant Drive BatchHttpRequest
participant _get_new_ancestors_for_files
participant _convert_retrieved_file_to_document
Caller->>GoogleDriveConnector: resolve_errors(errors, include_permissions)
GoogleDriveConnector->>GoogleDriveConnector: extract doc_ids from errors[].failed_document
GoogleDriveConnector->>get_files_by_web_view_links_batch: (service, doc_ids, field_type)
loop chunks of 100
get_files_by_web_view_links_batch->>Drive BatchHttpRequest: batch.add(files().get) per link
Drive BatchHttpRequest-->>get_files_by_web_view_links_batch: callback(request_id, response, exception)
end
get_files_by_web_view_links_batch-->>GoogleDriveConnector: BatchRetrievalResult{files, errors}
GoogleDriveConnector-->>Caller: yield ConnectorFailure for each batch error
GoogleDriveConnector->>_get_new_ancestors_for_files: retrieved_files, permission_sync_context
_get_new_ancestors_for_files-->>GoogleDriveConnector: ancestor HierarchyNodes
GoogleDriveConnector-->>Caller: yield HierarchyNode (ancestors)
GoogleDriveConnector->>_convert_retrieved_file_to_document: parallel (max_workers=8)
_convert_retrieved_file_to_document-->>GoogleDriveConnector: Document | ConnectorFailure | None
GoogleDriveConnector-->>Caller: yield Document or ConnectorFailure
Prompt To Fix All With AIThis is a comment left during a code review.
Path: backend/onyx/connectors/google_drive/connector.py
Line: 1719-1726
Comment:
**Missing `exclude_domain_link_only` filter before building `retrieved_files`**
Both other code paths in this connector that build a document list apply an explicit filter for `exclude_domain_link_only` before handing files off for conversion:
- `_convert_retrieved_files_to_documents` (line 1517):
```python
if self.exclude_domain_link_only and has_link_only_permission(retrieved_file.drive_file):
continue
```
- `_extract_slim_docs_from_google_drive` (line 1806): identical guard.
`resolve_errors` fetches files with `DriveFileFieldType.WITH_PERMISSIONS` when `self.exclude_domain_link_only` is `True` (line 1695), which means the permission data needed for the check *is* available in the response — but the check itself never runs. As a result, when `exclude_domain_link_only=True`, files whose only access is a domain link will be re-indexed through the error-resolution path even though they are supposed to be excluded by configuration.
```python
retrieved_files = [
RetrievedDriveFile(
drive_file=file,
user_email=self.primary_admin_email,
completion_stage=DriveRetrievalStage.DONE,
)
for file in batch_result.files.values()
if not (
self.exclude_domain_link_only
and has_link_only_permission(file)
)
]
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "pr comments" | Re-trigger Greptile |
| if exception: | ||
| logger.warning(f"Error retrieving file {request_id}: {exception}") | ||
| else: | ||
| results[request_id] = response |
There was a problem hiding this comment.
Silent batch failure swallows transient errors
When an individual batch request fails (e.g., due to a transient network/auth error, not just a permanent 404), the failure is silently dropped — only a logger.warning is emitted and no ConnectorFailure is produced. This makes transient retrieval errors indistinguishable from permanent ones (file deleted, permission revoked).
Per the interface contract, "Caller's responsibility is to delete the old ConnectorFailures and replace with the new ones." A caller who deletes all input failures matching the submitted errors list and only keeps the yielded outputs will permanently lose failure records for documents that failed due to transient batch errors. The document would disappear from both the search index and the failure tracker, with no way to retry.
Consider yielding a new ConnectorFailure (with the exception message) for any batch item that fails, rather than silently dropping it:
def callback(
request_id: str,
response: GoogleDriveFileType,
exception: Exception | None,
) -> None:
if exception:
logger.warning(f"Error retrieving file {request_id}: {exception}")
errors[request_id] = exception # collect errors for the caller
else:
results[request_id] = responseThen the public resolve_errors can yield ConnectorFailure objects for entries present in errors but absent in files.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/connectors/google_drive/file_retrieval.py
Line: 579-582
Comment:
**Silent batch failure swallows transient errors**
When an individual batch request fails (e.g., due to a transient network/auth error, not just a permanent 404), the failure is silently dropped — only a `logger.warning` is emitted and no `ConnectorFailure` is produced. This makes transient retrieval errors indistinguishable from permanent ones (file deleted, permission revoked).
Per the interface contract, "Caller's responsibility is to delete the old `ConnectorFailure`s and replace with the new ones." A caller who deletes all input failures matching the submitted `errors` list and only keeps the yielded outputs will permanently lose failure records for documents that failed due to transient batch errors. The document would disappear from both the search index and the failure tracker, with no way to retry.
Consider yielding a new `ConnectorFailure` (with the exception message) for any batch item that fails, rather than silently dropping it:
```python
def callback(
request_id: str,
response: GoogleDriveFileType,
exception: Exception | None,
) -> None:
if exception:
logger.warning(f"Error retrieving file {request_id}: {exception}")
errors[request_id] = exception # collect errors for the caller
else:
results[request_id] = response
```
Then the public `resolve_errors` can yield `ConnectorFailure` objects for entries present in `errors` but absent in `files`.
How can I resolve this? If you propose a fix, please make it concise.| google_drive_service_acct_connector_factory: Callable[..., GoogleDriveConnector], | ||
| ) -> None: | ||
| """Resolving an empty error list should yield nothing.""" | ||
| connector = google_drive_service_acct_connector_factory( |
There was a problem hiding this comment.
Unused list comprehension result — dead code
The result of this list comprehension is discarded. The variable is never assigned, so this line has no effect on the test. If the intent was to assert something about the number of ConnectorFailures (e.g., that there are none), the assertion is missing.
| connector = google_drive_service_acct_connector_factory( | |
| new_failures = [r for r in results if isinstance(r, ConnectorFailure)] |
Then add an assertion, e.g. assert len(new_failures) == 0.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/tests/daily/connectors/google_drive/test_resolver.py
Line: 183
Comment:
**Unused list comprehension result — dead code**
The result of this list comprehension is discarded. The variable is never assigned, so this line has no effect on the test. If the intent was to assert something about the number of `ConnectorFailure`s (e.g., that there are none), the assertion is missing.
```suggestion
new_failures = [r for r in results if isinstance(r, ConnectorFailure)]
```
Then add an assertion, e.g. `assert len(new_failures) == 0`.
How can I resolve this? If you propose a fix, please make it concise.| fields = _get_fields_for_file_type(field_type) | ||
| if len(web_view_links) <= MAX_BATCH_SIZE: | ||
| return _get_files_by_web_view_links_batch(service, web_view_links, fields) | ||
|
|
||
| result: dict[str, GoogleDriveFileType] = {} | ||
| for i in range(0, len(web_view_links), MAX_BATCH_SIZE): | ||
| chunk = web_view_links[i : i + MAX_BATCH_SIZE] | ||
| result.update(_get_files_by_web_view_links_batch(service, chunk, fields)) |
There was a problem hiding this comment.
The if len(web_view_links) <= MAX_BATCH_SIZE guard is unnecessary. The for loop below handles that case in a single iteration, so the early return only adds code without changing behavior. Removing it simplifies the function:
| fields = _get_fields_for_file_type(field_type) | |
| if len(web_view_links) <= MAX_BATCH_SIZE: | |
| return _get_files_by_web_view_links_batch(service, web_view_links, fields) | |
| result: dict[str, GoogleDriveFileType] = {} | |
| for i in range(0, len(web_view_links), MAX_BATCH_SIZE): | |
| chunk = web_view_links[i : i + MAX_BATCH_SIZE] | |
| result.update(_get_files_by_web_view_links_batch(service, chunk, fields)) | |
| result: dict[str, GoogleDriveFileType] = {} | |
| for i in range(0, len(web_view_links), MAX_BATCH_SIZE): | |
| chunk = web_view_links[i : i + MAX_BATCH_SIZE] | |
| result.update(_get_files_by_web_view_links_batch(service, chunk, fields)) | |
| return result |
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/connectors/google_drive/file_retrieval.py
Line: 554-561
Comment:
**Redundant early-exit branch**
The `if len(web_view_links) <= MAX_BATCH_SIZE` guard is unnecessary. The `for` loop below handles that case in a single iteration, so the early return only adds code without changing behavior. Removing it simplifies the function:
```suggestion
result: dict[str, GoogleDriveFileType] = {}
for i in range(0, len(web_view_links), MAX_BATCH_SIZE):
chunk = web_view_links[i : i + MAX_BATCH_SIZE]
result.update(_get_files_by_web_view_links_batch(service, chunk, fields))
return result
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
5 issues found across 4 files
Confidence score: 2/5
- There is a high-confidence data-loss/regression risk in
backend/onyx/connectors/google_drive/connector.py:resolve_errorscan silently drop unresolved error IDs instead of surfacingConnectorFailure, which can cause failed items to disappear from handling. backend/onyx/connectors/google_drive/file_retrieval.pyhas two related high-severity concerns (field-mask/projection mismatch and dropped items on batch exceptions) that can break batched error resolution and violate the resolver contract in user-facing failure paths.- The test gaps in
backend/tests/daily/connectors/google_drive/test_resolver.pyreduce safety: current assertions can miss missing hierarchy nodes and do not enforce the invalid-linkConnectorFailurebehavior, making regressions easier to ship. - Pay close attention to
backend/onyx/connectors/google_drive/connector.py,backend/onyx/connectors/google_drive/file_retrieval.py,backend/tests/daily/connectors/google_drive/test_resolver.py- silent failure/drop behavior and insufficient assertions around resolver error handling.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/tests/daily/connectors/google_drive/test_resolver.py">
<violation number="1" location="backend/tests/daily/connectors/google_drive/test_resolver.py:134">
P2: This test can pass even when `resolve_errors()` returns no expected hierarchy nodes, because it only validates nodes opportunistically inside the loop and never asserts that the expected IDs were present.</violation>
<violation number="2" location="backend/tests/daily/connectors/google_drive/test_resolver.py:170">
P2: This test does not verify the invalid-link behavior it describes, because `ConnectorFailure` results are ignored instead of asserted away.</violation>
</file>
<file name="backend/onyx/connectors/google_drive/connector.py">
<violation number="1" location="backend/onyx/connectors/google_drive/connector.py:1697">
P1: Unresolved error IDs are silently dropped in `resolve_errors`; emit a `ConnectorFailure` (or raise) for IDs missing from the batch response so failed items are not lost.
(Based on your team's feedback about logging or warning instead of failing silently.) [FEEDBACK_USED]</violation>
</file>
<file name="backend/onyx/connectors/google_drive/file_retrieval.py">
<violation number="1" location="backend/onyx/connectors/google_drive/file_retrieval.py:554">
P1: Use `files.get` field masks here; the current `files.list` projection (`nextPageToken, files(...)`) can break batched error resolution.</violation>
<violation number="2" location="backend/onyx/connectors/google_drive/file_retrieval.py:580">
P1: When a batch request fails (e.g., transient network/auth error), the exception is logged but the failed item is silently dropped from results. Per the `Resolver` interface contract, the caller deletes old `ConnectorFailure`s and replaces them with yielded outputs. This means documents that fail transiently will vanish from both the index and the failure tracker with no way to retry. Consider collecting batch errors and propagating them so `resolve_errors` can yield replacement `ConnectorFailure` objects for items that couldn't be retrieved.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
🖼️ Visual Regression Report
|
| ) -> Generator[Document | ConnectorFailure | HierarchyNode, None, None]: | ||
| """Attempts to yield back ALL the documents described by the errors, no checkpointing. | ||
|
|
||
| Caller's responsibility is to delete the old ConnectorFailures and replace with the new ones. |
There was a problem hiding this comment.
this comment doesn't make 100% sense to me. it seems to imply you are meant to return ConnectorFailure, but the typing indicates you can return Document and HierarchyNode. i think it just might need some more detail.
There was a problem hiding this comment.
u can return connector failures too!
Description
Add a new interface for connectors to implement, intended to be used for individual indexing error resolution. Implemented the interface for google drive.
How Has This Been Tested?
added connector tests
Additional Options
Summary by cubic
Adds a new
Resolverinterface and Google Drive error resolution to re-fetch failed documents bywebViewLinkusing batched Drive API calls. Emits ancestor folders and can sync permissions to reduce API calls and speed up recovery.New Features
Resolver.resolve_errors(errors, include_permissions=False)to re-process failures and emitDocumentandHierarchyNode.GoogleDriveConnector.resolve_errorswith batchedfiles().get(100-item chunks), yields per-link failures when fetches fail, emits ancestors before documents, and optionally syncs permissions.get_files_by_web_view_links_batchusing DriveBatchHttpRequest; skips invalid links and continues; daily tests cover single/multiple files, invalid links, empty inputs, entity-failure skips, and hierarchy validation.Bug Fixes
files().getby deriving single-file fields from list fields, preventing missing or incorrect metadata.Written for commit 043df22. Summary will update on new commits.