Skip to content

Add properties to tests, add additional params for tests, fix broken numbers.#9

Merged
SuppliedOrange merged 2 commits intomainfrom
new-patches
May 5, 2025
Merged

Add properties to tests, add additional params for tests, fix broken numbers.#9
SuppliedOrange merged 2 commits intomainfrom
new-patches

Conversation

@SuppliedOrange
Copy link
Copy Markdown
Owner

@SuppliedOrange SuppliedOrange commented May 5, 2025

Summary by CodeRabbit

  • New Features

    • Added support for configuring headless mode and page load timeout when running tests.
    • Test commands now accept optional arguments for headless mode and timeout settings.
  • Documentation

    • Updated installation instructions and CLI usage in the README to reflect new test options.
  • Refactor

    • Standardized test function signatures to accept additional properties for enhanced configurability.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2025

Walkthrough

The changes introduce enhanced configurability for the web scraper by allowing test runners to specify headless mode and pageLoadTimeout as options. These are threaded through new and updated type definitions, test runner logic, and the scraper implementation. Documentation is updated to reflect the new optional test arguments, and test files are refactored to accept the new properties interface.

Changes

File(s) Change Summary
README.md Updated installation instructions to mention new test arguments (--headless, --pageLoadTimeout) and clarified CLI usage wording.
data/numbers.ts Updated URLs and DOM extraction logic for queries 4, 5, and 6 to match new target websites and structures.
src/types/TestProperties.ts Added new TestProperties interface with optional scraperConstructorProperties of type ScraperConstructor.
src/webscraper/Scraper.ts Extended ScraperConstructor with optional headless and pageLoadTimeout; updated Scraper class and methods to use these options.
test.ts Modified test runner to accept and pass headless and pageLoadTimeout as part of TestProperties to tests.
tests/check_formats.test.ts
tests/check_order.test.ts
tests/check_repeating_domain.test.ts
Updated test function signatures to accept an unused TestProperties parameter.
tests/scrape_and_check.test.ts Updated test function to accept and utilize TestProperties, passing options to the Scraper constructor and logging properties.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Test Runner
    participant TestFunction
    participant Scraper

    User->>CLI/Test Runner: Run test with --headless/--pageLoadTimeout
    CLI/Test Runner->>TestFunction: Call with data, {scraperConstructorProperties}
    TestFunction->>Scraper: Instantiate with options (headless, pageLoadTimeout)
    Scraper->>Scraper: Launch browser (headless), set timeout
    Scraper->>Scraper: Fetch page with timeout
    Scraper-->>TestFunction: Return result
    TestFunction-->>CLI/Test Runner: Return test result
    CLI/Test Runner-->>User: Output results
Loading

Poem

🐇
A hop and a skip, new options abound,
Headless or not, let your scraper be crowned.
Timeouts are set, so tests run with grace,
Each rabbit dev beams with a smile on their face.
Docs now explain how to pass what you need—
Configurable tests, a rabbit’s good deed!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (2)
tests/check_order.test.ts (1)

1-43: 💡 Verification agent

🧩 Analysis chain

Note the pipeline failure in CI

While your code changes look good, there appears to be a test failure in the CI pipeline. This isn't related to your parameter changes, but you might need to investigate:

Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

🏁 Script executed:

#!/bin/bash
# Let's locate the definition for number 5 in the data to understand what might be failing

# Search for number 5 definition in the data files
echo "Looking for number 5 definition in data files..."
rg -A 10 -B 2 "number: 5" --type ts

# Check if any query functions use innerText on potentially null elements
echo "Checking for potential null element access in query functions..."
rg -A 5 -B 5 "innerText" --type ts | grep -C 2 "number: 5"

Length of output: 1240


Guard against null in number 5’s queryFunction

The CI failure shows that for number 5, document.querySelector(".we-customer-ratings__averages__display") can return null, leading to a TypeError when accessing .innerText. You’ll need to add a null‐check or fallback in the queryFunction for number 5:

• Location to update:

  • data/numbers.ts, the object with number: 5

• Suggested diff:

 queryFunction: () => {
-    return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText)
+    const el = document.querySelector(".we-customer-ratings__averages__display");
+    if (!el) {
+        throw new Error("Element .we-customer-ratings__averages__display not found for number 5");
+    }
+    return Math.ceil(el.innerText);
 }

This will prevent the null‐access error during page.evaluate and make the test more robust.

data/numbers.ts (1)

1-63: 💡 Verification agent

🧩 Analysis chain

Consider adding retry logic and error handling across all query functions.

None of the query functions have proper error handling for cases where elements aren't found or have unexpected content.

Consider implementing a helper function to standardize error handling across all queries:

function safeQuery(selector: string, defaultValue: string, transform?: (text: string) => string): string {
  const element = document.querySelector(selector);
  if (!element) return defaultValue;
  
  const text = element.innerText;
  return transform ? transform(text) : text;
}

Then update each query function to use this pattern for more consistent error handling.

Please check if the websites being scraped have robots.txt files that might prohibit scraping:


🏁 Script executed:

#!/bin/bash
# Check robots.txt files for the three updated URLs

echo "Checking PokemonDB robots.txt"
curl -s https://pokemondb.net/robots.txt

echo -e "\nChecking Apple App Store robots.txt"
curl -s https://apps.apple.com/robots.txt

echo -e "\nChecking Microsoft robots.txt"
curl -s https://dotnet.microsoft.com/robots.txt

Length of output: 1645


Add standardized error handling and confirm robots.txt compliance

I’ve checked the robots.txt for all three domains—

  • pokemondb.net: /pokedex/* isn’t disallowed (only /pokebase/* is blocked, with a crawl-delay of 2)
  • apps.apple.com: your /us/app/... path is allowed
  • dotnet.microsoft.com: only /download/thank-you/ is disallowed, not /en-us/download/dotnet/6.0

To improve resilience when elements are missing or text is malformed, introduce a helper in, say, src/utils/safeQuery.ts:

export function safeQuery<T>(
  selector: string,
  defaultValue: T,
  transform: (text: string, el: Element) => T = text => (text as unknown as T),
  retries = 1,
  delayMs = 200
): Promise<T> {
  return new Promise((resolve) => {
    const attempt = (n: number) => {
      const el = document.querySelector(selector);
      if (el) {
        try {
          resolve(transform(el.innerText, el));
        } catch {
          resolve(defaultValue);
        }
      } else if (n > 0) {
        setTimeout(() => attempt(n - 1), delayMs);
      } else {
        resolve(defaultValue);
      }
    };
    attempt(retries);
  });
}

Then in data/numbers.ts, refactor each queryFunction:

- queryFunction: () => {
-   return document.querySelector("td").innerText.substring(3,4)
- }
+ queryFunction: () =>
+   safeQuery("td", "0", text => text.substring(3, 4)),

• Wrap all document.querySelector / querySelectorAll calls in safeQuery
• Provide sensible defaults ("0", "", etc.) and optional retry counts
• Centralize error handling for missing nodes or transform failures

🧹 Nitpick comments (4)
src/types/TestProperties.ts (1)

1-7: Consider simplifying interface structure.

The interface definition has unnecessary blank lines (lines 4 and 6) that could be removed for conciseness. The interface structure is correct and properly imports the ScraperConstructor type from the appropriate location.

import { ScraperConstructor } from "../webscraper/Scraper";

export default interface TestProperties {
-
    scraperConstructorProperties?: ScraperConstructor;
-    
}
test.ts (1)

19-21: Review default headless value and consider adding validation for pageLoadTimeout.

The default headless mode is set to true, which is a sensible default. However, there's no validation for pageLoadTimeout which could lead to issues if invalid values are provided.

Consider adding validation for pageLoadTimeout to ensure it's a positive number:

let headless = (["true", "false"].includes(args.headless)) ? args.headless === "true" : true;
-let pageLoadTimeout = args.pageLoadTimeout || null;
+let pageLoadTimeout = args.pageLoadTimeout ? parseInt(args.pageLoadTimeout) : null;
+if (pageLoadTimeout !== null && (isNaN(pageLoadTimeout) || pageLoadTimeout <= 0)) {
+    console.warn("Invalid pageLoadTimeout provided, using default timeout");
+    pageLoadTimeout = null;
+}
src/webscraper/Scraper.ts (2)

120-136: Fix timeout error message inconsistency.

The error message mentions a 30-second timeout but the default pageLoadTimeout is set to 15000ms (15 seconds). This inconsistency could be confusing.

if (error.message.includes("Timeout")) {

    this.error(
-        `Timeout of 30s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.`
+        `Timeout of ${this.pageLoadTimeout/1000}s while navigating to ${scrapeQuery.url}. Skipping waiting for page load and executing query function.`
    );

}

122-122: Consider more robust error type checking.

Checking for timeout errors by looking for the string "Timeout" in the error message is brittle and may break if Playwright changes their error message format.

Consider using a more reliable method to check for timeout errors:

-if (error.message.includes("Timeout")) {
+if (error.name === "TimeoutError" || error.message.includes("Timeout")) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aedc95e and 194faf7.

📒 Files selected for processing (9)
  • README.md (1 hunks)
  • data/numbers.ts (1 hunks)
  • src/types/TestProperties.ts (1 hunks)
  • src/webscraper/Scraper.ts (6 hunks)
  • test.ts (3 hunks)
  • tests/check_formats.test.ts (1 hunks)
  • tests/check_order.test.ts (1 hunks)
  • tests/check_repeating_domain.test.ts (1 hunks)
  • tests/scrape_and_check.test.ts (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
tests/check_formats.test.ts (4)
tests/check_order.test.ts (1)
  • data (5-43)
tests/check_repeating_domain.test.ts (1)
  • data (5-60)
src/types/ScrapeQuery.ts (1)
  • ScrapeQuery (1-5)
src/types/TestProperties.ts (1)
  • TestProperties (3-7)
tests/check_order.test.ts (2)
src/types/ScrapeQuery.ts (1)
  • ScrapeQuery (1-5)
src/types/TestProperties.ts (1)
  • TestProperties (3-7)
src/types/TestProperties.ts (1)
src/webscraper/Scraper.ts (1)
  • ScraperConstructor (9-15)
test.ts (3)
tests/scrape_and_check.test.ts (1)
  • data (6-59)
src/types/ScrapeQuery.ts (1)
  • ScrapeQuery (1-5)
src/types/TestProperties.ts (1)
  • TestProperties (3-7)
🪛 GitHub Actions: Webscraped Results
tests/check_formats.test.ts

[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

tests/check_order.test.ts

[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

tests/check_repeating_domain.test.ts

[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

tests/scrape_and_check.test.ts

[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

test.ts

[error] 78-78: Test Failure: Number 5 failed to scrape. Error: Failed to evaluate query function due to TypeError: Cannot read properties of null (reading 'innerText') during page.evaluate.

🔇 Additional comments (20)
tests/check_order.test.ts (1)

2-2: LGTM: TestProperties parameter added correctly

These changes align well with the broader refactoring to support passing optional test configuration properties. The unused parameter is correctly prefixed with an underscore following TypeScript conventions.

Also applies to: 5-5

tests/check_formats.test.ts (1)

2-2: LGTM: TestProperties parameter added consistently

The addition of the TestProperties import and parameter follows the same pattern as in other test files, maintaining consistency throughout the codebase.

Also applies to: 5-5

tests/check_repeating_domain.test.ts (1)

2-2: LGTM: TestProperties parameter added consistently

This file follows the same pattern as the other test files, correctly importing and adding the TestProperties parameter.

Also applies to: 5-5

README.md (2)

18-20: Documentation updated correctly for new parameters

Great job updating the README to document the new test properties functionality. This clearly informs users about the new optional parameters available for tests.


26-26: Minor clarity improvement

This minor wording change improves clarity for users.

test.ts (4)

5-5: LGTM: Appropriate import for TestProperties.

The TestProperties import is correctly added to support the new type being used in test function signatures.


30-30: LGTM: Correct type declaration for test functions.

The Map type has been properly updated to include the optional TestProperties parameter in the function signature.


41-43: LGTM: Updated test function signature to accept optional TestProperties.

The test function signature has been correctly updated to include the optional TestProperties parameter while preserving backward compatibility.


58-63: LGTM: Properly passing scraper configuration properties to test functions.

The test invocation properly constructs and passes TestProperties with scraperConstructorProperties containing the headless and pageLoadTimeout values from command line arguments.

tests/scrape_and_check.test.ts (5)

4-4: LGTM: Appropriate TestProperties import.

The TestProperties import is correctly added to support the new type being used as a parameter.


14-14: LGTM: Useful logging of scraper constructor properties.

Logging the scraper properties provides useful debug information, though it would be better with a null check on properties.


19-19: LGTM: Properly spreading constructor properties into scraper config.

The spread operator correctly applies any provided scraperConstructorProperties to the Scraper constructor, giving tests flexibility to configure scraper behavior.


32-35: LGTM: Improved code readability with whitespace.

The added blank lines improve readability around error handling code.


43-43: LGTM: Improved code readability with whitespace.

The added blank line improves the visual separation between error handling code and the try/catch block.

src/webscraper/Scraper.ts (6)

13-14: LGTM: Good extension of the ScraperConstructor interface.

The interface correctly adds optional headless and pageLoadTimeout properties to allow configurable scraper behavior.


29-30: LGTM: Added private properties with sensible default for pageLoadTimeout.

The private properties are correctly added with a sensible default timeout of 15 seconds.


44-45: LGTM: Updated JSDoc with new parameters.

The JSDoc comment has been properly updated to document the new headless and pageLoadTimeout parameters.


54-55: LGTM: Properly setting properties in constructor.

The constructor correctly sets the headless and pageLoadTimeout properties when provided, with an undefined check for the boolean headless property.


72-74: LGTM: Properly applying headless mode to browser launch.

The headless configuration is correctly applied when launching the browser.


113-116: LGTM: Properly setting page timeout during navigation.

The pageLoadTimeout is correctly applied to the page navigation options.

import TestProperties from "../src/types/TestProperties";

export default async ( data: ScrapeQuery[] ) => {
export default async ( data: ScrapeQuery[], properties: TestProperties ) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add null check for the properties parameter.

The function accepts a properties parameter but doesn't check if it's undefined or null before accessing its properties, which could lead to runtime errors.

-export default async ( data: ScrapeQuery[], properties: TestProperties ) => {
+export default async ( data: ScrapeQuery[], properties?: TestProperties ) => {
+    properties = properties || {};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export default async ( data: ScrapeQuery[], properties: TestProperties ) => {
export default async ( data: ScrapeQuery[], properties?: TestProperties ) => {
properties = properties || {};
// …rest of implementation…

Comment thread data/numbers.ts
Comment on lines +41 to 44
url: "https://pokemondb.net/pokedex/charmander",
queryFunction: () => {
return Array.from(
document.querySelectorAll(".user-action-time > span.relativetime") )
.map(x => x.innerText.split(" ")[1]
).filter( x => x[0] == 4 )[0][0]
return document.querySelector("td").innerText.substring(3,4)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add more specific selector for improved robustness.

The current implementation selects the first td element on the page without any specific filtering. This approach is fragile as it could break if the page structure changes or if the order of table cells is modified.

Consider using a more specific selector that targets the exact data you need:

- url: "https://pokemondb.net/pokedex/charmander",
  queryFunction: () => {
-     return document.querySelector("td").innerText.substring(3,4)
+     return document.querySelector("table.vitals-table td:nth-child(1)").innerText.substring(3,4)
  }

Also, add a comment explaining what this substring is extracting for better maintainability.

Committable suggestion skipped: line range outside the PR's diff.

Comment thread data/numbers.ts
Comment on lines +55 to 58
url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0",
queryFunction: () => {
return document.getElementById("splash-stars").innerText.split(" ").length + 1
return document.querySelector("h1").innerText.substring(14)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve robustness of .NET version extraction.

The current implementation relies on a fixed substring position (14) which is brittle and may break if the heading text changes.

Use a regex or more targeted approach:

url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0",
queryFunction: () => {
-   return document.querySelector("h1").innerText.substring(14)
+   const h1Text = document.querySelector("h1")?.innerText || "";
+   const match = h1Text.match(/\.NET (\d+)\.(\d+)/);
+   return match ? match[1] : "6"; // Extract the major version or use default
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0",
queryFunction: () => {
return document.getElementById("splash-stars").innerText.split(" ").length + 1
return document.querySelector("h1").innerText.substring(14)
}
url: "https://dotnet.microsoft.com/en-us/download/dotnet/6.0",
queryFunction: () => {
const h1Text = document.querySelector("h1")?.innerText || "";
const match = h1Text.match(/\.NET (\d+)\.(\d+)/);
return match ? match[1] : "6"; // Extract the major version or use default
}

Comment thread data/numbers.ts
Comment on lines +48 to 51
url: "https://apps.apple.com/us/app/nyan-cat-lost-in-space/id433592592",
queryFunction: () => {
let [hp, atk] = document.querySelector("a[href='/dex/sm/pokemon/bewear/']")
.parentElement.parentElement.querySelectorAll(".PokemonAltRow-hp, .PokemonAltRow-atk");
return atk.querySelector("span").innerText - hp.querySelector("span").innerText
return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling to the rating extraction.

The current implementation doesn't handle cases where the rating element might not be present or have an unexpected format.

Add error handling and make the selector more specific:

url: "https://apps.apple.com/us/app/nyan-cat-lost-in-space/id433592592",
queryFunction: () => {
-   return Math.ceil(document.querySelector(".we-customer-ratings__averages__display").innerText)
+   const ratingElement = document.querySelector(".we-customer-ratings__averages__display");
+   if (!ratingElement) return '5'; // Default fallback
+   const rating = parseFloat(ratingElement.innerText);
+   return isNaN(rating) ? '5' : Math.ceil(rating).toString();
}

Committable suggestion skipped: line range outside the PR's diff.

@SuppliedOrange SuppliedOrange merged commit cf2d366 into main May 5, 2025
3 checks passed
@SuppliedOrange SuppliedOrange deleted the new-patches branch May 5, 2025 23:29
@SuppliedOrange
Copy link
Copy Markdown
Owner Author

Closes #8.
Closes #7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant