A small Python project for scraping product cards from a Wildberries catalog page and saving results to CSV.
Current behavior:
- opens a catalog page in Playwright;
- scrolls the page a specified number of times;
- extracts id, brand (name), price, and image link;
- removes duplicates by id;
- saves output to
data/products.csv.
- Python 3.10+
- Playwright (browser automation)
- BeautifulSoup4 (HTML parsing)
scraper/
main.py # entry point
run_browser.py # browser launch, scrolling, HTML retrieval
parser.py # data extraction from HTML
models.py # Product model
preservation.py # deduplication and CSV writing
data/
products.csv # scraping result
- Go to the project folder.
- Create and activate a virtual environment.
- Install dependencies.
- Install a browser for Playwright.
Example for macOS/Linux:
python3 -m venv venv
source venv/bin/activate
pip install playwright beautifulsoup4
playwright install chromiumRun from the project root as a module:
python -m scraper.mainAfter execution, the output file will be available at data/products.csv.
Each product card includes the following fields:
idname(brand)priceimg(image URL)
By default, URL and number of scrolls are set in scraper/main.py:
url = "https://www.wildberries.ge/catalog/obuv/muzhskaya/kedy-i-krossovki"
products = run(url=url, scrolls=1)You can change url and scrolls as needed.
- Website markup can change, so selectors in
parser.pymay need updates. - With
headless=True, the browser runs without UI. For debugging, you can temporarily setheadless=Falseinrun_browser.py. - Deduplication is done by
id; if a product has noid, it may be treated as a separate record.