An end-to-end Python data pipeline designed to programmatically extract, normalize, cache, and analyze structured bibliographical data from live web sources.
This project is personally driven by a deep passion for cinema, thrillers, and immersive storytelling novels. The core inspiration stems from a desire to analyze master storytelling structures, specifically focusing on the prolific literary output of Stephen King. This data engineering and machine learning framework was built to map out historical publishing behaviors, extract deep dataset layers, and predict future book characteristics or author metrics based on historical publication trends.
Instead of relying on static, pre-packaged datasets, this project showcases a professional End-to-End Data Lifecycle moving from raw web scraping to predictive machine learning modeling, terminal-based inference, and automated data quality cleaning.
- Goal: Track Stephen King's 50-year career trajectory (1974–2024).
- Target: Predict future book lengths by analyzing publication year, publisher ecosystems, and book formats.
- Scraping Engine: Built automated extraction pipelines using BeautifulSoup.
- Context Tracking: Dynamically captures Wikipedia section headers to tag book formats automatically.
- Server Protection: Implemented local raw caching to prevent redundant server hits.
- Noise Filtering: Advanced regex layers strip out embedded ISBN strings and bracketed citations.
- Anomaly Recovery: Structural cross-checks detect table column shifts and auto-correct misplaced page numbers or years.
- Statistical Profiling: Isolated volume outliers and evaluated length spreads across categories.
- Matrix Encoding: Transformed Publisher and Book Type features using One-Hot Encoding.
- Variance Control: Automatically groups rare categories into an Other baseline to stabilize arrays.
- Algorithm: Core engine uses an ensemble Random Forest Regressor for robust tabular predictions.
- Validation Layer: Replaced volatile single splits with a rigorous 5-Fold Cross-Validation dashboard.
- Metrics: Evaluates real generalized performance using Mean Absolute Error (MAE) and R² scores.
- Stack: Generated publication-quality analytical plots via matplotlib and seaborn.
- Visual Assets: Renders publication velocity timelines, novel length evolution trends, and publisher market share charts.
- Findings: Confirmed consistent output velocity, quantified a clear structural shift toward higher page counts over time, and mapped publisher reliance patterns.
- Core Language & Runtime: Python 3.10+
- Machine Learning & Modeling: scikit-learn (Random Forest, Cross-Validation, Regression Metrics, joblib)
- Data Engineering & Analysis: pandas (DataFrames, One-Hot Encoding, Data Wrangling), numpy (Numerical Operations Matrix manipulation)
- Web Scraping & Connectivity: requests (HTTP Client), beautifulsoup4 (HTML parsing), re (Regular Expressions data cleaning)
- Data Visualization & Analytics: matplotlib (Base Plotting Engine), seaborn (Statistical Visualizations)
- Storage & Serialization: json (Native Local Data Warehouse Cache Optimization)
📁 book-data-extractor/
│
├── 📁 scraped_data/ # Pipeline destination (Final clean data outputs)
├── 📁 models/ # Production Directory: Serialized ML binaries and structural features
├── 📁 graph images/ # Data Visualization Assets: Generated EDA plots and analysis charts
│
├── 📄 pipeline.py # Production Script: Automated web scraping, regex filtering, and ETL pipeline
├── 📄 train_model.py # Production Script: One-hot feature transformation, 5-Fold CV, and model trainer
├── 📄 predict.py # Production Script: Real-time terminal dashboard for interactive user inference
├── 📄 retrieve_data.json # Data Cache: Final clean dataset payload generated by the scraping pipeline
├── 📄 LICENSE.md # Project distribution license rules
└── 📄 README.md # Project architecture and analytical summary
The final cleaned dataset contains 63 unique book records structured according to the following relational schema:
| Column Name | Data Type | Key Type | Nullable | Description |
|---|---|---|---|---|
id |
Integer |
Primary Key | No | Unique identifier assigned to each book to enforce database row integrity. |
Title |
String |
- | No | The official published title of the literary work. |
Year |
Integer |
- | No | The original calendar year of publication (ranges from 1974 to 2024). |
Publisher |
String |
- | No | The distributing corporate publishing house name. |
ISBN |
String |
Unique Key | Yes | Universal International Standard Book Number used for retail tracking. |
Pages |
Integer |
- | No | Total physical page count of the book's standard print edition. |
These charts track individual annual book release counts to illustrate continuous narrative output and publishing frequency shifts across a 50-year timeline.
Checkout here :: MLOps_Architecture written in mermaid code.
To run this project locally, your environment needs to have Python installed along with the required web scraping, data engineering, and data visualization dependencies.
- Python Runtime:
Python 3.10or higher is recommended. - Package Manager:
pip(comes bundled with Python installations).
The core dependencies are split into three modules:
- Networking & Parsing:
requests,beautifulsoup4,lxml - Data Processing:
pandas,numpy - Data Visualization:
matplotlib,seaborn,scikit-learn notebook
Follow these terminal commands to set up the environment on your machine:
python pipeline.py
Train your Random Forest Regressor and evaluate predictive metrics on your clean dataset:
python train_model.py
Launch the user prediction panel to query live book-length predictions through interactive parameter menus:
python predict.py
To view the full visual data discovery process and plot historical graphs step-by-step:
jupyter notebook
Open requests_data.ipynb from the browser dashboard and run cells sequentially.
- Priyanshu Vijay - Data Engineer & ML Analyst - Butkii025
The data pipeline code is licensed under the MIT License. If you use the cleaned dataset (data.csv) generated by this repository for further machine learning or statistical analysis, please attribute it as follows:
P.Vijay, (2026). Book Data Extractor Dataset (Version 1.0) [Data set]. GitHub.