📚 book-data-extractor

An end-to-end Python data pipeline designed to programmatically extract, normalize, cache, and analyze structured bibliographical data from live web sources.

🎯 Project Objective

This project is personally driven by a deep passion for cinema, thrillers, and immersive storytelling novels. The core inspiration stems from a desire to analyze master storytelling structures, specifically focusing on the prolific literary output of Stephen King. This data engineering and machine learning framework was built to map out historical publishing behaviors, extract deep dataset layers, and predict future book characteristics or author metrics based on historical publication trends.

Instead of relying on static, pre-packaged datasets, this project showcases a professional End-to-End Data Lifecycle moving from raw web scraping to predictive machine learning modeling, terminal-based inference, and automated data quality cleaning.

🏗️ The Enhanced Data Science Framework

1. Define the Problem

Goal: Track Stephen King's 50-year career trajectory (1974–2024).
Target: Predict future book lengths by analyzing publication year, publisher ecosystems, and book formats.

2. Data Acquisition & Ingestion

Scraping Engine: Built automated extraction pipelines using BeautifulSoup.
Context Tracking: Dynamically captures Wikipedia section headers to tag book formats automatically.
Server Protection: Implemented local raw caching to prevent redundant server hits.

3. Data Cleaning & Preprocessing

Noise Filtering: Advanced regex layers strip out embedded ISBN strings and bracketed citations.
Anomaly Recovery: Structural cross-checks detect table column shifts and auto-correct misplaced page numbers or years.

4. Exploratory Data Analysis (EDA) & Feature Engineering

Statistical Profiling: Isolated volume outliers and evaluated length spreads across categories.
Matrix Encoding: Transformed Publisher and Book Type features using One-Hot Encoding.
Variance Control: Automatically groups rare categories into an Other baseline to stabilize arrays.

5. Predictive Modeling Engine

Algorithm: Core engine uses an ensemble Random Forest Regressor for robust tabular predictions.
Validation Layer: Replaced volatile single splits with a rigorous 5-Fold Cross-Validation dashboard.
Metrics: Evaluates real generalized performance using Mean Absolute Error (MAE) and R² scores.

6. Data Visualization & Storytelling

Stack: Generated publication-quality analytical plots via matplotlib and seaborn.
Visual Assets: Renders publication velocity timelines, novel length evolution trends, and publisher market share charts.

7. Interpretation & Insights

Findings: Confirmed consistent output velocity, quantified a clear structural shift toward higher page counts over time, and mapped publisher reliance patterns.

🛠️ Tech Stack & Dependencies

Core Language & Runtime: Python 3.10+
Machine Learning & Modeling: scikit-learn (Random Forest, Cross-Validation, Regression Metrics, joblib)
Data Engineering & Analysis: pandas (DataFrames, One-Hot Encoding, Data Wrangling), numpy (Numerical Operations Matrix manipulation)
Web Scraping & Connectivity: requests (HTTP Client), beautifulsoup4 (HTML parsing), re (Regular Expressions data cleaning)
Data Visualization & Analytics: matplotlib (Base Plotting Engine), seaborn (Statistical Visualizations)
Storage & Serialization: json (Native Local Data Warehouse Cache Optimization)

📂 Repository Architecture

📁 book-data-extractor/
│
├── 📁 scraped_data/       # Pipeline destination (Final clean data outputs)
├── 📁 models/             # Production Directory: Serialized ML binaries and structural features
├── 📁 graph images/       # Data Visualization Assets: Generated EDA plots and analysis charts
│
├── 📄 pipeline.py         # Production Script: Automated web scraping, regex filtering, and ETL pipeline
├── 📄 train_model.py      # Production Script: One-hot feature transformation, 5-Fold CV, and model trainer
├── 📄 predict.py          # Production Script: Real-time terminal dashboard for interactive user inference
├── 📄 retrieve_data.json  # Data Cache: Final clean dataset payload generated by the scraping pipeline
├── 📄 LICENSE.md          # Project distribution license rules
└── 📄 README.md           # Project architecture and analytical summary

📊 Extracted Dataset Schema

The final cleaned dataset contains 63 unique book records structured according to the following relational schema:

Column Name	Data Type	Key Type	Nullable	Description
`id`	`Integer`	Primary Key	No	Unique identifier assigned to each book to enforce database row integrity.
`Title`	`String`	-	No	The official published title of the literary work.
`Year`	`Integer`	-	No	The original calendar year of publication (ranges from 1974 to 2024).
`Publisher`	`String`	-	No	The distributing corporate publishing house name.
`ISBN`	`String`	Unique Key	Yes	Universal International Standard Book Number used for retail tracking.
`Pages`	`Integer`	-	No	Total physical page count of the book's standard print edition.

Visual Insights Gallery

These charts track individual annual book release counts to illustrate continuous narrative output and publishing frequency shifts across a 50-year timeline.

More extracted graphs ➡️

Flowchart

Checkout here :: MLOps_Architecture written in mermaid code.

⚙️ Prerequisites & Local Setup

To run this project locally, your environment needs to have Python installed along with the required web scraping, data engineering, and data visualization dependencies.

1. System Requirements

Python Runtime: Python 3.10 or higher is recommended.
Package Manager: pip (comes bundled with Python installations).

2. Required Python Libraries

The core dependencies are split into three modules:

Networking & Parsing: requests, beautifulsoup4, lxml
Data Processing: pandas, numpy
Data Visualization: matplotlib, seaborn, scikit-learn notebook

🚀 Running the Production Pipeline

Follow these terminal commands to set up the environment on your machine:

Step 1: Run the ETL Pipeline Script

python pipeline.py

Step 2: Execute the Machine Learning Layer

Train your Random Forest Regressor and evaluate predictive metrics on your clean dataset:

python train_model.py

Step 3: Run Interactive Live Inference Terminal

Launch the user prediction panel to query live book-length predictions through interactive parameter menus:

python predict.py

Step 4: Interactive Workspace

To view the full visual data discovery process and plot historical graphs step-by-step:

jupyter notebook

Open requests_data.ipynb from the browser dashboard and run cells sequentially.

⚫ Authors & Credits

Priyanshu Vijay - Data Engineer & ML Analyst - Butkii025

📄 License & Attribution

The data pipeline code is licensed under the MIT License. If you use the cleaned dataset (data.csv) generated by this repository for further machine learning or statistical analysis, please attribute it as follows:

P.Vijay, (2026). Book Data Extractor Dataset (Version 1.0) [Data set]. GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📚 book-data-extractor

🎯 Project Objective

🏗️ The Enhanced Data Science Framework

1. Define the Problem

2. Data Acquisition & Ingestion

3. Data Cleaning & Preprocessing

4. Exploratory Data Analysis (EDA) & Feature Engineering

5. Predictive Modeling Engine

6. Data Visualization & Storytelling

7. Interpretation & Insights

🛠️ Tech Stack & Dependencies

📂 Repository Architecture

📊 Extracted Dataset Schema

Visual Insights Gallery

Flowchart

⚙️ Prerequisites & Local Setup

1. System Requirements

2. Required Python Libraries

🚀 Running the Production Pipeline

Step 1: Run the ETL Pipeline Script

Step 2: Execute the Machine Learning Layer

Step 3: Run Interactive Live Inference Terminal

Step 4: Interactive Workspace

⚫ Authors & Credits

📄 License & Attribution

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📚 book-data-extractor

🎯 Project Objective

🏗️ The Enhanced Data Science Framework

1. Define the Problem

2. Data Acquisition & Ingestion

3. Data Cleaning & Preprocessing

4. Exploratory Data Analysis (EDA) & Feature Engineering

5. Predictive Modeling Engine

6. Data Visualization & Storytelling

7. Interpretation & Insights

🛠️ Tech Stack & Dependencies

📂 Repository Architecture

📊 Extracted Dataset Schema

Visual Insights Gallery

Flowchart

⚙️ Prerequisites & Local Setup

1. System Requirements

2. Required Python Libraries

🚀 Running the Production Pipeline

Step 1: Run the ETL Pipeline Script

Step 2: Execute the Machine Learning Layer

Step 3: Run Interactive Live Inference Terminal

Step 4: Interactive Workspace

⚫ Authors & Credits

📄 License & Attribution