Skip to content

Latest commit

 

History

History
157 lines (117 loc) · 7.87 KB

File metadata and controls

157 lines (117 loc) · 7.87 KB

📚 book-data-extractor

An end-to-end Python data pipeline designed to programmatically extract, normalize, cache, and analyze structured bibliographical data from live web sources.


🎯 Project Objective

This project is personally driven by a deep passion for cinema, thrillers, and immersive storytelling novels. The core inspiration stems from a desire to analyze master storytelling structures, specifically focusing on the prolific literary output of Stephen King. This data engineering and machine learning framework was built to map out historical publishing behaviors, extract deep dataset layers, and predict future book characteristics or author metrics based on historical publication trends.

Instead of relying on static, pre-packaged datasets, this project showcases a professional End-to-End Data Lifecycle moving from raw web scraping to predictive machine learning modeling, terminal-based inference, and automated data quality cleaning.


🏗️ The Enhanced Data Science Framework

1. Define the Problem

  • Goal: Track Stephen King's 50-year career trajectory (1974–2024).
  • Target: Predict future book lengths by analyzing publication year, publisher ecosystems, and book formats.

2. Data Acquisition & Ingestion

  • Scraping Engine: Built automated extraction pipelines using BeautifulSoup.
  • Context Tracking: Dynamically captures Wikipedia section headers to tag book formats automatically.
  • Server Protection: Implemented local raw caching to prevent redundant server hits.

3. Data Cleaning & Preprocessing

  • Noise Filtering: Advanced regex layers strip out embedded ISBN strings and bracketed citations.
  • Anomaly Recovery: Structural cross-checks detect table column shifts and auto-correct misplaced page numbers or years.

4. Exploratory Data Analysis (EDA) & Feature Engineering

  • Statistical Profiling: Isolated volume outliers and evaluated length spreads across categories.
  • Matrix Encoding: Transformed Publisher and Book Type features using One-Hot Encoding.
  • Variance Control: Automatically groups rare categories into an Other baseline to stabilize arrays.

5. Predictive Modeling Engine

  • Algorithm: Core engine uses an ensemble Random Forest Regressor for robust tabular predictions.
  • Validation Layer: Replaced volatile single splits with a rigorous 5-Fold Cross-Validation dashboard.
  • Metrics: Evaluates real generalized performance using Mean Absolute Error (MAE) and R² scores.

6. Data Visualization & Storytelling

  • Stack: Generated publication-quality analytical plots via matplotlib and seaborn.
  • Visual Assets: Renders publication velocity timelines, novel length evolution trends, and publisher market share charts.

7. Interpretation & Insights

  • Findings: Confirmed consistent output velocity, quantified a clear structural shift toward higher page counts over time, and mapped publisher reliance patterns.

🛠️ Tech Stack & Dependencies

  • Core Language & Runtime: Python 3.10+
  • Machine Learning & Modeling: scikit-learn (Random Forest, Cross-Validation, Regression Metrics, joblib)
  • Data Engineering & Analysis: pandas (DataFrames, One-Hot Encoding, Data Wrangling), numpy (Numerical Operations Matrix manipulation)
  • Web Scraping & Connectivity: requests (HTTP Client), beautifulsoup4 (HTML parsing), re (Regular Expressions data cleaning)
  • Data Visualization & Analytics: matplotlib (Base Plotting Engine), seaborn (Statistical Visualizations)
  • Storage & Serialization: json (Native Local Data Warehouse Cache Optimization)

📂 Repository Architecture

📁 book-data-extractor/
│
├── 📁 scraped_data/       # Pipeline destination (Final clean data outputs)
├── 📁 models/             # Production Directory: Serialized ML binaries and structural features
├── 📁 graph images/       # Data Visualization Assets: Generated EDA plots and analysis charts
│
├── 📄 pipeline.py         # Production Script: Automated web scraping, regex filtering, and ETL pipeline
├── 📄 train_model.py      # Production Script: One-hot feature transformation, 5-Fold CV, and model trainer
├── 📄 predict.py          # Production Script: Real-time terminal dashboard for interactive user inference
├── 📄 retrieve_data.json  # Data Cache: Final clean dataset payload generated by the scraping pipeline
├── 📄 LICENSE.md          # Project distribution license rules
└── 📄 README.md           # Project architecture and analytical summary

📊 Extracted Dataset Schema

The final cleaned dataset contains 63 unique book records structured according to the following relational schema:

Column Name Data Type Key Type Nullable Description
id Integer Primary Key No Unique identifier assigned to each book to enforce database row integrity.
Title String - No The official published title of the literary work.
Year Integer - No The original calendar year of publication (ranges from 1974 to 2024).
Publisher String - No The distributing corporate publishing house name.
ISBN String Unique Key Yes Universal International Standard Book Number used for retail tracking.
Pages Integer - No Total physical page count of the book's standard print edition.

Visual Insights Gallery

These charts track individual annual book release counts to illustrate continuous narrative output and publishing frequency shifts across a 50-year timeline.

More extracted graphs ➡️ Visuals



book_length_evolution

Flowchart

Checkout here :: MLOps_Architecture written in mermaid code.


⚙️ Prerequisites & Local Setup

To run this project locally, your environment needs to have Python installed along with the required web scraping, data engineering, and data visualization dependencies.

1. System Requirements

  • Python Runtime: Python 3.10 or higher is recommended.
  • Package Manager: pip (comes bundled with Python installations).

2. Required Python Libraries

The core dependencies are split into three modules:

  • Networking & Parsing: requests, beautifulsoup4, lxml
  • Data Processing: pandas, numpy
  • Data Visualization: matplotlib, seaborn, scikit-learn notebook

🚀 Running the Production Pipeline

Follow these terminal commands to set up the environment on your machine:

Step 1: Run the ETL Pipeline Script

python pipeline.py

Step 2: Execute the Machine Learning Layer

Train your Random Forest Regressor and evaluate predictive metrics on your clean dataset:

python train_model.py

Step 3: Run Interactive Live Inference Terminal

Launch the user prediction panel to query live book-length predictions through interactive parameter menus:

python predict.py

Step 4: Interactive Workspace

To view the full visual data discovery process and plot historical graphs step-by-step:

jupyter notebook

Open requests_data.ipynb from the browser dashboard and run cells sequentially.


⚫ Authors & Credits

  • Priyanshu Vijay - Data Engineer & ML Analyst - Butkii025

📄 License & Attribution

License: MIT

The data pipeline code is licensed under the MIT License. If you use the cleaned dataset (data.csv) generated by this repository for further machine learning or statistical analysis, please attribute it as follows:

P.Vijay, (2026). Book Data Extractor Dataset (Version 1.0) [Data set]. GitHub.