The goal of this project is to extract textual data from articles using provided URLs and perform text analysis to compute various metrics.
-
Web Scraping:
- Used the
BeautifulSouplibrary for web scraping. - Extracted data from the first URL and converted it into a string, then a list of words.
- Used the
-
Data Manipulation:
- Converted the extracted data into a pandas DataFrame and further into a NumPy array for manipulation.
-
Text Analysis:
- Created a
TextAnalysisclass inside theTextAnalysis.pyfile. - Defined class attributes for
StopWords,PositiveWords, andNegativeWordsto use across all URL data. - Methods were created to:
- Load stop words, positive words, and negative words from files.
- Extract, clean, and analyze the text data.
- Employed exception handling during data extraction.
- Managed the sequence of methods to ensure dependent variables like word count are assigned first.
- Created a
-
Automation:
- Created a
Main.pyfile that imports theTextAnalysisclass and performs analysis on each URL iteratively. - Results are stored in a dictionary and exported to an
Output.xlsxfile.
- Created a
- Ensure all required libraries are installed by running:
pip3 install -r requirements.txt ```
- Requests
- Bs4 (BeautifulSoup)
- pandas
- NLTK
- Openpyxl
- MacOS
- Clone the repository:
git clone https://github.com/samarth-jain28/Web-Scraping-and-Text-Analysis-Project/ ``` - Navigate to the project directory:
cd Web-Scraping-and-Text-Analysis-Project - pip3 install -r requirements.txt:
pip3 install -r requirements.txt
- Run the
Main.pyfile to start the analysis:python3 Main.py
- The results will be saved in
Output.xlsxwithin the project directory.
- Add support for additional languages in text analysis.
- Implement sentiment analysis using machine learning models.