GitHub - kpranke/Algorythm_NLP: Applying NLP to the full Reuters dataset

Description

This project is a part of the Becode.org AI Bootcamp programme. The goal is to provide a knowledge graph representing entities and relationships between them. Dataset: Reuters-21578 corpus.

Objectives

Be able to preprocess data obtained from textual sources
Be able to employ named entity recognition and relationship extraction using spaCy
Be able to visualize results
Be able to present insights and findings to client
Be able to store data using the graph database Neo4j
Be able to write clean and documented code.

Strengths

Applicable for any text.
Relationships are automatically extracted and clustered by meaning.
Most modules are fully automated, some use context specific information that can be easily added.
Entity Linking automatically groups entities that are spelled slightly different.
Adaptable Neo4j graphs.

Limitations

Only a prototype of a graph in Streamlit developed.
No transformers were harmed in the making of this project.

Further Developments

Implement interactive Streamlit app.
Further automate Entity recognition.

Repo Architecture

Project/
|-- Deployment/
|   |-- streamlit_app.py
|-- NER/
|   |-- NER_extraction.py
|-- Visualization/
|   |-- createGraph.py
|-- datacleaning/
|   |-- data_cleaning.py
|-- database/
|   |-- versions/
|   |   |-- v1.0
|   |   |-- v1.1
|   |   |-- v1.2
|   |   |-- v1.3
|   |-- entities.csv
|   |-- relationships.csv
|-- entity_linking/
|   |-- entityLinker.py
|   |-- entities_wiki_final.csv
|-- relationship_extraction/
|   |-- relationship_extraction.py
|-- stored_data/
|   |-- NER_train_data.obj
|   |-- all_verbs.obj
|   |-- articles_cleaned.obj
|   |-- articles_with_tickers.obj
|   |-- company_tickers.obj
|   |-- docs.obj
|   |-- entities.obj
|   |-- nouns.obj
|-- README.md
|-- .gitignore
|-- main.py

Installation

Clone the repository and install the dependencies with requirements.txt
Run main.py to execute all the classes described in the Usage.

Usage

Data cleaning with RegEx

data_cleaning/data_cleaning.py contains a Class get_data that downloads the data and cleans it, using basic Python string manipulation.

Named entity recognition with spaCy

NER/NER_extraction.py contains a Class NER responsible for all named entity recognition operations, including extracting named entities from the dataset with a configured language model and an entity ruler pipeline.

Entity linking with Pywikibot

entity_linking/entityLinker.py contains a Class entityLinker responsible for linking the entities to wikipedia using the pywikibot library.

Relationship extraction with spaCy

relationship_extraction/relationship_extraction.py contains a Class relationshipExtraction responsible for taking the previously found entities, and cross-referencing them with the database to find meaningful relations, including finding all the relationships by using the pattern matcher and using Jaccard similarity to make clusters from the verbs.

Creating a graph database with Neo4j

Visualization/createGraph.py contains a Class responsible for connecting to a Neo4j database and uploading entities (as nodes) and relationships between them.

Deployment with Streamlit and Agraph

Deployment/streamlit_app.py/ contains a prototype of a Streamlit app built with the Agraph component.

Visuals

Timeline

Duration: 2 weeks

version 1.0

We used the large english spacy nlp pipeline language model ('en_core_web_lg') on cleaned data, using the in-built pre-trained NER model, ruled based matching between 2 entities with a verb in the middle for relationship extraction.

version 1.1

The ruled based matching was improved. Where it used to recognize between any 2 entities with a verb in the middle, it now only searches for entities specified in a certain list. A custom NER model was trained for data gathered in PERSON, BANK, COMPANY, COMMODITY, AGREEMENTS.

version 1.2

The custom NER model was discarded for poor performance and replaced with a custom entity ruler class looking for COMPANY, PERSON and COMMODITY. Attempts at using Neural Coreference were discarded due to the constrained timing of the project.

version 1.3

The relationship extraction was optimized by using a custom clustering algorithm. The application of entity linking links together identical entities with different syntax.

Personal situation

Contributors: manwithplan, kpranke

Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Description

Objectives

Strengths

Limitations

Further Developments

Repo Architecture

Installation

Usage

Data cleaning with RegEx

Named entity recognition with spaCy

Entity linking with Pywikibot

Relationship extraction with spaCy

Creating a graph database with Neo4j

Deployment with Streamlit and Agraph

Visuals

Timeline

version 1.0

version 1.1

version 1.2

version 1.3

Personal situation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Deployment		Deployment
NER		NER
Visualization		Visualization
data_cleaning		data_cleaning
database		database
entity_linking		entity_linking
relationship_extraction		relationship_extraction
stored_data		stored_data
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Description

Objectives

Strengths

Limitations

Further Developments

Repo Architecture

Installation

Usage

Data cleaning with RegEx

Named entity recognition with spaCy

Entity linking with Pywikibot

Relationship extraction with spaCy

Creating a graph database with Neo4j

Deployment with Streamlit and Agraph

Visuals

Timeline

version 1.0

version 1.1

version 1.2

version 1.3

Personal situation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages