This repository contains a Retrieval-augmented generation (RAG) app for a streamlined workflow for processing, merging, normalizing, and analyzing SEC-EDGAR data with Large Language Models [LLM API used - Gemini pro model]. The final step involves using Streamlit to visualize the LLM results for quicker financial insights to better interpret the SEC-EDGAR tickers employed [MSFT, AAPL, GOOGL].
Demo [SEE HERE] 🔗-https://drive.google.com/file/d/1eMftidtmhJIFWohNr3so_RnMjGIrgeee/view?usp=sharing
Output 1 📉-
The illustrations show a significant increase in operating income as revenue grew from $257.64 billion to $280.38 billion, reflecting a positive correlation between revenue and profitability.
Output 2 📈 -
The illustrations show the trend of dividends declared per share over three years, showing a consistent increase from 2021 to 2023. The dividends rose from approximately $0.88 to about $0.94 per share, demonstrating a steady and positive growth in dividends declared over time.
-
Python: Our primary programming language for application development.
-
Gemini-Pro: A comprehensive data analysis and insights LLM suited for our RAG app. Offers free access up to a certain number of requests. Additionally, I explored other open-source LLMs like StabilityAI, Camel-AI, and Zephyr 7B. Gemini-Pro provides versatile output formats, including structured JSON/tabular data and well-tuned text analysis, making it highly suitable for our app.
-
Plotly: Plotly provides interactive and customizable visualizations for our app after converting .json responses to a dataframe.
-
Streamlit: Enables easy deployment and offers robust visualization features.
NOTE - Text Analysis of LLM can be accessed from the pdf 'INSIGHTS WITH TEXT RESPONSES'.
-
Data Extraction and Zipping:
- Go to the
data_processingdirectory. - Run:
1_extra_and_zip.py - Output: This will create a zip file for each ticker.
- Go to the
-
Merge and Normalize:
- Go to the
data_processingdirectory. - Run:
2_merge_and_normalize.pyusing the zip file created in step 1. Here we first convert to .json then .txt for faster processing of embeddings.
- Go to the
- Repeat: Perform step 1 and 2 for each ticker separately.
- Output: Generates merged and cleaned files that are ready for analysis.
-
Store Processed Files:
- Save: Place the merged files inside the
documentsdirectory in.txtformat.
- Save: Place the merged files inside the
-
Load Data and Create Embeddings:
- Run:
load_data.py - Uncomment: The API line, and provide your
gemini-proAPI key. - Note: This step involves file splitting and the creation of embeddings.
- Run:
-
Analyze Data with Gemini:
- Run:
main.py - Uncomment: The API line, and provide your
gemini-proAPI key.
- Run:
-
Automation with Streamlit:
- Run:
app.pyusing Streamlit. - Output: This creates an interface for fast analysis.
- Run:
- Clone the Repository:
git clone https://github.com/tishachawla-jg/SEC-EDGAR_Analyis_App.git
Install Dependencies: pip install -r requirements.txt
Run app locally: streamlit run app.py
If you wish to contribute to this project, please create a pull request or raise an issue to discuss improvements.
NOTE TO APP USERS - Make sure to cross check the answers for potential hallucinations!!!
Referenes -
