This project consists of two main Python scripts: telegram_crawler.py and main.py. Together, they provide a system for fetching similar Telegram channels, with main.py serving as a web-based interface for the core crawling logic implemented in telegram_crawler.py.
This script is the core logic for interacting with the Telegram API to find similar channels.
InputHandler:- Purpose: Manages the loading and validation of input channel usernames.
- Methods:
load_from_cli(args): Loads channel usernames directly from command-line arguments.load_from_file(file_path): Reads channel usernames from a specified JSON or CSV file.validate_channels(channels): Cleans and validates a list of channel usernames (e.g., removes '@' prefix, checks length).
TelegramCrawler:- Purpose: Handles the actual communication with the Telegram API using the
telethonlibrary. - Methods:
__init__(api_id, api_hash, phone, session_name): Initializes the Telegram client with API credentials and session details.connect(): Establishes a connection to Telegram, handling authentication (phone number, code, 2FA password) if necessary.get_similar_channels(channel_username): Fetches recommendations for a given Telegram channel usingGetChannelRecommendationsRequestand retrieves additional details like member count.close(): Disconnects the Telegram client.
- Purpose: Handles the actual communication with the Telegram API using the
load_config(): Loads Telegram API credentials and other settings (like delay between channels) from environment variables (typically from a.envfile). It performs basic validation to ensure required configurations are present.parse_arguments(): Parses command-line arguments, allowing the script to be run with channel lists directly or from a file, and to override configuration settings.process_channels(input_channels, config): The central asynchronous function that orchestrates the crawling process. It initializesTelegramCrawler, connects to the API, iterates through the input channels, fetches similar channels for each, applies rate limiting, and collects all results. It also handles saving the results to a JSON file.main(): The asynchronous entry point of the script when run from the command line. It parses arguments, loads configuration, validates input channels, and callsprocess_channels.
- Direct interaction with the Telegram API.
- Authentication and session management for Telegram.
- Fetching similar channels and their details.
- Handling input channel validation and normalization.
- Managing rate limits and delays during crawling.
- Saving raw crawling results to a JSON file.
- Configuring logging for crawler operations.
This script provides a Flask-based web interface for the Telegram Similar Channels Finder, allowing users to input channels, configure settings, and view results through a browser.
Flask App: The main web application instance.TempStorage:- Purpose: A class designed for file-based storage of channels and results, ensuring data persistence across Flask worker reloads (which can happen in production environments like Gunicorn). It avoids relying solely on Flask's session or in-memory storage for critical data.
- Methods:
set_channels,set_results,channels,results. These methods read from and write totemp_channels.jsonandtemp_results.jsonfiles in thedata/directory.
- Flask Routes:
/(index): The main entry point for the web application. It handles both GET requests (to display the form and results) and POST requests (to process various form submissions like configuration, manual input, file upload, and starting the crawler)./api/run-crawler: An API endpoint (POST) to trigger the crawler asynchronously./export-csv: A route to download the collected results as a CSV file.
handle_config_form(): Processes the submission of the configuration form. It reads API credentials and other settings from the form, updates the.envfile, and reloads environment variables for the application.handle_manual_input(): Processes manually entered channel usernames from the web form, validates them, and stores them inTempStorage.handle_file_upload(): Handles the upload of channel lists via JSON, CSV, or plain text files. It parses the file content, extracts channel usernames, and stores them inTempStorage.handle_start_crawler(): Initiates the crawling process. It retrieves the channels fromTempStorage, loads the configuration, and then calls thetelegram_crawler.process_channelsfunction to run the crawler. The results are then stored back intoTempStorage.
- Providing a user-friendly web interface for the crawler.
- Handling user input for channels (manual entry, file upload).
- Managing application configuration via a web form and updating the
.envfile. - Triggering the
telegram_crawler.pylogic. - Storing and retrieving channels and results persistently using
TempStorage. - Displaying crawling results to the user.
- Providing an option to export results as a CSV file.
The project follows a client-server architecture where main.py acts as the web server (frontend) and telegram_crawler.py serves as the backend logic for Telegram API interactions.
- Frontend (
main.py):- Presents forms for configuration and channel input.
- Receives user requests (e.g., "start crawl").
- Manages temporary storage of input channels and fetched results using
TempStorage(which writes totemp_channels.jsonandtemp_results.json).
- Backend (
telegram_crawler.py):- Contains the core business logic for connecting to Telegram, fetching channel recommendations, and processing data.
- It is imported and directly called by
main.py. Specifically,main.pyinvokes thetelegram_crawler.process_channelsfunction.
This separation allows the web interface to focus on presentation and user interaction, while the crawler script handles the complex, potentially long-running, and API-specific tasks.
- Input Channels:
- Users provide channel usernames either by manual entry in a text area or by uploading a JSON/CSV/text file through the
main.pyweb interface. main.py'shandle_manual_input()orhandle_file_upload()functions receive this input.- The channels are then stored in
TempStorage(persisted totemp_channels.json).
- Users provide channel usernames either by manual entry in a text area or by uploading a JSON/CSV/text file through the
- Crawler Execution:
- When the user clicks "Start Crawler" in
main.py, thehandle_start_crawler()function is triggered. - It retrieves the stored channels from
TempStorage. - It loads the configuration using
telegram_crawler.load_config(). - It then calls
asyncio.run(telegram_crawler.process_channels(channels, config)), passing the input channels and configuration to the crawler script.
- When the user clicks "Start Crawler" in
- Results Storage and Presentation:
telegram_crawler.process_channelsperforms the API calls, collects similar channels, and returns a dictionary containing the results.- This result dictionary is then stored by
main.pyintoTempStorage(persisted totemp_results.json). - The
index()route inmain.pyretrieves these results fromTempStorageand renders them on theindex.htmltemplate for display to the user. - Users can also download the results as a CSV file via the
/export-csvroute, which reads fromTempStorageas well.
The application relies on environment variables for sensitive information and configurable settings, primarily loaded from a .env file in the project root.
telegram_crawler.py:- Uses
load_dotenv()to load variables from.env. - The
load_config()function specifically readsTELEGRAM_API_ID,TELEGRAM_API_HASH,TELEGRAM_PHONE,TELEGRAM_SESSION,DELAY_BETWEEN_CHANNELS, andBATCH_SIZE. - These values are crucial for the
TelegramCrawlerto connect and operate.
- Uses
main.py:- The
handle_config_form()function allows users to update these environment variables directly through the web interface. It reads the submitted values and writes them back to the.envfile. - It also uses
os.getenv()to display current configuration values on the web page. - Crucially,
main.pycallsload_dotenv()beforetelegram_crawler.load_config()inhandle_start_crawler()andapi_run_crawler()to ensure the latest environment variables are loaded before the crawler runs.
- The
This setup allows for flexible configuration, either by directly editing the .env file or through the web interface.