Skip to content

Latest commit

 

History

History
118 lines (91 loc) · 9.38 KB

File metadata and controls

118 lines (91 loc) · 9.38 KB

Telegram Similar Channels Finder Project Overview

This project consists of two main Python scripts: telegram_crawler.py and main.py. Together, they provide a system for fetching similar Telegram channels, with main.py serving as a web-based interface for the core crawling logic implemented in telegram_crawler.py.

telegram_crawler.py Analysis

This script is the core logic for interacting with the Telegram API to find similar channels.

Main Classes and Their Purposes:

  • InputHandler:
    • Purpose: Manages the loading and validation of input channel usernames.
    • Methods:
      • load_from_cli(args): Loads channel usernames directly from command-line arguments.
      • load_from_file(file_path): Reads channel usernames from a specified JSON or CSV file.
      • validate_channels(channels): Cleans and validates a list of channel usernames (e.g., removes '@' prefix, checks length).
  • TelegramCrawler:
    • Purpose: Handles the actual communication with the Telegram API using the telethon library.
    • Methods:
      • __init__(api_id, api_hash, phone, session_name): Initializes the Telegram client with API credentials and session details.
      • connect(): Establishes a connection to Telegram, handling authentication (phone number, code, 2FA password) if necessary.
      • get_similar_channels(channel_username): Fetches recommendations for a given Telegram channel using GetChannelRecommendationsRequest and retrieves additional details like member count.
      • close(): Disconnects the Telegram client.

Key Functions:

  • load_config(): Loads Telegram API credentials and other settings (like delay between channels) from environment variables (typically from a .env file). It performs basic validation to ensure required configurations are present.
  • parse_arguments(): Parses command-line arguments, allowing the script to be run with channel lists directly or from a file, and to override configuration settings.
  • process_channels(input_channels, config): The central asynchronous function that orchestrates the crawling process. It initializes TelegramCrawler, connects to the API, iterates through the input channels, fetches similar channels for each, applies rate limiting, and collects all results. It also handles saving the results to a JSON file.
  • main(): The asynchronous entry point of the script when run from the command line. It parses arguments, loads configuration, validates input channels, and calls process_channels.

Responsibilities:

  • Direct interaction with the Telegram API.
  • Authentication and session management for Telegram.
  • Fetching similar channels and their details.
  • Handling input channel validation and normalization.
  • Managing rate limits and delays during crawling.
  • Saving raw crawling results to a JSON file.
  • Configuring logging for crawler operations.

main.py Analysis

This script provides a Flask-based web interface for the Telegram Similar Channels Finder, allowing users to input channels, configure settings, and view results through a browser.

Main Components:

  • Flask App: The main web application instance.
  • TempStorage:
    • Purpose: A class designed for file-based storage of channels and results, ensuring data persistence across Flask worker reloads (which can happen in production environments like Gunicorn). It avoids relying solely on Flask's session or in-memory storage for critical data.
    • Methods: set_channels, set_results, channels, results. These methods read from and write to temp_channels.json and temp_results.json files in the data/ directory.
  • Flask Routes:
    • / (index): The main entry point for the web application. It handles both GET requests (to display the form and results) and POST requests (to process various form submissions like configuration, manual input, file upload, and starting the crawler).
    • /api/run-crawler: An API endpoint (POST) to trigger the crawler asynchronously.
    • /export-csv: A route to download the collected results as a CSV file.

Key Functions (Form Handlers):

  • handle_config_form(): Processes the submission of the configuration form. It reads API credentials and other settings from the form, updates the .env file, and reloads environment variables for the application.
  • handle_manual_input(): Processes manually entered channel usernames from the web form, validates them, and stores them in TempStorage.
  • handle_file_upload(): Handles the upload of channel lists via JSON, CSV, or plain text files. It parses the file content, extracts channel usernames, and stores them in TempStorage.
  • handle_start_crawler(): Initiates the crawling process. It retrieves the channels from TempStorage, loads the configuration, and then calls the telegram_crawler.process_channels function to run the crawler. The results are then stored back into TempStorage.

Responsibilities:

  • Providing a user-friendly web interface for the crawler.
  • Handling user input for channels (manual entry, file upload).
  • Managing application configuration via a web form and updating the .env file.
  • Triggering the telegram_crawler.py logic.
  • Storing and retrieving channels and results persistently using TempStorage.
  • Displaying crawling results to the user.
  • Providing an option to export results as a CSV file.

Overall Architecture and Interaction

The project follows a client-server architecture where main.py acts as the web server (frontend) and telegram_crawler.py serves as the backend logic for Telegram API interactions.

  • Frontend (main.py):
    • Presents forms for configuration and channel input.
    • Receives user requests (e.g., "start crawl").
    • Manages temporary storage of input channels and fetched results using TempStorage (which writes to temp_channels.json and temp_results.json).
  • Backend (telegram_crawler.py):
    • Contains the core business logic for connecting to Telegram, fetching channel recommendations, and processing data.
    • It is imported and directly called by main.py. Specifically, main.py invokes the telegram_crawler.process_channels function.

This separation allows the web interface to focus on presentation and user interaction, while the crawler script handles the complex, potentially long-running, and API-specific tasks.

Data Flow

  1. Input Channels:
    • Users provide channel usernames either by manual entry in a text area or by uploading a JSON/CSV/text file through the main.py web interface.
    • main.py's handle_manual_input() or handle_file_upload() functions receive this input.
    • The channels are then stored in TempStorage (persisted to temp_channels.json).
  2. Crawler Execution:
    • When the user clicks "Start Crawler" in main.py, the handle_start_crawler() function is triggered.
    • It retrieves the stored channels from TempStorage.
    • It loads the configuration using telegram_crawler.load_config().
    • It then calls asyncio.run(telegram_crawler.process_channels(channels, config)), passing the input channels and configuration to the crawler script.
  3. Results Storage and Presentation:
    • telegram_crawler.process_channels performs the API calls, collects similar channels, and returns a dictionary containing the results.
    • This result dictionary is then stored by main.py into TempStorage (persisted to temp_results.json).
    • The index() route in main.py retrieves these results from TempStorage and renders them on the index.html template for display to the user.
    • Users can also download the results as a CSV file via the /export-csv route, which reads from TempStorage as well.

Configuration Process

The application relies on environment variables for sensitive information and configurable settings, primarily loaded from a .env file in the project root.

  • telegram_crawler.py:
    • Uses load_dotenv() to load variables from .env.
    • The load_config() function specifically reads TELEGRAM_API_ID, TELEGRAM_API_HASH, TELEGRAM_PHONE, TELEGRAM_SESSION, DELAY_BETWEEN_CHANNELS, and BATCH_SIZE.
    • These values are crucial for the TelegramCrawler to connect and operate.
  • main.py:
    • The handle_config_form() function allows users to update these environment variables directly through the web interface. It reads the submitted values and writes them back to the .env file.
    • It also uses os.getenv() to display current configuration values on the web page.
    • Crucially, main.py calls load_dotenv() before telegram_crawler.load_config() in handle_start_crawler() and api_run_crawler() to ensure the latest environment variables are loaded before the crawler runs.

This setup allows for flexible configuration, either by directly editing the .env file or through the web interface.