Skip to content

Latest commit

 

History

History
823 lines (670 loc) · 26.3 KB

File metadata and controls

823 lines (670 loc) · 26.3 KB

Project Vision: Telegram Similar Channels Finder

Strategic vision for evolving the application into a modular, scalable, and user-friendly system.

1. Core Principles & Architectural Goals

Separation of Concerns

The project currently separates Telegram API communication (telegram_crawler.py) from the web interface (main.py). Future enhancements will further modularize the codebase:

  • Move common utilities (e.g., logging, configuration loading, data validation) into a separate utils or common module.
  • Introduce a centralized, type-checked configuration management system using libraries like Pydantic.
  • Establish a dedicated service layer (e.g., a “crawler service” object) that both CLI and web interfaces can utilize, minimizing code duplication.

Code Style and Type Safety

  • Implement type hints throughout the codebase (Python 3’s type annotations) to improve maintainability and enable static type checking (e.g., mypy).
  • Adopt linting and formatting tools (e.g., flake8, black) to ensure consistent code style.

Service Layer Architecture

A TelegramService will encapsulate crawling logic, caching, and rate limiting, providing a unified interface for various components.

# services/telegram_service.py
class TelegramService:
    def __init__(self, config: Config):
        self.crawler = TelegramCrawler(config)
        self.cache = ChannelCache()
        self.rate_limiter = RateLimiter()

    async def get_similar_channels_batch(self, channels: List[str]) -> Dict:
        """Process multiple channels with caching and rate limiting"""
        results = {}

        for channel in channels:
            # Check cache first
            if cached := await self.cache.get(channel):
                results[channel] = cached
                continue

            # Rate limit check
            await self.rate_limiter.wait_if_needed()

            # Fetch from API
            similar = await self.crawler.get_similar_channels(channel)
            await self.cache.set(channel, similar)
            results[channel] = similar

        return results

2. Key Enhancements: Crawler & Performance

Improvements to the core crawling logic will focus on input validation, asynchronous processing, robust error handling, and performance optimization through caching and efficient Telethon usage.

Input Handling and Validation

  • Utilize a schema validation library like pydantic for channels and configuration settings, making input validation self-documenting and easier to update.
  • Enhance input flexibility (e.g., trimming spaces, handling typos) and provide clear error messages.

Asynchronous Processing & Rate Limiting

  • Leverage Python’s asyncio.gather or semaphore-based concurrency to fetch channel recommendations concurrently while adhering to Telegram rate limits.
  • Implement an adaptive rate limiter that dynamically adjusts based on API responses to maximize throughput and prevent throttling.

Error Handling and Resilience

  • Isolate API errors from data validation errors. Implement retry logic for transient API failures.
  • Adopt structured logging (e.g., Python’s standard logging module with JSON output) for easier monitoring and analysis of errors.
  • Consider circuit-breaker patterns for continuous API failures.

Performance Improvements

  • Batch Processing: Dynamically adjust BATCH_SIZE and DELAY_BETWEEN_CHANNELS based on observed crawl performance.
  • Caching: Implement caching for frequently requested channel results (in-memory or Redis) to speed up subsequent requests.
  • Optimized Telethon Usage: Ensure proper session handling (caching and reusing connections) in Telethon to minimize reconnect overhead.

Code Structure and Documentation

  • Refactor complex methods (e.g., process_channels) into smaller, more focused functions (e.g., separating API calls, result aggregation, file I/O).
  • Document functions, expected data formats, API limits, and error-handling behaviors to facilitate future maintenance.

Concurrent Processing Implementation

# crawler/concurrent_crawler.py
import asyncio
from typing import List, Dict

class ConcurrentCrawler:
    def __init__(self, max_concurrent: int = 5):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.session_pool = SessionPool(size=max_concurrent)

    async def process_channels(self, channels: List[str]) -> Dict:
        """Process multiple channels concurrently"""
        tasks = [
            self._process_with_limit(channel)
            for channel in channels
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        return {
            channel: result
            for channel, result in zip(channels, results)
            if not isinstance(result, Exception)
        }

    async def _process_with_limit(self, channel: str):
        async with self.semaphore:
            session = await self.session_pool.acquire()
            try:
                return await self._fetch_similar(session, channel)
            finally:
                await self.session_pool.release(session)

Caching Layer Implementation

# cache/channel_cache.py
from functools import lru_cache
import hashlib

class ChannelCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.memory_cache = {}  # L1 cache

    async def get_or_fetch(self, channel: str, fetcher):
        # L1 memory cache
        if channel in self.memory_cache:
            return self.memory_cache[channel]

        # L2 Redis cache
        cache_key = f"channel:{channel}"
        cached = await self.redis.get(cache_key)
        if cached:
            result = json.loads(cached)
            self.memory_cache[channel] = result
            return result

        # Fetch from API
        result = await fetcher(channel)

        # Store in both caches
        await self.redis.setex(
            cache_key,
            86400,  # 24 hours
            json.dumps(result)
        )
        self.memory_cache[channel] = result

        return result

3. Key Enhancements: Web Interface & API

The web interface will be made more responsive and user-friendly, with a focus on asynchronous task management, improved data persistence, and enhanced configuration.

Responsive and Asynchronous Task Management

  • Offload long-running crawling processes to background workers (e.g., Celery, RQ, or asyncio) to maintain web request responsiveness.
  • Enable users to start a crawl and then poll for status updates or receive notifications upon completion.

Improving Data Persistence

  • Transition from file-based TempStorage to a lightweight database (e.g., SQLite, PostgreSQL) or an in-memory solution like Redis with file backup for state persistence across worker reloads.
  • Implement locking or atomic write operations to ensure data integrity during concurrent requests.

Enhanced Configuration Management

  • Validate updated configurations using schema validation before writing to .env to prevent misconfiguration.
  • Consider an admin interface for frequent configuration changes.

User Experience Improvements

  • Real-Time Feedback: Utilize WebSockets or AJAX polling for continuous crawler status updates on the web interface.
  • Result Presentation: Implement sorting, filtering, and pagination for large result sets in the UI.
  • Error Reporting: Clearly report errors (e.g., API errors, missing channels) to users on the UI, not just in logs.

Background Task Queue Implementation

# tasks/celery_tasks.py
from celery import Celery
from typing import List

app = Celery('telegram_crawler', broker='redis://localhost:6379')

@app.task(bind=True, max_retries=3)
def crawl_channels_task(self, channel_ids: List[str]):
    """Background task for crawling channels"""
    try:
        crawler = TelegramService()
        results = asyncio.run(
            crawler.get_similar_channels_batch(channel_ids)
        )

        # Store results
        storage = RedisStorage()
        storage.set_results(results)

        return {"status": "completed", "channels": len(results)}
    except Exception as exc:
        # Retry with exponential backoff
        raise self.retry(exc=exc, countdown=60 * (self.request.retries + 1))

RESTful API with FastAPI

# api/main.py
from fastapi import FastAPI, BackgroundTasks, HTTPException
from pydantic import BaseModel
from typing import List

app = FastAPI()

class CrawlRequest(BaseModel):
    channels: List[str]
    config: dict = {}

class CrawlResponse(BaseModel):
    task_id: str
    status: str

@app.post("/api/crawl", response_model=CrawlResponse)
async def start_crawl(request: CrawlRequest, background_tasks: BackgroundTasks):
    """Start crawling process"""
    # Validate channels
    validated = InputHandler.validate_channels(request.channels)

    # Create task
    task = crawl_channels_task.delay(validated)

    return CrawlResponse(
        task_id=task.id,
        status="processing"
    )

@app.get("/api/status/{task_id}")
async def get_status(task_id: str):
    """Get crawl task status"""
    task = crawl_channels_task.AsyncResult(task_id)

    return {
        "task_id": task_id,
        "status": task.status,
        "result": task.result if task.ready() else None
    }

@app.websocket("/ws/updates/{task_id}")
async def websocket_updates(websocket: WebSocket, task_id: str):
    """Real-time updates via WebSocket"""
    await websocket.accept()

    while True:
        task = crawl_channels_task.AsyncResult(task_id)
        await websocket.send_json({
            "status": task.status,
            "progress": task.info.get('current', 0) if task.info else 0
        })

        if task.ready():
            break

        await asyncio.sleep(1)

Modern React Frontend

// frontend/src/components/ChannelCrawler.jsx
import React, { useState, useEffect } from 'react';
import { useWebSocket } from 'react-use-websocket';

function ChannelCrawler() {
    const [channels, setChannels] = useState([]);
    const [taskId, setTaskId] = useState(null);
    const [progress, setProgress] = useState(0);

    const { lastMessage } = useWebSocket(
        taskId ? `ws://localhost:8000/ws/updates/${taskId}` : null
    );

    useEffect(() => {
        if (lastMessage) {
            const data = JSON.parse(lastMessage.data);
            setProgress(data.progress);
        }
    }, [lastMessage]);

    const startCrawl = async () => {
        const response = await fetch('/api/crawl', {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ channels })
        });

        const data = await response.json();
        setTaskId(data.task_id);
    };

    return (
        <div>
            <ChannelInput onChange={setChannels} />
            <ProgressBar value={progress} />
            <button onClick={startCrawl}>Start Crawling</button>
            <ResultsDisplay taskId={taskId} />
        </div>
    );
}

4. Operational Excellence

This section outlines strategies for robust configuration management, persistent data storage, comprehensive error handling, and streamlined deployment.

Configuration Management with Pydantic

# config/settings.py
from pydantic import BaseSettings, validator
from typing import Optional

class Settings(BaseSettings):
    # Telegram settings
    telegram_api_id: int
    telegram_api_hash: str
    telegram_phone: str
    telegram_session: str = "default_session"

    # Performance settings
    max_concurrent_requests: int = 5
    delay_between_requests: float = 1.0
    batch_size: int = 50
    cache_ttl: int = 86400

    # Redis settings
    redis_url: str = "redis://localhost:6379"

    # Database settings
    database_url: str = "postgresql://user:pass@localhost/telegram_crawler"

    class Config:
        env_file = ".env"
        case_sensitive = False

    @validator('telegram_phone')
    def validate_phone(cls, v):
        if not v.startswith('+'):
            raise ValueError('Phone must start with +')
        return v

# Singleton pattern
settings = Settings()

Data Persistence with Redis

# storage/redis_storage.py
import redis
import json

class RedisStorage:
    def __init__(self):
        self.client = redis.Redis(
            host='localhost',
            port=6379,
            decode_responses=True
        )

    def set_channels(self, channels: List[str], ttl: int = 3600):
        self.client.setex(
            'channels',
            ttl,
            json.dumps(channels)
        )

    def get_channels(self) -> List[str]:
        data = self.client.get('channels')
        return json.loads(data) if data else []

Database Integration with PostgreSQL

# models/database.py
from sqlalchemy import create_engine, Column, String, DateTime, JSON
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()

class Channel(Base):
    __tablename__ = 'channels'

    username = Column(String, primary_key=True)
    title = Column(String)
    members_count = Column(Integer)
    description = Column(String)
    similar_channels = Column(JSON)
    last_updated = Column(DateTime, default=datetime.utcnow)

class CrawlJob(Base):
    __tablename__ = 'crawl_jobs'

    id = Column(String, primary_key=True)
    channels = Column(JSON)
    status = Column(String)
    started_at = Column(DateTime)
    completed_at = Column(DateTime)
    results = Column(JSON)

Comprehensive Error Handling

# utils/error_handler.py
import sentry_sdk
from sentry_sdk.integrations.celery import CeleryIntegration

class ErrorHandler:
    def __init__(self):
        sentry_sdk.init(
            dsn="your-sentry-dsn",
            integrations=[CeleryIntegration()]
        )

    @staticmethod
    def handle_telegram_errors(func):
        """Decorator for handling Telegram-specific errors"""
        async def wrapper(*args, **kwargs):
            try:
                return await func(*args, **kwargs)
            except FloodWaitError as e:
                logger.warning(f"Rate limited, waiting {e.seconds}s")
                await asyncio.sleep(e.seconds)
                return await func(*args, **kwargs)
            except ChannelPrivateError:
                logger.error(f"Channel is private: {args}")
                return None
            except Exception as e:
                logger.exception("Unexpected error")
                sentry_sdk.capture_exception(e)
                raise
        return wrapper

Monitoring and Metrics

# monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge

# Metrics
channels_processed = Counter(
    'telegram_channels_processed_total',
    'Total number of channels processed'
)

api_request_duration = Histogram(
    'telegram_api_request_duration_seconds',
    'Duration of Telegram API requests'
)

active_sessions = Gauge(
    'telegram_active_sessions',
    'Number of active Telegram sessions'
)

Docker Compose Setup

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:pass@db/telegram_crawler
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - redis

  worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    environment:
      - DATABASE_URL=postgresql://user:pass@db/telegram_crawler
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - redis

  db:
    image: postgres:13
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=telegram_crawler
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:6-alpine
    ports:
      - "6379:6379"

volumes:
  postgres_data:

Deployment Optimization with Kubernetes

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: telegram-crawler-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: telegram-crawler-api
  template:
    metadata:
      labels:
        app: telegram-crawler-api
    spec:
      containers:
      - name: api
        image: telegram-crawler:latest
        ports:
        - containerPort: 8000
        env:
        - name: DATABASE_URL
          value: "postgresql://user:pass@postgres-service/telegram_crawler"
        - name: REDIS_URL
          value: "redis://redis-service:6379"

5. Testing Strategy & Continuous Improvement

This section outlines the approach to testing, performance benchmarking, and continuous integration/continuous deployment (CI/CD) to ensure the reliability, efficiency, and rapid evolution of the project.

Automated Testing

  • Unit Tests: Develop comprehensive unit tests for individual functions and classes (e.g., TelegramService, ChannelCache, RateLimiter).
  • Integration Tests: Create integration tests to verify the interaction between different components (e.g., crawler with cache, API with background tasks).
  • End-to-End Tests: Implement end-to-end tests for critical user flows (e.g., starting a crawl from the web UI and verifying results).

Performance Benchmarking

  • Establish benchmarks for key operations (e.g., channels processed per minute, API response times) and integrate them into the CI/CD pipeline.
  • Use profiling tools to identify and address performance bottlenecks.

CI/CD Pipeline

  • Automate testing, linting, and deployment processes using tools like GitHub Actions, GitLab CI, or Jenkins.
  • Implement blue/green deployments or canary releases for seamless updates with minimal downtime.

Example Test Structure

# tests/test_crawler.py
import pytest
from unittest.mock import AsyncMock, patch
from services.telegram_service import TelegramService
from models.database import Channel

@pytest.fixture
def mock_telegram_service():
    with patch('services.telegram_service.TelegramCrawler') as MockCrawler, \
         patch('services.telegram_service.ChannelCache') as MockCache, \
         patch('services.telegram_service.RateLimiter') as MockRateLimiter:

        mock_crawler_instance = MockCrawler.return_value
        mock_crawler_instance.get_similar_channels = AsyncMock(return_value=["@channel_a", "@channel_b"])

        mock_cache_instance = MockCache.return_value
        mock_cache_instance.get = AsyncMock(return_value=None)
        mock_cache_instance.set = AsyncMock()

        mock_rate_limiter_instance = MockRateLimiter.return_value
        mock_rate_limiter_instance.wait_if_needed = AsyncMock()

        service = TelegramService(config=None) # config can be mocked or a dummy
        yield service

@pytest.mark.asyncio
async def test_get_similar_channels_batch(mock_telegram_service):
    channels_to_crawl = ["@test_channel_1", "@test_channel_2"]
    results = await mock_telegram_service.get_similar_channels_batch(channels_to_crawl)

    assert "@test_channel_1" in results
    assert "@test_channel_2" in results
    assert results["@test_channel_1"] == ["@channel_a", "@channel_b"]
    assert mock_telegram_service.crawler.get_similar_channels.call_count == 2
    assert mock_telegram_service.cache.set.call_count == 2

@pytest.mark.asyncio
async def test_get_similar_channels_batch_cached(mock_telegram_service):
    mock_telegram_service.cache.get.return_value = ["@cached_channel"]
    channels_to_crawl = ["@test_channel_cached"]
    results = await mock_telegram_service.get_similar_channels_batch(channels_to_crawl)

    assert "@test_channel_cached" in results
    assert results["@test_channel_cached"] == ["@cached_channel"]
    assert mock_telegram_service.crawler.get_similar_channels.call_count == 0 # Should not call API if cached

Strengths

1. Well-Structured Architecture

  • The separation of concerns is clearly defined with service layers, dedicated storage, and proper abstraction
  • The modular design will significantly improve maintainability and testability
  • The progression from monolithic to microservices-ready architecture is well thought out

2. Performance Optimization

  • The multi-level caching strategy (L1 memory + L2 Redis) is excellent for reducing API calls
  • Concurrent processing with semaphore-based rate limiting shows good understanding of async patterns
  • The session pooling approach will minimize connection overhead

3. Operational Excellence

  • Comprehensive error handling with retry logic and circuit breakers
  • Proper monitoring with Prometheus metrics
  • Docker and Kubernetes configurations for easy deployment and scaling

Areas for Enhancement

1. Security Considerations

Add a dedicated security section covering:

# security/auth.py
class SecurityManager:
    def __init__(self):
        self.jwt_secret = settings.jwt_secret
        self.rate_limiter = RateLimiter()

    def generate_api_key(self, user_id: str) -> str:
        """Generate secure API key for user"""
        pass

    def validate_request(self, request: Request) -> bool:
        """Validate API key and rate limits"""
        pass

Consider adding:

  • API key management for the REST endpoints
  • Input sanitization beyond validation
  • Secure storage of Telegram session files
  • HTTPS/TLS configuration in deployment

2. Data Privacy and Compliance

# privacy/data_handler.py
class DataPrivacyHandler:
    def anonymize_channel_data(self, channel_data: dict) -> dict:
        """Remove or hash sensitive information"""
        pass

    def handle_gdpr_request(self, user_id: str) -> dict:
        """Handle data deletion requests"""
        pass

3. Graceful Degradation

Add fallback mechanisms:

class FallbackStrategy:
    async def get_channels_with_fallback(self, channel: str):
        try:
            # Try primary method
            return await self.crawler.get_similar_channels(channel)
        except TelegramAPIError:
            # Try cached data even if stale
            return await self.cache.get(channel, ignore_ttl=True)
        except Exception:
            # Return minimal data
            return {"error": "Service temporarily unavailable"}

4. Advanced Caching Strategies

Consider implementing:

  • Cache warming for popular channels
  • Predictive caching based on usage patterns
  • Cache invalidation strategies
class SmartCache:
    async def warm_cache(self):
        """Pre-fetch popular channels during low-traffic periods"""
        popular_channels = await self.get_popular_channels()
        for channel in popular_channels:
            await self.refresh_cache(channel)

    async def predictive_cache(self, current_channel: str):
        """Cache likely next requests based on patterns"""
        related = await self.ml_model.predict_related(current_channel)
        asyncio.create_task(self.batch_cache(related))

5. Observability Enhancements

Expand monitoring to include:

# monitoring/tracing.py
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

class TracingMiddleware:
    @tracer.start_as_current_span("crawl_channel")
    async def trace_crawl(self, channel: str):
        span = trace.get_current_span()
        span.set_attribute("channel.name", channel)
        span.set_attribute("cache.hit", False)
        # ... crawling logic

6. Testing Improvements

Add more test scenarios:

# tests/test_resilience.py
@pytest.mark.asyncio
async def test_circuit_breaker():
    """Test circuit breaker opens after failures"""
    crawler = TelegramService(config)

    # Simulate multiple failures
    for _ in range(5):
        with pytest.raises(ServiceUnavailable):
            await crawler.get_similar_channels("test")

    # Circuit should be open
    assert crawler.circuit_breaker.is_open()

@pytest.mark.asyncio
async def test_cache_stampede_prevention():
    """Test that concurrent requests for same channel don't cause stampede"""
    # Implementation

7. API Versioning

Consider API versioning strategy:

# api/v1/routes.py
router = APIRouter(prefix="/api/v1")

@router.post("/crawl")
async def crawl_v1(request: CrawlRequestV1):
    # V1 implementation

# api/v2/routes.py
router = APIRouter(prefix="/api/v2")

@router.post("/crawl")
async def crawl_v2(request: CrawlRequestV2):
    # V2 with breaking changes

8. Documentation Generation

Add automatic API documentation:

# docs/generator.py
from fastapi import FastAPI
from fastapi.openapi.utils import get_openapi

def custom_openapi():
    if app.openapi_schema:
        return app.openapi_schema

    openapi_schema = get_openapi(
        title="Telegram Crawler API",
        version="2.0.0",
        description="API for finding similar Telegram channels",
        routes=app.routes,
    )

    # Add examples
    openapi_schema["paths"]["/api/crawl"]["post"]["examples"] = {
        "simple": {
            "value": {"channels": ["@example_channel"]}
        }
    }

    app.openapi_schema = openapi_schema
    return app.openapi_schema

9. Performance Profiling

Add profiling capabilities:

# profiling/performance.py
import cProfile
import pstats

class PerformanceProfiler:
    def profile_crawl(self, channels: List[str]):
        profiler = cProfile.Profile()
        profiler.enable()

        # Run crawl
        results = asyncio.run(self.crawler.process_channels(channels))

        profiler.disable()
        stats = pstats.Stats(profiler)
        stats.sort_stats('cumulative')
        stats.print_stats(20)  # Top 20 functions

        return results

10. Business Logic Separation

Consider domain-driven design:

# domain/models.py
@dataclass
class Channel:
    username: str
    title: str
    members_count: int

    def is_valid(self) -> bool:
        return len(self.username) > 0 and self.members_count >= 0

# domain/services.py
class ChannelAnalysisService:
    def calculate_similarity_score(self, channel1: Channel, channel2: Channel) -> float:
        """Business logic for channel similarity"""
        pass

Minor Suggestions

  1. Environment-Specific Configs: Add development, staging, and production configuration profiles
  2. Health Checks: Implement comprehensive health check endpoints that verify all dependencies
  3. Feature Flags: Consider implementing feature toggles for gradual rollouts
  4. Audit Logging: Add audit trails for configuration changes and sensitive operations
  5. Backup Strategies: Document backup and recovery procedures for Redis and PostgreSQL