Skip to content

Kohulan/DECIMER-Image-Segmentation

Repository files navigation

๐Ÿ”ฌ DECIMER Image Segmentation ๐Ÿ“„

Deep Learning for Chemical Image Recognition - Automated Structure Detection & Extraction

DECIMER Segmentation

License Maintenance GitHub issues GitHub contributors tensorflow Model Card DOI GitHub release PyPI version fury.io

๐ŸŒ Try it live at decimer.ai


๐Ÿ“š Table of Contents


๐Ÿ“ Overview

Unlocking decades of chemical knowledge from scientific literature!

Chemistry has accumulated vast amounts of knowledge about chemical compounds, structures, and properties across countless scientific publications. DECIMER Segmentation is the first open-source, deep learning-based tool designed to automatically recognize and extract chemical structure depictions from scientific documents.

๐ŸŽฏ The Challenge

Converting images of chemical structures into machine-readable formats (OCSR - Optical Chemical Structure Recognition) is a crucial step in digitizing chemical knowledge. But before we can recognize structures, we need to find and extract them from complex document pages!

๐Ÿ’ก The Solution

DECIMER Segmentation uses advanced deep learning to:

  • ๐Ÿ” Detect chemical structure depictions in scientific publications
  • โœ‚๏ธ Extract individual structure images with precision
  • ๐Ÿ“š Process both modern PDFs and scanned historical documents
  • โšก Automate the entire workflow from document to segmented structures

โœจ Key Features

๐Ÿค– Deep Learning Powered

Built on Mask R-CNN architecture for state-of-the-art detection accuracy

๐Ÿ“– Universal Compatibility

Works with PDFs, scanned pages, and bitmap images from any publisher

๐Ÿ†“ Open Source

Freely available code and pre-trained models for the community

โšก High Performance

GPU acceleration support for rapid batch processing

๐ŸŽจ Smart Post-Processing

Automatic mask expansion to capture complete structures

๐ŸŒ Web Application

User-friendly interface available at decimer.ai


๐ŸŽฏ How It Works

DECIMER Segmentation employs a sophisticated two-stage workflow:

1๏ธโƒฃ Detection Stage

๐Ÿ“„ Input Document โ†’ ๐Ÿค– Mask R-CNN Model โ†’ ๐ŸŽญ Structure Masks

The deep learning model analyzes the page and creates precise masks indicating the location of each chemical structure.

2๏ธโƒฃ Post-Processing Stage

๐ŸŽญ Initial Masks โ†’ ๐Ÿ”ง Expansion Algorithm โ†’ โœ… Complete Structures

An intelligent post-processing workflow ensures that potentially incomplete masks are expanded to capture the full structure.

๐ŸŽจ Visual Workflow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  PDF/Image File โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Page Extractionโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Mask R-CNN      โ”‚
โ”‚ Detection       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Mask Expansion  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Segmented       โ”‚
โ”‚ Structures      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โš™๏ธ Installation

๐Ÿ Prerequisites

We strongly recommend using a Conda environment for seamless dependency management.

Install Miniconda (if not already installed)

# Linux
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# macOS
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh

๐Ÿ“ฆ Installation Options

Option 1: Install from GitHub (Development Version)
# Clone the repository
git clone https://github.com/Kohulan/DECIMER-Image-Segmentation.git
cd DECIMER-Image-Segmentation

# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install dependencies
conda install pip
python -m pip install -U pip

# Install DECIMER-Segmentation
pip install .

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler
Option 2: Install from PyPI (Stable Release)
# Create and activate conda environment
conda create --name DECIMER_IMGSEG python=3.10
conda activate DECIMER_IMGSEG

# Install from PyPI
pip install decimer-segmentation

# Install Poppler (required for PDF processing)
conda install -c conda-forge poppler

๐Ÿ–ฅ๏ธ Hardware Requirements

  • CPU Mode: Works on any modern CPU
  • GPU Mode (Recommended): CUDA-compatible GPU with appropriate drivers
    • Significantly faster processing
    • Essential for batch processing

๐Ÿš€ Usage

Command Line Interface

Process entire documents with a single command:

# Segment structures from a PDF or image file
python3 segment_structures_in_document.py your_document.pdf

# Output will be saved in a folder named after your input file
# e.g., your_document/ containing all segmented structures

Python API

๐ŸŽจ Example 1: Segment from Image Array

from decimer_segmentation import segment_chemical_structures
import cv2

# Load your scanned page
page_image = cv2.imread("path/to/scanned_page.png")

# Extract all chemical structures
segments = segment_chemical_structures(page_image, expand=True)

# segments is a list of numpy arrays, each containing a structure
for idx, structure in enumerate(segments):
    cv2.imwrite(f"structure_{idx}.png", structure)
    print(f"โœ… Saved structure {idx}")

๐Ÿ“„ Example 2: Segment from File (PDF or Image)

from decimer_segmentation import segment_chemical_structures_from_file

# Process a PDF file
segments = segment_chemical_structures_from_file(
    "path/to/document.pdf",
    expand=True
)

# Process an image file
segments = segment_chemical_structures_from_file(
    "path/to/page_image.jpg",
    expand=True
)

print(f"๐ŸŽ‰ Extracted {len(segments)} chemical structures!")

๐Ÿ”ง Example 3: Batch Processing

from decimer_segmentation import segment_chemical_structures_from_file
import os
from pathlib import Path

def batch_segment(input_dir, output_dir):
    """Process multiple PDF files"""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    for pdf_file in Path(input_dir).glob("*.pdf"):
        print(f"๐Ÿ“„ Processing {pdf_file.name}...")
        
        segments = segment_chemical_structures_from_file(
            str(pdf_file),
            expand=True
        )
        
        # Save each segment
        file_output_dir = Path(output_dir) / pdf_file.stem
        file_output_dir.mkdir(exist_ok=True)
        
        for idx, segment in enumerate(segments):
            output_path = file_output_dir / f"structure_{idx:03d}.png"
            cv2.imwrite(str(output_path), segment)
        
        print(f"โœ… Extracted {len(segments)} structures from {pdf_file.name}")

# Use it
batch_segment("input_pdfs/", "output_structures/")

๐ŸŽฏ Example 4: Advanced Usage with Custom Parameters

from decimer_segmentation import segment_chemical_structures
import cv2

# Load image
image = cv2.imread("complex_page.png")

# Segment with custom settings
segments = segment_chemical_structures(
    image,
    expand=True,          # Enable mask expansion
    visualization=True    # Generate visualization (if available)
)

# Process results
for idx, segment in enumerate(segments):
    # You can now pass this to DECIMER Image Transformer
    # for structure recognition
    print(f"Structure {idx}: {segment.shape}")

๐Ÿ““ Interactive Tutorial

For more comprehensive examples and interactive demonstrations, check out our Jupyter Notebook!


๐ŸชŸ Notes for Windows Users

Windows-Specific Instructions

1๏ธโƒฃ Use Anaconda PowerShell Prompt

Run all commands in the Anaconda PowerShell Prompt (not regular Command Prompt or PowerShell).

2๏ธโƒฃ Install Poppler for PDF Support

PDF processing on Windows requires Poppler. Follow these steps:

  1. Download Poppler

  2. Specify Poppler Path in Code

    from decimer_segmentation import segment_chemical_structures_from_file
    
    segments = segment_chemical_structures_from_file(
        "document.pdf",
        expand=True,
        poppler_path=r"C:\Program Files\poppler\Library\bin"
    )

3๏ธโƒฃ GPU Support on Windows

Ensure you have:

  • CUDA Toolkit installed
  • cuDNN libraries configured
  • Compatible GPU drivers

๐Ÿ“Š Model Information

๐Ÿค– Pre-trained Model

The Mask R-CNN model is publicly available and ready to use:

DOI

๐ŸŽ“ Model Architecture

  • Base Network: Mask R-CNN
  • Training Data: Diverse chemical literature from multiple publishers
  • Task: Instance segmentation of chemical structure depictions
  • Performance: Manually validated on publications from various sources

๐Ÿ” Model Performance

The model has been rigorously evaluated on:

  • โœ… Publications from multiple scientific publishers
  • โœ… Documents spanning different time periods
  • โœ… Both modern PDFs and scanned historical pages
  • โœ… Various image qualities and layouts

๐Ÿ“„ Citation

If DECIMER Segmentation contributes to your research, please cite:

@article{Rajan2021,
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Sorokina, Maria and Zielesny, Achim and Steinbeck, Christoph},
  title = {DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature},
  journal = {Journal of Cheminformatics},
  year = {2021},
  volume = {13},
  number = {20},
  doi = {10.1186/s13321-021-00496-1}
}

Full Citation:
Rajan, K., Brinkhaus, H.O., Sorokina, M. et al. DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature. J Cheminform 13, 20 (2021). https://doi.org/10.1186/s13321-021-00496-1


๐Ÿ™ Acknowledgements

๐ŸŒŸ Special Thanks

This project wouldn't be possible without the support and contributions from the community and funding organizations.

Contributors
All our amazing contributors who helped improve the codebase
Community
Users providing feedback and reporting issues
Open Source
Projects we build upon: TensorFlow, Mask R-CNN

๐ŸŒ Project Website

Experience DECIMER Live!

DECIMER.ai

๐Ÿš€ Try DECIMER.ai - Web application.

๐Ÿ“ฆ Complete DECIMER Suite

DECIMER Segmentation is part of a comprehensive chemical structure recognition pipeline:

  1. ๐Ÿ” DECIMER Segmentation (You are here)
    Extract chemical structures from documents

  2. ๐Ÿง  DECIMER Image Transformer
    Convert structure images to SMILES strings

  3. ๐Ÿ—„๏ธ MARCUS
    Molecular Annotation and Recognition for Curating Unravelled Structures


๐Ÿ›๏ธ Research Group

๐ŸŽ“ Maintained by the Kohulan @ Steinbeck Group

Cheminformatics Group

Natural Products Cheminformatics Research Group
Institute for Inorganic and Analytical Chemistry
Friedrich Schiller University Jena, Germany


โญ Star History

Star History Chart


๐Ÿ“Š Project Analytics

Repobeats


๐Ÿค Contributing

We welcome contributions! Please feel free to submit a Pull Request.

๐Ÿ“ Report Bug ยท ๐Ÿ’ก Request Feature ยท โญ Star this repo


Made with โค๏ธ and โ˜• for the global chemistry community

ยฉ 2025 Kohulan @ Steinbeck Lab, Friedrich Schiller University Jena


๐Ÿ”ฌ Advancing Open Science in Chemistry | ๐ŸŒ Digitizing Chemical Knowledge | ๐Ÿค– Powered by Deep Learning

About

Chemical structure detection and segmentation tool for Journal articles.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

โšก