Skip to content

COSMIC

COncept-aware Semantic Meta-chunking with Intelligent Classification

A production-ready intelligent text chunking framework for Retrieval-Augmented Generation (RAG) systems.

PyPI version Python 3.10+ License

Overview

COSMIC addresses fundamental limitations in existing text chunking approaches for RAG systems:

The Problem

Current chunking methods suffer from three critical issues:

  1. Semantic Fragmentation - Fixed-length chunkers split mid-concept, breaking coherent ideas
  2. Context Loss - Simple overlap strategies create redundancy without preserving meaning
  3. Domain Blindness - One-size-fits-all approaches ignore domain-specific structure

The Solution

COSMIC introduces a 6-stage pipeline that combines:

  • Discourse Coherence Scoring (DCS) - Multi-signal boundary detection using topical coherence, coreference density, and discourse markers
  • MST-based Domain Clustering - Minimum spanning tree clustering for domain classification
  • Adaptive Boundary Fusion - Weighted combination of structural and semantic signals
  • LLM Verification - Optional verification of uncertain boundaries
  • Zero-Overlap Architecture - Self-contained conceptual chunks without redundant overlap

Quick Start

# Install from PyPI
pip install cosmic-chunker[all]

# Download spaCy model
python -m spacy download en_core_web_trf
from cosmic import COSMICChunker, Document

# Create chunker
chunker = COSMICChunker()

# Chunk a document
doc = Document.from_text("Your document text here...")
chunks = chunker.chunk_document(doc, strategy="auto")

for chunk in chunks:
    print(f"Domain: {chunk.domain}, Coherence: {chunk.coherence_score:.2f}")
    print(chunk.text[:100])

Features

  • 6-Stage Pipeline: Structure analysis, semantic boundaries, domain classification, boundary fusion, LLM verification, reference linking
  • Multiple Strategies: Full COSMIC, semantic-only, sliding window, fixed-length
  • Graceful Degradation: Automatic fallback when stages fail
  • Ollama Integration: Local LLM verification without API costs
  • CLI Tool: Easy command-line interface for quick chunking
  • Batch Processing: Efficient processing of multiple documents

Target Metrics

Metric Target Description
Coherence Score > 0.85 Semantic unity within chunks
Cross-Concept Splits < 5% Chunks that break conceptual boundaries
Latency < 150ms/page Processing speed
Fallback Rate < 15% Graceful degradation frequency

License

Apache 2.0 License - See LICENSE for details.