COSMIC¶
COncept-aware Semantic Meta-chunking with Intelligent Classification
A production-ready intelligent text chunking framework for Retrieval-Augmented Generation (RAG) systems.
Overview¶
COSMIC addresses fundamental limitations in existing text chunking approaches for RAG systems:
The Problem¶
Current chunking methods suffer from three critical issues:
- Semantic Fragmentation - Fixed-length chunkers split mid-concept, breaking coherent ideas
- Context Loss - Simple overlap strategies create redundancy without preserving meaning
- Domain Blindness - One-size-fits-all approaches ignore domain-specific structure
The Solution¶
COSMIC introduces a 6-stage pipeline that combines:
- Discourse Coherence Scoring (DCS) - Multi-signal boundary detection using topical coherence, coreference density, and discourse markers
- MST-based Domain Clustering - Minimum spanning tree clustering for domain classification
- Adaptive Boundary Fusion - Weighted combination of structural and semantic signals
- LLM Verification - Optional verification of uncertain boundaries
- Zero-Overlap Architecture - Self-contained conceptual chunks without redundant overlap
Quick Start¶
# Install from PyPI
pip install cosmic-chunker[all]
# Download spaCy model
python -m spacy download en_core_web_trf
from cosmic import COSMICChunker, Document
# Create chunker
chunker = COSMICChunker()
# Chunk a document
doc = Document.from_text("Your document text here...")
chunks = chunker.chunk_document(doc, strategy="auto")
for chunk in chunks:
print(f"Domain: {chunk.domain}, Coherence: {chunk.coherence_score:.2f}")
print(chunk.text[:100])
Features¶
- 6-Stage Pipeline: Structure analysis, semantic boundaries, domain classification, boundary fusion, LLM verification, reference linking
- Multiple Strategies: Full COSMIC, semantic-only, sliding window, fixed-length
- Graceful Degradation: Automatic fallback when stages fail
- Ollama Integration: Local LLM verification without API costs
- CLI Tool: Easy command-line interface for quick chunking
- Batch Processing: Efficient processing of multiple documents
Target Metrics¶
| Metric | Target | Description |
|---|---|---|
| Coherence Score | > 0.85 | Semantic unity within chunks |
| Cross-Concept Splits | < 5% | Chunks that break conceptual boundaries |
| Latency | < 150ms/page | Processing speed |
| Fallback Rate | < 15% | Graceful degradation frequency |
License¶
Apache 2.0 License - See LICENSE for details.