Chunking Strategies¶
COSMIC offers multiple chunking strategies to balance quality and performance.
Strategy Overview¶
| Strategy | Quality | Speed | Best For |
|---|---|---|---|
auto |
Adaptive | Adaptive | General use (recommended) |
full |
Highest | Slowest | Well-structured documents |
semantic |
High | Medium | Clear topic transitions |
sliding |
Medium | Fast | Speed-critical applications |
fixed |
Baseline | Fastest | Comparisons, simple docs |
Auto Strategy (Recommended)¶
Automatically selects the best strategy based on document structure score:
- Score > 0.7: Uses full COSMIC pipeline
- Score 0.4-0.7: Uses semantic-only
- Score < 0.4: Uses fallback chain
This is the recommended default for most use cases.
Full COSMIC Pipeline¶
Complete 6-stage pipeline:
- Structure Analysis - Detects headings, lists, tables
- Semantic Boundary Detection - Computes DCS between sentences
- Domain Classification - MST-based clustering
- Boundary Fusion - Merges structural and semantic signals
- LLM Verification - Verifies uncertain boundaries (optional)
- Reference Linking - Resolves cross-references
Best for:
- Well-structured documents (reports, papers, documentation)
- When quality is more important than speed
- Documents with clear sections and headings
CLI:
cosmic chunk document.txt --strategy full
cosmic chunk document.txt --strategy full --ollama auto # With LLM
cosmic chunk document.txt --strategy full --no-llm # Without LLM
Semantic Only¶
Uses Discourse Coherence Scoring (DCS) without structure analysis.
Pipeline:
- Semantic Boundary Detection (DCS)
- Domain Classification
- Reference Linking
Best for:
- Documents without clear structure
- Plain text with topic transitions
- Faster processing than full pipeline
CLI:
Sliding Window¶
Basic similarity-based chunking with configurable overlap.
Approach:
- Compute embeddings for sentences
- Slide window and measure similarity
- Split where similarity drops below threshold
Best for:
- Speed-critical applications
- Simple documents
- When consistency matters more than precision
CLI:
Fixed Length¶
Simple token-based splitting at configured target_tokens.
Approach:
- Count tokens
- Split at target boundaries
- Try to break at sentence boundaries when possible
Best for:
- Baseline comparisons
- Very simple documents
- Maximum speed
- When semantic boundaries don't matter
CLI:
Fallback Chain¶
When strategies encounter errors, COSMIC automatically falls back:
Each level maintains chunking functionality while reducing complexity.
Strategy Comparison¶
Quality Metrics¶
| Strategy | Coherence | Cross-Concept | Domain Accuracy |
|---|---|---|---|
| Full | > 0.85 | < 5% | > 90% |
| Semantic | > 0.80 | < 10% | > 85% |
| Sliding | > 0.70 | < 20% | N/A |
| Fixed | > 0.60 | < 30% | N/A |
Performance¶
| Strategy | Latency (per page) | Memory |
|---|---|---|
| Full | ~150ms | High |
| Semantic | ~80ms | Medium |
| Sliding | ~30ms | Low |
| Fixed | ~10ms | Minimal |
Choosing a Strategy¶
Is quality critical?
├── Yes → Is document well-structured?
│ ├── Yes → full
│ └── No → semantic
└── No → Is speed critical?
├── Yes → fixed
└── No → auto (recommended)
Programmatic Strategy Selection¶
from cosmic import COSMICChunker, Document
chunker = COSMICChunker()
doc = Document.from_text(text)
# Check document structure
structure_score = chunker.analyze_structure(doc)
# Choose strategy based on analysis
if structure_score > 0.7:
strategy = "full"
elif structure_score > 0.4:
strategy = "semantic"
else:
strategy = "sliding"
chunks = chunker.chunk_document(doc, strategy=strategy)