Chunking Strategies¶

COSMIC offers multiple chunking strategies to balance quality and performance.

Strategy Overview¶

Strategy	Quality	Speed	Best For
`auto`	Adaptive	Adaptive	General use (recommended)
`full`	Highest	Slowest	Well-structured documents
`semantic`	High	Medium	Clear topic transitions
`sliding`	Medium	Fast	Speed-critical applications
`fixed`	Baseline	Fastest	Comparisons, simple docs

Auto Strategy (Recommended)¶

chunks = chunker.chunk_document(doc, strategy="auto")

Automatically selects the best strategy based on document structure score:

Score > 0.7: Uses full COSMIC pipeline
Score 0.4-0.7: Uses semantic-only
Score < 0.4: Uses fallback chain

This is the recommended default for most use cases.

Full COSMIC Pipeline¶

chunks = chunker.chunk_document(doc, strategy="full")

Complete 6-stage pipeline:

Structure Analysis - Detects headings, lists, tables
Semantic Boundary Detection - Computes DCS between sentences
Domain Classification - MST-based clustering
Boundary Fusion - Merges structural and semantic signals
LLM Verification - Verifies uncertain boundaries (optional)
Reference Linking - Resolves cross-references

Best for:

Well-structured documents (reports, papers, documentation)
When quality is more important than speed
Documents with clear sections and headings

CLI:

cosmic chunk document.txt --strategy full
cosmic chunk document.txt --strategy full --ollama auto  # With LLM
cosmic chunk document.txt --strategy full --no-llm       # Without LLM

Semantic Only¶

chunks = chunker.chunk_document(doc, strategy="semantic")

Uses Discourse Coherence Scoring (DCS) without structure analysis.

Pipeline:

Semantic Boundary Detection (DCS)
Domain Classification
Reference Linking

Best for:

Documents without clear structure
Plain text with topic transitions
Faster processing than full pipeline

CLI:

cosmic chunk document.txt --strategy semantic

Sliding Window¶

chunks = chunker.chunk_document(doc, strategy="sliding")

Basic similarity-based chunking with configurable overlap.

Approach:

Compute embeddings for sentences
Slide window and measure similarity
Split where similarity drops below threshold

Best for:

Speed-critical applications
Simple documents
When consistency matters more than precision

CLI:

cosmic chunk document.txt --strategy sliding

Fixed Length¶

chunks = chunker.chunk_document(doc, strategy="fixed")

Simple token-based splitting at configured target_tokens.

Approach:

Count tokens
Split at target boundaries
Try to break at sentence boundaries when possible

Best for:

Baseline comparisons
Very simple documents
Maximum speed
When semantic boundaries don't matter

CLI:

cosmic chunk document.txt --strategy fixed

Fallback Chain¶

When strategies encounter errors, COSMIC automatically falls back:

Full COSMIC → Semantic-only → Sliding window → Fixed-length

Each level maintains chunking functionality while reducing complexity.

Strategy Comparison¶

Quality Metrics¶

Strategy	Coherence	Cross-Concept	Domain Accuracy
Full	> 0.85	< 5%	> 90%
Semantic	> 0.80	< 10%	> 85%
Sliding	> 0.70	< 20%	N/A
Fixed	> 0.60	< 30%	N/A

Performance¶

Strategy	Latency (per page)	Memory
Full	~150ms	High
Semantic	~80ms	Medium
Sliding	~30ms	Low
Fixed	~10ms	Minimal

Choosing a Strategy¶

Is quality critical?
├── Yes → Is document well-structured?
│         ├── Yes → full
│         └── No → semantic
└── No → Is speed critical?
          ├── Yes → fixed
          └── No → auto (recommended)

Programmatic Strategy Selection¶

from cosmic import COSMICChunker, Document

chunker = COSMICChunker()
doc = Document.from_text(text)

# Check document structure
structure_score = chunker.analyze_structure(doc)

# Choose strategy based on analysis
if structure_score > 0.7:
    strategy = "full"
elif structure_score > 0.4:
    strategy = "semantic"
else:
    strategy = "sliding"

chunks = chunker.chunk_document(doc, strategy=strategy)