Skip to content

Chunking Strategies

COSMIC offers multiple chunking strategies to balance quality and performance.

Strategy Overview

Strategy Quality Speed Best For
auto Adaptive Adaptive General use (recommended)
full Highest Slowest Well-structured documents
semantic High Medium Clear topic transitions
sliding Medium Fast Speed-critical applications
fixed Baseline Fastest Comparisons, simple docs
chunks = chunker.chunk_document(doc, strategy="auto")

Automatically selects the best strategy based on document structure score:

  • Score > 0.7: Uses full COSMIC pipeline
  • Score 0.4-0.7: Uses semantic-only
  • Score < 0.4: Uses fallback chain

This is the recommended default for most use cases.

Full COSMIC Pipeline

chunks = chunker.chunk_document(doc, strategy="full")

Complete 6-stage pipeline:

  1. Structure Analysis - Detects headings, lists, tables
  2. Semantic Boundary Detection - Computes DCS between sentences
  3. Domain Classification - MST-based clustering
  4. Boundary Fusion - Merges structural and semantic signals
  5. LLM Verification - Verifies uncertain boundaries (optional)
  6. Reference Linking - Resolves cross-references

Best for:

  • Well-structured documents (reports, papers, documentation)
  • When quality is more important than speed
  • Documents with clear sections and headings

CLI:

cosmic chunk document.txt --strategy full
cosmic chunk document.txt --strategy full --ollama auto  # With LLM
cosmic chunk document.txt --strategy full --no-llm       # Without LLM

Semantic Only

chunks = chunker.chunk_document(doc, strategy="semantic")

Uses Discourse Coherence Scoring (DCS) without structure analysis.

Pipeline:

  1. Semantic Boundary Detection (DCS)
  2. Domain Classification
  3. Reference Linking

Best for:

  • Documents without clear structure
  • Plain text with topic transitions
  • Faster processing than full pipeline

CLI:

cosmic chunk document.txt --strategy semantic

Sliding Window

chunks = chunker.chunk_document(doc, strategy="sliding")

Basic similarity-based chunking with configurable overlap.

Approach:

  1. Compute embeddings for sentences
  2. Slide window and measure similarity
  3. Split where similarity drops below threshold

Best for:

  • Speed-critical applications
  • Simple documents
  • When consistency matters more than precision

CLI:

cosmic chunk document.txt --strategy sliding

Fixed Length

chunks = chunker.chunk_document(doc, strategy="fixed")

Simple token-based splitting at configured target_tokens.

Approach:

  1. Count tokens
  2. Split at target boundaries
  3. Try to break at sentence boundaries when possible

Best for:

  • Baseline comparisons
  • Very simple documents
  • Maximum speed
  • When semantic boundaries don't matter

CLI:

cosmic chunk document.txt --strategy fixed

Fallback Chain

When strategies encounter errors, COSMIC automatically falls back:

Full COSMIC → Semantic-only → Sliding window → Fixed-length

Each level maintains chunking functionality while reducing complexity.

Strategy Comparison

Quality Metrics

Strategy Coherence Cross-Concept Domain Accuracy
Full > 0.85 < 5% > 90%
Semantic > 0.80 < 10% > 85%
Sliding > 0.70 < 20% N/A
Fixed > 0.60 < 30% N/A

Performance

Strategy Latency (per page) Memory
Full ~150ms High
Semantic ~80ms Medium
Sliding ~30ms Low
Fixed ~10ms Minimal

Choosing a Strategy

Is quality critical?
├── Yes → Is document well-structured?
│         ├── Yes → full
│         └── No → semantic
└── No → Is speed critical?
          ├── Yes → fixed
          └── No → auto (recommended)

Programmatic Strategy Selection

from cosmic import COSMICChunker, Document

chunker = COSMICChunker()
doc = Document.from_text(text)

# Check document structure
structure_score = chunker.analyze_structure(doc)

# Choose strategy based on analysis
if structure_score > 0.7:
    strategy = "full"
elif structure_score > 0.4:
    strategy = "semantic"
else:
    strategy = "sliding"

chunks = chunker.chunk_document(doc, strategy=strategy)