Configuration¶

COSMIC can be configured via environment variables, YAML files, or Python code.

Environment Variables¶

Create a .env file in your project root:

# LLM Provider: "openai", "ollama", or "auto"
COSMIC_LLM_PROVIDER=ollama

# OpenAI-compatible API settings
COSMIC_LLM_URL=http://localhost:8000/v1
COSMIC_LLM_MODEL=default
COSMIC_LLM_API_KEY=your-api-key

# Ollama settings
OLLAMA_HOST=http://localhost:11434
COSMIC_OLLAMA_MODEL=auto

# Embedding device
COSMIC_EMBEDDING_DEVICE=cuda  # cuda, cpu, or mps

YAML Configuration¶

Default configuration is at configs/default.yaml:

# Discourse Coherence Score weights (must sum to 1.0)
dcs:
  alpha: 0.4     # Topical coherence (embedding similarity)
  beta: 0.35     # Coreference density (entity continuity)
  gamma: 0.25    # Discourse markers (transition signals)
  threshold: 0.5 # Below this = boundary candidate

# Structure analysis thresholds
structure:
  enabled: true
  full_threshold: 0.7    # Above = full pipeline
  semantic_threshold: 0.4 # Above = semantic-only, below = fallback

# Embedding model settings
embedding:
  model_name: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  batch_size: 64
  cache_size: 10000
  device: cuda
  normalize: true

# LLM verification (Stage 5)
llm:
  enabled: true
  provider: ollama  # openai, ollama, auto
  base_url: http://localhost:11434/v1
  model_name: gemma3:latest
  confidence_threshold: 0.8  # Only verify below this
  batch_size: 10
  timeout_seconds: 30
  max_context_tokens: 512

# Reference linking (Stage 6)
reference:
  enabled: true
  use_coreference: true
  coreference_model: en_core_web_trf

# Boundary fusion weights
fusion:
  structural_weight: 0.6
  semantic_weight: 0.4
  acceptance_threshold: 0.5

# Chunk size constraints
chunks:
  min_tokens: 100
  max_tokens: 2000
  target_tokens: 500

# Runtime settings
workers: 4
gpu_memory_fraction: 0.5

Python Configuration¶

Using COSMICConfig¶

from cosmic import COSMICChunker, COSMICConfig
from cosmic.core.config import DCSConfig, ChunkConstraints, LLMConfig

config = COSMICConfig(
    # Discourse Coherence Score weights
    dcs=DCSConfig(
        alpha=0.5,     # Topical coherence weight
        beta=0.3,      # Coreference density weight
        gamma=0.2,     # Discourse marker weight
        threshold=0.5, # Boundary detection threshold
    ),

    # Chunk size constraints
    chunks=ChunkConstraints(
        min_tokens=50,
        max_tokens=1024,
        target_tokens=512,
    ),

    # LLM verification settings
    llm=LLMConfig(
        enabled=True,
        provider="ollama",
        base_url="http://localhost:11434/v1",
        model_name="gemma3:latest",
    ),
)

chunker = COSMICChunker(config=config)

Loading from YAML¶

from pathlib import Path
from cosmic import COSMICChunker, COSMICConfig

# Load from file
config = COSMICConfig.from_yaml(Path("configs/custom.yaml"))
chunker = COSMICChunker(config=config)

# Or pass path directly
chunker = COSMICChunker(config_path=Path("configs/custom.yaml"))

DCS Weight Tuning¶

The Discourse Coherence Score formula:

DCS = α × topical_coherence + β × coreference_density + γ × discourse_signal

Default Weights¶

Weight	Value	Signal
α (alpha)	0.4	Topical coherence from embedding similarity
β (beta)	0.35	Coreference density measuring entity continuity
γ (gamma)	0.25	Discourse markers indicating transitions

Tuning Guidelines¶

For technical documents:

dcs=DCSConfig(alpha=0.5, beta=0.25, gamma=0.25)

For narrative text:

dcs=DCSConfig(alpha=0.3, beta=0.4, gamma=0.3)

For structured documents (lists, tables):

dcs=DCSConfig(alpha=0.35, beta=0.3, gamma=0.35)

Chunk Size Configuration¶

chunks=ChunkConstraints(
    min_tokens=100,   # Minimum chunk size
    max_tokens=2000,  # Maximum chunk size
    target_tokens=500 # Target size for splitting
)

Size Guidelines¶

Use Case	min	max	target
RAG (short context)	100	512	350
RAG (long context)	200	2000	1000
Summarization	500	4000	2000
Classification	50	500	200

Disabling Features¶

Via CLI¶

cosmic chunk doc.txt --no-llm        # Skip LLM verification
cosmic chunk doc.txt --no-reference  # Skip reference linking

Via Configuration¶

config = COSMICConfig(
    llm=LLMConfig(enabled=False),
    reference=ReferenceConfig(enabled=False),
)

Via YAML¶

llm:
  enabled: false

reference:
  enabled: false