Skip to content

Configuration

COSMIC can be configured via environment variables, YAML files, or Python code.

Environment Variables

Create a .env file in your project root:

# LLM Provider: "openai", "ollama", or "auto"
COSMIC_LLM_PROVIDER=ollama

# OpenAI-compatible API settings
COSMIC_LLM_URL=http://localhost:8000/v1
COSMIC_LLM_MODEL=default
COSMIC_LLM_API_KEY=your-api-key

# Ollama settings
OLLAMA_HOST=http://localhost:11434
COSMIC_OLLAMA_MODEL=auto

# Embedding device
COSMIC_EMBEDDING_DEVICE=cuda  # cuda, cpu, or mps

YAML Configuration

Default configuration is at configs/default.yaml:

# Discourse Coherence Score weights (must sum to 1.0)
dcs:
  alpha: 0.4     # Topical coherence (embedding similarity)
  beta: 0.35     # Coreference density (entity continuity)
  gamma: 0.25    # Discourse markers (transition signals)
  threshold: 0.5 # Below this = boundary candidate

# Structure analysis thresholds
structure:
  enabled: true
  full_threshold: 0.7    # Above = full pipeline
  semantic_threshold: 0.4 # Above = semantic-only, below = fallback

# Embedding model settings
embedding:
  model_name: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  batch_size: 64
  cache_size: 10000
  device: cuda
  normalize: true

# LLM verification (Stage 5)
llm:
  enabled: true
  provider: ollama  # openai, ollama, auto
  base_url: http://localhost:11434/v1
  model_name: gemma3:latest
  confidence_threshold: 0.8  # Only verify below this
  batch_size: 10
  timeout_seconds: 30
  max_context_tokens: 512

# Reference linking (Stage 6)
reference:
  enabled: true
  use_coreference: true
  coreference_model: en_core_web_trf

# Boundary fusion weights
fusion:
  structural_weight: 0.6
  semantic_weight: 0.4
  acceptance_threshold: 0.5

# Chunk size constraints
chunks:
  min_tokens: 100
  max_tokens: 2000
  target_tokens: 500

# Runtime settings
workers: 4
gpu_memory_fraction: 0.5

Python Configuration

Using COSMICConfig

from cosmic import COSMICChunker, COSMICConfig
from cosmic.core.config import DCSConfig, ChunkConstraints, LLMConfig

config = COSMICConfig(
    # Discourse Coherence Score weights
    dcs=DCSConfig(
        alpha=0.5,     # Topical coherence weight
        beta=0.3,      # Coreference density weight
        gamma=0.2,     # Discourse marker weight
        threshold=0.5, # Boundary detection threshold
    ),

    # Chunk size constraints
    chunks=ChunkConstraints(
        min_tokens=50,
        max_tokens=1024,
        target_tokens=512,
    ),

    # LLM verification settings
    llm=LLMConfig(
        enabled=True,
        provider="ollama",
        base_url="http://localhost:11434/v1",
        model_name="gemma3:latest",
    ),
)

chunker = COSMICChunker(config=config)

Loading from YAML

from pathlib import Path
from cosmic import COSMICChunker, COSMICConfig

# Load from file
config = COSMICConfig.from_yaml(Path("configs/custom.yaml"))
chunker = COSMICChunker(config=config)

# Or pass path directly
chunker = COSMICChunker(config_path=Path("configs/custom.yaml"))

DCS Weight Tuning

The Discourse Coherence Score formula:

DCS = α × topical_coherence + β × coreference_density + γ × discourse_signal

Default Weights

Weight Value Signal
α (alpha) 0.4 Topical coherence from embedding similarity
β (beta) 0.35 Coreference density measuring entity continuity
γ (gamma) 0.25 Discourse markers indicating transitions

Tuning Guidelines

For technical documents:

dcs=DCSConfig(alpha=0.5, beta=0.25, gamma=0.25)

For narrative text:

dcs=DCSConfig(alpha=0.3, beta=0.4, gamma=0.3)

For structured documents (lists, tables):

dcs=DCSConfig(alpha=0.35, beta=0.3, gamma=0.35)

Chunk Size Configuration

chunks=ChunkConstraints(
    min_tokens=100,   # Minimum chunk size
    max_tokens=2000,  # Maximum chunk size
    target_tokens=500 # Target size for splitting
)

Size Guidelines

Use Case min max target
RAG (short context) 100 512 350
RAG (long context) 200 2000 1000
Summarization 500 4000 2000
Classification 50 500 200

Disabling Features

Via CLI

cosmic chunk doc.txt --no-llm        # Skip LLM verification
cosmic chunk doc.txt --no-reference  # Skip reference linking

Via Configuration

config = COSMICConfig(
    llm=LLMConfig(enabled=False),
    reference=ReferenceConfig(enabled=False),
)

Via YAML

llm:
  enabled: false

reference:
  enabled: false