Skip to content

Document

Input document representation for COSMIC.

Class Definition

class Document:
    """
    Represents an input document with sentence segmentation.

    Attributes:
        id: Unique document identifier.
        text: Full document text.
        sentences: List of Sentence objects.
        metadata: Optional metadata dictionary.
    """

Class Methods

from_text

@classmethod
def from_text(
    cls,
    text: str,
    doc_id: Optional[str] = None,
    metadata: Optional[dict] = None,
) -> Document:
    """
    Create a Document from plain text.

    Args:
        text: Document text content.
        doc_id: Optional unique identifier. Auto-generated if None.
        metadata: Optional metadata dictionary.

    Returns:
        Document instance with sentence segmentation.

    Example:
        doc = Document.from_text(
            "First sentence. Second sentence.",
            doc_id="my-doc",
            metadata={"source": "user"}
        )
    """

from_file

@classmethod
def from_file(
    cls,
    path: Path,
    doc_id: Optional[str] = None,
    metadata: Optional[dict] = None,
) -> Document:
    """
    Create a Document from a file.

    Args:
        path: Path to the file.
        doc_id: Optional identifier. Defaults to filename.
        metadata: Optional metadata.

    Returns:
        Document instance.

    Supported formats:
        - .txt: Plain text
        - .md: Markdown
    """

from_pages

@classmethod
def from_pages(
    cls,
    pages: list[str],
    doc_id: Optional[str] = None,
    metadata: Optional[dict] = None,
) -> Document:
    """
    Create a Document from multiple pages.

    Args:
        pages: List of page contents.
        doc_id: Optional identifier.
        metadata: Optional metadata.

    Returns:
        Document with page boundaries preserved.
    """

Properties

Property Type Description
id str Document identifier
text str Full text content
sentences list[Sentence] Segmented sentences
num_sentences int Number of sentences
num_pages int Number of pages
total_chars int Total character count

Sentence Class

@dataclass
class Sentence:
    """
    Represents a single sentence.

    Attributes:
        index: Position in document (0-indexed).
        text: Sentence text.
        char_start: Starting character offset.
        char_end: Ending character offset.
        page: Page number (1-indexed).
    """

Usage Examples

Basic Creation

from cosmic import Document

# From text
doc = Document.from_text("First sentence. Second sentence. Third sentence.")

print(f"ID: {doc.id}")
print(f"Sentences: {doc.num_sentences}")
print(f"Characters: {doc.total_chars}")

With Metadata

doc = Document.from_text(
    text="Your document content here.",
    doc_id="doc-123",
    metadata={
        "source": "api-upload",
        "author": "user@example.com",
        "created": "2024-01-01",
    }
)

print(doc.metadata["source"])  # "api-upload"

From File

from pathlib import Path

doc = Document.from_file(Path("document.txt"))
doc = Document.from_file(Path("README.md"))

Multi-Page Document

pages = [
    "Page 1 content. First paragraph.",
    "Page 2 content. Second paragraph.",
    "Page 3 content. Third paragraph.",
]

doc = Document.from_pages(pages, doc_id="multi-page")
print(f"Pages: {doc.num_pages}")

Iterating Sentences

doc = Document.from_text("First. Second. Third.")

for sentence in doc:
    print(f"[{sentence.index}] {sentence.text}")
    print(f"  Position: {sentence.char_start}-{sentence.char_end}")
    print(f"  Page: {sentence.page}")

Accessing Sentences by Index

doc = Document.from_text("One. Two. Three.")

first = doc.sentences[0]
last = doc.sentences[-1]

# Slice
middle = doc.sentences[1:3]

Sentence Segmentation

COSMIC uses spaCy for sentence segmentation. The segmenter:

  • Handles abbreviations (Dr., Mr., etc.)
  • Recognizes sentence boundaries
  • Preserves whitespace information

Thread Safety

Document objects are immutable after creation and safe to share across threads.

Serialization

# To dictionary
data = {
    "id": doc.id,
    "text": doc.text,
    "metadata": doc.metadata,
}

# Recreate
doc = Document.from_text(
    data["text"],
    doc_id=data["id"],
    metadata=data["metadata"],
)