Document¶
Input document representation for COSMIC.
Class Definition¶
class Document:
"""
Represents an input document with sentence segmentation.
Attributes:
id: Unique document identifier.
text: Full document text.
sentences: List of Sentence objects.
metadata: Optional metadata dictionary.
"""
Class Methods¶
from_text¶
@classmethod
def from_text(
cls,
text: str,
doc_id: Optional[str] = None,
metadata: Optional[dict] = None,
) -> Document:
"""
Create a Document from plain text.
Args:
text: Document text content.
doc_id: Optional unique identifier. Auto-generated if None.
metadata: Optional metadata dictionary.
Returns:
Document instance with sentence segmentation.
Example:
doc = Document.from_text(
"First sentence. Second sentence.",
doc_id="my-doc",
metadata={"source": "user"}
)
"""
from_file¶
@classmethod
def from_file(
cls,
path: Path,
doc_id: Optional[str] = None,
metadata: Optional[dict] = None,
) -> Document:
"""
Create a Document from a file.
Args:
path: Path to the file.
doc_id: Optional identifier. Defaults to filename.
metadata: Optional metadata.
Returns:
Document instance.
Supported formats:
- .txt: Plain text
- .md: Markdown
"""
from_pages¶
@classmethod
def from_pages(
cls,
pages: list[str],
doc_id: Optional[str] = None,
metadata: Optional[dict] = None,
) -> Document:
"""
Create a Document from multiple pages.
Args:
pages: List of page contents.
doc_id: Optional identifier.
metadata: Optional metadata.
Returns:
Document with page boundaries preserved.
"""
Properties¶
| Property | Type | Description |
|---|---|---|
id |
str |
Document identifier |
text |
str |
Full text content |
sentences |
list[Sentence] |
Segmented sentences |
num_sentences |
int |
Number of sentences |
num_pages |
int |
Number of pages |
total_chars |
int |
Total character count |
Sentence Class¶
@dataclass
class Sentence:
"""
Represents a single sentence.
Attributes:
index: Position in document (0-indexed).
text: Sentence text.
char_start: Starting character offset.
char_end: Ending character offset.
page: Page number (1-indexed).
"""
Usage Examples¶
Basic Creation¶
from cosmic import Document
# From text
doc = Document.from_text("First sentence. Second sentence. Third sentence.")
print(f"ID: {doc.id}")
print(f"Sentences: {doc.num_sentences}")
print(f"Characters: {doc.total_chars}")
With Metadata¶
doc = Document.from_text(
text="Your document content here.",
doc_id="doc-123",
metadata={
"source": "api-upload",
"author": "user@example.com",
"created": "2024-01-01",
}
)
print(doc.metadata["source"]) # "api-upload"
From File¶
from pathlib import Path
doc = Document.from_file(Path("document.txt"))
doc = Document.from_file(Path("README.md"))
Multi-Page Document¶
pages = [
"Page 1 content. First paragraph.",
"Page 2 content. Second paragraph.",
"Page 3 content. Third paragraph.",
]
doc = Document.from_pages(pages, doc_id="multi-page")
print(f"Pages: {doc.num_pages}")
Iterating Sentences¶
doc = Document.from_text("First. Second. Third.")
for sentence in doc:
print(f"[{sentence.index}] {sentence.text}")
print(f" Position: {sentence.char_start}-{sentence.char_end}")
print(f" Page: {sentence.page}")
Accessing Sentences by Index¶
doc = Document.from_text("One. Two. Three.")
first = doc.sentences[0]
last = doc.sentences[-1]
# Slice
middle = doc.sentences[1:3]
Sentence Segmentation¶
COSMIC uses spaCy for sentence segmentation. The segmenter:
- Handles abbreviations (Dr., Mr., etc.)
- Recognizes sentence boundaries
- Preserves whitespace information
Thread Safety¶
Document objects are immutable after creation and safe to share across threads.