Documents - Ragnerock Docs

Documents are the foundation of Ragnerock. This guide explains how document processing works and how to manage your document library.

Document Types

Ragnerock supports a wide variety of document formats:

PDF — Research reports, filings, presentations
Word — .docx and .doc files
Excel — Spreadsheets with structured data
PowerPoint — Slide decks
Text — Plain text and markdown files
HTML — Web pages and articles

Document Processing Pipeline

When you upload a document, Ragnerock processes it through several stages:

1. Ingestion

The document is securely uploaded and stored in your project’s storage. Ragnerock validates the file format and prepares it for processing.

2. Text Extraction

Text and structure are extracted from the document. For PDFs, this includes:

Full text content
Page boundaries
Tables and figures
Headers and footers

3. Chunking

The document is split into semantic chunks optimized for:

Vector embedding generation
Context window limits
Search relevance

4. Embedding Generation

Each chunk is converted into a vector embedding, enabling semantic search across your document library.

Working with Documents

Uploading Documents

Use the SDK to upload documents to your project:

from ragnerock import create_engine, Session, Document

engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")

with Session(engine) as session:
    # Create a document from a local file
    doc = Document(
        file_path="/path/to/report.pdf",
        name="Q4 Earnings Report"
    )
    session.create(doc)

    print(f"Uploaded: {doc.name}")
    print(f"ID: {doc.id}")
    print(f"Status: {doc.status}")

Document Properties

Every document has associated metadata:

with Session(engine) as session:
    doc = session.get(Document, name="Q4 Earnings Report")

    print(doc.name)        # User-provided name
    print(doc.id)          # Unique identifier (UUID)
    print(doc.status)      # processing, success, error, etc.
    print(doc.filesize)    # Size in bytes
    print(doc.created_at)  # Upload timestamp
    print(doc.file_type)   # File type code

Document Status

The status property indicates the processing state:

from ragnerock import DocumentStatus

doc = session.get(Document, name="New Upload")

# Check status
if doc.status == DocumentStatus.PROCESSING:
    print("Document is still being processed")
elif doc.status == DocumentStatus.SUCCESS:
    print("Document is ready")
elif doc.status == DocumentStatus.ERROR:
    print("Processing failed")

Available statuses:

DocumentStatus.PENDING — Queued for processing
DocumentStatus.PROCESSING — Currently being processed
DocumentStatus.SUCCESS — Processing completed successfully
DocumentStatus.ERROR — Processing failed
DocumentStatus.UNKNOWN — Status unavailable

Listing Documents

Browse all documents in your project:

with Session(engine) as session:
    # List all documents
    for doc in session.list(Document):
        print(f"{doc.name} - {doc.status}")

    # Get all documents at once
    all_docs = session.list(Document).all()
    print(f"Total documents: {len(all_docs)}")

    # Get just the first document
    first_doc = session.list(Document).first()

Accessing Document Content

Documents contain pages and chunks that you can access:

from ragnerock import Page, Chunk

with Session(engine) as session:
    doc = session.get(Document, name="Q4 Earnings Report")

    # List pages
    for page in doc.list(Page):
        print(f"Page {page.page_number}:")
        print(page.content[:200])
        print("---")

    # List chunks
    for chunk in doc.list(Chunk):
        print(f"Chunk: {chunk.content[:100]}...")

Deleting Documents

Remove documents when no longer needed:

with Session(engine) as session:
    doc = session.get(Document, name="Old Report")
    if doc:
        session.delete(doc)
        print("Document deleted")

Best Practices

Use descriptive names — Make documents easy to identify with clear, consistent naming
Check processing status — Wait for documents to finish processing before running workflows
Monitor for errors — Handle failed document processing appropriately
Organize by project — Use separate projects for different use cases or data sets

Next Steps

Learn about Annotations to extract structured data
Explore the SDK Resources documentation for full API details
See the Quick Start for a complete workflow example