Documents

Understanding document processing and management in Ragnerock.

Documents are the foundation of Ragnerock. This guide explains how document processing works and how to manage your document library.

Document Types

Ragnerock supports a wide variety of document formats:

  • PDF — Research reports, filings, presentations
  • Word — .docx and .doc files
  • Excel — Spreadsheets with structured data
  • PowerPoint — Slide decks
  • Text — Plain text and markdown files
  • HTML — Web pages and articles

Document Processing Pipeline

When you upload a document, Ragnerock processes it through several stages:

1. Ingestion

The document is securely uploaded and stored in your project’s storage. Ragnerock validates the file format and prepares it for processing.

2. Text Extraction

Text and structure are extracted from the document. For PDFs, this includes:

  • Full text content
  • Page boundaries
  • Tables and figures
  • Headers and footers

3. Chunking

The document is split into semantic chunks optimized for:

  • Vector embedding generation
  • Context window limits
  • Search relevance

4. Embedding Generation

Each chunk is converted into a vector embedding, enabling semantic search across your document library.

Working with Documents

Uploading Documents

Use the SDK to upload documents to your project:

from ragnerock import create_engine, Session, Document

engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")

with Session(engine) as session:
    # Create a document from a local file
    doc = Document(
        file_path="/path/to/report.pdf",
        name="Q4 Earnings Report"
    )
    session.create(doc)

    print(f"Uploaded: {doc.name}")
    print(f"ID: {doc.id}")
    print(f"Status: {doc.status}")

Document Properties

Every document has associated metadata:

with Session(engine) as session:
    doc = session.get(Document, name="Q4 Earnings Report")

    print(doc.name)        # User-provided name
    print(doc.id)          # Unique identifier (UUID)
    print(doc.status)      # processing, success, error, etc.
    print(doc.filesize)    # Size in bytes
    print(doc.created_at)  # Upload timestamp
    print(doc.file_type)   # File type code

Document Status

The status property indicates the processing state:

from ragnerock import DocumentStatus

doc = session.get(Document, name="New Upload")

# Check status
if doc.status == DocumentStatus.PROCESSING:
    print("Document is still being processed")
elif doc.status == DocumentStatus.SUCCESS:
    print("Document is ready")
elif doc.status == DocumentStatus.ERROR:
    print("Processing failed")

Available statuses:

  • DocumentStatus.PENDING — Queued for processing
  • DocumentStatus.PROCESSING — Currently being processed
  • DocumentStatus.SUCCESS — Processing completed successfully
  • DocumentStatus.ERROR — Processing failed
  • DocumentStatus.UNKNOWN — Status unavailable

Listing Documents

Browse all documents in your project:

with Session(engine) as session:
    # List all documents
    for doc in session.list(Document):
        print(f"{doc.name} - {doc.status}")

    # Get all documents at once
    all_docs = session.list(Document).all()
    print(f"Total documents: {len(all_docs)}")

    # Get just the first document
    first_doc = session.list(Document).first()

Accessing Document Content

Documents contain pages and chunks that you can access:

from ragnerock import Page, Chunk

with Session(engine) as session:
    doc = session.get(Document, name="Q4 Earnings Report")

    # List pages
    for page in doc.list(Page):
        print(f"Page {page.page_number}:")
        print(page.content[:200])
        print("---")

    # List chunks
    for chunk in doc.list(Chunk):
        print(f"Chunk: {chunk.content[:100]}...")

Deleting Documents

Remove documents when no longer needed:

with Session(engine) as session:
    doc = session.get(Document, name="Old Report")
    if doc:
        session.delete(doc)
        print("Document deleted")

Best Practices

  1. Use descriptive names — Make documents easy to identify with clear, consistent naming
  2. Check processing status — Wait for documents to finish processing before running workflows
  3. Monitor for errors — Handle failed document processing appropriately
  4. Organize by project — Use separate projects for different use cases or data sets

Next Steps