Documents
Understanding document processing and management in Ragnerock.
Documents are the foundation of Ragnerock. This guide explains how document processing works and how to manage your document library.
Document Types
Ragnerock supports a wide variety of document formats:
- PDF — Research reports, filings, presentations
- Word — .docx and .doc files
- Excel — Spreadsheets with structured data
- PowerPoint — Slide decks
- Text — Plain text and markdown files
- HTML — Web pages and articles
Document Processing Pipeline
When you upload a document, Ragnerock processes it through several stages:
1. Ingestion
The document is securely uploaded and stored in your project’s storage. Ragnerock validates the file format and prepares it for processing.
2. Text Extraction
Text and structure are extracted from the document. For PDFs, this includes:
- Full text content
- Page boundaries
- Tables and figures
- Headers and footers
3. Chunking
The document is split into semantic chunks optimized for:
- Vector embedding generation
- Context window limits
- Search relevance
4. Embedding Generation
Each chunk is converted into a vector embedding, enabling semantic search across your document library.
Working with Documents
Uploading Documents
Use the SDK to upload documents to your project:
from ragnerock import create_engine, Session, Document
engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")
with Session(engine) as session:
# Create a document from a local file
doc = Document(
file_path="/path/to/report.pdf",
name="Q4 Earnings Report"
)
session.create(doc)
print(f"Uploaded: {doc.name}")
print(f"ID: {doc.id}")
print(f"Status: {doc.status}")
Document Properties
Every document has associated metadata:
with Session(engine) as session:
doc = session.get(Document, name="Q4 Earnings Report")
print(doc.name) # User-provided name
print(doc.id) # Unique identifier (UUID)
print(doc.status) # processing, success, error, etc.
print(doc.filesize) # Size in bytes
print(doc.created_at) # Upload timestamp
print(doc.file_type) # File type code
Document Status
The status property indicates the processing state:
from ragnerock import DocumentStatus
doc = session.get(Document, name="New Upload")
# Check status
if doc.status == DocumentStatus.PROCESSING:
print("Document is still being processed")
elif doc.status == DocumentStatus.SUCCESS:
print("Document is ready")
elif doc.status == DocumentStatus.ERROR:
print("Processing failed")
Available statuses:
DocumentStatus.PENDING— Queued for processingDocumentStatus.PROCESSING— Currently being processedDocumentStatus.SUCCESS— Processing completed successfullyDocumentStatus.ERROR— Processing failedDocumentStatus.UNKNOWN— Status unavailable
Listing Documents
Browse all documents in your project:
with Session(engine) as session:
# List all documents
for doc in session.list(Document):
print(f"{doc.name} - {doc.status}")
# Get all documents at once
all_docs = session.list(Document).all()
print(f"Total documents: {len(all_docs)}")
# Get just the first document
first_doc = session.list(Document).first()
Accessing Document Content
Documents contain pages and chunks that you can access:
from ragnerock import Page, Chunk
with Session(engine) as session:
doc = session.get(Document, name="Q4 Earnings Report")
# List pages
for page in doc.list(Page):
print(f"Page {page.page_number}:")
print(page.content[:200])
print("---")
# List chunks
for chunk in doc.list(Chunk):
print(f"Chunk: {chunk.content[:100]}...")
Deleting Documents
Remove documents when no longer needed:
with Session(engine) as session:
doc = session.get(Document, name="Old Report")
if doc:
session.delete(doc)
print("Document deleted")
Best Practices
- Use descriptive names — Make documents easy to identify with clear, consistent naming
- Check processing status — Wait for documents to finish processing before running workflows
- Monitor for errors — Handle failed document processing appropriately
- Organize by project — Use separate projects for different use cases or data sets
Next Steps
- Learn about Annotations to extract structured data
- Explore the SDK Resources documentation for full API details
- See the Quick Start for a complete workflow example