Annotations
Extract structured data from unstructured documents using AI-powered operators.
Annotations are Ragnerock’s core mechanism for extracting structured data from unstructured documents. This guide covers how annotation schemas work and how to access your extracted data.
What Are Annotations?
Annotations transform unstructured document content into structured, queryable data. The system uses:
- Operators — Define the shape of data to extract (JSON Schema) and AI instructions
- Workflows — Orchestrate multiple operators into processing pipelines
- Annotations — The actual extracted data attached to documents
The result is structured data you can query with SQL, export to your data warehouse, or use in your quantitative models.
Operators (Annotation Schemas)
Operators define what data to extract. Each operator has:
- Name — Identifier for the extraction task
- JSON Schema — The structure of the output data
- Generation Prompt — Instructions for the AI model
- Scope — Granularity level (document, page, paragraph, or sentence)
Viewing Operators
Operators are created in the Ragnerock web application. Use the SDK to list available operators:
from ragnerock import create_engine, Session, Operator
engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")
with Session(engine) as session:
for op in session.list(Operator):
print(f"{op.name}: {op.description}")
print(f" Scope: {op.scope.name}")
print(f" Schema: {op.jsonschema}")
Example Schema
Here’s what an operator’s JSON schema might look like for sentiment analysis:
{
"type": "object",
"properties": {
"overall_sentiment": {
"type": "string",
"enum": ["very_negative", "negative", "neutral", "positive", "very_positive"],
"description": "Overall sentiment of the content"
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Confidence score for the sentiment"
},
"key_themes": {
"type": "array",
"items": {"type": "string"},
"maxItems": 5,
"description": "Main themes discussed"
}
},
"required": ["overall_sentiment", "key_themes"]
}
Workflows
Workflows chain multiple operators together into processing pipelines. They’re created in the web UI and can be triggered on documents via the SDK.
Running Workflows
from ragnerock import create_engine, Session, Document, Workflow
engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")
with Session(engine) as session:
# Get a workflow
workflow = session.get(Workflow, name="Financial Analysis")
# Get documents to process
docs = session.list(Document).limit(10).all()
# Run the workflow
job = session.run(workflow, documents=docs)
# Wait for completion
job.wait(timeout=600)
print(f"Job status: {job.status}")
Listing Workflows
with Session(engine) as session:
for wf in session.list(Workflow):
print(f"{wf.name}: {wf.description}")
print(f" Active: {wf.is_active}")
print(f" Auto-run on upload: {wf.auto_run_on_upload}")
print(f" Operators: {[op.operator_name for op in wf.operators]}")
Accessing Annotations
After workflows run, annotations are attached to documents. Access them through the SDK:
List Annotations for a Document
from ragnerock import Annotation
with Session(engine) as session:
doc = session.get(Document, name="Apple 10-K 2024")
# List all annotations
for ann in doc.list(Annotation):
print(f"Schema: {ann.schema_id}")
print(f"Data: {ann.data}")
print("---")
# Filter by operator name
for ann in doc.list(Annotation, operator="sentiment_analysis"):
print(f"Sentiment: {ann.data.get('overall_sentiment')}")
print(f"Themes: {ann.data.get('key_themes')}")
Annotation Data
The data field contains the extracted values matching the operator’s JSON schema:
ann = doc.list(Annotation, operator="financial_metrics").first()
if ann:
print(ann.data)
# {'revenue': 394328, 'net_income': 96995, 'gross_margin': 0.438}
# Access individual fields
print(f"Revenue: ${ann.data.get('revenue')}M")
Provenance and Source Tracking
Annotations include references back to their source. Use lazy-loaded properties to trace provenance:
with Session(engine) as session:
ann = session.get(Annotation, id="...")
# Get the source document
source_doc = ann.document
print(f"Source: {source_doc.name}")
# Get the source chunk (if chunk-level annotation)
if ann.chunk:
print(f"Chunk text: {ann.chunk.content[:200]}")
# Get the source page (if page-level annotation)
if ann.page:
print(f"Page number: {ann.page.page_number}")
# Get the operator that created this annotation
operator = ann.operator
print(f"Created by: {operator.name}")
Querying Annotations
Annotations are stored in a queryable format. Use SQL to analyze across your document library:
with Session(engine) as session:
result = session.query("""
SELECT
document_name,
overall_sentiment,
confidence,
key_themes
FROM sentiment_analysis
WHERE overall_sentiment IN ('positive', 'very_positive')
ORDER BY confidence DESC
LIMIT 20
""")
# As a list of dictionaries
for row in result.to_dict():
print(f"{row['document_name']}: {row['overall_sentiment']}")
# As a pandas DataFrame
df = result.to_pandas()
print(df.describe())
Aggregations
result = session.query("""
SELECT
overall_sentiment,
COUNT(*) as count,
AVG(confidence) as avg_confidence
FROM sentiment_analysis
GROUP BY overall_sentiment
ORDER BY count DESC
""")
Filtering by Date
result = session.query("""
SELECT document_name, revenue, net_income
FROM financial_metrics
WHERE created_at > '2024-01-01'
ORDER BY revenue DESC
""")
Best Practices
- Start specific, then generalize — Begin with a focused schema, expand as needed
- Use enums for categorical data — Enables consistent filtering and aggregation
- Test on sample documents — Validate operator design before batch processing
- Query incrementally — Start with simple queries, add complexity as needed
- Track provenance — Use annotation relationships for audit trails
Next Steps
- Learn about Documents for upload and processing
- Explore SQL Queries for advanced data analysis
- Use the Research Agent for interactive exploration