Annotations

Extract structured data from unstructured documents using AI-powered operators.

Annotations are Ragnerock’s core mechanism for extracting structured data from unstructured documents. This guide covers how annotation schemas work and how to access your extracted data.

What Are Annotations?

Annotations transform unstructured document content into structured, queryable data. The system uses:

  1. Operators — Define the shape of data to extract (JSON Schema) and AI instructions
  2. Workflows — Orchestrate multiple operators into processing pipelines
  3. Annotations — The actual extracted data attached to documents

The result is structured data you can query with SQL, export to your data warehouse, or use in your quantitative models.

Operators (Annotation Schemas)

Operators define what data to extract. Each operator has:

  • Name — Identifier for the extraction task
  • JSON Schema — The structure of the output data
  • Generation Prompt — Instructions for the AI model
  • Scope — Granularity level (document, page, paragraph, or sentence)

Viewing Operators

Operators are created in the Ragnerock web application. Use the SDK to list available operators:

from ragnerock import create_engine, Session, Operator

engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")

with Session(engine) as session:
    for op in session.list(Operator):
        print(f"{op.name}: {op.description}")
        print(f"  Scope: {op.scope.name}")
        print(f"  Schema: {op.jsonschema}")

Example Schema

Here’s what an operator’s JSON schema might look like for sentiment analysis:

{
  "type": "object",
  "properties": {
    "overall_sentiment": {
      "type": "string",
      "enum": ["very_negative", "negative", "neutral", "positive", "very_positive"],
      "description": "Overall sentiment of the content"
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1,
      "description": "Confidence score for the sentiment"
    },
    "key_themes": {
      "type": "array",
      "items": {"type": "string"},
      "maxItems": 5,
      "description": "Main themes discussed"
    }
  },
  "required": ["overall_sentiment", "key_themes"]
}

Workflows

Workflows chain multiple operators together into processing pipelines. They’re created in the web UI and can be triggered on documents via the SDK.

Running Workflows

from ragnerock import create_engine, Session, Document, Workflow

engine = create_engine("ragnerock://user@example.com:pass@api.ragnerock.com/my_project")

with Session(engine) as session:
    # Get a workflow
    workflow = session.get(Workflow, name="Financial Analysis")

    # Get documents to process
    docs = session.list(Document).limit(10).all()

    # Run the workflow
    job = session.run(workflow, documents=docs)

    # Wait for completion
    job.wait(timeout=600)

    print(f"Job status: {job.status}")

Listing Workflows

with Session(engine) as session:
    for wf in session.list(Workflow):
        print(f"{wf.name}: {wf.description}")
        print(f"  Active: {wf.is_active}")
        print(f"  Auto-run on upload: {wf.auto_run_on_upload}")
        print(f"  Operators: {[op.operator_name for op in wf.operators]}")

Accessing Annotations

After workflows run, annotations are attached to documents. Access them through the SDK:

List Annotations for a Document

from ragnerock import Annotation

with Session(engine) as session:
    doc = session.get(Document, name="Apple 10-K 2024")

    # List all annotations
    for ann in doc.list(Annotation):
        print(f"Schema: {ann.schema_id}")
        print(f"Data: {ann.data}")
        print("---")

    # Filter by operator name
    for ann in doc.list(Annotation, operator="sentiment_analysis"):
        print(f"Sentiment: {ann.data.get('overall_sentiment')}")
        print(f"Themes: {ann.data.get('key_themes')}")

Annotation Data

The data field contains the extracted values matching the operator’s JSON schema:

ann = doc.list(Annotation, operator="financial_metrics").first()

if ann:
    print(ann.data)
    # {'revenue': 394328, 'net_income': 96995, 'gross_margin': 0.438}

    # Access individual fields
    print(f"Revenue: ${ann.data.get('revenue')}M")

Provenance and Source Tracking

Annotations include references back to their source. Use lazy-loaded properties to trace provenance:

with Session(engine) as session:
    ann = session.get(Annotation, id="...")

    # Get the source document
    source_doc = ann.document
    print(f"Source: {source_doc.name}")

    # Get the source chunk (if chunk-level annotation)
    if ann.chunk:
        print(f"Chunk text: {ann.chunk.content[:200]}")

    # Get the source page (if page-level annotation)
    if ann.page:
        print(f"Page number: {ann.page.page_number}")

    # Get the operator that created this annotation
    operator = ann.operator
    print(f"Created by: {operator.name}")

Querying Annotations

Annotations are stored in a queryable format. Use SQL to analyze across your document library:

with Session(engine) as session:
    result = session.query("""
        SELECT
            document_name,
            overall_sentiment,
            confidence,
            key_themes
        FROM sentiment_analysis
        WHERE overall_sentiment IN ('positive', 'very_positive')
        ORDER BY confidence DESC
        LIMIT 20
    """)

    # As a list of dictionaries
    for row in result.to_dict():
        print(f"{row['document_name']}: {row['overall_sentiment']}")

    # As a pandas DataFrame
    df = result.to_pandas()
    print(df.describe())

Aggregations

result = session.query("""
    SELECT
        overall_sentiment,
        COUNT(*) as count,
        AVG(confidence) as avg_confidence
    FROM sentiment_analysis
    GROUP BY overall_sentiment
    ORDER BY count DESC
""")

Filtering by Date

result = session.query("""
    SELECT document_name, revenue, net_income
    FROM financial_metrics
    WHERE created_at > '2024-01-01'
    ORDER BY revenue DESC
""")

Best Practices

  1. Start specific, then generalize — Begin with a focused schema, expand as needed
  2. Use enums for categorical data — Enables consistent filtering and aggregation
  3. Test on sample documents — Validate operator design before batch processing
  4. Query incrementally — Start with simple queries, add complexity as needed
  5. Track provenance — Use annotation relationships for audit trails

Next Steps