Document extraction with a VLM

Turn PDFs and images into governed graph rows using your AnythingGraph entity schema or an installed playbook. Fetch the extraction contract over REST or MCP, run your vision model, then ingest structured JSON.

1. Define schema 2. Fetch extraction spec 3. VLM extract 4. Ingest 5. Query via graph / MCP

Overview

AnythingGraph does not run OCR or vision models inside the platform. You provide the document; your VLM (GPT-4o, Claude, Gemini, or an on-prem model) returns JSON that matches your entity structure. The graph stores validated rows, applies playbook mappings, and exposes the same data to the dashboard and MCP agents.

Each entity field supports AI metadata: description, example, and extraction_hint (where to find the value on a page). Playbooks package record types, relationships, connector mappings, and ingest instructions.

Recommended playbook for invoices: invoice-records-structured — assumes extraction happens outside AnythingGraph; the playbook validates and graphs invoice facts.

Prerequisites

  1. Run the stack from the repository root: ./start-all.sh (data-layer :8182, dashboard API :5180, MCP HTTP :3333/mcp when started separately).
  2. Install a playbook in the dashboard (Playbooks → Install), or create custom record types under Entity structure with AI metadata filled in.
  3. Choose how your orchestrator fetches schema:
    • REST — dashboard API (GET /api/entities/:id, GET /api/playbooks/:id).
    • MCP — connect Cursor, Claude Desktop, or your agent runtime to AnythingGraph MCP; the host calls get_entity before vision.

Step 1 — Define or install schema

Option A — Playbook (catalog)

Install e.g. invoice-records-structured, procure-to-pay, or document-registry. Playbook JSON lives under dashboard/backend/src/playbook/playbooks/ and defines entities, fields, and example payloads.

Option B — Custom entity structure

In the dashboard, open Entity structure → create or edit a record type → expand AI metadata on each field. Hints such as “top-right corner, labeled Invoice #” improve VLM accuracy.

Step 2 — Fetch the extraction specification

Before calling your VLM, assemble a machine-readable spec: entity names, field names, types, required flags, descriptions, examples, and extraction hints. Use either REST (direct HTTP) or MCP (agent host invokes tools).

Method Best for Primary calls
REST Custom pipelines, server-side ETL, dashboard API GET /api/entities/:id, GET /api/playbooks/:id
MCP Agent + VLM in one session (no manual copy/paste) list_entitiesget_entity
Note: get_graph_query_context and anythinggraph://schema-summary are for graph Q&A, not rich extraction prompts. Always use get_entity (MCP) or GET /api/entities/:id (REST) for field-level metadata.

REST API

List entities, then fetch each definition (includes extraction_hint):

curl -s http://127.0.0.1:5180/api/entities
curl -s http://127.0.0.1:5180/api/entities/1

# Playbook catalog + install status (field defs from entities after install)
curl -s http://127.0.0.1:5180/api/playbooks/invoice-records-structured

MCP (agent host)

Connect Cursor, Claude Desktop, or your orchestrator to http://127.0.0.1:3333/mcp (cd mcp-service && npm run start:http). Tool sequence:

  1. health_check
  2. get_graph_query_context with optional playbook_id
  3. get_entity for each entity in scope
{
  "mcpServers": {
    "anythinggraph": {
      "url": "http://127.0.0.1:3333/mcp"
    }
  }
}

Step 3 — Build and run the VLM prompt

Pass the generated extraction spec plus the document (image or PDF pages) to your vision model.

{
  "task": "Extract structured business records from the attached document.",
  "rules": [
    "Output a JSON object with a records array.",
    "Use field_name keys exactly as defined in the schema.",
    "Use null for missing optional fields; do not invent values.",
    "Dates: prefer ISO-8601 (YYYY-MM-DD) when possible."
  ],
  "schema": { "...": "from Step 2 — entities[].fields[]" }
}

Example model output for invoice-records-structured:

{
  "records": [
    {
      "invoice_number": "INV-2024-0042",
      "vendor_name": "Acme Supplies Ltd",
      "total_amount": 1250.0,
      "invoice_date": "2024-03-15"
    }
  ]
}

Step 4 — Ingest into AnythingGraph

Playbook webhook (recommended for bulk)

After playbook install:

POST http://127.0.0.1:5180/api/playbooks/invoice-records-structured/webhook
Content-Type: application/json

{
  "records": [
    {
      "invoice_number": "INV-2024-0042",
      "vendor_name": "Acme Supplies Ltd",
      "total_amount": 1250,
      "invoice_date": "2024-03-15"
    }
  ]
}

The connector validates required fields, applies field mappings, routes rows to entities, and sends failures to the landing zone.

MCP row insert (single records)

create_entity_row(
  entity_id=<from list_entities>,
  values_json='{"invoice_number":"INV-2024-0042",...}'
)

SDK

client.dashboard.playbook_webhook("invoice-records-structured", {"records": [...]})

Step 5 — Validate and use the graph

  • Review failed rows in the dashboard Landing zone.
  • Sync RDF cache (Settings → Caching) and open Graph View.
  • Query via MCP ask_graph or SPARQL (sync_rdf_cache then run_sparql).

Architecture

┌─────────────┐     REST or MCP      ┌──────────────────┐
│ Orchestrator│ ──────────────────► │ AnythingGraph    │
│ (your app)  │ ◄── schema / spec   │ data-layer + MCP │
└──────┬──────┘                     └────────▲─────────┘
       │                                   │
       │ document + spec                   │ JSON records
       ▼                                   │
┌─────────────┐                     ┌──────┴───────┐
│ VLM         │                     │ Connector /  │
│ (vision API)│                     │ webhook      │
└─────────────┘                     └──────────────┘

The VLM never talks to LMDB directly. Your orchestrator owns the loop: fetch schema → extract → ingest → optional graph queries.

Troubleshooting

Issue What to check
REST fetch fails (network) Dashboard API running on :5180; CORS enabled on dashboard; open this page via http:// not file://
Playbook entities not found Install the playbook first; entity names in LMDB must match playbook entities[].name
MCP tools missing cd mcp-service && npm run start:http; data-layer on :8182
Ingest validation errors Landing zone; required fields; field mappings if VLM keys differ from schema
Empty extraction hints Playbook catalog JSON may omit hints — add them on entity fields in the dashboard or use GET /api/entities/:id after editing AI metadata