Document analyzer with cross-doc recall

Ingest a folder of docs, then ask questions across all of them.

Pattern: "I have a folder of PDFs / markdown / call transcripts. I want an AI to answer questions across all of them, with citations back to the source."

mnueron's role: durable storage + fast keyword search + tagged metadata.

Architecture

[Ingest phase, runs once or whenever new docs arrive]
folder of files → for each file:
   → extract text
   → chunk into ~2KB pieces (mnueron's chunker handles this)
   → save each chunk with metadata.file, metadata.page, etc.

[Query phase, runs on every user question]
user question → mnueron bulk_search across all chunks
              → top-k hits with source metadata
              → assemble into LLM prompt
              → LLM responds with citations

Minimal Python

from pathlib import Path
from mnueron import Mnueron
from openai import OpenAI
import pypdf  # pip install pypdf

m = Mnueron()
llm = OpenAI()
NS = "company-docs"

# [Ingest] Run once. Idempotent on source_ref so re-runs do not duplicate.
def ingest(folder: Path):
    for path in folder.glob("**/*.pdf"):
        text = "\n".join(p.extract_text() for p in pypdf.PdfReader(path).pages)
        # mnueron auto-chunks long content into per-paragraph rows
        # tagged with the parent_ref. We just hand it the raw text.
        m.save(
            content=text,
            namespace=NS,
            source="pdf-ingest",
            source_ref=str(path),
            tags=["pdf", path.parent.name],
            metadata={
                "file":    path.name,
                "summarize": True,   # 1-2 sentence summary per chunk
            },
        )

# [Query] One memory recall per question.
def ask(question: str) -> str:
    hits = m.search(question, namespace=NS, k=8)
    cites = "\n".join(
        f"[{i+1}] {h.metadata.get('file','?')}: {h.content[:300]}"
        for i, h in enumerate(hits)
    )
    resp = llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role":"system", "content":
                "Answer using ONLY the citations below. If the answer is not "
                "in the citations, say so. Cite sources by [n] number.\n\n" + cites},
            {"role":"user", "content": question},
        ],
    )
    return resp.choices[0].message.content

if __name__ == "__main__":
    ingest(Path("./docs"))
    print(ask("What did we decide about the migration timeline?"))

Why mnueron is the right fit here

  • BM25 search + future semantic — keyword precision today, vector reranking when v0.5 lands.
  • Metadata filtersm.search(q, metadata_filter={"file": "Q3-roadmap.pdf"}) to constrain by source.
  • Cross-tool recall — same data is searchable from claude.ai (via Chrome extension), Claude Desktop (via MCP), or your bot.
  • Chunking is automatic — pass a 300-page PDF, mnueron breaks it into searchable rows linked by parent_ref.

When NOT to use this pattern

  • If your "documents" are <100 lines each and you'll never have more than 20 of them, a simple in-memory dict + regex is enough. mnueron is overkill below a few hundred chunks.
  • If you need pixel-perfect citations (page + character offset), you'll need to track those in metadata yourself when ingesting.
Last updated 2026-05-17edit