Pattern: "I have a folder of PDFs / markdown / call transcripts. I want an AI to answer questions across all of them, with citations back to the source."
mnueron's role: durable storage + fast keyword search + tagged metadata.
Architecture
[Ingest phase, runs once or whenever new docs arrive]
folder of files → for each file:
→ extract text
→ chunk into ~2KB pieces (mnueron's chunker handles this)
→ save each chunk with metadata.file, metadata.page, etc.
[Query phase, runs on every user question]
user question → mnueron bulk_search across all chunks
→ top-k hits with source metadata
→ assemble into LLM prompt
→ LLM responds with citations
Minimal Python
from pathlib import Path
from mnueron import Mnueron
from openai import OpenAI
import pypdf # pip install pypdf
m = Mnueron()
llm = OpenAI()
NS = "company-docs"
# [Ingest] Run once. Idempotent on source_ref so re-runs do not duplicate.
def ingest(folder: Path):
for path in folder.glob("**/*.pdf"):
text = "\n".join(p.extract_text() for p in pypdf.PdfReader(path).pages)
# mnueron auto-chunks long content into per-paragraph rows
# tagged with the parent_ref. We just hand it the raw text.
m.save(
content=text,
namespace=NS,
source="pdf-ingest",
source_ref=str(path),
tags=["pdf", path.parent.name],
metadata={
"file": path.name,
"summarize": True, # 1-2 sentence summary per chunk
},
)
# [Query] One memory recall per question.
def ask(question: str) -> str:
hits = m.search(question, namespace=NS, k=8)
cites = "\n".join(
f"[{i+1}] {h.metadata.get('file','?')}: {h.content[:300]}"
for i, h in enumerate(hits)
)
resp = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role":"system", "content":
"Answer using ONLY the citations below. If the answer is not "
"in the citations, say so. Cite sources by [n] number.\n\n" + cites},
{"role":"user", "content": question},
],
)
return resp.choices[0].message.content
if __name__ == "__main__":
ingest(Path("./docs"))
print(ask("What did we decide about the migration timeline?"))
Why mnueron is the right fit here
- BM25 search + future semantic — keyword precision today, vector reranking when v0.5 lands.
- Metadata filters —
m.search(q, metadata_filter={"file": "Q3-roadmap.pdf"})to constrain by source. - Cross-tool recall — same data is searchable from claude.ai (via Chrome extension), Claude Desktop (via MCP), or your bot.
- Chunking is automatic — pass a 300-page PDF, mnueron breaks it into searchable rows linked by parent_ref.
When NOT to use this pattern
- If your "documents" are <100 lines each and you'll never have more than 20 of them, a simple in-memory dict + regex is enough. mnueron is overkill below a few hundred chunks.
- If you need pixel-perfect citations (page + character offset),
you'll need to track those in
metadatayourself when ingesting.