← Context / plans

Download PDFs into Corpus Inbox — preserve the original binary alongside Jina-extracted markdown; wire both inbox vectors (UI and agent-chat) so the operator's PDF discoveries land as commit-able evidence, not just summarized text

Today's inbox flow Jina-fetches every URL and writes the extracted markdown to `clients/<client>/corpus/inbox/<date>_<slug>.md`. For HTML pages this is fine — the markdown IS the content. For PDFs it loses information that matters: the original PDF (with its tables, figures, signatures, page numbering, citation-able URL) is gone, and the operator is left with whatever text Jina could pull out of it. The DOL workforce-strategy PDF captured in the 2026-06-08 milestone ship is exactly this case — useful text, but the binary that needs to be cited downstream isn't preserved. This plan adds a binary-download primitive to `services/content-ingest`, threads PDF detection + filesystem persistence through `corpus.inbox.add`, extends the inbox frontmatter contract with a `binary_asset` block, surfaces 'PDF saved' affordances in both the agent-chat result bubble AND the Content Reader UI (which gets a small 'send to inbox instead' toggle as an interim inbox-UI surface before the dedicated microfrontend ships), and lays down a per-client `.gitattributes` git-lfs discipline so PDFs don't bloat the per-client repo's git history. PDFs first; the same scaffolding extends to docx/pptx/xlsx when those become operator pain points.