v0.1.3 — LLM Citation Parser v1: Multi-Form Tokenizer + Modal-Driven Per-Cluster Review
Ships the first version of the LLM-citation-format converter that takes Google AI's `[1, 2, 3]` and Perplexity's `[1][2]` styles plus their `[N] [Title](url)` reference lists and rewrites them into the Lossless `[^hex]` footnote format. Pure-TypeScript service with token-level parsing, a CLI test harness for non-Obsidian validation, and a modal UI that gives the user per-numeric checkbox control plus per-row Convert buttons. Preserves already-Lossless citations verbatim, partially-converts multi-comma forms when only some members have ref defs, flags orphans and collisions, and refuses to transform anything where the multi-cluster collision rule would corrupt the result. Validated end-to-end against `ChromaDB.md` (the canonical messy test case mixing one Google-AI numeric series with 53 already-Lossless hex citations).
Why Care?
The day's pressing task: convert citations pasted from LLM tools (Google
AI Overviews, Perplexity, Claude — though Claude's prose form remains
out-of-scope) into the Lossless [^hex] footnote format the rest of
the cite-wide plugin operates on. The existing convertAllCitations
command does one regex pass over the whole file and groups citations
by their numeric ID. That silently corrupts when the same [1] in
two different sections of a file refers to different sources — they
get merged into one citation group with one hex ID, attribution lost.
The user's framing was sharper than that:
"It cannot perform on a selection within a file, nor can it analyze for potential issues across copy-pasted content from multiple sources."
The fix isn't a smarter regex. It's a two-phase pipeline with explicit collision detection plus a modal UI that gives the user visibility into every proposed transformation before any disk write. The user picks per-numeric, runs Apply for the batch, or hits Convert on individual rows to incrementally pick off conversions while the modal stays open and re-renders.
The strategic reason this matters beyond cite-wide: the workflow this serves is the user's primary research intake. Pasted LLM outputs are where citations enter the knowledge graph from external tools. If that boundary is lossy or destructive, every downstream system (canonical citations folder, eventual vector DB, RAG pipelines, the Investment Memo Orchestrator) inherits the loss. Getting the import boundary trustworthy is what makes the rest of the company-brain ambition workable.
What Was Built
Commits in order
| # | Commit | Title |
| 1 | d4567d8 | feat(parser): LLM-citation parser handles multi-form clusters + partial conversion |
| 2 | 622152c | feat(parser): modal-driven LLM citation review with per-numeric selection |
| 3 | b3bb3da | fix(modal): tighter LLM-citations modal layout + per-row Convert button |
| 4 | 3dcf4ca | fix(modal): consolidate Select-All/Unselect-All to single tri-state All checkbox |
src/services/llmCitationParserService.ts — the parser (commit 1)
Pure TypeScript, no Obsidian imports — testable from a plain Node script without spinning up the plugin host.
Token kinds recognized:
| Kind | Pattern | Example | Source |
inline-numeric-single | \[\d+\](?!:) | [12] | conventional footnote |
inline-numeric-multi-comma | \[\d+(,\s*\d+)+\] | [1, 2, 3] | Google AI |
inline-numeric-multi-adjacent | \[\d+\]\[\d+\]… | [1][2][3] | Perplexity |
inline-hex | \[\^[a-z0-9]+\](?!:) | [^abc123] | already Lossless |
refdef-numeric | line-anchored [N] [Title](url) or [N]: … | [2] [Vector DBs for RAG](https://…) | LLM ref list |
refdef-hex | line-anchored [^abc]: … | [^abc123]: 2024, … | already Lossless |
Two-phase API:
parse(content): ParseResult— tokenizes, builds numeric/hex reference maps, detects orphans (inline without ref defs; ref defs without inline citations), detects collisions (same numeric ID defined more than once — a strong signal of two LLM-output clusters pasted into the same file), surfaces all of these as a structuredflagsarray.transform(content, parseResult, { selectedNumbers?, mapping? })— generates one hex per transformable numeric (collision-checked against the existing hex-ref namespace so existing Lossless hex IDs are never reused), substitutes inline + reference-def occurrences, preserves all already-[^hex]markers verbatim. Multi-comma forms partially convert: if[1, 2, 3]has ref defs only for[2], the output is[1] [^xxx] [3]— the orphan numerics survive untouched and get a flag.
Convenience: proposeHexMapping(parseResult): Map<number, hex> —
introduced for the modal flow. Pre-computes every transformable
numeric → hex mapping. The modal calls this on open and uses the same
map for every preview, every per-row Convert, and the final Apply, so
the hex shown in the UI is exactly what gets written.
scripts/parse-llm-citations.mjs — the CLI test harness (commit 1)
Standalone Node script that bundles the parser via esbuild's
in-memory build (the parser file uses TypeScript syntax that bare
Node won't run; bundle is in-memory, no extra files hit disk).
node scripts/parse-llm-citations.mjs <input.md> # parse-only report
node scripts/parse-llm-citations.mjs <input.md> -o <out> # transform & write Used to validate the parser end-to-end against ChromaDB.md before
wiring as an Obsidian command. Stays in the repo as the regression
harness — any future parser change can be sanity-checked here without
toggling the plugin in Obsidian.
Validation against ChromaDB.md (the canonical messy file)
The test file mixes:
53 already-Lossless hex reference defs
18 numeric reference defs
[2]through[22](with gaps for intentionally-orphan numbers)43 inline hex citations (preserved verbatim)
6 inline numeric tokens (4 multi-comma forms + 2 singles)
Parser detects all of this correctly: 18 numeric ref defs converted to
hex, 13 inline numeric citations expanded into hex form, 96 hex
citations preserved untouched, 4 orphan-inline-numeric warnings raised
([1], [3], [14], [19] cited inline but never ref-def'd in the
source), 7 orphan-numeric-ref info flags (refs [4]–[10] defined
but not cited inline — these are LLM-list entries that never made it
into the prose), 24 orphan-hex-ref info flags (canonical sources
defined but not currently used inline). No collisions.
src/modals/LlmCitationsModal.ts — the modal UI (commit 2)
The modal-driven UX was the user's response to the first iteration's failure mode: running headlessly produced a single Notice with stats and zero visibility into which clusters got detected, how the transformation would shape, or whether orphans were correctly being preserved as numeric.
What the modal shows, per row:
Checkbox (default checked)
Title:
[N] → [^hex]in monospace, with per-row inline-occurrence count badge (or "orphan ref — no inline citation" when the ref def has no callers)Reference-definition row: badge, line link, 140-char body preview
One row per inline occurrence: kind badge (
single/multi/adjacent), line link, 140-char preview of the source linePer-row Convert button on the right edge — converts just that numeric, refreshes parse state from the post-conversion file, and re-renders the modal in place so the user can keep picking off conversions one at a time
Header controls: title with live "(N of M selected)" count, a single tri-state "All" checkbox (consolidating the original Select-All + Unselect-All buttons that kept wrapping the layout — see commit 4), an Apply button that batch-converts every checked row in one write.
Line-link behavior: every line number in the modal is a clickable anchor. Clicking scrolls the editor to that line, places the cursor, and closes the modal so the user can read the full surrounding context.
Flags section: orphans and collisions are surfaced at the bottom of the modal, grouped by code, with up to 5 example messages per code (each with its own line link if the flag carried one).
Layout and density (commits 3 + 4)
The first modal version inherited too much from existing styles — buttons were stacking vertically and overflowing the right edge, each row was 7-8 lines tall. The fixes:
Inline-styled overrides for everything specific to this modal, so
styles.css(which the user is curating separately) doesn't get touched.Compact spacing: group margin 2rem→0.5rem, header padding 1rem →0.4rem, content padding 1.25rem→0.4rem, instance padding 0.75rem →0.25rem. Each row is now ~3-4 lines tall.
Compact typography: title 1.1rem→0.9rem in monospace, badge font 0.75rem→0.65rem, line-info font 0.9rem→0.78rem.
Header consolidation: Select-All + Unselect-All → single tri-state "All" checkbox (browser-native indeterminate visual when some-but-not-all are selected). Three buttons in the header had been wrapping in some Obsidian theme contexts; two items + one checkbox don't.
Command wiring
parse-llm-citations (registered in main.ts registerCitationCommands)
now opens the modal instead of running headless:
this.addCommand({
id: 'parse-llm-citations',
name: 'Parse LLM Citations in Current File',
editorCallback: (editor: Editor) => {
new LlmCitationsModal(this.app, editor).open();
}
}); What Changed in Approach (the meta-lesson)
| Pattern this rejects | Pattern this adopts |
| Run-on-whole-file with a Notice for stats | Modal-driven preview-then-confirm with per-item granularity |
| One regex pass that merges colliding numerics silently | Two-phase parse → transform pipeline that detects collisions explicitly and refuses to corrupt them |
Multi-form citations like [1, 2, 3] are ignored or treated atomically | Multi-form is decomposed into per-number transformations; partial conversion when some are orphan |
| Fresh hex generation on each call | Pre-computed mapping during parse; same mapping flows through preview, per-row Convert, and final Apply |
Edit styles.css for any visual change | Inline-style overrides for modal-specific layout; styles.css stays curatable independently |
| Two buttons (Select-All / Unselect-All) per UI section | One tri-state checkbox using browser-native indeterminate state — fewer visual elements, clearer semantics |
The generalizing point: "transform" and "review" are different
operations and need different code paths. The existing
convertAllCitations is a transform-only operation; this new
parse-llm-citations is review-then-transform with the user's hand
on the steering wheel. Pasted-from-LLM citations are exactly the
input class where the user's judgment beats the regex's confidence —
because the input is messier than what regex was designed for, and
because a wrong transformation here corrupts attribution that's
hard to recover later.
Open Items
Claude prose-style attributions (no bracket markers, paragraphs like "Erik Brynjolfsson of Stanford…") are out of scope for the deterministic regex parser. Would need a Claude-API-driven extractor (matches the broader spec in
Citation-Acquisition-Pipeline.md). Deferred until a real test case shows up.Heading slugs still drift when an inline
[12]in a heading becomes[^hex]— the auto-generated#1-pinecone-…anchor breaks because the bracketed text is now different. Pre-existing issue inherited fromconvertAllCitations; not introduced here. Could be addressed by a follow-up pass that detects ToC links pointing at freshly-regenerated heading slugs.The
proposeHexMappingrandom-hex generation usesMath.random()— fine for collision avoidance within one document but technically not deterministic across runs. If repeatable mappings ever matter (regression tests, content-addressed citation IDs), swap in a content-derived hash.Per-cluster pattern signatures (Google-AI vs Perplexity vs Claude) are detected at the token level but not surfaced as cluster-level classifications in the modal. The user can read the inline-token kind badges to infer source, but a cleaner UI would group rows by detected cluster pattern. Hold for a future pass.
CLI test harness coverage — the harness can run parse-only and transform-and-write modes, but doesn't have golden-file regression tests. Adding
scripts/test/llm-citations.test.mjswith 4-5 input → expected-output pairs would catch regressions cheaply.The two sibling spec docs (
Citation-Acquisition-Pipeline.mdandCitation-Field-Acquisition-Guide.md) are still untracked in the working tree as of this changelog. The user committedLossless-Citation-Standards.mdin48bbc2dbut left the two sibling docs for later review.Force-push to remote still pending. All work since the
4ca2046changelog is local-only; nothing has been pushed since the force-push that overwrote Tanuj's three Aug-2025 commits. Dependabot won't see the lockfile changes (and won't close the alerts) until the branch is pushed.
Files Touched
cite-wide/
├── package.json (version 0.0.1.2 → 0.1.3)
├── manifest.json (version 0.0.1.1 → 0.1.3)
├── versions.json (added 0.1.3 → minAppVersion mapping)
├── main.ts (registered parse-llm-citations command + import)
├── README.md (added LLM Citation Conversion feature section)
├── CLAUDE.md (refreshed repo layout, status, open questions)
├── src/
│ ├── services/
│ │ └── llmCitationParserService.ts (created — tokenizer, parse, transform, proposeHexMapping)
│ └── modals/
│ └── LlmCitationsModal.ts (created — preview-then-apply UI, per-row Convert)
├── scripts/
│ └── parse-llm-citations.mjs (created — Node CLI test harness)
└── changelog/ (originally created at context-v/changelogs/; relocated to repo-root changelog/ on 2026-05-17)
└── 2026-05-01_05.md (created — this file) Reference
Predecessor changelogs:
2026-05-01_01.md(deps refresh),2026-05-01_02.md(type-safety pass),2026-05-01_03.md(dedupe by URL),2026-05-01_04.md(deps cleanup + Tanuj-intent port).Test file:
/Users/mpstaton/content-md/lossless/Tooling/Software Development/Databases/ChromaDB.md— mixed-format chaos file with one Google-AI cluster + 53 already- Lossless hex citations.Source-format catalog:
context-v/blueprints/Parse-Common-Citation-Formats.md— example outputs from Google AI Chat, Perplexity, and Claude.Inline citation format spec:
context-v/reminders/Lossless-Citation-Spec.md— defines the[^hex]placement rules this parser targets.Future-state reference:
context-v/blueprints/Citation-Acquisition-Pipeline.md— the broader MCP-server architecture this parser fits into. The current parser is the deterministic "Phase 2" piece; the AI-driven pieces (Claude prose extraction, publisher classification, etc.) remain unbuilt per that spec.Commits:
d4567d8,622152c,b3bb3da,3dcf4caondevelopment.