cite-wide 0.1.3 Published

v0.1.3 — LLM Citation Parser v1: Multi-Form Tokenizer + Modal-Driven Per-Cluster Review

Ships the first version of the LLM-citation-format converter that takes Google AI's `[1, 2, 3]` and Perplexity's `[1][2]` styles plus their `[N] [Title](url)` reference lists and rewrites them into the Lossless `[^hex]` footnote format. Pure-TypeScript service with token-level parsing, a CLI test harness for non-Obsidian validation, and a modal UI that gives the user per-numeric checkbox control plus per-row Convert buttons. Preserves already-Lossless citations verbatim, partially-converts multi-comma forms when only some members have ref defs, flags orphans and collisions, and refuses to transform anything where the multi-cluster collision rule would corrupt the result. Validated end-to-end against `ChromaDB.md` (the canonical messy test case mixing one Google-AI numeric series with 53 already-Lossless hex citations).

Why Care?

The day's pressing task: convert citations pasted from LLM tools (Google AI Overviews, Perplexity, Claude — though Claude's prose form remains out-of-scope) into the Lossless [^hex] footnote format the rest of the cite-wide plugin operates on. The existing convertAllCitations command does one regex pass over the whole file and groups citations by their numeric ID. That silently corrupts when the same [1] in two different sections of a file refers to different sources — they get merged into one citation group with one hex ID, attribution lost.

The user's framing was sharper than that:

"It cannot perform on a selection within a file, nor can it analyze for potential issues across copy-pasted content from multiple sources."

The fix isn't a smarter regex. It's a two-phase pipeline with explicit collision detection plus a modal UI that gives the user visibility into every proposed transformation before any disk write. The user picks per-numeric, runs Apply for the batch, or hits Convert on individual rows to incrementally pick off conversions while the modal stays open and re-renders.

The strategic reason this matters beyond cite-wide: the workflow this serves is the user's primary research intake. Pasted LLM outputs are where citations enter the knowledge graph from external tools. If that boundary is lossy or destructive, every downstream system (canonical citations folder, eventual vector DB, RAG pipelines, the Investment Memo Orchestrator) inherits the loss. Getting the import boundary trustworthy is what makes the rest of the company-brain ambition workable.

What Was Built

Commits in order

#CommitTitle
1d4567d8feat(parser): LLM-citation parser handles multi-form clusters + partial conversion
2622152cfeat(parser): modal-driven LLM citation review with per-numeric selection
3b3bb3dafix(modal): tighter LLM-citations modal layout + per-row Convert button
43dcf4cafix(modal): consolidate Select-All/Unselect-All to single tri-state All checkbox

src/services/llmCitationParserService.ts — the parser (commit 1)

Pure TypeScript, no Obsidian imports — testable from a plain Node script without spinning up the plugin host.

Token kinds recognized:

KindPatternExampleSource
inline-numeric-single\[\d+\](?!:)[12]conventional footnote
inline-numeric-multi-comma\[\d+(,\s*\d+)+\][1, 2, 3]Google AI
inline-numeric-multi-adjacent\[\d+\]\[\d+\]…[1][2][3]Perplexity
inline-hex\[\^[a-z0-9]+\](?!:)[^abc123]already Lossless
refdef-numericline-anchored [N] [Title](url) or [N]: …[2] [Vector DBs for RAG](https://…)LLM ref list
refdef-hexline-anchored [^abc]: …[^abc123]: 2024, …already Lossless

Two-phase API:

  • parse(content): ParseResult — tokenizes, builds numeric/hex reference maps, detects orphans (inline without ref defs; ref defs without inline citations), detects collisions (same numeric ID defined more than once — a strong signal of two LLM-output clusters pasted into the same file), surfaces all of these as a structured flags array.

  • transform(content, parseResult, { selectedNumbers?, mapping? }) — generates one hex per transformable numeric (collision-checked against the existing hex-ref namespace so existing Lossless hex IDs are never reused), substitutes inline + reference-def occurrences, preserves all already-[^hex] markers verbatim. Multi-comma forms partially convert: if [1, 2, 3] has ref defs only for [2], the output is [1] [^xxx] [3] — the orphan numerics survive untouched and get a flag.

Convenience: proposeHexMapping(parseResult): Map<number, hex> — introduced for the modal flow. Pre-computes every transformable numeric → hex mapping. The modal calls this on open and uses the same map for every preview, every per-row Convert, and the final Apply, so the hex shown in the UI is exactly what gets written.

scripts/parse-llm-citations.mjs — the CLI test harness (commit 1)

Standalone Node script that bundles the parser via esbuild's in-memory build (the parser file uses TypeScript syntax that bare Node won't run; bundle is in-memory, no extra files hit disk).

node scripts/parse-llm-citations.mjs <input.md>          # parse-only report
node scripts/parse-llm-citations.mjs <input.md> -o <out> # transform & write

Used to validate the parser end-to-end against ChromaDB.md before wiring as an Obsidian command. Stays in the repo as the regression harness — any future parser change can be sanity-checked here without toggling the plugin in Obsidian.

Validation against ChromaDB.md (the canonical messy file)

The test file mixes:

  • 53 already-Lossless hex reference defs

  • 18 numeric reference defs [2] through [22] (with gaps for intentionally-orphan numbers)

  • 43 inline hex citations (preserved verbatim)

  • 6 inline numeric tokens (4 multi-comma forms + 2 singles)

Parser detects all of this correctly: 18 numeric ref defs converted to hex, 13 inline numeric citations expanded into hex form, 96 hex citations preserved untouched, 4 orphan-inline-numeric warnings raised ([1], [3], [14], [19] cited inline but never ref-def'd in the source), 7 orphan-numeric-ref info flags (refs [4][10] defined but not cited inline — these are LLM-list entries that never made it into the prose), 24 orphan-hex-ref info flags (canonical sources defined but not currently used inline). No collisions.

src/modals/LlmCitationsModal.ts — the modal UI (commit 2)

The modal-driven UX was the user's response to the first iteration's failure mode: running headlessly produced a single Notice with stats and zero visibility into which clusters got detected, how the transformation would shape, or whether orphans were correctly being preserved as numeric.

What the modal shows, per row:

  • Checkbox (default checked)

  • Title: [N] → [^hex] in monospace, with per-row inline-occurrence count badge (or "orphan ref — no inline citation" when the ref def has no callers)

  • Reference-definition row: badge, line link, 140-char body preview

  • One row per inline occurrence: kind badge (single / multi / adjacent), line link, 140-char preview of the source line

  • Per-row Convert button on the right edge — converts just that numeric, refreshes parse state from the post-conversion file, and re-renders the modal in place so the user can keep picking off conversions one at a time

Header controls: title with live "(N of M selected)" count, a single tri-state "All" checkbox (consolidating the original Select-All + Unselect-All buttons that kept wrapping the layout — see commit 4), an Apply button that batch-converts every checked row in one write.

Line-link behavior: every line number in the modal is a clickable anchor. Clicking scrolls the editor to that line, places the cursor, and closes the modal so the user can read the full surrounding context.

Flags section: orphans and collisions are surfaced at the bottom of the modal, grouped by code, with up to 5 example messages per code (each with its own line link if the flag carried one).

Layout and density (commits 3 + 4)

The first modal version inherited too much from existing styles — buttons were stacking vertically and overflowing the right edge, each row was 7-8 lines tall. The fixes:

  • Inline-styled overrides for everything specific to this modal, so styles.css (which the user is curating separately) doesn't get touched.

  • Compact spacing: group margin 2rem→0.5rem, header padding 1rem →0.4rem, content padding 1.25rem→0.4rem, instance padding 0.75rem →0.25rem. Each row is now ~3-4 lines tall.

  • Compact typography: title 1.1rem→0.9rem in monospace, badge font 0.75rem→0.65rem, line-info font 0.9rem→0.78rem.

  • Header consolidation: Select-All + Unselect-All → single tri-state "All" checkbox (browser-native indeterminate visual when some-but-not-all are selected). Three buttons in the header had been wrapping in some Obsidian theme contexts; two items + one checkbox don't.

Command wiring

parse-llm-citations (registered in main.ts registerCitationCommands) now opens the modal instead of running headless:

TS
this.addCommand({
    id: 'parse-llm-citations',
    name: 'Parse LLM Citations in Current File',
    editorCallback: (editor: Editor) => {
        new LlmCitationsModal(this.app, editor).open();
    }
});

What Changed in Approach (the meta-lesson)

Pattern this rejectsPattern this adopts
Run-on-whole-file with a Notice for statsModal-driven preview-then-confirm with per-item granularity
One regex pass that merges colliding numerics silentlyTwo-phase parse → transform pipeline that detects collisions explicitly and refuses to corrupt them
Multi-form citations like [1, 2, 3] are ignored or treated atomicallyMulti-form is decomposed into per-number transformations; partial conversion when some are orphan
Fresh hex generation on each callPre-computed mapping during parse; same mapping flows through preview, per-row Convert, and final Apply
Edit styles.css for any visual changeInline-style overrides for modal-specific layout; styles.css stays curatable independently
Two buttons (Select-All / Unselect-All) per UI sectionOne tri-state checkbox using browser-native indeterminate state — fewer visual elements, clearer semantics

The generalizing point: "transform" and "review" are different operations and need different code paths. The existing convertAllCitations is a transform-only operation; this new parse-llm-citations is review-then-transform with the user's hand on the steering wheel. Pasted-from-LLM citations are exactly the input class where the user's judgment beats the regex's confidence — because the input is messier than what regex was designed for, and because a wrong transformation here corrupts attribution that's hard to recover later.

Open Items

  • Claude prose-style attributions (no bracket markers, paragraphs like "Erik Brynjolfsson of Stanford…") are out of scope for the deterministic regex parser. Would need a Claude-API-driven extractor (matches the broader spec in Citation-Acquisition-Pipeline.md). Deferred until a real test case shows up.

  • Heading slugs still drift when an inline [12] in a heading becomes [^hex] — the auto-generated #1-pinecone-… anchor breaks because the bracketed text is now different. Pre-existing issue inherited from convertAllCitations; not introduced here. Could be addressed by a follow-up pass that detects ToC links pointing at freshly-regenerated heading slugs.

  • The proposeHexMapping random-hex generation uses Math.random() — fine for collision avoidance within one document but technically not deterministic across runs. If repeatable mappings ever matter (regression tests, content-addressed citation IDs), swap in a content-derived hash.

  • Per-cluster pattern signatures (Google-AI vs Perplexity vs Claude) are detected at the token level but not surfaced as cluster-level classifications in the modal. The user can read the inline-token kind badges to infer source, but a cleaner UI would group rows by detected cluster pattern. Hold for a future pass.

  • CLI test harness coverage — the harness can run parse-only and transform-and-write modes, but doesn't have golden-file regression tests. Adding scripts/test/llm-citations.test.mjs with 4-5 input → expected-output pairs would catch regressions cheaply.

  • The two sibling spec docs (Citation-Acquisition-Pipeline.md and Citation-Field-Acquisition-Guide.md) are still untracked in the working tree as of this changelog. The user committed Lossless-Citation-Standards.md in 48bbc2d but left the two sibling docs for later review.

  • Force-push to remote still pending. All work since the 4ca2046 changelog is local-only; nothing has been pushed since the force-push that overwrote Tanuj's three Aug-2025 commits. Dependabot won't see the lockfile changes (and won't close the alerts) until the branch is pushed.

Files Touched

cite-wide/
├── package.json                               (version 0.0.1.2 → 0.1.3)
├── manifest.json                              (version 0.0.1.1 → 0.1.3)
├── versions.json                              (added 0.1.3 → minAppVersion mapping)
├── main.ts                                    (registered parse-llm-citations command + import)
├── README.md                                  (added LLM Citation Conversion feature section)
├── CLAUDE.md                                  (refreshed repo layout, status, open questions)
├── src/
│   ├── services/
│   │   └── llmCitationParserService.ts       (created — tokenizer, parse, transform, proposeHexMapping)
│   └── modals/
│       └── LlmCitationsModal.ts               (created — preview-then-apply UI, per-row Convert)
├── scripts/
│   └── parse-llm-citations.mjs                (created — Node CLI test harness)
└── changelog/                                 (originally created at context-v/changelogs/; relocated to repo-root changelog/ on 2026-05-17)
    └── 2026-05-01_05.md                       (created — this file)

Reference

  • Predecessor changelogs: 2026-05-01_01.md (deps refresh), 2026-05-01_02.md (type-safety pass), 2026-05-01_03.md (dedupe by URL), 2026-05-01_04.md (deps cleanup + Tanuj-intent port).

  • Test file: /Users/mpstaton/content-md/lossless/Tooling/Software Development/Databases/ChromaDB.md — mixed-format chaos file with one Google-AI cluster + 53 already- Lossless hex citations.

  • Source-format catalog: context-v/blueprints/Parse-Common-Citation-Formats.md — example outputs from Google AI Chat, Perplexity, and Claude.

  • Inline citation format spec: context-v/reminders/Lossless-Citation-Spec.md — defines the [^hex] placement rules this parser targets.

  • Future-state reference: context-v/blueprints/Citation-Acquisition-Pipeline.md — the broader MCP-server architecture this parser fits into. The current parser is the deterministic "Phase 2" piece; the AI-driven pieces (Claude prose extraction, publisher classification, etc.) remain unbuilt per that spec.

  • Commits: d4567d8, 622152c, b3bb3da, 3dcf4ca on development.