cite-wide 0.2.0 Published

v0.2.0 — Ship v1 of LLM Response Parser: Paste Modal + Spec-Conformant Output

Closes the LLM-citation-parser feature to a true v1 ship. Three coordinated changes: a Paste LLM Content modal that intercepts at paste-time so the colliding-numerics problem never enters the file, a fix for Perplexity's `Title https://url` reference-def format that was being silently rejected by the refdef-body heuristic, and a spec-conformance pass on transformation output that brings inline citation spacing and reference-def body shape into line with the Lossless Citation Spec. Output now matches the spec on inline whitespace and refdef body shape; the canonical `date / author / publisher / accessed-date` envelope still requires network or AI assistance and is explicitly deferred.

Why Care?

v0.1.3 (2026-05-01_05.md) shipped the LLM citation parser as a review-then-convert modal: detect every [N] and [N] [Title](url) in the active file, propose a hex mapping, let the user pick rows and Apply. That was the post-hoc path — fix what's already in the doc.

This version closes the feature on three remaining gaps that together prevented it from being a real v1:

  1. No upstream control. Pasting raw LLM output into a doc and then running the parser worked, but it left a window where two Perplexity responses (each with their own [1][N] series) could be pasted into the same file before the parser ran — and collide. The new Paste modal closes that window: convert at the paste boundary, never let raw LLM output enter the file.

  2. Perplexity refdefs were being silently dropped. The original looksLikeRefDefBody heuristic only accepted markdown-link-at- line-start ([Title](url)) or bare-URL-at-line-start. Perplexity puts the title BEFORE the URL ([1] Title https://url), which matched neither — so Perplexity reference lists weren't being detected as refdefs at all, and conversions silently no-op'd.

  3. Output didn't match the Lossless spec. Even when conversion worked, the output was a faithful token-substitution of the source: text.[^hex] (no space after period, per Perplexity's inline form), [^hex]: Title https://url (raw Perplexity refdef shape, not the Lossless canonical markdown-link). The spec requires single-space-before-citation and [Title](URL) refdef body shape; the user identified both gaps when comparing GitLab.md against Lossless-Citation-Spec.md.

The strategic point: the parser's job isn't just "convert numerics to hexes." It's "produce Lossless-conformant output from any recognized input shape, regardless of which LLM tool generated the input." This version is the first where the output actually meets that bar deterministically — without fetching, without AI calls, purely from the pasted material.

What Was Built

Commits in order

#CommitTitle
134d3ef8feat(modal): paste-time LLM citation conversion at cursor
202668d3fix(parser): recognize Perplexity refdef format ([N] Title https://url)
30423326fix(parser): output now matches Lossless spec — inline spacing + markdown-link refdefs

Paste LLM Content modal (commit 1)

New modal src/modals/PasteLlmContentModal.ts and command paste-llm-content ("Paste LLM Content (Convert Citations on Insert)"). The flow:

  1. User triggers the command — modal opens with the textarea auto-focused so ⌘V drops them straight into pasting.

  2. User picks the source provider (Google AI Overviews / Perplexity) via radio. Currently informational metadata; the parser auto-handles both formats. Field is carried for any future provider-specific tweaks.

  3. Pastes the LLM output verbatim into the textarea.

  4. Clicks Parse and Insert.

What happens on insert:

  • The host document's hex namespace is collected (hexRefs keys + inlineHex numbers) and passed to the parser as additionalUsedHexes. Generated hex IDs for the pasted content are guaranteed not to collide with citations already in the host doc.

  • The pasted content runs through the same parse → transform pipeline as the in-file modal.

  • Output is inserted at cursor via editor.replaceSelection().

  • A Notice reports inline + ref def conversion counts; warning-level flags surface to the console.

Service refactor to support this: proposeHexMapping(parseResult, additionalUsedHexes?) now takes an optional second parameter — a Set<string> of hex IDs to exclude from generation in addition to those defined in the parsed content itself. The Paste modal passes the host doc's namespace; the in-file modal calls without the second argument and gets the same behavior as before.

Perplexity refdef format fix (commit 2)

Bug: the refdef-body heuristic /^\[.+\]\(.+\)/.test(body) || /^https?:\/\//.test(body) only matched two body shapes — markdown link at line-start or bare URL at line-start. Perplexity's typical reference-list format is:

[1] Understanding GitHub Actions https://docs.github.com/articles/...

Title text first, URL at the end. Body = "Understanding GitHub Actions https://...". Doesn't start with [ (not markdown link). Doesn't start with https?:// (URL is at the end). Heuristic returned false, the line fell through to scanInline, the [1] got tokenized as inline-numeric-single, and no refdef was ever recorded. Result: numericRefs.size === 0, no hex mapping, transform was a no-op, user saw "the output was not transformed."

Fix: loosen to "body contains a URL anywhere" — /https?:\/\//.test(body). The line-anchored [N] prefix in the parent regex already filters out almost everything that isn't a reference definition, so allowing URLs anywhere in the body is safe and catches all three known formats (Google AI markdown-link, Perplexity title-then-URL, Lossless @URL shorthand).

Verified end-to-end on a Perplexity sample (9 ref defs + 19 inline single citations): all detected and transformed; zero flags.

Spec-conformance pass on transform output (commit 3)

The user observed two output deltas vs. Lossless-Citation-Spec.md after running the parser on GitLab.md (a real Perplexity-derived research file):

Spec ruleSource hadOld outputNew output
Single space between content and citationtext.[1]text.[^hex]text. [^hex]
Single space between consecutive citations[3] [4][5] [6][^c] [^d][^e] [^f][^c] [^d] [^e] [^f]
Word-then-cite needs space toochanges[2]changes[^h]changes [^h]
Reference body in markdown-link form[1] Title https://url[^h]: Title https://url[^h]: [Title](https://url)

Two new private methods on llmCitationParserService:

  • normalizeInlineCitationSpacing(line) — iterative pass that inserts a space between any non-whitespace character and an immediately-following [^hex] marker. Catches all three boundary cases (punctuation-then-cite, word-then-cite, cite-then-cite) in a single pattern. Iterates with a safety cap of 100 so chains like text[^a][^b][^c] resolve fully across multiple passes (each pass fixes one boundary, regex re-scans the modified string).

  • reformatRefDefBodyAsMarkdownLink(body) — when transforming a numeric refdef to hex form, restructures the body into [Title](URL) shape if it isn't already. Skip-clause: if the body already contains any markdown link, leave alone — preserves Lossless / Google-AI inputs that arrive correctly shaped, and avoids double-wrapping anything mid-body.

Wired in: normalizeInlineCitationSpacing runs at the end of the inline-transform per-line block (after multi-comma, adjacent, and single replaces). reformatRefDefBodyAsMarkdownLink runs inside the refdef-numeric branch when emitting the new hex form. Numeric refdefs that don't get a hex (collision, unselected) keep their body verbatim — we don't restructure refdefs we aren't transforming.

Verification

Build clean across all three commits (tsc + eslint + esbuild). End-to-end fixture covering all three spacing cases plus refdef restructuring:

INPUT:
  changes[2] [5] [3]                       (word + spaced + spaced)
  growth:[3] [4][5] [6]                    (colon + spaced + adjacent + spaced)
  cases.[1] [2]                            (period + spaced)
  [1] Title One https://example.com/one    (Perplexity refdef)

OUTPUT:
  changes [^h2] [^h5] [^h3]
  growth: [^h3] [^h4] [^h5] [^h6]
  cases. [^h1] [^h2]
  [^h1]: [Title One](https://example.com/one)

User confirmed working on the GitLab.md reconversion.

What Changed in Approach (the meta-lesson)

Pattern this rejectsPattern this adopts
Token substitution only — preserve surrounding bytes verbatimToken substitution PLUS a final spec-conformance pass that normalizes whitespace and reshapes refdef bodies
Convert in the file after pastingConvert at the paste boundary; raw LLM output never enters the file
Heuristic accepts only the formats we'd written it forHeuristic accepts any line-anchored [N] whose body contains a URL anywhere — catches all three known LLM-output shapes with one rule
Hex generation per-modal-instance, blind to host contextHex generation can opt into an "exclude these hex IDs too" parameter so the same generator works for in-file and paste-into-file flows
Provider radio is decorative UI; parser auto-handles both formatsProvider radio is retained but marked as informational pending future provider-specific tweaks (the natural use is refdef-body restructuring per provider, but auto-detect handles both formats safely without it)

The generalizable point: a parser's job ends not when the input is recognized but when the output meets the target spec. The v0.1.3 parser was correct on input recognition but produced non-spec-conformant output that required manual cleanup. The v0.2.0 parser does the cleanup itself. That's the difference between "a working parser" and "a feature you can hand to a content team."

Open Items

  • Canonical Lossless reference shape still partially unfilled. The Lossless spec wants [^hex]: 2025, Jan 25. {Author}. [Title](url). {Publisher Name} || [Publisher](publisher_url). Accessed {Month Day, Year}. — date, author, publisher name, accessed date are all missing from the parser's deterministic output because they require network (URL fetch + meta-tag extraction) or AI (publisher classification, author normalization). Explicitly deferred per the user's "3 involves making a tool call that will have to run in the background." The MCP-driven canonical-citations agent spec'd in Citation-Acquisition-Pipeline.md covers this.

  • Provider radio is still informational. The parser auto-handles Google AI and Perplexity forms regardless of selection. Keeping the field for any future provider-specific tweaks (e.g., provider-specific refdef-body extraction strategies, or future format additions like Claude prose-style attribution).

  • Claude prose-style attribution ("Erik Brynjolfsson of Stanford Digital Economy Lab — Canaries in the Coal Mine?…") remains out of scope for the deterministic regex parser. Would need a Claude-API-driven extractor. No real workflow currently bumps into this.

  • Heading slug drift when an inline [12] in a heading becomes [^hex] — auto-generated #1-pinecone-… anchor breaks because the bracketed text changed. Pre-existing issue inherited from convertAllCitations; not worsened by the new parser. A follow-up pass that detects ToC links pointing at freshly- regenerated heading slugs would handle it.

  • "Normalize Lossless Citation Spacing" as a standalone command. The new spec-conformance pass only runs on lines the parser transforms (i.e., lines with at least one numeric citation). For files already converted under the v0.1.3 parser, the spacing issues are baked in — re-running the parser doesn't fix them because there are no numeric citations left to trigger the transform path. A separate command that runs only the normalizeInlineCitationSpacing pass over the active file would retroactively fix older converted files. Holding for now — surface if users hit it in practice.

  • Force-push to remote still pending. All work since the 4ca2046 changelog is local-only on development. Dependabot won't see the lockfile changes until the branch is pushed. User indicated they'll push manually.

Files Touched

cite-wide/
├── package.json                                       (version 0.1.3 → 0.2.0)
├── manifest.json                                      (version 0.1.3 → 0.2.0)
├── versions.json                                      (added 0.2.0 → minAppVersion mapping)
├── main.ts                                            (registered paste-llm-content command + import)
├── README.md                                          (rewrote LLM Citation Conversion section to cover both commands; added Paste LLM Content command entry)
├── CLAUDE.md                                          (updated Status to v0.2.0; added PasteLlmContentModal to repo layout)
├── src/
│   ├── services/
│   │   └── llmCitationParserService.ts                (proposeHexMapping gained additionalUsedHexes; looksLikeRefDefBody loosened; normalizeInlineCitationSpacing + reformatRefDefBodyAsMarkdownLink added)
│   └── modals/
│       └── PasteLlmContentModal.ts                    (created — paste-time conversion at cursor)
└── changelog/                                         (originally created at context-v/changelogs/; relocated to repo-root changelog/ on 2026-05-17)
    └── 2026-05-02_01.md                               (created — this file)

Reference

  • Predecessor changelogs: 2026-05-01_01.md (deps refresh), 2026-05-01_02.md (type-safety pass), 2026-05-01_03.md (dedupe by URL), 2026-05-01_04.md (deps cleanup + Tanuj-intent port), 2026-05-01_05.md (LLM citation parser v0.1.3 — modal-driven per-cluster review).

  • Spec doc the output now conforms to: context-v/reminders/Lossless-Citation-Spec.md (inline citation spacing rules + reference-section formatting requirements).

  • Original spec for the Paste modal: context-v/specs/Modal-for-Pasting-LLM-Native-Content.md.

  • Source-format catalog: context-v/blueprints/Parse-Common-Citation-Formats.md (example outputs from Google AI Chat, Perplexity, and Claude).

  • Future-state reference: context-v/blueprints/Citation-Acquisition-Pipeline.md — the broader MCP-driven canonical-citations pipeline that would fill the rest of the canonical Lossless reference shape (date / author / publisher / accessed-date). The parser shipped here is the deterministic Phase 2 piece of that broader architecture; the network/AI Phase 5 (content archival + structured extraction) remains unbuilt.

  • Commits: 34d3ef8, 02668d3, 0423326 on development.

  • Test fixtures used during development: /tmp/perplexity-test.md and /tmp/perplexity-spacing-test.md (synthetic mixed-pattern files); GitLab.md in the user's vault (real-world Perplexity-derived research file).