v0.2.0 — Ship v1 of LLM Response Parser: Paste Modal + Spec-Conformant Output
Closes the LLM-citation-parser feature to a true v1 ship. Three coordinated changes: a Paste LLM Content modal that intercepts at paste-time so the colliding-numerics problem never enters the file, a fix for Perplexity's `Title https://url` reference-def format that was being silently rejected by the refdef-body heuristic, and a spec-conformance pass on transformation output that brings inline citation spacing and reference-def body shape into line with the Lossless Citation Spec. Output now matches the spec on inline whitespace and refdef body shape; the canonical `date / author / publisher / accessed-date` envelope still requires network or AI assistance and is explicitly deferred.
Why Care?
v0.1.3 (2026-05-01_05.md) shipped the LLM citation parser as a
review-then-convert modal: detect every [N] and [N] [Title](url)
in the active file, propose a hex mapping, let the user pick rows
and Apply. That was the post-hoc path — fix what's already in the
doc.
This version closes the feature on three remaining gaps that together prevented it from being a real v1:
No upstream control. Pasting raw LLM output into a doc and then running the parser worked, but it left a window where two Perplexity responses (each with their own
[1]–[N]series) could be pasted into the same file before the parser ran — and collide. The new Paste modal closes that window: convert at the paste boundary, never let raw LLM output enter the file.Perplexity refdefs were being silently dropped. The original
looksLikeRefDefBodyheuristic only accepted markdown-link-at- line-start ([Title](url)) or bare-URL-at-line-start. Perplexity puts the title BEFORE the URL ([1] Title https://url), which matched neither — so Perplexity reference lists weren't being detected as refdefs at all, and conversions silently no-op'd.Output didn't match the Lossless spec. Even when conversion worked, the output was a faithful token-substitution of the source:
text.[^hex](no space after period, per Perplexity's inline form),[^hex]: Title https://url(raw Perplexity refdef shape, not the Lossless canonical markdown-link). The spec requires single-space-before-citation and[Title](URL)refdef body shape; the user identified both gaps when comparing GitLab.md againstLossless-Citation-Spec.md.
The strategic point: the parser's job isn't just "convert numerics to hexes." It's "produce Lossless-conformant output from any recognized input shape, regardless of which LLM tool generated the input." This version is the first where the output actually meets that bar deterministically — without fetching, without AI calls, purely from the pasted material.
What Was Built
Commits in order
| # | Commit | Title |
| 1 | 34d3ef8 | feat(modal): paste-time LLM citation conversion at cursor |
| 2 | 02668d3 | fix(parser): recognize Perplexity refdef format ([N] Title https://url) |
| 3 | 0423326 | fix(parser): output now matches Lossless spec — inline spacing + markdown-link refdefs |
Paste LLM Content modal (commit 1)
New modal src/modals/PasteLlmContentModal.ts and command
paste-llm-content ("Paste LLM Content (Convert Citations on
Insert)"). The flow:
User triggers the command — modal opens with the textarea auto-focused so
⌘Vdrops them straight into pasting.User picks the source provider (Google AI Overviews / Perplexity) via radio. Currently informational metadata; the parser auto-handles both formats. Field is carried for any future provider-specific tweaks.
Pastes the LLM output verbatim into the textarea.
Clicks Parse and Insert.
What happens on insert:
The host document's hex namespace is collected (
hexRefskeys +inlineHexnumbers) and passed to the parser asadditionalUsedHexes. Generated hex IDs for the pasted content are guaranteed not to collide with citations already in the host doc.The pasted content runs through the same
parse → transformpipeline as the in-file modal.Output is inserted at cursor via
editor.replaceSelection().A Notice reports
inline + ref defconversion counts; warning-level flags surface to the console.
Service refactor to support this:
proposeHexMapping(parseResult, additionalUsedHexes?) now takes an
optional second parameter — a Set<string> of hex IDs to exclude
from generation in addition to those defined in the parsed content
itself. The Paste modal passes the host doc's namespace; the in-file
modal calls without the second argument and gets the same behavior
as before.
Perplexity refdef format fix (commit 2)
Bug: the refdef-body heuristic
/^\[.+\]\(.+\)/.test(body) || /^https?:\/\//.test(body) only
matched two body shapes — markdown link at line-start or bare URL
at line-start. Perplexity's typical reference-list format is:
[1] Understanding GitHub Actions https://docs.github.com/articles/... Title text first, URL at the end. Body = "Understanding GitHub
Actions https://...". Doesn't start with [ (not markdown link).
Doesn't start with https?:// (URL is at the end). Heuristic
returned false, the line fell through to scanInline, the [1]
got tokenized as inline-numeric-single, and no refdef was ever
recorded. Result: numericRefs.size === 0, no hex mapping,
transform was a no-op, user saw "the output was not transformed."
Fix: loosen to "body contains a URL anywhere" —
/https?:\/\//.test(body). The line-anchored [N] prefix in the
parent regex already filters out almost everything that isn't a
reference definition, so allowing URLs anywhere in the body is safe
and catches all three known formats (Google AI markdown-link,
Perplexity title-then-URL, Lossless @URL shorthand).
Verified end-to-end on a Perplexity sample (9 ref defs + 19 inline single citations): all detected and transformed; zero flags.
Spec-conformance pass on transform output (commit 3)
The user observed two output deltas vs. Lossless-Citation-Spec.md
after running the parser on GitLab.md (a real Perplexity-derived
research file):
| Spec rule | Source had | Old output | New output |
| Single space between content and citation | text.[1] | text.[^hex] | text. [^hex] |
| Single space between consecutive citations | [3] [4][5] [6] | [^c] [^d][^e] [^f] | [^c] [^d] [^e] [^f] |
| Word-then-cite needs space too | changes[2] | changes[^h] | changes [^h] |
| Reference body in markdown-link form | [1] Title https://url | [^h]: Title https://url | [^h]: [Title](https://url) |
Two new private methods on llmCitationParserService:
normalizeInlineCitationSpacing(line)— iterative pass that inserts a space between any non-whitespace character and an immediately-following[^hex]marker. Catches all three boundary cases (punctuation-then-cite, word-then-cite, cite-then-cite) in a single pattern. Iterates with a safety cap of 100 so chains liketext[^a][^b][^c]resolve fully across multiple passes (each pass fixes one boundary, regex re-scans the modified string).reformatRefDefBodyAsMarkdownLink(body)— when transforming a numeric refdef to hex form, restructures the body into[Title](URL)shape if it isn't already. Skip-clause: if the body already contains any markdown link, leave alone — preserves Lossless / Google-AI inputs that arrive correctly shaped, and avoids double-wrapping anything mid-body.
Wired in: normalizeInlineCitationSpacing runs at the end of the
inline-transform per-line block (after multi-comma, adjacent, and
single replaces). reformatRefDefBodyAsMarkdownLink runs inside
the refdef-numeric branch when emitting the new hex form. Numeric
refdefs that don't get a hex (collision, unselected) keep their
body verbatim — we don't restructure refdefs we aren't transforming.
Verification
Build clean across all three commits (tsc + eslint + esbuild).
End-to-end fixture covering all three spacing cases plus refdef
restructuring:
INPUT:
changes[2] [5] [3] (word + spaced + spaced)
growth:[3] [4][5] [6] (colon + spaced + adjacent + spaced)
cases.[1] [2] (period + spaced)
[1] Title One https://example.com/one (Perplexity refdef)
OUTPUT:
changes [^h2] [^h5] [^h3]
growth: [^h3] [^h4] [^h5] [^h6]
cases. [^h1] [^h2]
[^h1]: [Title One](https://example.com/one) User confirmed working on the GitLab.md reconversion.
What Changed in Approach (the meta-lesson)
| Pattern this rejects | Pattern this adopts |
| Token substitution only — preserve surrounding bytes verbatim | Token substitution PLUS a final spec-conformance pass that normalizes whitespace and reshapes refdef bodies |
| Convert in the file after pasting | Convert at the paste boundary; raw LLM output never enters the file |
| Heuristic accepts only the formats we'd written it for | Heuristic accepts any line-anchored [N] whose body contains a URL anywhere — catches all three known LLM-output shapes with one rule |
| Hex generation per-modal-instance, blind to host context | Hex generation can opt into an "exclude these hex IDs too" parameter so the same generator works for in-file and paste-into-file flows |
| Provider radio is decorative UI; parser auto-handles both formats | Provider radio is retained but marked as informational pending future provider-specific tweaks (the natural use is refdef-body restructuring per provider, but auto-detect handles both formats safely without it) |
The generalizable point: a parser's job ends not when the input is recognized but when the output meets the target spec. The v0.1.3 parser was correct on input recognition but produced non-spec-conformant output that required manual cleanup. The v0.2.0 parser does the cleanup itself. That's the difference between "a working parser" and "a feature you can hand to a content team."
Open Items
Canonical Lossless reference shape still partially unfilled. The Lossless spec wants
[^hex]: 2025, Jan 25. {Author}. [Title](url). {Publisher Name} || [Publisher](publisher_url). Accessed {Month Day, Year}.— date, author, publisher name, accessed date are all missing from the parser's deterministic output because they require network (URL fetch + meta-tag extraction) or AI (publisher classification, author normalization). Explicitly deferred per the user's "3 involves making a tool call that will have to run in the background." The MCP-driven canonical-citations agent spec'd inCitation-Acquisition-Pipeline.mdcovers this.Provider radio is still informational. The parser auto-handles Google AI and Perplexity forms regardless of selection. Keeping the field for any future provider-specific tweaks (e.g., provider-specific refdef-body extraction strategies, or future format additions like Claude prose-style attribution).
Claude prose-style attribution ("Erik Brynjolfsson of Stanford Digital Economy Lab —
Canaries in the Coal Mine?…") remains out of scope for the deterministic regex parser. Would need a Claude-API-driven extractor. No real workflow currently bumps into this.Heading slug drift when an inline
[12]in a heading becomes[^hex]— auto-generated#1-pinecone-…anchor breaks because the bracketed text changed. Pre-existing issue inherited fromconvertAllCitations; not worsened by the new parser. A follow-up pass that detects ToC links pointing at freshly- regenerated heading slugs would handle it."Normalize Lossless Citation Spacing" as a standalone command. The new spec-conformance pass only runs on lines the parser transforms (i.e., lines with at least one numeric citation). For files already converted under the v0.1.3 parser, the spacing issues are baked in — re-running the parser doesn't fix them because there are no numeric citations left to trigger the transform path. A separate command that runs only the
normalizeInlineCitationSpacingpass over the active file would retroactively fix older converted files. Holding for now — surface if users hit it in practice.Force-push to remote still pending. All work since the
4ca2046changelog is local-only ondevelopment. Dependabot won't see the lockfile changes until the branch is pushed. User indicated they'll push manually.
Files Touched
cite-wide/
├── package.json (version 0.1.3 → 0.2.0)
├── manifest.json (version 0.1.3 → 0.2.0)
├── versions.json (added 0.2.0 → minAppVersion mapping)
├── main.ts (registered paste-llm-content command + import)
├── README.md (rewrote LLM Citation Conversion section to cover both commands; added Paste LLM Content command entry)
├── CLAUDE.md (updated Status to v0.2.0; added PasteLlmContentModal to repo layout)
├── src/
│ ├── services/
│ │ └── llmCitationParserService.ts (proposeHexMapping gained additionalUsedHexes; looksLikeRefDefBody loosened; normalizeInlineCitationSpacing + reformatRefDefBodyAsMarkdownLink added)
│ └── modals/
│ └── PasteLlmContentModal.ts (created — paste-time conversion at cursor)
└── changelog/ (originally created at context-v/changelogs/; relocated to repo-root changelog/ on 2026-05-17)
└── 2026-05-02_01.md (created — this file) Reference
Predecessor changelogs:
2026-05-01_01.md(deps refresh),2026-05-01_02.md(type-safety pass),2026-05-01_03.md(dedupe by URL),2026-05-01_04.md(deps cleanup + Tanuj-intent port),2026-05-01_05.md(LLM citation parser v0.1.3 — modal-driven per-cluster review).Spec doc the output now conforms to:
context-v/reminders/Lossless-Citation-Spec.md(inline citation spacing rules + reference-section formatting requirements).Original spec for the Paste modal:
context-v/specs/Modal-for-Pasting-LLM-Native-Content.md.Source-format catalog:
context-v/blueprints/Parse-Common-Citation-Formats.md(example outputs from Google AI Chat, Perplexity, and Claude).Future-state reference:
context-v/blueprints/Citation-Acquisition-Pipeline.md— the broader MCP-driven canonical-citations pipeline that would fill the rest of the canonical Lossless reference shape (date / author / publisher / accessed-date). The parser shipped here is the deterministic Phase 2 piece of that broader architecture; the network/AI Phase 5 (content archival + structured extraction) remains unbuilt.Commits:
34d3ef8,02668d3,0423326ondevelopment.Test fixtures used during development:
/tmp/perplexity-test.mdand/tmp/perplexity-spacing-test.md(synthetic mixed-pattern files);GitLab.mdin the user's vault (real-world Perplexity-derived research file).