Dedupe Citations by URL — Command, Modal, and URL-Normalizing Service for Consolidating Multi-Source Research Pastes
New 'Dedupe Citations by URL' command for the workflow where the same article gets cited under multiple hex IDs because research from Perplexity / Google AI / Claude was pasted into the same file across a session. The service finds reference-section entries that share an article URL (after normalizing host case, fragment, and tracking params), groups them, and rewrites the file so each unique URL ends up with exactly one canonical hex ID. The modal lists groups in document order with per-occurrence line links, default-checked checkboxes per group so the user can opt out of any false-positive match, and a single 'Apply Dedup' button that writes through app.vault.modify.
Why Care?
The workflow: drafting a piece of writing, asking Perplexity a question,
copy-pasting the response (citations included) into the markdown. An hour
later, asking Google's AI Overview a related question, pasting that.
Later still, asking Claude. Each tool's response carries its own freshly-
generated hex IDs for the citations it returns, but multiple sources
routinely answer with citations to the same underlying articles —
NYT pieces, arXiv preprints, vendor press releases. The file ends up
with three different [^xxxxxx] hex IDs all pointing at the same URL.
Until now, fixing that meant either accepting the duplication or hand-
rewriting every inline [^def456] to match [^abc123] and deleting
the redundant reference-section entries — tedious and error-prone
exactly in the part of the workflow where momentum is highest. The
"unfinished research mid-stage" pattern Cite Wide is designed to
support is the place where this duplication accumulates fastest, so
the dedup tool needs to live inside Cite Wide rather than as an
external script.
The strategic point of the command: URL is the right uniqueness key for a citation. Hex IDs are an internal artifact of the plugin; content creators don't think in them. They think "I cited that NYT article three times across two pasting sessions." Matching on URL directly (with normalization to bridge the trivial differences — fragment hashes, utm_* tracking params, trailing slashes) gives the tool a meaning that lines up with the author's mental model.
What Was Built
src/services/dedupeByUrlService.ts — analyzer + applier
Two public methods, ~290 lines total. Pure functions over string
content; no Obsidian API dependencies, so the service is unit-testable
without a vault.
findDuplicateUrlGroups(content): DuplicateGroup[] — the analyzer.
Walks every line, matches
^\s*\[\^([a-z0-9]+)\]:\s*(.+)$to identify reference-section entries (the:after the bracket is load-bearing; this is what distinguishes a reference definition from an inline marker).Pulls the primary article URL out of each reference line's tail. Convention: first markdown-link form
[Title](URL)if present (matches the Jina extraction format we already produce), falling back to first barehttps?://...URL otherwise.Normalizes the URL — lowercase host, drop fragment, drop tracking params (
utm_*,fbclid,gclid), strip trailing slash and trailing punctuation. Survives invalid URLs by returning the lightly-cleaned input.Groups reference entries by normalized URL; URLs with 2+ entries are duplicate groups.
For each group, scans the file for inline
[^hexId]occurrences (excluding reference-definition lines via a negative lookahead on:).Picks the canonical hex ID: earliest-inline appearance wins. Falls back to "first reference entry by line number" if no hex in the group has any inline appearances at all.
Returns groups sorted by the line number of their first occurrence, so the modal renders in document order.
applyDedup(content, groupsToDedupe): { content, stats } — the applier.
Builds two derived structures: a
replacementMap(oldHex → canonicalHex) and areferenceLinesToRemoveset.One pass through the lines:
Lines matching
^\s*\[\^([a-z0-9]+)\]:whose hex is in the remove set are dropped entirely.Every other line gets a regex pass per replacement, rewriting
[\^oldHex\](?!:)→[^canonicalHex](the negative lookahead prevents accidentally rewriting a reference-definition we intend to keep, though the previous filter already handled those).
Returns the new content plus counts (groupsDeduped, inlineReplacements, referenceLinesRemoved) for the post-apply notice.
Type definitions exported for the modal to consume:
DuplicateOccurrence, DuplicateGroup, DedupeStats, DedupeResult.
src/modals/DedupeByUrlModal.ts — the UI
Follows the existing CitationModal pattern (95vw modal, cite-wide-*
class vocabulary, scroll-to-line behavior on click) so it slots into
the same look-and-feel.
Per group:
Checkbox (default
checked) controls whether the group gets included on apply. User unchecks anything they don't want consolidated — the doc explicitly anticipates "the user can uncheck one or more if they feel there has been an error or if they don't fully understand the locations for inline and reference section."Header shows the URL and a one-line summary of what will happen ("Keep
[^abc123], remove[^def456],[^ghi789]").Always-expanded body (no toggle) shows every occurrence — inline and reference — in line-number order, with the hex ID badge, the line number as a clickable anchor, and a 120-char preview of the line content.
"Reference" badge surfaces on reference-section rows; "(keep)" suffix on the canonical hex's badge.
Click on any "Line N" link → editor selects the marker [^hexId] at
that location, scrolls into view, closes the modal.
"Apply Dedup" → calls service, writes via app.vault.modify, shows
notice with the three counts.
main.ts — command registration
id: dedupe-citations-by-url
name: Dedupe Citations by URL Registered inside registerCitationCommands, immediately after
show-citations so the two modal commands sit together in the palette.
What Changed in Approach (the meta-lesson)
| Pattern this command rejects | Pattern this command adopts |
| Treat hex ID as the citation's identity | Treat URL as the citation's identity; hex ID is plumbing |
| Match URLs strict-byte-equal, hope authors normalize themselves | Normalize at the comparison boundary (host case, fragment, utm_*, trailing slash) so trivial inconsistencies stop blocking matches |
| Ask the user to confirm every individual rewrite | Group by URL and let the user opt out per group; default to "yes consolidate everything we found" since false-positive groups are rare given the URL match |
| Operate on Obsidian's metadata cache (which lags during editing) | Operate on the live editor.getValue() so what the user sees is what we analyze |
| Mutate via regex.replace and hope for the best | Detect-then-apply split: the analyzer returns immutable groups, the applier consumes them; user-visible UI sits between the two phases |
The generalizable point: for any "merge things that look the same" operation in a content workflow, the matching key should be the authoring-domain identity, not the storage-domain identity. URL is authoring-domain; hex ID is storage-domain. Building the dedup around URL with normalization is what makes the feature feel like the user's mental model rather than a database operation.
Open Items
Orphan citation files. When the command rewrites
[^def456]→[^abc123]and deletesdef456's reference line, the per-citation markdown fileCitations/def456.md(created earlier bycitationFileService.createCitationFile) stays on disk. It's still visible in Dataview queries but no inline references point at it. Out of scope for this command, which operates on the open file's text only. A separate "Garbage Collect Orphan Citation Files" command (scanCitations/, find files whose hex doesn't appear in any vault.md, prompt-then-delete) would handle this cleanly. The blast radius (file deletion vs. text edit) justifies a separate command.Inheriting usage stats. The canonical citation file's
usageCountandfilesUsedInare not updated to reflect the usage previously recorded against the eliminated hexes. The right time to do this is during apply, before deleting the duplicate reference line: read each duplicate hex's citation file viametadataCache.getFileCache, sumusageCounts, unionfilesUsedInlists, thenprocessFrontMatterthe canonical file. Defer until the orphan-cleanup command lands so the two citation-file mutations land together.Same hex ID with two different reference URLs. Malformed input (the user accidentally wrote two
[^abc123]: ...lines pointing at different URLs). This hex would land in two groups and dedup application could produce nondeterministic rewrites. Not handled specifically — the simpler story is "fix the malformed reference manually, then run dedup." If we see this in real files, add detect-and-skip with a notice.Per-group canonical override. Right now the user can opt-out of a whole group but can't pick which hex survives within a group. In practice the "earliest-inline" default is right almost always. If users push back, the modal can grow a radio-button row per group letting them override the canonical pick. Hold for now.
No styles.css changes. The modal reuses the existing
cite-wide-*class vocabulary and inline-styles only the checkbox margin. If the dedup-specific visuals need polish (hex badges, "keep" indicator color, etc.), that's a follow-on style pass.
Files Touched
cite-wide/
├── main.ts (modified — added 1 command registration + 1 import)
└── src/
├── services/
│ └── dedupeByUrlService.ts (created — analyzer + applier; ~290 lines)
└── modals/
└── DedupeByUrlModal.ts (created — list UI with checkboxes + line-link scroll; ~190 lines) Build clean: tsc -noEmit -skipLibCheck && eslint . && esbuild production.
Reference
Workflow this serves: described in this turn's user message — multi-source research pasting (Perplexity, Google AI, Claude) into the same markdown file across a session, with the "we often start content we do not finish" pattern that lets duplication accumulate before the author returns to clean up.
UI conventions reused from:
src/modals/CitationModal.ts— modal sizing,cite-wide-*class vocabulary, scroll-to-line pattern, badge styling.Type-safety patterns from:
context-v/reminders/Obsidian-Type-Safety.md— the new code uses noany, narrows DOM lookups viainstanceof HTMLElement, typesMap<K, V>explicitly, and usesvoid-prefixed promise calls in event handlers.Prior changelogs:
2026-05-01_01.md(deps refresh),2026-05-01_02.md(type-safety pass).Commit:
4d9a869 feat(dedup): command + modal to consolidate citations sharing a URL