cite-wide 0.0.1.2 Published

Dedupe Citations by URL — Command, Modal, and URL-Normalizing Service for Consolidating Multi-Source Research Pastes

New 'Dedupe Citations by URL' command for the workflow where the same article gets cited under multiple hex IDs because research from Perplexity / Google AI / Claude was pasted into the same file across a session. The service finds reference-section entries that share an article URL (after normalizing host case, fragment, and tracking params), groups them, and rewrites the file so each unique URL ends up with exactly one canonical hex ID. The modal lists groups in document order with per-occurrence line links, default-checked checkboxes per group so the user can opt out of any false-positive match, and a single 'Apply Dedup' button that writes through app.vault.modify.

All ship notes

Why Care?

The workflow: drafting a piece of writing, asking Perplexity a question, copy-pasting the response (citations included) into the markdown. An hour later, asking Google's AI Overview a related question, pasting that. Later still, asking Claude. Each tool's response carries its own freshly- generated hex IDs for the citations it returns, but multiple sources routinely answer with citations to the same underlying articles — NYT pieces, arXiv preprints, vendor press releases. The file ends up with three different [^xxxxxx] hex IDs all pointing at the same URL.

Until now, fixing that meant either accepting the duplication or hand- rewriting every inline [^def456] to match [^abc123] and deleting the redundant reference-section entries — tedious and error-prone exactly in the part of the workflow where momentum is highest. The "unfinished research mid-stage" pattern Cite Wide is designed to support is the place where this duplication accumulates fastest, so the dedup tool needs to live inside Cite Wide rather than as an external script.

The strategic point of the command: URL is the right uniqueness key for a citation. Hex IDs are an internal artifact of the plugin; content creators don't think in them. They think "I cited that NYT article three times across two pasting sessions." Matching on URL directly (with normalization to bridge the trivial differences — fragment hashes, utm_* tracking params, trailing slashes) gives the tool a meaning that lines up with the author's mental model.

What Was Built

`src/services/dedupeByUrlService.ts` — analyzer + applier

Two public methods, ~290 lines total. Pure functions over string content; no Obsidian API dependencies, so the service is unit-testable without a vault.

findDuplicateUrlGroups(content): DuplicateGroup[] — the analyzer.

Walks every line, matches ^\s*\[\^([a-z0-9]+)\]:\s*(.+)$ to identify reference-section entries (the : after the bracket is load-bearing; this is what distinguishes a reference definition from an inline marker).
Pulls the primary article URL out of each reference line's tail. Convention: first markdown-link form [Title](URL) if present (matches the Jina extraction format we already produce), falling back to first bare https?://... URL otherwise.
Normalizes the URL — lowercase host, drop fragment, drop tracking params (utm_*, fbclid, gclid), strip trailing slash and trailing punctuation. Survives invalid URLs by returning the lightly-cleaned input.
Groups reference entries by normalized URL; URLs with 2+ entries are duplicate groups.
For each group, scans the file for inline [^hexId] occurrences (excluding reference-definition lines via a negative lookahead on :).
Picks the canonical hex ID: earliest-inline appearance wins. Falls back to "first reference entry by line number" if no hex in the group has any inline appearances at all.
Returns groups sorted by the line number of their first occurrence, so the modal renders in document order.

applyDedup(content, groupsToDedupe): { content, stats } — the applier.

Builds two derived structures: a replacementMap (oldHex → canonicalHex) and a referenceLinesToRemove set.
One pass through the lines:
- Lines matching ^\s*\[\^([a-z0-9]+)\]: whose hex is in the remove set are dropped entirely.
- Every other line gets a regex pass per replacement, rewriting [\^oldHex\](?!:) → [^canonicalHex] (the negative lookahead prevents accidentally rewriting a reference-definition we intend to keep, though the previous filter already handled those).
Returns the new content plus counts (groupsDeduped, inlineReplacements, referenceLinesRemoved) for the post-apply notice.

Type definitions exported for the modal to consume: DuplicateOccurrence, DuplicateGroup, DedupeStats, DedupeResult.

`src/modals/DedupeByUrlModal.ts` — the UI

Follows the existing CitationModal pattern (95vw modal, cite-wide-* class vocabulary, scroll-to-line behavior on click) so it slots into the same look-and-feel.

Per group:

Checkbox (default checked) controls whether the group gets included on apply. User unchecks anything they don't want consolidated — the doc explicitly anticipates "the user can uncheck one or more if they feel there has been an error or if they don't fully understand the locations for inline and reference section."
Header shows the URL and a one-line summary of what will happen ("Keep [^abc123], remove [^def456], [^ghi789]").
Always-expanded body (no toggle) shows every occurrence — inline and reference — in line-number order, with the hex ID badge, the line number as a clickable anchor, and a 120-char preview of the line content.
"Reference" badge surfaces on reference-section rows; "(keep)" suffix on the canonical hex's badge.

Click on any "Line N" link → editor selects the marker [^hexId] at that location, scrolls into view, closes the modal.

"Apply Dedup" → calls service, writes via app.vault.modify, shows notice with the three counts.

`main.ts` — command registration

id:   dedupe-citations-by-url
name: Dedupe Citations by URL

Registered inside registerCitationCommands, immediately after show-citations so the two modal commands sit together in the palette.

What Changed in Approach (the meta-lesson)

Pattern this command rejects	Pattern this command adopts
Treat hex ID as the citation's identity	Treat URL as the citation's identity; hex ID is plumbing
Match URLs strict-byte-equal, hope authors normalize themselves	Normalize at the comparison boundary (host case, fragment, utm_*, trailing slash) so trivial inconsistencies stop blocking matches
Ask the user to confirm every individual rewrite	Group by URL and let the user opt out per group; default to "yes consolidate everything we found" since false-positive groups are rare given the URL match
Operate on Obsidian's metadata cache (which lags during editing)	Operate on the live `editor.getValue()` so what the user sees is what we analyze
Mutate via regex.replace and hope for the best	Detect-then-apply split: the analyzer returns immutable groups, the applier consumes them; user-visible UI sits between the two phases

The generalizable point: for any "merge things that look the same" operation in a content workflow, the matching key should be the authoring-domain identity, not the storage-domain identity. URL is authoring-domain; hex ID is storage-domain. Building the dedup around URL with normalization is what makes the feature feel like the user's mental model rather than a database operation.

Open Items

Orphan citation files. When the command rewrites [^def456] → [^abc123] and deletes def456's reference line, the per-citation markdown file Citations/def456.md (created earlier by citationFileService.createCitationFile) stays on disk. It's still visible in Dataview queries but no inline references point at it. Out of scope for this command, which operates on the open file's text only. A separate "Garbage Collect Orphan Citation Files" command (scan Citations/, find files whose hex doesn't appear in any vault .md, prompt-then-delete) would handle this cleanly. The blast radius (file deletion vs. text edit) justifies a separate command.
Inheriting usage stats. The canonical citation file's usageCount and filesUsedIn are not updated to reflect the usage previously recorded against the eliminated hexes. The right time to do this is during apply, before deleting the duplicate reference line: read each duplicate hex's citation file via metadataCache.getFileCache, sum usageCounts, union filesUsedIn lists, then processFrontMatter the canonical file. Defer until the orphan-cleanup command lands so the two citation-file mutations land together.
Same hex ID with two different reference URLs. Malformed input (the user accidentally wrote two [^abc123]: ... lines pointing at different URLs). This hex would land in two groups and dedup application could produce nondeterministic rewrites. Not handled specifically — the simpler story is "fix the malformed reference manually, then run dedup." If we see this in real files, add detect-and-skip with a notice.
Per-group canonical override. Right now the user can opt-out of a whole group but can't pick which hex survives within a group. In practice the "earliest-inline" default is right almost always. If users push back, the modal can grow a radio-button row per group letting them override the canonical pick. Hold for now.
No styles.css changes. The modal reuses the existing cite-wide-* class vocabulary and inline-styles only the checkbox margin. If the dedup-specific visuals need polish (hex badges, "keep" indicator color, etc.), that's a follow-on style pass.

Files Touched

cite-wide/
├── main.ts                                              (modified — added 1 command registration + 1 import)
└── src/
    ├── services/
    │   └── dedupeByUrlService.ts                        (created — analyzer + applier; ~290 lines)
    └── modals/
        └── DedupeByUrlModal.ts                          (created — list UI with checkboxes + line-link scroll; ~190 lines)

Build clean: tsc -noEmit -skipLibCheck && eslint . && esbuild production.

Reference

Workflow this serves: described in this turn's user message — multi-source research pasting (Perplexity, Google AI, Claude) into the same markdown file across a session, with the "we often start content we do not finish" pattern that lets duplication accumulate before the author returns to clean up.
UI conventions reused from: src/modals/CitationModal.ts — modal sizing, cite-wide-* class vocabulary, scroll-to-line pattern, badge styling.
Type-safety patterns from: context-v/reminders/Obsidian-Type-Safety.md — the new code uses no any, narrows DOM lookups via instanceof HTMLElement, types Map<K, V> explicitly, and uses void-prefixed promise calls in event handlers.
Prior changelogs: 2026-05-01_01.md (deps refresh), 2026-05-01_02.md (type-safety pass).
Commit: 4d9a869 feat(dedup): command + modal to consolidate citations sharing a URL