Lossless Citation Standards — Schema, Audiences, Portability Thesis

How to read this doc. Sections are grouped by audience: the WHY sections (Thesis, Portability Spine) are mandatory context for anyone — human or AI agent — touching the schema. The WHAT sections (Schema, Source-Type Matrix, Audiences) are the actual specification. The HOW is deferred to the two sibling docs.

For an AI agent reading this as prompt context: read top-to-bottom on first pass; re-read the Source-Type Matrix per source you process. Do not invent fields. Do not omit Portability Spine fields. When you don’t know a value, leave the field empty — never fabricate.

Why Care? — The Portability Thesis

Citations in this system are not bibliographic decoration. They are the interop anchor between three knowledge tiers the Lossless Group is consciously building:

Personal knowledge base — what one contributor reads, captures, and references in their own working files.
Organizational knowledge base — the “communist-style” pool where any contributor can use any other contributor’s captured content. Cross-vault, cross-project. All of our content belongs to all of us.
Beyond-human-scale derived layers — vector databases, RAG/KAG retrieval pipelines, fine-tuning corpora. The substrate that lets agentic systems (e.g. the Investment Memo Orchestrator we’re building separately) reason from our accumulated context rather than from live web search alone.

The schema below is designed for the third tier first, even though most contributors only see the first two. A field like first_accessed_at_url looks like trivial bibliographic metadata when you’re writing a blog post. It’s a deduplication key when 300 documents reference the same article through different display URLs and a vector pipeline is trying not to triple-count the same source.

The unifying mental model: treat content as data we intend to keep, port, reuse, and continue to build out. Every field exists because some downstream consumer — perhaps not yet built — will need it.

The Portability Spine

Five fields are load-bearing for the cross-tier story. If a citation has these five, it can move between tiers without identity loss. Everything else is display, classification, or annotation.

Field	Why it’s spine
`internal_uuid`	The stable cross-system anchor. Survives vault renames, file moves, copy-paste between vaults, eventual migration into a structured database. UUID generation is mechanical; the commitment to having one at capture time is the architectural decision.
`first_accessed_at_url`	The original URL we retrieved from. Anchoring and attribution — what we cited the first time. Survives even when the source’s URL drifts (slug rename, redirect, paywall move).
`recently_accessed_at_url`	The current URL when content has drifted. Lets a downstream agent or content-archive process know where to re-fetch. Often equals `first_accessed_at_url`; only diverges when drift is detected.
`downloaded_content_path`	The bit-preserving record. Even if the publisher takes the page down, the source we cited is still on disk. This is what makes our citations honest in a year.
`structured_data_path`	The contract that “the JSON-y form of this source exists somewhere reachable.” Lets a downstream consumer (Investment Memo Orchestrator, vector DB ingest, etc.) skip re-fetching and re-extracting — pull from structured shape directly.

Backfill priority for an existing Citations folder is always: spine first, everything else second. The spine fields unblock all downstream work; the rest is polish.

Naming Conventions (load-bearing — apply consistently)

Surface	Convention	Example
YAML property names	`snake_case` (underscore)	`internal_uuid`, `default_slug`, `publisher_favicon_url`
`default_slug` value	`snake-case` / kebab-case (dash)	`vector-databases-2026-market-analysis`
`tags` values	`Train-Case` (capitalized words, dash-separated, no spaces)	`Vector-Databases`, `RAG-Pipelines`, `Investment-Memo`
`publisher_type` values	`Train-Case` from controlled taxonomy below	`Market-Research-Organization`

Obsidian does not allow spaces in tags. We chose Train-Case (over kebab-case) for tag values so the tag taxonomy reads cleanly when humans browse it; we chose snake_case for properties because it’s the YAML/Python idiom most consumers will expect.

Audiences and Influences

Lossless citations need to render usefully into multiple downstream conventions. The schema is rendering-target-neutral — same data, different format depending on where the citation is consumed. The styles below are what we currently render into; new styles plug in by writing a new renderer over the same schema.

Lossless House Style (default)

Our house style for footnote-form citations in Markdown. See context-v/reminders/Lossless-Citation-Spec.md for the inline-citation rules ([^hexcode] placement, spacing, reference-section formatting). The reference-line shape is:

[^{hexcode}]: 2025, Jan 25. {Author Surname, First Name}. [Title of the source](url). Source Publisher Name || [Source Publisher Name](url). Accessed {Month Day, Year}.

Maps to the schema as: date_published (split into year/month/day), authors[0], title, recently_accessed_at_url (what we link to), publisher (+ optional publisher_url), date_recently_accessed.

Obsidian-Native

Default to GitHub Flavored Markdown + Remark/CommonMark conventions. Footnote syntax with [^hexcode] markers; reference definitions at the bottom of the file. Obsidian’s autocomplete handles enum-like fields without us enforcing them. The Lossless House Style is a tightening of the Obsidian-native baseline.

Academic Style

Renders to APA / MLA / Chicago / etc. Required fields beyond the spine: authors (full list, ordered), date_published (year minimum), title, subtitle (when journal/book has one), publisher (journal name, conference name, or book publisher), edition_or_version (for books, software, datasets). lede is not used; academic styles treat subtitle and abstract as separate concerns.

TODO (human): Confirm which academic style is the default rendering target — APA seems most common in our market work; MLA appears in some humanities contributions. May want a per-document citation_style_override for sources that warrant non-default rendering.

Market Analyst Style

The shape used in published research notes from research firms (Gartner, Forrester, Bloomberg Intelligence, McKinsey/BCG insights pieces, niche trackers like Pitchbook, Preqin, Crunchbase, SimilarWeb). Required: publisher (the firm), publisher_type: "Market-Research-Organization" or similar, title, date_published, often authors (analyst names matter). subtitle common; lede rare. Often paired with api_provider_* fields when the source is delivered via a data-as-a-service API rather than as a public webpage.

Web-Ready Style

Magazine, blog, newsletter, social media, UGC. Required: first_accessed_at_url, title, lede (the magazine-style hook), publisher (the site name, normalized — “The New York Times” not “nytimes.com”), publisher_favicon_url (site renderers use it for visual cues), piece_og_image (for hero treatment). subtitle is rarely used here; lede replaces it.

TODO (human): Decide whether lede is auto-generated from the first paragraph by AI when not present in OG metadata, or only ever populated when the source provides one. Consequences: AI-generated lede mixes our voice with the source’s; source-only lede will be empty for most blog posts.

Publisher and Publication Types

The publisher_type field is a controlled-but-not-enforced Train-Case taxonomy. It drives the Source-Type → Field Applicability Matrix below — different types call for different field subsets.

`publisher_type` value	Definition	Examples
`Academic-Journal`	Peer-reviewed scholarly publication	Nature, JFE, Quarterly Journal of Economics
`Academic-Working-Paper`	Pre-publication or non-peer-reviewed academic	NBER, SSRN, arXiv
`Market-Research-Organization`	For-sale industry research	Gartner, Forrester, IDC, Omdia
`Think-Tank`	Policy / strategy non-profit	Brookings, RAND, Council on Foreign Relations
`Government-Agency`	National or supranational government	BLS, ECB, OECD, World Bank
`Consulting-Firm`	Strategy/management consulting public output	McKinsey Quarterly, BCG Insights, Bain Reports
`Industry-Media`	Trade publications & news	TechCrunch, The Information, Bloomberg, FT
`Individual-Researcher`	Personal blog / Substack of a credentialed individual	Stratechery, Marginal Revolution, Eugene Wei
`Content-Creator`	YouTube, podcast, or platform-native creator	Lex Fridman podcast, AI Explained YouTube
`Social-Media-User`	Single-platform-resident author (no own site)	LinkedIn poster, Twitter/X user, Mastodon user
`Data-as-a-Service-Provider`	Source delivered primarily via API or paid platform	Crunchbase, Preqin, Pitchbook, SimilarWeb, Traxn
`Content-Marketing`	Vendor-published content with brand-promotion intent	Hubspot blog, AWS blog, vendor whitepapers
`UGC-Community`	User-generated discussion	Reddit thread, Quora answer, Stack Overflow post, Hacker News thread
`Other`	Doesn’t fit cleanly elsewhere — flag for review	(used sparingly)

AI agent note: When you cannot confidently classify a source, use Other and surface the ambiguity to a human reviewer. Do not guess between two plausible types.

Schema (full YAML)

The schema is grouped into logical sections below. All fields are optional unless marked Required. Empty strings and empty arrays are valid — leaving fields off is fine. The agent should not fabricate values to make a field present.

Read the YAML block below as an illustrative example, not a per-citation requirement. Different citation types populate different property sets — see the Source-Type → Field Applicability Matrix later in this document for the per-type expectations. A real canonical citation file will fill some sections and leave others entirely off; that is the intended shape, not an incomplete one. The block exists to enumerate every field name and document its purpose in one place, not to suggest every citation should carry every field.

# ─── Identity (Portability Spine pt. 1) ──────────────────────────────────
internal_uuid: "{uuid-v4}"           # Required (always-include). Stable cross-system anchor. UUID v4. Must be unique across the entire knowledge base. Generation is deterministic; persistence is not negotiable.
reference_hexcode: "abc123"          # Required for canonical sources (sources promoted into the Citations folder). 6+ chars from [a-z0-9]. Used in inline `[^reference_hexcode]` markers across vault. Generated by cite-wide plugin.
default_slug: "example-slug"         # Required. snake-case (kebab-case) of the meaningful part of the title. Must be unique within the canonical-citations folder. Human-readable; works as a URL path component.

# ─── Bibliographic core ──────────────────────────────────────────────────
title: "Example Title"               # Required.
subtitle: "Example Subtitle"         # Used by academic, book, and market-analyst sources. Mutually exclusive with `lede` per source — don't fill both.
lede: "Example Lede"                 # Magazine/blog hook. Mutually exclusive with `subtitle` per source.
authors:                             # Empty array OK if anonymous; otherwise list in publication order.
  - "Author One"
  - "Author Two"
date_published: 2024-10-15           # ISO 8601. Missing day common — render gracefully. Missing month rare but possible — same.
edition_or_version: "1.0"            # Books, journals, datasets, software releases. Free-form: "Second Edition", "Version 2.0", "v0.42.1".

# ─── Publisher / source-of-source ────────────────────────────────────────
publisher: "Example Publisher"       # Normalized name. "The New York Times", not "nytimes.com" or "NYT".
publisher_url: "https://example.com" # Canonical site root, not a section page.
publisher_type:                      # Single value from the Train-Case taxonomy. Drives field applicability.
  - "Industry-Media"
publisher_favicon_url: "https://example.com/favicon.ico"  # Web-ready rendering needs this.

# ─── Access & retrieval (Portability Spine pt. 2) ────────────────────────
first_accessed_at_url: "https://example.com/article"        # Required. The URL we *originally* retrieved from. Anchoring + attribution. Spine field.
date_added: 2026-04-15                                      # When this citation entry was created in our system.
recently_accessed_at_url: "https://example.com/article-v2"  # The CURRENT URL if the source has drifted (slug rename, redirect, mirror). Often = first_accessed_at_url; only diverges on drift detection.
date_recently_accessed: 2026-04-20                          # When we last successfully retrieved (by either URL).

# ─── Media (display & social-share assets) ───────────────────────────────
piece_og_image: "https://example.com/article/og-image.jpg"
piece_thumbnail_url: "https://example.com/article/thumbnail.jpg"

# ─── API / structured-data provider ──────────────────────────────────────
api_provider_url: "https://api.example.com/v1/"             # The API root, when the source is API-deliverable.
api_provider_name: "Example API Provider"                   # Normalized. "Google Books API", "Crunchbase API".
api_source_url: "https://api.example.com/v1/article-slug"   # The specific endpoint that returned this source.

# ─── Content archival (Portability Spine pt. 3) ──────────────────────────
downloaded_content_path: "Citations/_archive/abc123.html"   # Spine field. Local path to the bit-preserving copy.
structured_data_path: "Citations/_archive/abc123.json"      # Spine field. Local path to the structured-extraction output.

# ─── Annotation & cross-reference ────────────────────────────────────────
cited_in_files:                      # Tracked automatically by the plugin; do not hand-edit.
  - "file1.md"
  - "file2.md"
tags:                                # Train-Case. Categories, topics, dossier tags, project tags. Hierarchy resolved at consumer level.
  - "Tag-One"
  - "Tag-Two"

# ─── Document-grounded citation fields ───────────────────────────────────
# APPLICABILITY: only populate these when the source is an attached document
# (PDF / plain-text / custom-content) cited via the Anthropic Citations API
# or an equivalent. URL-based citations (web search results from any provider)
# do NOT populate these fields. See Lossless-Citation-Spec.md →
# "Document-Grounded Citations" for the markdown reference-def shape.
# Currently a deferred / future-state concern — most citations will not have these.
cited_excerpt: "The verbatim passage the model quoted from the document"
                                     # The exact text returned in Anthropic's `cited_text` field.
                                     # Insert verbatim — no length cap, no truncation.
cited_location:                      # Position info inside the source document. Pick the shape that matches the source type.
  type: "page"                       # one of: "page" | "char" | "block"
  document_title: "Source Document"  # Display title of the cited document (mirrors `document_title` from the API).
  document_path: "Citations/_attached/source.pdf"
                                     # Local vault-relative path to the document. Default directory: Citations/_attached/
  start_page: 12                     # page-only; 1-indexed inclusive (already converted from API exclusive end)
  end_page: 13                       # page-only; 1-indexed inclusive
  start_char: 1234                   # char-only; 0-indexed inclusive (preserves API convention)
  end_char: 1456                     # char-only; 0-indexed exclusive (preserves API convention)
  start_block: 0                     # block-only; 0-indexed inclusive (preserves API convention)
  end_block: 1                       # block-only; 0-indexed exclusive (preserves API convention)

AI agent note (output contract): Emit exactly the field names above. Do not introduce new fields without updating this spec. Do not fabricate values; emit empty strings or omit fields when the source-of-truth lookup fails. The pipeline doc specifies which lookup paths to try and in what order.

Source-Type → Field Applicability Matrix

For each source type, this matrix lists fields that are expected (E), optional (O), or not applicable (—). The agent uses this to decide which fields to attempt and which to skip.

Field \ Type	Acad-Journal	Acad-WP	Mkt-Research	Think-Tank	Gov-Agency	Industry-Media	Indiv-Researcher	Content-Creator	Social-Media	DaaS-Provider	Content-Marketing	UGC-Community
`internal_uuid`	E	E	E	E	E	E	E	E	E	E	E	E
`reference_hexcode`	E	E	E	E	E	E	E	E	E	E	E	E
`default_slug`	E	E	E	E	E	E	E	E	E	E	E	E
`title`	E	E	E	E	E	E	E	E	E	E	E	E
`subtitle`	E	O	E	O	O	—	—	—	—	—	—	—
`lede`	—	—	—	—	—	E	E	O	—	—	E	—
`authors`	E	E	E	O	—	E	E	E	E	—	O	E
`date_published`	E	E	E	E	E	E	E	E	E	E	E	E
`edition_or_version`	E	O	O	O	O	—	—	—	—	E	—	—
`publisher`	E	E	E	E	E	E	O	E	E	E	E	E
`publisher_url`	E	E	E	E	E	E	E	E	E	E	E	E
`publisher_type`	E	E	E	E	E	E	E	E	E	E	E	E
`publisher_favicon_url`	O	O	O	O	E	E	E	E	E	O	E	E
`first_accessed_at_url`	E	E	E	E	E	E	E	E	E	E	E	E
`recently_accessed_at_url`	O	O	O	O	O	O	O	O	O	O	O	O
`date_added`	E	E	E	E	E	E	E	E	E	E	E	E
`date_recently_accessed`	E	E	E	E	E	E	E	E	E	E	E	E
`piece_og_image`	O	O	O	O	O	E	E	E	E	—	E	O
`piece_thumbnail_url`	O	O	O	O	O	E	E	E	E	—	E	O
`api_provider_url`	O	O	E	—	O	—	—	—	—	E	—	—
`api_provider_name`	O	O	E	—	O	—	—	—	—	E	—	—
`api_source_url`	O	O	E	—	O	—	—	—	—	E	—	—
`downloaded_content_path`	E	E	E	E	E	E	E	O	O	E	O	O
`structured_data_path`	E	E	E	E	E	O	O	—	—	E	—	—
`cited_in_files`	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)	(auto)
`tags`	E	E	E	E	E	E	E	E	E	E	E	E

AI agent note: “Expected” means: the agent must attempt to fill this field and surface a warning if it cannot. “Optional” means: fill if cheaply available; do not chase. “Not applicable” means: do not emit even if data is somehow available (it would be misleading on this source type).

Note on recently_accessed_at_url: Always Optional in this matrix because the agent only fills it when URL drift is detected. On first capture, it equals first_accessed_at_url and may be omitted (renderers default to first_accessed_at_url when this is empty).

Human reviewer note: Override the matrix at the per-source level if a specific source warrants a non-default field set (e.g. a Substack post that’s structurally an academic working paper).

Minimum-Viable Citation

For a citation to be considered valid (i.e. ingestible by downstream tooling without warnings), it must have:

All five Portability Spine fields: internal_uuid, first_accessed_at_url, recently_accessed_at_url (may equal first), downloaded_content_path, structured_data_path. (For ad-hoc citations not yet promoted to canonical, the two _path fields may be empty strings — the validator emits a warning but not an error.)
title
date_added
publisher_type (so the matrix above can be applied for further checks)
tags non-empty (downstream search and clustering rely on tag presence)

A citation that lacks any of these is incomplete — usable in display contexts but not in the canonical archive. The acquisition pipeline (sibling doc) defines the agent’s behavior when a minimum-viable threshold cannot be met.

Relationship to the In-Code `CitationMetadata` Interface

The schema above is the authoritative on-disk shape for any citation promoted into the Citations folder. It is the reference for what canonical citations look like, what fields downstream consumers can rely on, and what shape the acquisition pipeline targets when it fills missing fields.

The cite-wide plugin also defines an in-code TypeScript interface, CitationMetadata, in src/services/citationFileService.ts. That interface is not a competing schema — it is a simplified operational projection used while the plugin is reading, writing, and managing citation files in memory. The current shape:

interface CitationMetadata {
    hexId: string;
    title: string | undefined;
    author: string | undefined;
    url: string | undefined;
    date: string | undefined;
    source: string | undefined;
    tags: string[];
    created: string;
    lastModified: string;
    referenceText: string | undefined;
    usageCount: number;
    filesUsedIn: string[];
}

This shape predates the full schema above and has not yet caught up. It is the gap between current implementation and current thinking — not an alternative tier the schema is meant to coexist with. Treat it as legacy-shaped; do not introduce new code that depends on its narrow shape.

Known gaps to close as the interface evolves

The interface needs to grow toward the on-disk shape. Specific gaps:

In-code field	Replace / extend with	Notes
`author: string`	`authors: string[]`	Plural per the schema; even single-author sources go in an array.
`url: string`	`first_accessed_at_url`, `recently_accessed_at_url`	Splits to capture URL drift.
`date: string`	`date_published`, `date_added`, `date_recently_accessed`	The schema distinguishes three time concerns; the current single `date` collapses them.
`source: string`	`publisher`, `publisher_url`, `publisher_type`, `publisher_favicon_url`	The “source” concept gets unpacked into a small object.
(none)	`internal_uuid`, `default_slug`, `reference_hexcode`	Identity / spine fields not present today.
(none)	`downloaded_content_path`, `structured_data_path`	Spine fields for portability not present today.
(none)	`cited_excerpt`, `cited_location`	Document-grounded fields (deferred — see “Document-grounded citation fields” above).
`created`, `lastModified`, `referenceText`, `usageCount`, `filesUsedIn`	(operational; keep)	Plugin-internal bookkeeping. `filesUsedIn` corresponds to `cited_in_files` in the schema; the rest are operational and don’t appear in the on-disk YAML.

The evolution is incremental — each of these gaps can be closed in a separate change without breaking existing citation files (additive YAML reads back as missing fields, which the schema already permits).

Note: The earlier “Big-bang vs Dual-shape vs Opt-in” migration framing has been retired. The schema is the target; the in-code interface migrates toward it field-by-field as the plugin is touched. New code should use the schema’s field names directly when reading or writing YAML frontmatter, even if the in-memory interface still uses the legacy shape elsewhere.

Open Decisions (TODOs)

These are the spec-level open questions. Each blocks some piece of the implementation but not the rest.

TODO (human): Default academic rendering style — APA / MLA / Chicago / other? Allow per-document citation_style_override?
TODO (human): lede auto-generation policy — only when source provides, or AI-generate when missing? Consequences: AI-generated lede mixes our voice with the source’s; source-only lede will be empty for most blog posts.
TODO (human): tags taxonomy governance — controlled vocabulary or organic growth with periodic consolidation passes? If controlled, where is the canonical list stored?
TODO (human + AI): Migration model from current CitationMetadata — see Coexistence section. Lean: dual-shape.
TODO (AI): Validate the Source-Type Matrix against real sources — first pass was reasoned-from-template; some cells need empirical checking. Particularly Social-Media and UGC-Community rows.
TODO (human): Should recently_accessed_at_url always be present (mirror to first_accessed_at_url on first capture for symmetry), or only present on detected drift (current draft)?
TODO (AI): When subtitle and lede data both seem present in source meta tags, what’s the decision rule? Currently “by source type” via the matrix — but a tricky case is a magazine-style academic paper.

Cross-References

Inline citation format (the markdown layer, distinct from this YAML layer): context-v/reminders/Lossless-Citation-Spec.md
Acquisition pipeline (the agent that fills this schema): Citation-Acquisition-Pipeline.md
Per-field operational guide (cheap path / expensive path / fallback for each field): Citation-Field-Acquisition-Guide.md
Parsing source citations from LLM responses (Perplexity, Google AI, Claude formats): Parse-Common-Citation-Formats.md
Current narrow implementation: src/services/citationFileService.ts (CitationMetadata interface)