← Corpus / content-farm / exploration
Using APIs to Ingest More Data
Beyond OpenGraph.io — what Jina.ai already gives us (we pay for it), what other fetching services do, and a rough sketch of how a portfolio-company site crawler would actually work.
- Path
- explorations/Using-APIs-to-Ingest-More-Data.md
- Authors
- Michael Staton
- Augmented with
- Claude Code on Claude Opus 4.7 (1M context)
- Tags
- Exploration · API-Integrations · Jina · Crawling · Portfolio-Companies
What we can do with Jina.ai
The Jina account is already paid for, so anything below is incremental cost only on token/request usage.
- Reader (
r.jina.ai/<url>) — fetch any URL as clean LLM-ready markdown with title / description / content. Handles JS-rendered pages via headless browser. JSON mode (Accept: application/json) returns{ title, description, content, links }. - Search (
s.jina.ai/<query>) — search API that returns the full text of top results, not snippets. Useful for “what does X say about Y” questions. - DeepSearch — agentic multi-hop crawl that keeps refining until it has enough info. Expensive but thorough; right tool for “research this thing for me.”
- Embeddings + Reranker — semantic similarity and result re-ranking. Useful if we build a portfolio-company knowledge base and want “find companies similar to this thesis.”
- Segmenter / Classifier — chunk long content semantically, zero-shot label pages (e.g. “is this an
/aboutpage?”).
Other fetching beyond Jina
- Microlink — preview cards, free tier. Drop-in OpenGraph-style replacement.
- Browserless / ScrapingBee — raw headless browser when we need DOM control or to interact with the page.
- Firecrawl — purpose-built site crawler with sitemap walking and structured extraction baked in. Closest match to the portfolio-crawler dream.
How a portfolio-company crawler would work
Rough shape:
- Input: a list of company domains. Could live in an Obsidian folder as one file per company with a
url:frontmatter field. - Discovery pass: Jina Reader on the homepage in JSON mode → get markdown plus the outbound link list → filter to same-origin and prioritize
/about,/team,/careers,/blog,/press. - Fetch pass: Jina Reader on each prioritized page. 3–8 pages per company keeps cost bounded.
- Extract pass: feed concatenated markdown to Claude with a Zod schema — name, tagline, stage, founders, investors, last funding, customers, tech stack, recent news. Get back structured JSON.
- Persist: write back into the company’s Obsidian file as frontmatter plus a generated body section, with a
last_crawledfield. Re-run periodically; diff to surface changes.
Recommendation and tradeoff
For v1, build it as a separate plugin or command, not bolted onto Metafetch. It is a different shape: multi-page per target, LLM in the loop, longer-running. Metafetch stays single-page-OG-only.
The main tradeoff is cost predictability. We will want a per-company page budget (say ≤ 5 pages × ~2k tokens each), or the Jina + Claude bill becomes unpredictable on a 50-company portfolio. Firecrawl is worth a head-to-head comparison because it bundles crawl plus extract in one API and may end up cheaper per company than Reader plus Claude separately.
Open questions
- Should the portfolio-company schema and command surface be sketched in detail (no code yet) before we commit to a build?
- Firecrawl vs. Reader+Claude: pick one for v1, or run both on a small sample and compare cost and extraction quality?
- Where do crawl results live — inside Obsidian as one file per company, or as a separate JSON store with Obsidian files as views over it?