← Corpus / memopop-ai / plan
Team and People Metadata Ingestion
Six phases to take any company URL → structured TeamRoster JSON the MemoPop app can render as cards. Each phase ships independently and is verifiable on its own.
- Path
- plans/Team-and-People-Metadata-Ingestion.md
- Authors
- Michael Staton
- Augmented with
- Claude Code (Opus 4.7)
- Tags
- Team-Extraction · Crawling · Firecrawl · Pydantic · LangChain · MemoPop · Implementation-Plan
Team and People Metadata Ingestion
Implements: [[Crawl-for-Better-Team-Structured-Output]] Touches:
apps/memopop-orchestrator/
Why this plan exists
Analysts spend 1–2 hours per memo assembling team-card data: name, title, headshot URL, professional socials. The orchestrator already produces team prose but throws away the structured intermediate. This plan adds a first-class TeamRoster artifact, sourced from a real browser crawl, with photo URLs probed for external durability and socials filtered to the professional set. The exploration doc covers tradeoffs and rationale; this doc is the build order.
Build order, at a glance
- Phase 1 — Skeleton & schema. Dep, env var, Pydantic types, empty CLI that prints help.
- Phase 2 — Discovery + extraction. Walk candidate URLs, call Firecrawl
extract, writeteam-roster.json. - Phase 3 — Photo durability probe. Mark each photo
is_externally_stable. - Phase 4 — Photo + social fallback agent. When a photo fails or socials are missing, search.
- Phase 5 — Validation matrix. Run against 5 test URLs covering known failure modes.
- Phase 6 — Refactor
improve_team_section.pyto consume the roster.
Each phase is independently shippable. Stop after any phase if the next one isn’t earning its keep.
Phase 1 — Skeleton & schema
Goal: Land the dep, the env-var placeholder, the schemas, and a CLI stub that imports cleanly. No network calls yet.
Precondition — ensure the venv exists. The orchestrator’s .venv/ is known to disappear (see CLAUDE.md “Ongoing Troubleshooting”). Before anything else:
cd apps/memopop-orchestrator
[ -x .venv/bin/python ] || (rm -rf .venv venv && uv venv --python python3.11)
uv pip install -e . # NOT pip — uv only
Note: pydantic>=2.0.0 is already a declared dependency (used in src/server/models.py). No Pydantic install action needed.
Actions:
-
Add to
apps/memopop-orchestrator/pyproject.tomlunder[project] dependencies:"firecrawl-py>=2.0.0", # Web crawling for team roster extraction -
Add to
apps/memopop-orchestrator/.env.example:# Team Roster Extraction (optional — required only for cli/extract_team_roster.py) FIRECRAWL_API_KEY=your-firecrawl-key-here # Get key at firecrawl.dev -
Create
apps/memopop-orchestrator/src/schemas/team_roster.pywith the Pydantic models from the exploration doc:SocialLinks,Photo,TeamMember,TeamRoster. Use Pydantic v2 (BaseModel,HttpUrl,model_json_schema()). -
Create
apps/memopop-orchestrator/cli/extract_team_roster.pywith argparse for--url,--name,--description,--output-dir, and a--refreshflag for skipping cache. Body: import schemas, print parsed args, exit 0. No Firecrawl call yet. -
Reinstall:
cd apps/memopop-orchestrator && uv pip install -e .
Verify:
.venv/bin/python -c "from firecrawl import Firecrawl; print('ok')"printsok..venv/bin/python -c "from src.schemas.team_roster import TeamRoster; print(TeamRoster.model_json_schema()['title'])"printsTeamRoster..venv/bin/python cli/extract_team_roster.py --url https://example.comexits 0 and echoes parsed args..env.examplehas the placeholder; no real key in any committed file.
Phase 2 — Discovery + extraction
Goal: From a single root URL, return a populated TeamRoster JSON. Photo durability and fallbacks come later.
Actions:
-
Create
apps/memopop-orchestrator/src/scrapers/team_roster_scraper.py. Define aScraperProtocolwith two methods:class Scraper(Protocol): def discover_team_pages(self, root_url: str) -> list[str]: ... def extract_roster(self, urls: list[str], schema: dict) -> dict: ...Implement
FirecrawlScraper(Scraper)againstfirecrawl-py 4.x. Discovery is link-graph-driven, not path-guessing — let Firecrawl tell us what’s actually linked from the site:- Primary:
firecrawl.map(root_url, search="team", limit=20)returns URLs from the site’s link graph ranked by relevance to “team”. Take the top few. - Secondary: if
mapreturns nothing,firecrawl.scrape(root_url, formats=["links"])and read the homepage’s actual nav/footer links. Filter where path or anchor text containsteam|people|leadership|founders|about|who-we-are. - Last-resort fallback: probe canonical paths (
/team,/about,/leadership,/people) with HEAD requests. Only if both above are empty. - Return the deduplicated candidate list. Pass all of them into
extract_roster— Firecrawl handles multi-URL extraction in one call.
- Primary:
-
Wire
cli/extract_team_roster.py:- Load
.env, fail loudly ifFIRECRAWL_API_KEYis missing. - Instantiate
FirecrawlScraper, run discovery, thenextract_roster(candidate_urls, TeamRoster.model_json_schema()). - Validate the response with
TeamRoster.model_validate(). On validation failure, save the raw response toteam-roster.raw.jsonand surface the validation error. - Compute output dir:
output/{slug}-roster/(separate from memo artifact dirs — rosters can be reused across memo versions). - Write
team-roster.jsonand a markdown previewteam-roster.md(one section per member: name as H3, title, bio, photo URL, socials list).
- Load
-
Add a
--from-memo <Company>shortcut that readsdata/{Company}.jsonforurlanddescriptionso it stays consistent with the rest of the orchestrator’s CLI ergonomics. -
Socials lookup (in-phase, not deferred to Phase 4). A roster of names + titles + photos with empty socials is ~30% of the value. After extract, for each member with missing professional socials, run a search-by-name+title using the existing
src/agents/socials_enrichment.pyhelpers (search_for_social_profile(query, platform)— Tavily-first, DuckDuckGo fallback). Skip per-member with--no-enrich. Apply the professional-only filter (no Facebook/Instagram; YouTube/TikTok only when the channel is clearly a professional/branded creator presence).
Verify:
- Run on a known-easy site (Astro Knots site you control or another small static site): produces
team-roster.jsonwith ≥3 members. - Run on a representative VC site (e.g.,
aixventures.com): ≥1 member has a LinkedIn URL after enrichment, even if the team page didn’t expose them. TeamRoster.model_validate(json.load(f))succeeds.- No
facebook.comorinstagram.comURLs appear in any roster. - The markdown preview renders without errors in any markdown viewer.
Phase 3 — Photo durability probe
Goal: Set Photo.is_externally_stable correctly for every photo URL the extractor returned. Do not yet replace failing photos — just label them.
Actions:
-
Create
apps/memopop-orchestrator/src/scrapers/photo_probe.pywithasync def is_externally_stable(url: str) -> boolper the exploration doc: HEAD withUser-Agent: MemoPop/1.0, no Referer, no cookies; fall back to range-limited GET if HEAD is rejected; require200andcontent-typestarting withimage/. -
In
cli/extract_team_roster.py, after extraction, run probes concurrently withasyncio.gather(cap at 10 simultaneous). Setmember.photo.is_externally_stableaccordingly. -
Surface the count in the CLI output:
8/12 photos externally stable.
Verify:
- Pick a site whose photos are CDN-protected (most Squarespace, some Webflow):
is_externally_stable=Falseon those photos. - Pick a site with public photos:
is_externally_stable=True. - Probe runtime under 5 seconds for a 12-person team.
Phase 4 — Photo fallback agent
Goal: When a photo is not externally stable, an agent runs targeted searches across durable sources and updates the roster. (Social fallback moved into Phase 2’s acceptance criteria — a roster without socials is not “extracted”.)
Actions:
-
Create
apps/memopop-orchestrator/src/agents/team_metadata_enrichment.py. Reusesocials_enrichment.py’sget_platform_domainandis_valid_profile_urlhelpers — do not duplicate them. -
Photo fallback priority list (try in order, accept the first that probes stable):
- Wikipedia (
wikipedia.org) — search"{name}" "{title}" "{org}", look for an infobox image. - Crunchbase — only if
CRUNCHBASE_API_KEYis set; otherwise skip. - Conference speaker pages — Tavily search restricted to known speaker-page hosts (TED, web summits, industry-specific).
- GitHub avatar — only if a GitHub social was found.
- Public LinkedIn photo: explicitly off by default. Add a
--allow-linkedin-photoflag if the analyst opts in for a specific run.
- Wikipedia (
-
Social fallback uses the existing
socials_enrichment.search_for_social_profile. Apply the professional filter at the LLM layer with a clear rubric in the prompt:- Always include: LinkedIn, X/Twitter, Bluesky, Medium, GitHub, personal
.dev/.comsite. - Never include: Facebook, Instagram, personal email, phone.
- Conditionally include YouTube and TikTok: include if the channel is branded around the person’s professional identity AND has ≥1k subscribers OR is referenced from their LinkedIn/personal site. Skip otherwise. Log the rationale into
member.sources.
- Always include: LinkedIn, X/Twitter, Bluesky, Medium, GitHub, personal
-
Wire into
cli/extract_team_roster.pybehind a default-on--enrich/--no-enrichflag.
Verify:
- Run on a Phase 3 site that had failing photos: ≥50% of failed photos get a stable replacement, with
Photo.sourcecorrectly populated. - Run on a site with sparse socials (most pre-Series-A startups): the LLM successfully skips a personal Instagram and includes a professional X account.
- No Facebook/Instagram URLs appear in any
SocialLinksfield across the test set.
Phase 5 — Validation matrix
Goal: Honest accuracy data on five representative sites before we commit to Firecrawl long-term.
Actions:
-
Pick five test URLs covering:
- Static marketing site with a clean
/teampage. - SPA where
/teamis client-rendered. - Site with hot-link-protected photos.
- Site where
/teamdoesn’t exist; bios live on/about. - Site behind Cloudflare bot challenge (intentionally hard).
Record the actual URLs in
apps/memopop-orchestrator/tests/fixtures/team_roster_test_urls.yaml.
- Static marketing site with a clean
-
For each URL, capture in a results table:
- Did discovery find the right page?
- Member count (extracted vs. ground truth from manual count).
- Photo durability rate.
- Social completeness (LinkedIn coverage; total socials per member).
- Total runtime.
- Total cost (Firecrawl + LLM).
-
Decide: ship Firecrawl, or invest in the Crawl4AI alternative implementation of the same
Scraperprotocol. Capture the call in a follow-up exploration or directly in the changelog.
Verify:
- Results table committed to
apps/memopop-orchestrator/changelog/as part of the phase-5 ship note. - Decision is recorded with reasoning, not just numbers.
Phase 6 — Refactor improve_team_section.py to consume the roster
Goal: The prose path stops re-doing discovery. It calls extract_team_roster (or loads a cached roster) and runs Phase 2 (per-person deep dive) only on confirmed names.
Actions:
-
In
cli/improve_team_section.py:- Before Phase 1, look for
output/{slug}-roster/team-roster.json. If absent or--refresh-roster, invokeextract_team_rosterprogrammatically. - Replace Phase 1’s “spray and pray” Perplexity calls with the loaded roster’s
members[].nameand.title. - Phase 2 (per-person Perplexity deep dives) runs unchanged but only against confirmed names.
- Phase 3 synthesis prompt gets the structured roster passed in alongside the deep-dive results — fewer hallucinated names in the final prose.
- Before Phase 1, look for
-
The memo’s
2-sections/04-team.mdno longer changes shape; the structured roster is a parallel artifact, not a replacement.
Verify:
- Run
improve_team_section.pyon a company that has a cached roster. Memo prose names match roster names exactly (no spelling drift, no extra people, no missing people). - Citation count and prose quality are unchanged or better than the pre-refactor baseline.
Done when
-
firecrawl-pyis inpyproject.toml;FIRECRAWL_API_KEYis in.env.example. -
cli/extract_team_roster.py --url <root>produces a validTeamRosterJSON for any of the five validation URLs. - Photo durability is correctly probed and labelled.
- Photo + social fallback agent fills gaps; no Facebook/Instagram leaks.
- Validation matrix is committed to a changelog entry with the Firecrawl-vs-Crawl4AI decision.
-
improve_team_section.pyconsumes the roster instead of re-discovering names.
Out of scope (for this plan)
- Stealth founders, ICs hidden from the team page (flag in
notes, no automation). - Automatic re-crawling on a schedule (manual
--refreshonly). - Photo de-duplication / face-matching across sources.
- Multi-language team pages (English-only for v1).
Notes & open questions
Carried forward from the exploration — answer as we go:
- Roster artifact location: Proposed
output/{slug}-roster/(separate from memo version dirs). Confirms during Phase 2. - Cache invalidation: Default 30 days;
--refreshalways re-crawls. - Should
find_durable_photo.pybe its own CLI? Likely yes after Phase 4 — useful standalone for one-off fixes. - Confidence scoring:
TeamMember.confidencewill need a clear rubric in Phase 4 — propose a draft in the Phase 4 PR.