Team and People Metadata Ingestion

Implements: [[Crawl-for-Better-Team-Structured-Output]] Touches: apps/memopop-orchestrator/

Why this plan exists

Analysts spend 1–2 hours per memo assembling team-card data: name, title, headshot URL, professional socials. The orchestrator already produces team prose but throws away the structured intermediate. This plan adds a first-class TeamRoster artifact, sourced from a real browser crawl, with photo URLs probed for external durability and socials filtered to the professional set. The exploration doc covers tradeoffs and rationale; this doc is the build order.

Build order, at a glance

Phase 1 — Skeleton & schema. Dep, env var, Pydantic types, empty CLI that prints help.
Phase 2 — Discovery + extraction. Walk candidate URLs, call Firecrawl extract, write team-roster.json.
Phase 3 — Photo durability probe. Mark each photo is_externally_stable.
Phase 4 — Photo + social fallback agent. When a photo fails or socials are missing, search.
Phase 5 — Validation matrix. Run against 5 test URLs covering known failure modes.
Phase 6 — Refactor improve_team_section.py to consume the roster.

Each phase is independently shippable. Stop after any phase if the next one isn’t earning its keep.

Phase 1 — Skeleton & schema

Goal: Land the dep, the env-var placeholder, the schemas, and a CLI stub that imports cleanly. No network calls yet.

Precondition — ensure the venv exists. The orchestrator’s .venv/ is known to disappear (see CLAUDE.md “Ongoing Troubleshooting”). Before anything else:

cd apps/memopop-orchestrator
[ -x .venv/bin/python ] || (rm -rf .venv venv && uv venv --python python3.11)
uv pip install -e .  # NOT pip — uv only

Note: pydantic>=2.0.0 is already a declared dependency (used in src/server/models.py). No Pydantic install action needed.

Actions:

Add to apps/memopop-orchestrator/pyproject.toml under [project] dependencies:
```
"firecrawl-py>=2.0.0",  # Web crawling for team roster extraction
```

Add to apps/memopop-orchestrator/.env.example:

# Team Roster Extraction (optional — required only for cli/extract_team_roster.py)
FIRECRAWL_API_KEY=your-firecrawl-key-here  # Get key at firecrawl.dev

Create apps/memopop-orchestrator/src/schemas/team_roster.py with the Pydantic models from the exploration doc: SocialLinks, Photo, TeamMember, TeamRoster. Use Pydantic v2 (BaseModel, HttpUrl, model_json_schema()).
Create apps/memopop-orchestrator/cli/extract_team_roster.py with argparse for --url, --name, --description, --output-dir, and a --refresh flag for skipping cache. Body: import schemas, print parsed args, exit 0. No Firecrawl call yet.
Reinstall: cd apps/memopop-orchestrator && uv pip install -e .

Verify:

.venv/bin/python -c "from firecrawl import Firecrawl; print('ok')" prints ok.
.venv/bin/python -c "from src.schemas.team_roster import TeamRoster; print(TeamRoster.model_json_schema()['title'])" prints TeamRoster.
.venv/bin/python cli/extract_team_roster.py --url https://example.com exits 0 and echoes parsed args.
.env.example has the placeholder; no real key in any committed file.

Phase 2 — Discovery + extraction

Goal: From a single root URL, return a populated TeamRoster JSON. Photo durability and fallbacks come later.

Actions:

Create apps/memopop-orchestrator/src/scrapers/team_roster_scraper.py. Define a Scraper Protocol with two methods:
```
class Scraper(Protocol):
    def discover_team_pages(self, root_url: str) -> list[str]: ...
    def extract_roster(self, urls: list[str], schema: dict) -> dict: ...
```
Implement FirecrawlScraper(Scraper) against firecrawl-py 4.x. Discovery is link-graph-driven, not path-guessing — let Firecrawl tell us what’s actually linked from the site:
- Primary: firecrawl.map(root_url, search="team", limit=20) returns URLs from the site’s link graph ranked by relevance to “team”. Take the top few.
- Secondary: if map returns nothing, firecrawl.scrape(root_url, formats=["links"]) and read the homepage’s actual nav/footer links. Filter where path or anchor text contains team|people|leadership|founders|about|who-we-are.
- Last-resort fallback: probe canonical paths (/team, /about, /leadership, /people) with HEAD requests. Only if both above are empty.
- Return the deduplicated candidate list. Pass all of them into extract_roster — Firecrawl handles multi-URL extraction in one call.
Wire cli/extract_team_roster.py:
- Load .env, fail loudly if FIRECRAWL_API_KEY is missing.
- Instantiate FirecrawlScraper, run discovery, then extract_roster(candidate_urls, TeamRoster.model_json_schema()).
- Validate the response with TeamRoster.model_validate(). On validation failure, save the raw response to team-roster.raw.json and surface the validation error.
- Compute output dir: output/{slug}-roster/ (separate from memo artifact dirs — rosters can be reused across memo versions).
- Write team-roster.json and a markdown preview team-roster.md (one section per member: name as H3, title, bio, photo URL, socials list).
Add a --from-memo <Company> shortcut that reads data/{Company}.json for url and description so it stays consistent with the rest of the orchestrator’s CLI ergonomics.
Socials lookup (in-phase, not deferred to Phase 4). A roster of names + titles + photos with empty socials is ~30% of the value. After extract, for each member with missing professional socials, run a search-by-name+title using the existing src/agents/socials_enrichment.py helpers (search_for_social_profile(query, platform) — Tavily-first, DuckDuckGo fallback). Skip per-member with --no-enrich. Apply the professional-only filter (no Facebook/Instagram; YouTube/TikTok only when the channel is clearly a professional/branded creator presence).

Verify:

Run on a known-easy site (Astro Knots site you control or another small static site): produces team-roster.json with ≥3 members.
Run on a representative VC site (e.g., aixventures.com): ≥1 member has a LinkedIn URL after enrichment, even if the team page didn’t expose them.
TeamRoster.model_validate(json.load(f)) succeeds.
No facebook.com or instagram.com URLs appear in any roster.
The markdown preview renders without errors in any markdown viewer.

Phase 3 — Photo durability probe

Goal: Set Photo.is_externally_stable correctly for every photo URL the extractor returned. Do not yet replace failing photos — just label them.

Actions:

Create apps/memopop-orchestrator/src/scrapers/photo_probe.py with async def is_externally_stable(url: str) -> bool per the exploration doc: HEAD with User-Agent: MemoPop/1.0, no Referer, no cookies; fall back to range-limited GET if HEAD is rejected; require 200 and content-type starting with image/.
In cli/extract_team_roster.py, after extraction, run probes concurrently with asyncio.gather (cap at 10 simultaneous). Set member.photo.is_externally_stable accordingly.
Surface the count in the CLI output: 8/12 photos externally stable.

Verify:

Pick a site whose photos are CDN-protected (most Squarespace, some Webflow): is_externally_stable=False on those photos.
Pick a site with public photos: is_externally_stable=True.
Probe runtime under 5 seconds for a 12-person team.

Phase 4 — Photo fallback agent

Goal: When a photo is not externally stable, an agent runs targeted searches across durable sources and updates the roster. (Social fallback moved into Phase 2’s acceptance criteria — a roster without socials is not “extracted”.)

Actions:

Create apps/memopop-orchestrator/src/agents/team_metadata_enrichment.py. Reuse socials_enrichment.py’s get_platform_domain and is_valid_profile_url helpers — do not duplicate them.
Photo fallback priority list (try in order, accept the first that probes stable):
- Wikipedia (wikipedia.org) — search "{name}" "{title}" "{org}", look for an infobox image.
- Crunchbase — only if CRUNCHBASE_API_KEY is set; otherwise skip.
- Conference speaker pages — Tavily search restricted to known speaker-page hosts (TED, web summits, industry-specific).
- GitHub avatar — only if a GitHub social was found.
- Public LinkedIn photo: explicitly off by default. Add a --allow-linkedin-photo flag if the analyst opts in for a specific run.
Social fallback uses the existing socials_enrichment.search_for_social_profile. Apply the professional filter at the LLM layer with a clear rubric in the prompt:
- Always include: LinkedIn, X/Twitter, Bluesky, Medium, GitHub, personal .dev / .com site.
- Never include: Facebook, Instagram, personal email, phone.
- Conditionally include YouTube and TikTok: include if the channel is branded around the person’s professional identity AND has ≥1k subscribers OR is referenced from their LinkedIn/personal site. Skip otherwise. Log the rationale into member.sources.
Wire into cli/extract_team_roster.py behind a default-on --enrich/--no-enrich flag.

Verify:

Run on a Phase 3 site that had failing photos: ≥50% of failed photos get a stable replacement, with Photo.source correctly populated.
Run on a site with sparse socials (most pre-Series-A startups): the LLM successfully skips a personal Instagram and includes a professional X account.
No Facebook/Instagram URLs appear in any SocialLinks field across the test set.

Phase 5 — Validation matrix

Goal: Honest accuracy data on five representative sites before we commit to Firecrawl long-term.

Actions:

Pick five test URLs covering:
- Static marketing site with a clean /team page.
- SPA where /team is client-rendered.
- Site with hot-link-protected photos.
- Site where /team doesn’t exist; bios live on /about.
- Site behind Cloudflare bot challenge (intentionally hard). Record the actual URLs in apps/memopop-orchestrator/tests/fixtures/team_roster_test_urls.yaml.
For each URL, capture in a results table:
- Did discovery find the right page?
- Member count (extracted vs. ground truth from manual count).
- Photo durability rate.
- Social completeness (LinkedIn coverage; total socials per member).
- Total runtime.
- Total cost (Firecrawl + LLM).
Decide: ship Firecrawl, or invest in the Crawl4AI alternative implementation of the same Scraper protocol. Capture the call in a follow-up exploration or directly in the changelog.

Verify:

Results table committed to apps/memopop-orchestrator/changelog/ as part of the phase-5 ship note.
Decision is recorded with reasoning, not just numbers.

Phase 6 — Refactor `improve_team_section.py` to consume the roster

Goal: The prose path stops re-doing discovery. It calls extract_team_roster (or loads a cached roster) and runs Phase 2 (per-person deep dive) only on confirmed names.

Actions:

In cli/improve_team_section.py:
- Before Phase 1, look for output/{slug}-roster/team-roster.json. If absent or --refresh-roster, invoke extract_team_roster programmatically.
- Replace Phase 1’s “spray and pray” Perplexity calls with the loaded roster’s members[].name and .title.
- Phase 2 (per-person Perplexity deep dives) runs unchanged but only against confirmed names.
- Phase 3 synthesis prompt gets the structured roster passed in alongside the deep-dive results — fewer hallucinated names in the final prose.
The memo’s 2-sections/04-team.md no longer changes shape; the structured roster is a parallel artifact, not a replacement.

Verify:

Run improve_team_section.py on a company that has a cached roster. Memo prose names match roster names exactly (no spelling drift, no extra people, no missing people).
Citation count and prose quality are unchanged or better than the pre-refactor baseline.

Done when

firecrawl-py is in pyproject.toml; FIRECRAWL_API_KEY is in .env.example.
cli/extract_team_roster.py --url <root> produces a valid TeamRoster JSON for any of the five validation URLs.
Photo durability is correctly probed and labelled.
Photo + social fallback agent fills gaps; no Facebook/Instagram leaks.
Validation matrix is committed to a changelog entry with the Firecrawl-vs-Crawl4AI decision.
improve_team_section.py consumes the roster instead of re-discovering names.

Out of scope (for this plan)

Stealth founders, ICs hidden from the team page (flag in notes, no automation).
Automatic re-crawling on a schedule (manual --refresh only).
Photo de-duplication / face-matching across sources.
Multi-language team pages (English-only for v1).

Notes & open questions

Carried forward from the exploration — answer as we go:

Roster artifact location: Proposed output/{slug}-roster/ (separate from memo version dirs). Confirms during Phase 2.
Cache invalidation: Default 30 days; --refresh always re-crawls.
Should find_durable_photo.py be its own CLI? Likely yes after Phase 4 — useful standalone for one-off fixes.
Confidence scoring: TeamMember.confidence will need a clear rubric in Phase 4 — propose a draft in the Phase 4 PR.

Team and People Metadata Ingestion

Why this plan exists

Build order, at a glance

Phase 1 — Skeleton & schema

Phase 2 — Discovery + extraction

Phase 3 — Photo durability probe

Phase 4 — Photo fallback agent

Phase 5 — Validation matrix

Phase 6 — Refactor improve_team_section.py to consume the roster

Done when

Out of scope (for this plan)

Notes & open questions

Phase 6 — Refactor `improve_team_section.py` to consume the roster