16 KiB
Digest Scrape Queue — Design Spec
Date: 2026-03-19 Status: Approved — ready for implementation planning
Goal
When a user clicks the 📰 Digest chip on a signal banner, the email is added to a persistent digest queue accessible via a dedicated nav tab. The user browses queued digest emails, selects extracted job links to process, and queues them through the existing discovery pipeline as status='pending' jobs in staging.db.
Decisions Made
| Decision | Choice |
|---|---|
| Digest tab placement | Separate top-level nav tab "📰 Digest", between Interviews and Apply |
| Storage | New digest_queue table in staging.db; unique on job_contact_id |
| Table creation | In scripts/db.py init_db() — canonical schema location, not dev-api.py |
| Link extraction | On-demand, backend regex against HTML-stripped plain-text body — no background task needed |
| Extraction UX | Show ranked link list; job-likely pre-checked, others unchecked; user ticks and submits |
| After queueing | Entry stays in digest list for reference; [✕] removes explicitly |
| Failure handling | Digest chip dismisses signal optimistically regardless of POST /api/digest-queue success |
| Duplicate protection | UNIQUE(job_contact_id) in table; POST /api/digest-queue returns { created: false } on duplicate (no 409) |
| Mobile nav | Digest tab does NOT appear in mobile bottom tab bar (all 5 slots occupied; deferred) |
| URL validation | Non-http/https schemes and blank URLs skipped silently in queue-jobs; validation deferred to pipeline |
Data Model
New table: digest_queue
Added to scripts/db.py init_db():
CREATE TABLE IF NOT EXISTS digest_queue (
id INTEGER PRIMARY KEY,
job_contact_id INTEGER NOT NULL REFERENCES job_contacts(id),
created_at TEXT DEFAULT (datetime('now')),
UNIQUE(job_contact_id)
)
init_db() is called at app startup and by dev-api.py startup — adding the CREATE TABLE IF NOT EXISTS there is safe and idempotent.
Backend
New endpoints in dev-api.py
GET /api/digest-queue
Returns all queued entries joined with job_contacts. body is HTML-stripped via _strip_html() before returning (display only — extraction uses a separate raw read, see extract-links):
SELECT dq.id, dq.job_contact_id, dq.created_at,
jc.subject, jc.from_addr, jc.received_at, jc.body
FROM digest_queue dq
JOIN job_contacts jc ON jc.id = dq.job_contact_id
ORDER BY dq.created_at DESC
Response: array of { id, job_contact_id, created_at, subject, from_addr, received_at, body }.
POST /api/digest-queue
Body: { job_contact_id: int }
- Verify
job_contact_idexists injob_contacts→ 404 if not found INSERT OR IGNORE INTO digest_queue (job_contact_id) VALUES (?)- Returns
{ ok: true, created: true }on insert,{ ok: true, created: false }if already present - Never returns 409 — the
createdfield is the duplicate signal
POST /api/digest-queue/{id}/extract-links
Extracts and ranks URLs from the entry's email body. No request body.
Important: this endpoint reads the raw body from job_contacts directly and runs URL_RE against it before any HTML stripping. _strip_html() calls BeautifulSoup.get_text(), which extracts visible text only — it does not preserve href attribute values. A URL that appears only as an href target (e.g., <a href="https://greenhouse.io/acme/1">Click here</a>) would be lost after stripping. Running the regex on raw HTML captures those URLs correctly because URL_RE's character exclusion class ([^\s<>"')\]]) stops at ", so it cleanly extracts href values without matching surrounding markup.
# Fetch raw body from DB — do NOT strip before extraction
row = db.execute(
"SELECT jc.body FROM digest_queue dq JOIN job_contacts jc ON jc.id = dq.job_contact_id WHERE dq.id = ?",
(digest_id,)
).fetchone()
if not row:
raise HTTPException(404, "Digest entry not found")
return {"links": extract_links(row["body"] or "")}
Extraction algorithm:
import re
from urllib.parse import urlparse
JOB_DOMAINS = {
'greenhouse.io', 'lever.co', 'workday.com', 'linkedin.com',
'ashbyhq.com', 'smartrecruiters.com', 'icims.com', 'taleo.net',
'jobvite.com', 'breezy.hr', 'recruitee.com', 'bamboohr.com',
'myworkdayjobs.com', 'careers.', 'jobs.',
}
FILTER_PATTERNS = re.compile(
r'(unsubscribe|mailto:|/track/|pixel\.|\.gif|\.png|\.jpg'
r'|/open\?|/click\?|list-unsubscribe)',
re.I
)
URL_RE = re.compile(r'https?://[^\s<>"\')\]]+', re.I)
def _score_url(url: str) -> int:
parsed = urlparse(url)
hostname = parsed.hostname or ''
path = parsed.path.lower()
if FILTER_PATTERNS.search(url):
return -1 # exclude
for domain in JOB_DOMAINS:
if domain in hostname or domain in path:
return 2 # job-likely
return 1 # other
def extract_links(body: str) -> list[dict]:
if not body:
return []
seen = set()
results = []
for m in URL_RE.finditer(body):
url = m.group(0).rstrip('.,;)')
if url in seen:
continue
seen.add(url)
score = _score_url(url)
if score < 0:
continue
# Title hint: last line of text immediately before the URL (up to 60 chars)
start = max(0, m.start() - 60)
hint = body[start:m.start()].strip().split('\n')[-1].strip()
results.append({'url': url, 'score': score, 'hint': hint})
results.sort(key=lambda x: -x['score'])
return results
Response: { links: [{ url, score, hint }] } — score=2 means job-likely (pre-check in UI), score=1 means other (unchecked).
POST /api/digest-queue/{id}/queue-jobs
Body: { urls: [string] }
- 404 if digest entry not found
- 400 if
urlsis empty - Non-http/https URLs and blank strings are skipped silently (counted as
skipped)
Calls insert_job from scripts/db.py. The actual signature is insert_job(db_path, job) where job is a dict. The status field is not passed — the schema default of 'pending' handles it:
from scripts.db import insert_job
from datetime import datetime
queued = 0
skipped = 0
for url in body.urls:
if not url or not url.startswith(('http://', 'https://')):
skipped += 1
continue
result = insert_job(DB_PATH, {
'url': url,
'title': '',
'company': '',
'source': 'digest',
'date_found': datetime.utcnow().isoformat(),
})
if result:
queued += 1
else:
skipped += 1 # duplicate URL — insert_job returns None on UNIQUE conflict
return {'ok': True, 'queued': queued, 'skipped': skipped}
DELETE /api/digest-queue/{id}
Removes entry from digest_queue. Does not affect job_contacts.
Returns { ok: true }. 404 if not found.
Frontend Changes
Chip handler update (InterviewCard.vue + InterviewsView.vue)
When newLabel === 'digest', the handler fires a third call after the existing reclassify + dismiss calls. Note: sig.id is job_contacts.id — this is the correct value for job_contact_id (the StageSignal.id field maps directly to the job_contacts primary key):
// After existing reclassify + dismiss calls:
if (newLabel === 'digest') {
fetch('/api/digest-queue', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ job_contact_id: sig.id }), // sig.id === job_contacts.id
}).catch(() => {}) // best-effort; signal already dismissed optimistically
}
Signal is removed from local array optimistically before this call (same as current dismiss behavior).
New store: web/src/stores/digest.ts
import { defineStore } from 'pinia'
import { ref } from 'vue'
export interface DigestEntry {
id: number
job_contact_id: number
created_at: string
subject: string
from_addr: string | null
received_at: string
body: string | null
}
export interface DigestLink {
url: string
score: number // 2 = job-likely, 1 = other
hint: string
}
export const useDigestStore = defineStore('digest', () => {
const entries = ref<DigestEntry[]>([])
async function fetchAll() {
const res = await fetch('/api/digest-queue')
entries.value = await res.json()
}
async function remove(id: number) {
entries.value = entries.value.filter(e => e.id !== id)
await fetch(`/api/digest-queue/${id}`, { method: 'DELETE' })
}
return { entries, fetchAll, remove }
})
New page: web/src/views/DigestView.vue
Layout — collapsed entry (default):
┌─────────────────────────────────────────────┐
│ ▸ TechCrunch Jobs Weekly │
│ From: digest@techcrunch.com · Mar 19 │
│ [Extract] [✕] │
└─────────────────────────────────────────────┘
Layout — expanded entry (after Extract):
┌─────────────────────────────────────────────┐
│ ▾ LinkedIn Job Digest │
│ From: jobs@linkedin.com · Mar 18 │
│ [Re-extract] [✕] │
│ ┌──────────────────────────────────────┐ │
│ │ ☑ Senior Engineer — Acme Corp │ │ ← score=2, pre-checked
│ │ greenhouse.io/acme/jobs/456 │ │
│ │ ☑ Staff Designer — Globex │ │
│ │ lever.co/globex/staff-designer │ │
│ │ ─── Other links ────────────────── │ │
│ │ ☐ acme.com/blog/engineering │ │ ← score=1, unchecked
│ │ ☐ linkedin.com/company/acme │ │
│ └──────────────────────────────────────┘ │
│ [Queue 2 selected →] │
└─────────────────────────────────────────────┘
After queueing:
Inline confirmation replaces the link list:
✅ 2 jobs queued for review, 1 skipped (already in pipeline)
Entry remains in the list. [✕] removes it.
Empty state:
🦅 No digest emails queued.
When you mark an email as 📰 Digest, it appears here.
Component state (per entry, keyed by DigestEntry.id):
const expandedIds = ref<Record<number, boolean>>({})
const linkResults = ref<Record<number, DigestLink[]>>({})
const selectedUrls = ref<Record<number, Set<string>>>({})
const queueResult = ref<Record<number, { queued: number; skipped: number } | null>>({})
const extracting = ref<Record<number, boolean>>({})
const queuing = ref<Record<number, boolean>>({})
selectedUrls uses Set<string>. Toggling a URL uses the spread-copy pattern to trigger Vue 3 reactivity — same pattern as expandedSignalIds in InterviewCard.vue:
function toggleUrl(entryId: number, url: string) {
const prev = selectedUrls.value[entryId] ?? new Set()
const next = new Set(prev)
next.has(url) ? next.delete(url) : next.add(url)
selectedUrls.value = { ...selectedUrls.value, [entryId]: next }
}
Router + Nav
Add to web/src/router/index.ts:
{ path: '/digest', component: () => import('../views/DigestView.vue') }
AppNav.vue changes:
Add NewspaperIcon to the Heroicons import (already imported from @heroicons/vue/24/outline), then append to navLinks after Interviews:
import { NewspaperIcon } from '@heroicons/vue/24/outline'
const navLinks = [
{ to: '/', icon: HomeIcon, label: 'Home' },
{ to: '/review', icon: ClipboardDocumentListIcon, label: 'Job Review' },
{ to: '/apply', icon: PencilSquareIcon, label: 'Apply' },
{ to: '/interviews', icon: CalendarDaysIcon, label: 'Interviews' },
{ to: '/digest', icon: NewspaperIcon, label: 'Digest' }, // NEW
{ to: '/prep', icon: LightBulbIcon, label: 'Interview Prep' },
{ to: '/survey', icon: MagnifyingGlassIcon, label: 'Survey' },
]
navLinks remains a static array. The badge count is rendered as a separate reactive expression in the template alongside the Digest link — keep navLinks as-is and add the digest store separately:
// In AppNav.vue <script setup>
import { useDigestStore } from '@/stores/digest'
const digestStore = useDigestStore()
In the template, inside the v-for="link in navLinks" loop, add a badge overlay for the Digest entry:
<span v-if="link.to === '/digest' && digestStore.entries.length > 0" class="nav-badge">
{{ digestStore.entries.length }}
</span>
The Digest nav item does not appear in the mobile bottom tab bar (mobileLinks array) — all 5 slots are occupied. Deferred to a future pass.
DigestView.vue calls digestStore.fetchAll() on onMounted.
Required Tests (tests/test_dev_api_digest.py)
All tests follow the same isolated DB pattern as test_dev_api_interviews.py: use importlib.reload + FastAPI TestClient, seed fixtures directly into the test DB.
| Test | Setup + assertion |
|---|---|
test_digest_queue_add |
Seed a job_contacts row; POST { job_contact_id } → 200, created: true, row in DB |
test_digest_queue_add_duplicate |
Seed + POST twice → second returns created: false, no error, only one row in DB |
test_digest_queue_add_missing_contact |
POST nonexistent job_contact_id → 404 |
test_digest_queue_list |
Seed job_contacts + digest_queue; GET → entries include subject, from_addr, body |
test_digest_extract_links |
Seed job_contacts with body containing a greenhouse.io URL and a tracker URL; seed digest_queue; POST to /extract-links → greenhouse URL present with score=2, tracker URL absent |
test_digest_extract_links_filters_trackers |
Same setup; assert unsubscribe and pixel URLs excluded from results |
test_digest_queue_jobs |
Seed digest_queue; POST { urls: ["https://greenhouse.io/acme/1"] } → queued: 1, skipped: 0; row exists in jobs with source='digest' and status='pending' |
test_digest_queue_jobs_skips_duplicates |
POST { urls: ["https://greenhouse.io/acme/1", "https://greenhouse.io/acme/1"] } — same URL twice in a single call → queued: 1, skipped: 1; one row in DB |
test_digest_queue_jobs_skips_invalid_urls |
POST { urls: ["", "ftp://bad", "https://good.com/job"] } → queued: 1, skipped: 2 |
test_digest_queue_jobs_empty_urls |
POST { urls: [] } → 400 |
test_digest_delete |
Seed + DELETE → 200; second DELETE → 404 |
Files
| File | Action |
|---|---|
scripts/db.py |
Add digest_queue table to init_db() |
dev-api.py |
Add 4 new endpoints; add extract_links() + _score_url() helpers |
web/src/stores/digest.ts |
New Pinia store |
web/src/views/DigestView.vue |
New page |
web/src/router/index.ts |
Add /digest route |
web/src/components/AppNav.vue |
Import digest store; add Digest nav item + reactive badge; desktop nav only |
web/src/components/InterviewCard.vue |
Third call in digest chip handler |
web/src/views/InterviewsView.vue |
Third call in digest chip handler |
tests/test_dev_api_digest.py |
New test file — 11 tests |
What Stays the Same
- Existing reclassify + dismiss two-call path for digest chip — unchanged
insert_jobinscripts/db.py— called as-is, no modification needed- Job Review UI — queued jobs appear there as
status='pending'automatically - Signal banner dismiss behavior — optimistic, unchanged
_strip_html()helper indev-api.py— reused forGET /api/digest-queueresponse body