feat: Lemmy JSON API crawler for signal extraction #10

Closed
opened 2026-04-22 07:03:07 -07:00 by pyr0ball · 0 comments
Owner

Summary

Build a crawler that polls Lemmy communities via the public JSON API (/api/v3/) to detect signal threads worth replying to, and feeds them into the opportunities queue automatically.

Why

Lemmy instances expose a full public REST API with no bot detection. Unlike Reddit (Playwright scraping), Lemmy signal extraction can be lightweight, reliable, and fast -- pure HTTP, no browser required.

Scope

  • app/services/lemmy/ module: client wrapping GET /api/v3/community, GET /api/v3/post/list, GET /api/v3/post
  • Configurable community list per campaign (e.g. technology@reddthat.com, selfhosted@lemmy.world)
  • Keyword/pattern matching against post titles + bodies (reuse signal extraction logic from Reddit crawler)
  • Auto-create opportunities via store.create_opportunity() on match
  • Dedup guard: skip posts already in opportunities table by thread_url
  • APScheduler integration: poll on configurable interval (default: hourly)
  • MCP tool: poll_lemmy_community for on-demand scan

API reference

Lemmy API v3 is public and unauthenticated for read operations:

  • GET /api/v3/post/list?community_name=<name>&sort=New&limit=50
  • GET /api/v3/post?id=<id> (full post + comments)

Instances to support initially

  • reddthat.com
  • lemmy.world
  • lemmy.ml
  • beehaw.org

Notes

  • No auth needed for public communities
  • Rate limit conservatively (1 req/sec, exponential backoff on 429)
  • Store instance + community separately in community field as <community>@<instance> convention
## Summary Build a crawler that polls Lemmy communities via the public JSON API (`/api/v3/`) to detect signal threads worth replying to, and feeds them into the opportunities queue automatically. ## Why Lemmy instances expose a full public REST API with no bot detection. Unlike Reddit (Playwright scraping), Lemmy signal extraction can be lightweight, reliable, and fast -- pure HTTP, no browser required. ## Scope - `app/services/lemmy/` module: client wrapping `GET /api/v3/community`, `GET /api/v3/post/list`, `GET /api/v3/post` - Configurable community list per campaign (e.g. `technology@reddthat.com`, `selfhosted@lemmy.world`) - Keyword/pattern matching against post titles + bodies (reuse signal extraction logic from Reddit crawler) - Auto-create opportunities via `store.create_opportunity()` on match - Dedup guard: skip posts already in opportunities table by `thread_url` - APScheduler integration: poll on configurable interval (default: hourly) - MCP tool: `poll_lemmy_community` for on-demand scan ## API reference Lemmy API v3 is public and unauthenticated for read operations: - `GET /api/v3/post/list?community_name=<name>&sort=New&limit=50` - `GET /api/v3/post?id=<id>` (full post + comments) ## Instances to support initially - reddthat.com - lemmy.world - lemmy.ml - beehaw.org ## Notes - No auth needed for public communities - Rate limit conservatively (1 req/sec, exponential backoff on 429) - Store `instance` + `community` separately in community field as `<community>@<instance>` convention
Sign in to join this conversation.
No labels
bug
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/magpie#10
No description provided.