feat: GET /api/library/sample-chunks — corpus sampling endpoint for Avocet embed bench #6

Closed
opened 2026-05-06 09:24:15 -07:00 by pyr0ball · 0 comments
Owner

Context

Avocet's embedding model comparison harness (avocet#59) needs to pull a representative sample of text chunks from Pagepiper to use as a comparison corpus. There is currently no endpoint that returns raw page-level chunks without requiring a search query.

Endpoint

GET /api/library/sample-chunks?limit=N

Returns up to N page-level chunks sampled from the database (random or ROWID order is fine — not search-ranked).

Response shape

[
  {
    "chunk_id": "abc123_p4",
    "doc_id": "abc123",
    "page_number": 4,
    "text": "full page text here..."
  },
  ...
]

Notes

  • limit defaults to 50, max 200.
  • No auth required (internal tool, same-host only).
  • Pagepiper must be running and the library non-empty for this to return results.
  • Avocet falls back to paste-mode corpus input if this endpoint is unreachable.

Acceptance criteria

  • GET /api/library/sample-chunks?limit=20 returns up to 20 chunks with chunk_id, doc_id, page_number, text fields.
  • Returns [] (not 500) if the library is empty.
  • Covered by at least one unit test.
## Context Avocet's embedding model comparison harness (avocet#59) needs to pull a representative sample of text chunks from Pagepiper to use as a comparison corpus. There is currently no endpoint that returns raw page-level chunks without requiring a search query. ## Endpoint ``` GET /api/library/sample-chunks?limit=N ``` Returns up to `N` page-level chunks sampled from the database (random or ROWID order is fine — not search-ranked). ## Response shape ```json [ { "chunk_id": "abc123_p4", "doc_id": "abc123", "page_number": 4, "text": "full page text here..." }, ... ] ``` ## Notes - `limit` defaults to 50, max 200. - No auth required (internal tool, same-host only). - Pagepiper must be running and the library non-empty for this to return results. - Avocet falls back to paste-mode corpus input if this endpoint is unreachable. ## Acceptance criteria - `GET /api/library/sample-chunks?limit=20` returns up to 20 chunks with `chunk_id`, `doc_id`, `page_number`, `text` fields. - Returns `[]` (not 500) if the library is empty. - Covered by at least one unit test.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/pagepiper#6
No description provided.