Community recipe submission: dedup detection and variation clustering #119

Closed
opened 2026-04-22 12:14:30 -07:00 by pyr0ball · 0 comments
Owner

Context

Blocks the "create if not exists" sub-feature of kiwi#118 (community subcategory tagging). Before a user can submit a new recipe to the community pool, we need to prevent near-duplicate proliferation and model the variation relationship between recipes.

Problem

Without dedup/clustering, the community pool accumulates 40 versions of "chocolate chip cookies" and the tagging system becomes noise. The corpus already has 3.2M recipes — most dishes a user wants to contribute probably exist already under a different title.

Proposed Approach

Layer 1 — FTS title search (instant, at submission time)

Before accepting a submission, search corpus + community pool by title. Show top 5 matches: "These recipes look similar — is yours different?" The user can tag an existing recipe instead of creating a duplicate.

Layer 2 — Ingredient Jaccard overlap (cheap, in-process)

For top FTS title hits, compute |intersection| / |union| on ingredient_names JSON arrays. Jaccard ≥ 0.7 → flag as "very similar". Present similarity tier in UI (very similar / somewhat similar / different) to help the user decide.

Layer 3 — Variation clustering (schema)

Some recipes are legitimately different but belong to the same dish family (NY Style vs Neapolitan Pizza). Community-submitted recipes should be able to declare themselves a variation of a corpus or community recipe via a similar_to_ref FK. Browse can then surface or group variations.

Schema Sketch

CREATE TABLE community_recipes (
    id              BIGSERIAL PRIMARY KEY,
    slug            TEXT NOT NULL UNIQUE,
    title           TEXT NOT NULL,
    ingredient_names JSONB NOT NULL DEFAULT '[]',
    body            JSONB NOT NULL,              -- full recipe content
    pseudonym       TEXT NOT NULL,
    source_product  TEXT NOT NULL DEFAULT 'kiwi',
    similar_to_source TEXT CHECK (similar_to_source IN ('corpus', 'community')),
    similar_to_ref  TEXT,                        -- corpus int ID or community slug
    upvotes         INTEGER NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

Acceptance Criteria

  • Submission flow runs FTS title search and shows similarity results before accepting
  • Jaccard overlap computed for top hits, shown as similarity tier
  • User can choose "tag existing instead" or "mine is different, proceed"
  • Schema supports similar_to_ref variation link
  • Browsing a subcategory can optionally surface variation clusters

Blocks

kiwi#118 "create if not exists" sub-feature only. The corpus-recipe tagging path in kiwi#118 can ship independently.

## Context Blocks the "create if not exists" sub-feature of kiwi#118 (community subcategory tagging). Before a user can submit a new recipe to the community pool, we need to prevent near-duplicate proliferation and model the variation relationship between recipes. ## Problem Without dedup/clustering, the community pool accumulates 40 versions of "chocolate chip cookies" and the tagging system becomes noise. The corpus already has 3.2M recipes — most dishes a user wants to contribute probably exist already under a different title. ## Proposed Approach ### Layer 1 — FTS title search (instant, at submission time) Before accepting a submission, search corpus + community pool by title. Show top 5 matches: "These recipes look similar — is yours different?" The user can tag an existing recipe instead of creating a duplicate. ### Layer 2 — Ingredient Jaccard overlap (cheap, in-process) For top FTS title hits, compute `|intersection| / |union|` on `ingredient_names` JSON arrays. Jaccard ≥ 0.7 → flag as "very similar". Present similarity tier in UI (very similar / somewhat similar / different) to help the user decide. ### Layer 3 — Variation clustering (schema) Some recipes are legitimately different but belong to the same dish family (NY Style vs Neapolitan Pizza). Community-submitted recipes should be able to declare themselves a variation of a corpus or community recipe via a `similar_to_ref` FK. Browse can then surface or group variations. ## Schema Sketch ```sql CREATE TABLE community_recipes ( id BIGSERIAL PRIMARY KEY, slug TEXT NOT NULL UNIQUE, title TEXT NOT NULL, ingredient_names JSONB NOT NULL DEFAULT '[]', body JSONB NOT NULL, -- full recipe content pseudonym TEXT NOT NULL, source_product TEXT NOT NULL DEFAULT 'kiwi', similar_to_source TEXT CHECK (similar_to_source IN ('corpus', 'community')), similar_to_ref TEXT, -- corpus int ID or community slug upvotes INTEGER NOT NULL DEFAULT 0, created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` ## Acceptance Criteria - [ ] Submission flow runs FTS title search and shows similarity results before accepting - [ ] Jaccard overlap computed for top hits, shown as similarity tier - [ ] User can choose "tag existing instead" or "mine is different, proceed" - [ ] Schema supports `similar_to_ref` variation link - [ ] Browsing a subcategory can optionally surface variation clusters ## Blocks kiwi#118 "create if not exists" sub-feature only. The corpus-recipe tagging path in kiwi#118 can ship independently.
pyr0ball added this to the Public Launch milestone 2026-04-24 16:09:31 -07:00
pyr0ball added the
enhancement
label 2026-04-24 16:12:30 -07:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#119
No description provided.