L3/L4 recipe generation: SSE streaming for real-time token output #126

New issue

Closed

opened 2026-04-27 16:53:27 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-04-27 16:53:27 -07:00

Owner

Problem

L3/L4 recipe suggestions currently block synchronously on LLM inference. On cloud, this means:

User sees a spinner for 30–90 seconds with no feedback
nginx proxy_read_timeout (defaulted to 60s, now bumped to 180s as a short-term fix) is the only thing preventing a 504
If inference takes longer than the timeout, the user gets a hard error with no recipe

Current flow (sync)

Frontend POST /recipes/suggest
  → nginx (blocks, waits for full response)
  → FastAPI (blocks in asyncio.to_thread, calls LLM)
  → LLMRouter → Ollama → full recipe text
  → 200 response (38–90s later)

Target flow (SSE streaming)

Frontend EventSource /recipes/suggest/stream
  → nginx (proxy_buffering off, tokens flow through immediately)
  → FastAPI StreamingResponse (generator yields tokens)
  → LLMRouter.stream() → Ollama streaming API
  → first token arrives in ~1s, user sees recipe building in real time

What needs to change

Backend

LLMRouter.stream(prompt) — add streaming method to cf-core LLMRouter that yields tokens using Ollama streaming API (/api/generate with stream: true) and OpenAI-compat stream=True
LLMRecipeGenerator.stream_generate() — new method alongside generate() that returns an async generator of partial recipe text
POST /api/v1/recipes/suggest — add ?stream=true query param that returns StreamingResponse with Content-Type: text/event-stream; existing non-stream path stays for backwards compat

Frontend

api.ts — add suggestRecipesStream() using fetch + ReadableStream (not EventSource, since we need POST)
recipes.ts store — add streamingSuggestion: string state; accumulate tokens; parse complete recipe when [DONE] sentinel arrives
RecipesView.vue — show live token stream in a card while generating; replace with full card on completion

nginx

nginx.cloud.conf — add proxy_buffering off on the /api/ location for SSE to flow through without nginx buffering entire response

Short-term workaround (already deployed)

proxy_read_timeout 180s / proxy_send_timeout 180s on both cloud nginx locations
_call_llm skips cold vllm allocations (warm=False) and falls back to LLMRouter → Ollama immediately
LLMRouter fallback excludes vllm backend to avoid double cf-orch allocation

Notes

cf-orch streaming proxy (/proxy/authorize + /proxy/stream) handles the vllm case when vllm IS warm — that path already produces SSE. The gap is Ollama (the active cloud fallback) which needs its own streaming plumbing.
recipe_llm async jobs not supported in CLOUD_MODE warning in logs: the existing async job pattern (background task + poll) is a different mechanism from SSE streaming — SSE is preferable because the connection stays open and no polling is needed.
LLM recipe generation is the only endpoint that needs streaming; all other endpoints are fast enough for sync.

## Problem L3/L4 recipe suggestions currently block synchronously on LLM inference. On cloud, this means: - User sees a spinner for 30–90 seconds with no feedback - nginx `proxy_read_timeout` (defaulted to 60s, now bumped to 180s as a short-term fix) is the only thing preventing a 504 - If inference takes longer than the timeout, the user gets a hard error with no recipe ## Current flow (sync) ``` Frontend POST /recipes/suggest → nginx (blocks, waits for full response) → FastAPI (blocks in asyncio.to_thread, calls LLM) → LLMRouter → Ollama → full recipe text → 200 response (38–90s later) ``` ## Target flow (SSE streaming) ``` Frontend EventSource /recipes/suggest/stream → nginx (proxy_buffering off, tokens flow through immediately) → FastAPI StreamingResponse (generator yields tokens) → LLMRouter.stream() → Ollama streaming API → first token arrives in ~1s, user sees recipe building in real time ``` ## What needs to change ### Backend 1. **`LLMRouter.stream(prompt)`** — add streaming method to cf-core `LLMRouter` that yields tokens using Ollama streaming API (`/api/generate` with `stream: true`) and OpenAI-compat `stream=True` 2. **`LLMRecipeGenerator.stream_generate()`** — new method alongside `generate()` that returns an async generator of partial recipe text 3. **`POST /api/v1/recipes/suggest`** — add `?stream=true` query param that returns `StreamingResponse` with `Content-Type: text/event-stream`; existing non-stream path stays for backwards compat ### Frontend 4. **`api.ts`** — add `suggestRecipesStream()` using `fetch` + `ReadableStream` (not EventSource, since we need POST) 5. **`recipes.ts` store** — add `streamingSuggestion: string` state; accumulate tokens; parse complete recipe when `[DONE]` sentinel arrives 6. **`RecipesView.vue`** — show live token stream in a card while generating; replace with full card on completion ### nginx 7. **`nginx.cloud.conf`** — add `proxy_buffering off` on the `/api/` location for SSE to flow through without nginx buffering entire response ## Short-term workaround (already deployed) - `proxy_read_timeout 180s` / `proxy_send_timeout 180s` on both cloud nginx locations - `_call_llm` skips cold vllm allocations (warm=False) and falls back to LLMRouter → Ollama immediately - LLMRouter fallback excludes `vllm` backend to avoid double cf-orch allocation ## Notes - cf-orch streaming proxy (`/proxy/authorize` + `/proxy/stream`) handles the vllm case when vllm IS warm — that path already produces SSE. The gap is Ollama (the active cloud fallback) which needs its own streaming plumbing. - `recipe_llm async jobs not supported in CLOUD_MODE` warning in logs: the existing async job pattern (background task + poll) is a different mechanism from SSE streaming — SSE is preferable because the connection stays open and no polling is needed. - LLM recipe generation is the only endpoint that needs streaming; all other endpoints are fast enough for sync.