L3/L4 recipe generation: SSE streaming for real-time token output #126
Labels
No labels
accessibility
backlog
beta-feedback
bug
duplicate
enhancement
feature-request
help wanted
invalid
needs-design
needs-triage
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Circuit-Forge/kiwi#126
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
L3/L4 recipe suggestions currently block synchronously on LLM inference. On cloud, this means:
proxy_read_timeout(defaulted to 60s, now bumped to 180s as a short-term fix) is the only thing preventing a 504Current flow (sync)
Target flow (SSE streaming)
What needs to change
Backend
LLMRouter.stream(prompt)— add streaming method to cf-coreLLMRouterthat yields tokens using Ollama streaming API (/api/generatewithstream: true) and OpenAI-compatstream=TrueLLMRecipeGenerator.stream_generate()— new method alongsidegenerate()that returns an async generator of partial recipe textPOST /api/v1/recipes/suggest— add?stream=truequery param that returnsStreamingResponsewithContent-Type: text/event-stream; existing non-stream path stays for backwards compatFrontend
api.ts— addsuggestRecipesStream()usingfetch+ReadableStream(not EventSource, since we need POST)recipes.tsstore — addstreamingSuggestion: stringstate; accumulate tokens; parse complete recipe when[DONE]sentinel arrivesRecipesView.vue— show live token stream in a card while generating; replace with full card on completionnginx
nginx.cloud.conf— addproxy_buffering offon the/api/location for SSE to flow through without nginx buffering entire responseShort-term workaround (already deployed)
proxy_read_timeout 180s/proxy_send_timeout 180son both cloud nginx locations_call_llmskips cold vllm allocations (warm=False) and falls back to LLMRouter → Ollama immediatelyvllmbackend to avoid double cf-orch allocationNotes
/proxy/authorize+/proxy/stream) handles the vllm case when vllm IS warm — that path already produces SSE. The gap is Ollama (the active cloud fallback) which needs its own streaming plumbing.recipe_llm async jobs not supported in CLOUD_MODEwarning in logs: the existing async job pattern (background task + poll) is a different mechanism from SSE streaming — SSE is preferable because the connection stays open and no polling is needed.