snipe/config/llm.cloud.yaml
pyr0ball cc997c09e3 refactor: rename CF_ORCH_URL → GPU_SERVER_URL (backward-compat alias kept)
GPU_SERVER_URL is the self-explanatory name a self-hoster can understand
without knowing CircuitForge internals. CF_ORCH_URL continues to work as
a drop-in fallback alias (runner.py, main.py both check GPU_SERVER_URL
first, then CF_ORCH_URL).

Updated everywhere the env var is referenced or documented:
- app/tasks/runner.py
- api/main.py
- app/llm/router.py
- .env.example (alias note added)
- compose.override.yml
- compose.cloud.yml
- config/llm.cloud.yaml
- tests/test_tasks/test_runner.py (primary key updated; 13/13 still pass)

Follows the GPU_SERVER_URL convention established in kiwi (see kiwi
app/core/config.py).

Closes: #55
2026-05-21 15:05:27 -07:00

38 lines
1.2 KiB
YAML

# config/llm.cloud.yaml
# Snipe — LLM config for the managed cloud instance (menagerie)
#
# Mounted read-only into the cloud API container at /app/config/llm.yaml
# (see compose.cloud.yml). Personal fine-tunes and local-only backends
# (claude_code, copilot) are intentionally excluded here.
#
# CF Orchestrator routes both ollama and vllm allocations for VRAM-aware
# scheduling. GPU_SERVER_URL must be set in .env for allocations to resolve;
# if cf-orch is unreachable the backend falls back to its static base_url.
#
# Model choice for query builder: llama3.1:8b
# - Reliable instruction following and JSON output
# - No creative fine-tuning drift (unlike writer models in the pool)
# - Fits comfortably in 8 GB VRAM alongside other services
backends:
ollama:
type: openai_compat
base_url: http://host.docker.internal:11434/v1
api_key: ollama
model: llama3.1:8b
enabled: true
supports_images: false
cf_orch:
service: ollama
ttl_s: 300
anthropic:
type: anthropic
api_key_env: ANTHROPIC_API_KEY
model: claude-haiku-4-5-20251001
enabled: false
supports_images: false
fallback_order:
- ollama
- anthropic