# Docker Profiles Peregrine uses Docker Compose profiles to start only the services your hardware can support. Choose a profile with `make start PROFILE=`. --- ## Profile Reference | Profile | Services started | Use case | |---------|----------------|----------| | `remote` | `app`, `searxng` | No GPU. LLM calls go to an external API (Anthropic, OpenAI-compatible). | | `cpu` | `app`, `ollama`, `searxng` | No GPU. Runs local models on CPU — functional but slow. | | `single-gpu` | `app`, `ollama`, `vision`, `searxng` | One NVIDIA GPU. Covers cover letters, research, and vision (survey screenshots). | | `dual-gpu` | `app`, `ollama`, `vllm`, `vision`, `searxng` | Two NVIDIA GPUs. GPU 0 = Ollama (cover letters), GPU 1 = vLLM (research). | --- ## Service Descriptions | Service | Image / Source | Port | Purpose | |---------|---------------|------|---------| | `app` | `Dockerfile` (Streamlit) | 8501 | The main Peregrine UI | | `ollama` | `ollama/ollama` | 11434 | Local model inference — cover letters and general tasks | | `vllm` | `vllm/vllm-openai` | 8000 | High-throughput local inference — research tasks | | `vision` | `scripts/vision_service/` | 8002 | Moondream2 — survey screenshot analysis | | `searxng` | `searxng/searxng` | 8888 | Private meta-search engine — company research web scraping | --- ## Choosing a Profile ### remote Use `remote` if: - You have no NVIDIA GPU - You plan to use Anthropic Claude or another API-hosted model exclusively - You want the fastest startup (only two containers) You must configure at least one external LLM backend in **Settings → LLM Backends**. ### cpu Use `cpu` if: - You have no GPU but want to run models locally (e.g. for privacy) - Acceptable for light use; cover letter generation may take several minutes per request Pull a model after the container starts: ```bash docker exec -it peregrine-ollama-1 ollama pull llama3.1:8b ``` ### single-gpu Use `single-gpu` if: - You have one NVIDIA GPU with at least 8 GB VRAM - Recommended for most single-user installs - The vision service (Moondream2) starts on the same GPU using 4-bit quantisation (~1.5 GB VRAM) ### dual-gpu Use `dual-gpu` if: - You have two or more NVIDIA GPUs - GPU 0 handles Ollama (cover letters, quick tasks) - GPU 1 handles vLLM (research, long-context tasks) - The vision service shares GPU 0 with Ollama --- ## GPU Memory Guidance | GPU VRAM | Recommended profile | Notes | |----------|-------------------|-------| | < 4 GB | `cpu` | GPU too small for practical model loading | | 4–8 GB | `single-gpu` | Run smaller models (3B–8B parameters) | | 8–16 GB | `single-gpu` | Run 8B–13B models comfortably | | 16–24 GB | `single-gpu` | Run 13B–34B models | | 24 GB+ | `single-gpu` or `dual-gpu` | 70B models with quantisation | --- ## How preflight.py Works `make start` calls `scripts/preflight.py` before launching Docker. Preflight does the following: 1. **Port conflict detection** — checks whether `STREAMLIT_PORT`, `OLLAMA_PORT`, `VLLM_PORT`, `SEARXNG_PORT`, and `VISION_PORT` are already in use. Reports any conflicts and suggests alternatives. 2. **GPU enumeration** — queries `nvidia-smi` for GPU count and VRAM per card. 3. **RAM check** — reads `/proc/meminfo` (Linux) or `vm_stat` (macOS) to determine available system RAM. 4. **KV cache offload** — if GPU VRAM is less than 10 GB, preflight calculates `CPU_OFFLOAD_GB` (the amount of KV cache to spill to system RAM) and writes it to `.env`. The vLLM container picks this up via `--cpu-offload-gb`. 5. **Profile recommendation** — writes `RECOMMENDED_PROFILE` to `.env`. This is informational; `make start` uses the `PROFILE` variable you specify (defaulting to `remote`). You can run preflight independently: ```bash make preflight # or python scripts/preflight.py ``` --- ## Customising Ports Edit `.env` before running `make start`: ```bash STREAMLIT_PORT=8501 OLLAMA_PORT=11434 VLLM_PORT=8000 SEARXNG_PORT=8888 VISION_PORT=8002 ``` All containers read from `.env` via the `env_file` directive in `compose.yml`.