Troubleshooting

This page covers common issues you may encounter when self-hosting Honcho, what causes them, and how to fix them.

Startup Failures

Server won’t start: “Missing client for …”

ValueError: Missing client for Deriver: google

Cause: The server validates at startup that all configured LLM providers have API keys. If a provider is referenced in your configuration but the corresponding API key isn’t set, the server refuses to start. Fix: Set the API keys for your configured providers. With default configuration, you need:

LLM_GEMINI_API_KEY=...    # Used by deriver, summary, dialectic minimal/low
LLM_ANTHROPIC_API_KEY=... # Used by dialectic medium/high/max, dream
LLM_OPENAI_API_KEY=...    # Used by embeddings (when EMBED_MESSAGES=true)

See the LLM Setup section for provider configuration. You can change which providers are used in your .env or config.toml (see Configuration Guide).

Server won’t start: “JWT_SECRET must be set”

ValueError: JWT_SECRET must be set if USE_AUTH is true

Cause: You enabled authentication (AUTH_USE_AUTH=true) but didn’t provide a JWT secret. Fix: Generate a secret and set it:

python scripts/generate_jwt_secret.py
# Then set the output as:
AUTH_JWT_SECRET=<generated_secret>

Or disable authentication for local development: AUTH_USE_AUTH=false

Runtime Errors

API returns “An unexpected error occurred” on every request

Cause: This is almost always a database issue. The health endpoint (/health) will return {"status": "ok"} even when the database is unreachable because it doesn’t check the database connection. The actual error appears in the server logs. Common causes and fixes:

Database is unreachable — Check that PostgreSQL is running and the DB_CONNECTION_URI is correct
Migrations haven’t been run — The server starts successfully without tables, but every API call will fail. Run:
```
uv run alembic upgrade head
```
In Docker:
```
docker compose exec api uv run alembic upgrade head
```
pgvector extension not installed — The vector extension must be enabled in your database:
```
CREATE EXTENSION IF NOT EXISTS vector;
```

How to diagnose: Check the server logs for the actual error. Look for:

sqlalchemy.exc.OperationalError — database connection issue
sqlalchemy.exc.ProgrammingError with “relation does not exist” — migrations not run
psycopg.OperationalError — connection refused or authentication failed

Health check passes but API calls fail

The /health endpoint is a lightweight check that confirms the server process is running. It does not verify:

Database connectivity
That migrations have been run
That LLM providers are reachable

To verify full functionality, try creating a workspace:

curl -X POST http://localhost:8000/v3/workspaces \
  -H "Content-Type: application/json" \
  -d '{"name": "test"}'

If this succeeds, your database connection and migrations are working.

Deriver not processing messages

Messages are stored but no observations, summaries, or representations are being generated. Common causes:

Deriver isn’t running — In manual setup, the deriver is a separate process:
```
uv run python -m src.deriver
```
In Docker, it starts automatically via docker compose up.
Deriver can’t reach the database — Check deriver logs for connection errors. The deriver uses the same DB_CONNECTION_URI as the API server.
Missing LLM API key for deriver provider — By default the deriver uses Google Gemini (LLM_GEMINI_API_KEY). Check deriver logs for API errors.
Processing backlog — With DERIVER_WORKERS=1 (default), high message volume can cause a backlog. Increase workers:
```
DERIVER_WORKERS=4
```
Representation Batch Max — By default the deriver buffers representation work until a session has enough tokens for that representation, set via DERIVER_REPRESENTATION_BATCH_MAX_TOKENS. Sub-threshold tails become eligible after DERIVER_REPRESENTATION_BATCH_MAX_AGE_SECONDS (default 1800 seconds), so quiet sessions eventually flush without disabling batching globally. Set the age to 0 for legacy behavior where sub-threshold tails wait indefinitely. See token batching for more details

Alternative Provider Issues

OpenRouter / custom provider not working

If calls to an OpenAI-compatible proxy fail:

Verify the endpoint and key are set. Use transport = "openai" with a base URL override:

LLM_OPENAI_API_KEY=sk-or-v1-...
DERIVER_MODEL_CONFIG__OVERRIDES__BASE_URL=https://openrouter.ai/api/v1

Check model names match the provider’s format. OpenRouter uses vendor/model format (e.g., anthropic/claude-haiku-4-5), not the raw model ID.
Ensure your model supports tool calling. The deriver, dialectic, and dream agents require tool use. Check the provider’s model page for tool calling support.
Check server logs for the actual error. API errors from the upstream provider will appear in Honcho’s logs with the HTTP status code and message body.

vLLM / Ollama not responding

Verify the model server is running and accessible from the Honcho process (or container):

curl http://localhost:8000/v1/models   # vLLM
curl http://localhost:11434/v1/models  # Ollama

In Docker, localhost inside a container doesn’t reach the host. Use host.docker.internal (macOS/Windows) or the host’s network IP:
```
DERIVER_MODEL_CONFIG__OVERRIDES__BASE_URL=http://host.docker.internal:8000/v1
```
Structured output failures — vLLM’s structured output support is limited to certain response formats. If you see JSON parsing errors, check the deriver/dream logs for the raw response. See Deriver produces no observations below.

Deriver produces no observations

If messages are processed (the queue drains, no errors in logs) but peers never accumulate observations — and you’re using an OpenAI-compatible provider — the likely cause is that the provider doesn’t support OpenAI Structured Outputs (json_schema). The OpenAI backend requests json_schema by default; providers like Z.AI GLM and some Ollama/vLLM deployments either reject it or silently ignore it and return prose, which the deriver can’t parse into observations. Fix: set STRUCTURED_OUTPUT_MODE=json_object on the deriver’s model config to request loose JSON mode, which injects the schema into the prompt instead:

DERIVER_MODEL_CONFIG__STRUCTURED_OUTPUT_MODE=json_object

This is a per-model-config setting on the OpenAI transport; set it on whichever features use the affected provider (e.g. DREAM_DEDUCTION_MODEL_CONFIG__STRUCTURED_OUTPUT_MODE).

Thinking budget errors with non-Anthropic providers

If you see errors like thinking budget not supported, invalid parameter, or silent failures where agents produce no output, one of your per-component *_MODEL_CONFIG__THINKING_BUDGET_TOKENS overrides is likely set to a value > 0 with a provider that doesn’t support Anthropic-style extended thinking. The built-in defaults do not set thinking budgets, so this only applies if you added those overrides yourself. Fix: Set *_MODEL_CONFIG__THINKING_BUDGET_TOKENS=0 for every component when using models that don’t support thinking:

DERIVER_MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
SUMMARY_MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DREAM_DEDUCTION_MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DREAM_INDUCTION_MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DIALECTIC_LEVELS__minimal__MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DIALECTIC_LEVELS__low__MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DIALECTIC_LEVELS__medium__MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DIALECTIC_LEVELS__high__MODEL_CONFIG__THINKING_BUDGET_TOKENS=0
DIALECTIC_LEVELS__max__MODEL_CONFIG__THINKING_BUDGET_TOKENS=0

For OpenAI reasoning models, use *_MODEL_CONFIG__THINKING_EFFORT instead of *_MODEL_CONFIG__THINKING_BUDGET_TOKENS.

Database Issues

Connection string format

The connection URI must use the postgresql+psycopg prefix:

# Correct
DB_CONNECTION_URI=postgresql+psycopg://postgres:postgres@localhost:5432/postgres

# Wrong - will fail
DB_CONNECTION_URI=postgresql://postgres:postgres@localhost:5432/postgres
DB_CONNECTION_URI=postgres://postgres:postgres@localhost:5432/postgres

Checking migration status

# See current migration version
uv run alembic current

# See migration history
uv run alembic history

# Upgrade to latest
uv run alembic upgrade head

Cache & Redis

Redis is optional

Redis is used for caching when CACHE_ENABLED=true (default: false). If Redis is unreachable, Honcho gracefully falls back to in-memory caching and logs a warning. This means:

The server and deriver will still start and function normally
Performance may be reduced under high load without Redis
You do not need Redis for local development or testing

Redis connection issues

If you see Redis connection warnings in logs but CACHE_ENABLED=false, they can be safely ignored. If you want caching:

# Start Redis via Docker
docker run -d -p 6379:6379 redis:latest

# Configure Honcho
CACHE_ENABLED=true
CACHE_URL=redis://localhost:6379/0

Docker Issues

Docker build fails with permission errors

The Honcho Dockerfile uses BuildKit mount syntax and creates a non-root app user. Common build failures: 1. BuildKit not enabled The Dockerfile uses RUN --mount=type=cache which requires Docker BuildKit. If you see syntax errors during build:

# Ensure BuildKit is enabled
DOCKER_BUILDKIT=1 docker compose build

Or add to your Docker daemon config (/etc/docker/daemon.json):

{ "features": { "buildkit": true } }

2. Permission denied during build or at runtime (Linux) On Linux, AppArmor or SELinux can block Docker build operations and volume mounts. Symptoms include permission denied errors during COPY, RUN, or when the container tries to access mounted volumes.

# Check if AppArmor is blocking Docker
sudo aa-status | grep docker

# Temporarily test without AppArmor (for diagnosis only)
docker compose down
sudo aa-remove-unknown
docker compose up -d

For SELinux, add :z to volume mounts in docker-compose.yml:

volumes:
  - .:/app:z

3. Volume mount UID mismatch The Dockerfile creates a non-root app user, but docker-compose.yml.example mounts .:/app which overlays the container filesystem with host-owned files. The app user inside the container may not have permission to read them. If you see permission errors at runtime (not build time), you can either:

Run without the source mount (remove - .:/app from volumes — the image already contains the code)
Or fix ownership: sudo chown -R 100:101 . (matches the app user inside the container)

Containers start but API fails

Check container status: docker compose ps
Check API logs: docker compose logs api
Check database logs: docker compose logs database
Ensure migrations ran: docker compose exec api uv run alembic upgrade head

Port conflicts

If port 8000 is already in use:

# Check what's using the port
lsof -i :8000

# Or change the port mapping in docker-compose.yml
ports:
  - "8001:8000"  # Map to a different host port

Rebuilding after code changes

docker compose build --no-cache
docker compose up -d

Getting Help

If your issue isn’t covered here:

Check the logs — most issues are diagnosed from server or deriver logs
GitHub Issues — Report bugs
Discord — Join our community
Configuration — See the Configuration Guide for all available settings

​Startup Failures

​Server won’t start: “Missing client for …”

​Server won’t start: “JWT_SECRET must be set”

​Runtime Errors

​API returns “An unexpected error occurred” on every request

​Health check passes but API calls fail

​Deriver not processing messages

​Alternative Provider Issues

​OpenRouter / custom provider not working

​vLLM / Ollama not responding

​Deriver produces no observations

​Thinking budget errors with non-Anthropic providers

​Database Issues

​Connection string format

​Checking migration status

​Cache & Redis

​Redis is optional

​Redis connection issues

​Docker Issues

​Docker build fails with permission errors

​Containers start but API fails

​Port conflicts

​Rebuilding after code changes

​Getting Help