Cog containers are Docker containers that serve an HTTP server for running predictions on your model. You can deploy them anywhere that Docker containers run.
The server inside Cog containers is coglet, a Rust-based prediction server that handles HTTP requests, worker process management, and prediction execution.
This guide assumes you have a model packaged with Cog. If you don't, follow our getting started guide, or use an example model.
First, build your model:
cog build -t my-modelYou can serve predictions locally with cog serve:
cog serve
# or, from a built image:
cog serve my-modelAlternatively, start the Docker container directly:
# If your model uses a CPU:
docker run -d -p 5001:5000 my-model
# If your model uses a GPU:
docker run -d -p 5001:5000 --gpus all my-modelThe server listens on port 5000 inside the container (mapped to 5001 above).
To view the OpenAPI schema, open localhost:5001/openapi.json in your browser or use cURL to make a request:
curl http://localhost:5001/openapi.jsonTo stop the server, run:
docker kill my-modelTo run a prediction on the model,
call the /predictions endpoint,
passing input in the format expected by your model:
curl http://localhost:5001/predictions -X POST \
--header "Content-Type: application/json" \
--data '{"input": {"image": "https://.../input.jpg"}}'For more details about the HTTP API, see the HTTP API reference documentation.
The server exposes a GET /health-check endpoint that returns the current status of the model container. Use this for readiness probes in orchestration systems like Kubernetes.
curl http://localhost:5001/health-checkThe response includes a status field with values like STARTING, READY, BUSY, SETUP_FAILED, or DEFUNCT. See the HTTP API reference for full details.
By default, the server processes one prediction at a time. To enable concurrent predictions, set the concurrency.max option in cog.yaml:
concurrency:
max: 4See the cog.yaml reference for more details.
You can configure runtime behavior with environment variables:
COG_SETUP_TIMEOUT: Maximum time in seconds for thesetup()method (default: no timeout).
See the environment variables reference for the full list.