nix-builder-autoscaler/README.md

87 lines
3.4 KiB
Markdown
Raw Permalink Normal View History

2026-06-10 12:46:50 +02:00
# nix-builder-autoscaler
`nix-builder-autoscaler` provides elastic Nix remote builder capacity for
[Guardian Project](https://guardianproject.info) jobs using [buildbot-nix](https://github.com/nix-community/buildbot-nix).
The reason this was created is because we don't have the budget for dedicated always-on build hardware.
So the idea is that Buildbot waits for builder capacity before it starts a
`nix-build` job, and idle builders should disappear some time after jobs finish.
The autoscaler launches EC2 nix builder instances, waits until they are reachable
through Tailscale and HAProxy, hands Buildbot a reservation for a ready slot,
and later drains and terminates unused capacity. It uses EC2 Spot by default and
can use on-demand instances for nested virtualization workloads when configured.
The Buildbot instance has a single master/single worker config like upstream buildbot-nix expects. HAPoroxy is used to present a logical single nix-builder host to buildbot-nix, this was inspired by [Garnix's yensid](https://web.archive.org/web/20260530230732/https://garnix.io/blog/yensid/)
## Pieces
The project has two main runtime pieces:
1. `agent/`: the autoscaler daemon and `autoscalerctl` CLI.
The daemon owns the slot database, reservation API, scheduler, EC2 runtime,
HAProxy binding, health checks, and metrics.
2. `buildbot-ext/`: the Buildbot integration.
The extension patches Buildbot `*/nix-build` builders with a capacity gate
step at the beginning and a reservation release step at the end. It also
lets the Buildbot host send Nix distributed builds through the HAProxy-backed
builder cluster.
The `nix/modules/` directory contains NixOS modules that package and wire these
pieces into hosts:
- `services.nix-builder-autoscaler` runs the daemon and can generate the HAProxy
slot configuration.
- `services.buildbot-nix.nix-build-autoscaler` installs the Buildbot extension
and configures Nix remote builder access.
## How it works
Buildbot (via the extension) creates a reservation before a Nix build starts.
The autoscaler assigns that reservation to a ready slot if one exists. If no
ready slot has capacity, the scheduler launches an EC2 instance into an empty
slot, subject to the configured minimum, maximum, warm pool, and timeout
settings.
The reconciler moves each slot through the runtime states:
1. `launching`: EC2 accepted the instance launch.
2. `booting`: the instance is running.
3. `binding`: the daemon found the instance's Tailscale IP and enabled its
HAProxy backend slot.
4. `ready`: HAProxy health checks pass and Buildbot can use the slot.
5. `draining` or `terminating`: the slot is being removed after release, idle
timeout, interruption, or failure.
Buildbot waits until the reservation becomes `ready`, then runs the build
through the configured Nix remote builder alias. When the build finishes, the
release step releases the reservation. Idle slots drain and terminate after the
configured cooldowns.
## Development
Common checks:
```sh
nix flake check
nix build .#nix-builder-autoscaler
nix build .#buildbot-autoscale-ext
nix fmt
```
Useful local CLI commands against a running daemon:
```sh
autoscalerctl status
autoscalerctl slots
autoscalerctl reservations
autoscalerctl drain <slot-id>
autoscalerctl reconcile-now
```
The daemon listens on `/run/nix-builder-autoscaler/daemon.sock` by default.
NixOS deployments should configure the service modules rather than hand-writing
daemon config files.