diff --git a/README.md b/README.md new file mode 100644 index 0000000..45837c8 --- /dev/null +++ b/README.md @@ -0,0 +1,86 @@ +# nix-builder-autoscaler + +`nix-builder-autoscaler` provides elastic Nix remote builder capacity for +[Guardian Project](https://guardianproject.info) jobs using [buildbot-nix](https://github.com/nix-community/buildbot-nix). + +The reason this was created is because we don't have the budget for dedicated always-on build hardware. + +So the idea is that Buildbot waits for builder capacity before it starts a +`nix-build` job, and idle builders should disappear some time after jobs finish. + +The autoscaler launches EC2 nix builder instances, waits until they are reachable +through Tailscale and HAProxy, hands Buildbot a reservation for a ready slot, +and later drains and terminates unused capacity. It uses EC2 Spot by default and +can use on-demand instances for nested virtualization workloads when configured. + +The Buildbot instance has a single master/single worker config like upstream buildbot-nix expects. HAPoroxy is used to present a logical single nix-builder host to buildbot-nix, this was inspired by [Garnix's yensid](https://web.archive.org/web/20260530230732/https://garnix.io/blog/yensid/) + +## Pieces + +The project has two main runtime pieces: + +1. `agent/`: the autoscaler daemon and `autoscalerctl` CLI. + The daemon owns the slot database, reservation API, scheduler, EC2 runtime, + HAProxy binding, health checks, and metrics. + +2. `buildbot-ext/`: the Buildbot integration. + The extension patches Buildbot `*/nix-build` builders with a capacity gate + step at the beginning and a reservation release step at the end. It also + lets the Buildbot host send Nix distributed builds through the HAProxy-backed + builder cluster. + +The `nix/modules/` directory contains NixOS modules that package and wire these +pieces into hosts: + +- `services.nix-builder-autoscaler` runs the daemon and can generate the HAProxy + slot configuration. +- `services.buildbot-nix.nix-build-autoscaler` installs the Buildbot extension + and configures Nix remote builder access. + +## How it works + +Buildbot (via the extension) creates a reservation before a Nix build starts. +The autoscaler assigns that reservation to a ready slot if one exists. If no +ready slot has capacity, the scheduler launches an EC2 instance into an empty +slot, subject to the configured minimum, maximum, warm pool, and timeout +settings. + +The reconciler moves each slot through the runtime states: + +1. `launching`: EC2 accepted the instance launch. +2. `booting`: the instance is running. +3. `binding`: the daemon found the instance's Tailscale IP and enabled its + HAProxy backend slot. +4. `ready`: HAProxy health checks pass and Buildbot can use the slot. +5. `draining` or `terminating`: the slot is being removed after release, idle + timeout, interruption, or failure. + +Buildbot waits until the reservation becomes `ready`, then runs the build +through the configured Nix remote builder alias. When the build finishes, the +release step releases the reservation. Idle slots drain and terminate after the +configured cooldowns. + +## Development + +Common checks: + +```sh +nix flake check +nix build .#nix-builder-autoscaler +nix build .#buildbot-autoscale-ext +nix fmt +``` + +Useful local CLI commands against a running daemon: + +```sh +autoscalerctl status +autoscalerctl slots +autoscalerctl reservations +autoscalerctl drain +autoscalerctl reconcile-now +``` + +The daemon listens on `/run/nix-builder-autoscaler/daemon.sock` by default. +NixOS deployments should configure the service modules rather than hand-writing +daemon config files.