Agent platform for infrastructure
For the last two weeks I have been building an agent platform for infrastructure at work. The idea is simple: give agents access to everything related to infrastructure, but keep operation safety under control.
In practice, it is a service that connects to Kubernetes, Prometheus, GitLab, source code, inventory, Consul, and so on, and exposes a common API for data and infrastructure handles.
As a result, it works almost magically. Even with heavily limited mutating operations and a large backlog of infrastructure unification work, MTTD takes minutes.
Once we enable mutations, agents will be able to run experiments and test hypotheses. The next step after that is monitoring and alerting. Then self-healing, limited at first and with HITL.
The unexpected part is how little even SOTA agents care about operation safety. Despite all preliminary blast-radius calculations and goal priorities, opus 4.8 xhigh can calmly, for example, roll back migrations without checking whether the rollback breaks data. Simply because “well, it cannot be that the programmers did not write a normal rollback”.