Canonical is hiring a Site Reliability / GitOps Engineer to design, automate, and maintain infrastructure for large-scale production systems.
Responsibilities:
- Develop and maintain infrastructure as code (IaC) practices
- Automate operations across distributed systems in private and public clouds
- Improve scalability, resilience, and performance of cloud and container systems
- Monitor systems using tools like Prometheus, Grafana, and Elasticsearch
- Troubleshoot issues across the stack (kernel to application level)
- Collaborate with engineering teams to improve system architecture and operations
- Contribute to open-source projects by reporting issues and submitting fixes
- Handle escalations and ensure system reliability
Requirements:
- Strong experience with Infrastructure as Code, CI/CD, and version control workflows
- Proficiency in Python for large-scale projects
- Solid understanding of Linux systems, networking, and storage technologies
- Experience with cloud computing and distributed systems
- Familiarity with observability tools (Prometheus, Grafana, ELK stack)
- Strong communication skills and ability to work in distributed teams
Nice to Have:
- Experience with Ceph, databases, and advanced Linux storage systems
- Familiarity with Ubuntu or Debian ecosystems
- Experience contributing to open-source projects
Benefits:
- Remote-first global work environment
- Learning and development budget
- Annual compensation review and performance bonuses
- Travel opportunities for team events
- Comprehensive benefits package
Join Canonical to build and operate infrastructure powering millions of users worldwide.