About the role
Site Reliability Engineer responsible for operation, reliability, and performance of platform infrastructure, ensuring high availability, scalability, and operational excellence.
BankingOnsite
Key Responsibilities
- Lead incident response for complex virtualization, storage, or OS-level disruptions and conduct blameless post-mortems and Root Cause Analysis (RCA).
- Develop and maintain software tools (Python, PowerShell, Java) for automation of infrastructure tasks via CI/CD pipelines.
- Architect and manage AI-first monitoring systems (Grafana, ELK) for predictive failure detection.
- Define and measure infrastructure-specific SLIs and SLOs (e.g., IOPS, disk latency, OS uptime) and manage error budgets.
- Adopt and maintain Infrastructure as Code (IaC) using Terraform, Ansible for consistent deployments.
Requirements
- Degree in Computer Science, Information Technology, or related Engineering field.
- At least 5 years of relevant experience.
- Intermediate-level administration of Windows Server (Active Directory, Clustering) and Redhat Linux.
- High proficiency in hypervisors (VMware) and enterprise storage architecture (SAN, NAS, S3).
- Proficiency in Python and PowerShell for automation.
- Hands-on experience with Grafana and Elasticsearch (ELK) for monitoring.