About the role
The Lead Software Engineer for AI Ops and Resilience will drive the evolution of IT operations through intelligent automation and hands-on engineering. This role focuses on reimagining ITSM practices and developing predictive, self-healing IT capabilities using AI/ML frameworks and modern automation tools. Key duties include leading automation deployments, mentoring engineering teams, and overseeing IT Command Centre operations to ensure high service reliability.
AviationOnsite1587
Key Responsibilities
- Reimagine and enhance core ITSM practices (Incident, Problem, Change, and Knowledge Management) using modern development frameworks and automation tools.
- Design, prototype, and implement AI-driven operational tools, including predictive incident detection, automated remediation workflows, and LLM-based knowledge agents.
- Lead the development and deployment of custom automation solutions to improve IT service reliability and reduce manual workload across ITSM domains.
- Collaborate with platform teams, enterprise architects, and developers to conceptualize and build next-generation IT operational capabilities.
- Provide mentorship and guidance to ITSM IPC Engineers, ensuring effective execution and governance of processes aligned with ITIL best practices.
- Act as the primary liaison between internal stakeholders and external service providers for the IT Command Centre and Helpdesk.
- Monitor and manage performance of vendor-managed services to ensure SLA and KPI compliance.
- Participate in service reviews, audits, and performance assessments while supporting escalation management and root cause analysis efforts.
Requirements
- Bachelor's Degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 5+ years of experience in IT operations or substantial exposure to ITSM processes and tooling.
- Strong understanding of ITIL framework and ITSM best practices; ITIL v3/v4 certification is preferred.
- Hands-on experience with automation tools, scripting, and AI/ML technologies relevant to IT operations.
- Proficient with ITSM platforms such as ServiceNow, BMC Remedy, or similar tools.
- Demonstrated ability to mentor technical teams and lead cross-functional collaboration.
- Excellent problem-solving, communication, and stakeholder management skills.
- Hands-on software development or scripting experience in Python, JavaScript (Node.js), or similar languages.
- Experience with monitoring and observability platforms like Splunk, Grafana, ScienceLogic, or equivalent is advantageous.
- Familiarity with CI/CD pipelines, GitOps practices, cloud platforms (AWS, Azure, GCP), and Infrastructure-as-Code (IaC) tools.
- Proficiency with AI/ML frameworks and tools such as TensorFlow, scikit-learn, LangChain, and OpenAI APIs is a strong advantage.