Google

Senior Customer Engineer, AI Infrastructure, Google Cloud

Google
TechnologySingaporeOnsitePosted 4 weeks ago

About the role

This is a senior-level technical role at Google Cloud focused on designing and optimizing AI infrastructure, specifically utilizing TPU and GPU hardware. The candidate will work with major enterprises and startups to deploy large-scale AI training and inferencing solutions while providing deep technical expertise in networking and performance tuning.

TechnologyOnsite

Key Responsibilities

  • Design and implement complex, multi-host AI training and inferencing solutions on Google Cloud TPUs, focusing on scalability and performance tuning.
  • Conduct in-depth performance profiling and optimization of customer models and data pipelines specifically for the TPU architecture, identifying and resolving bottlenecks.
  • Advise customers on best practices for integrating their ML operations workflows with the Google Cloud AI platform ecosystem for seamless TPU utilization.
  • Support Google Cloud Sales teams to deploy AI/ML accelerators (e.g., TPU/GPU) at AI innovators, large enterprises, and early-stage AI startups.
  • Guide customer discussions on network topologies and compute/storage, and support bring-up of the server, network, cluster, or cooling deployments.
  • Liaise with the product marketing management and engineering teams to stay on top of industry trends and devise enhancements to Google Cloud products.
  • Visit customer data centers during the bring-up phase of server, network, or cluster deployments.
  • Understand the needs of customers and help shape the future using AI technology.

Requirements

  • Bachelor's degree or equivalent practical experience.
  • 10 years of experience in developing and deploying models using deep learning frameworks (e.g., TensorFlow, PyTorch, or JAX) specifically on TPU hardware.
  • Experience in networking principles, including concepts like collective communication, inter-chip interconnects, and their impact on distributed AI training.
  • Experience with lower-level performance tools and techniques (e.g., custom kernel development, XLA compiler familiarity) relevant to optimizing code for Google's TPU chips.
  • Experience with leveraging AI hardware and software stacks and platforms to bring up and deploy AI compute clusters.
  • Knowledge of AI accelerator hardware (e.g., specific GPU generations) to effectively articulate the architectural differentiation and value proposition of cloud TPUs.
  • Knowledge of the AI infrastructure market, including main technology providers, differentiators, and trends.