Site Reliability Engineer-AI Cloud

Date: Jun 25, 2025

Location: Bade, Taiwan, TW

Company: Super Micro Computer

Job Req ID: 26896

About Supermicro:

Supermicro® is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.
 

Job Summary:

As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You’ll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties and Responsibilities:

Cloud Infra Automation:
Design and deploy infrastructure on bare metal or cloud using Terraform, Ansible, or Helm. Automate workflows with Python or Go.

Platform Reliability:
Maintain and scale GPU clusters, Kubernetes, and AI-optimized storage (Ceph, BeeGFS, Weka) to ensure stability and performance.

Monitoring & Alerting:
Use Prometheus, Grafana, ELK, etc., to monitor system health and trigger alerts on anomalies.

Capacity Planning:
Analyze usage patterns and forecast infrastructure needs for AI workloads.

Incident Management:
Lead root cause analysis and manage SLOs/SLIs/SLAs to maintain high availability.

CI/CD Integration:
Work with DevOps/MLOps teams on CI/CD pipelines using GitLab, ArgoCD, or similar tools.

Security & Compliance:
Secure Linux systems, manage certificates, and enforce access controls (RBAC, LDAP SSO, TLS, segmentation).

Documentation & Playbooks:
Maintain architecture diagrams, runbooks, and incident playbooks to support knowledge sharing and onboarding.

Qualifications:

  • Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 3-7 years of experience in the areas below is preferred.
  • Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
  • Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
  • Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
  • Strong scripting and coding skills (Bash, Python, or Go).
  • Exposure to secure multi-tenant environments and zero trust architectures.
  • Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
  • Excellent collaboration and communication skills for cross-team, partner, and customer initiatives


Job Segment: Cloud, Linux, Network, Computer Science, Data Center, Technology