Back to Job Listings

Site Reliability Engineer – Machine Learning Systems

SpringCube

Full time - Senior Associate/ Asst Manager

IT Services & Consulting

Singapore ( Onsite )

Published 4 weeks ago

Salary: Disclosed upon interview

Contact Employer
  • Share:
Send Feedback
Report This Job

Job Description

The SpringCube team curated the following job opportunity to help you in your job search. Explore the position below to find your next career move.

Site Reliability Engineer – Machine Learning Systems

Company Overview
As a global incubator for platforms across commerce, content, entertainment, and enterprise services, ByteDance engages over 2.5 billion users worldwide through products like TikTok. Committed to fostering a safe and positive online environment, ByteDance supports innovation, creativity, and enriching life for people across more than 30 countries with a workforce of over 110,000 employees.

Role Summary
The Machine Learning Systems team supports the development and operation of ByteDance’s distributed ML infrastructure, ensuring stable, reliable, and high-performance systems across global data centers. The role involves managing large-scale AI models and heterogeneous systems, utilizing GPU/NPU/RDMA/Storage integration. Key responsibilities include system stability, resource management, disaster recovery, and on-call support.

Key Responsibilities

  • Ensure efficiency and stability of ML systems for large model deployment, training, evaluation, and inference.
  • Maintain multi-cloud infrastructure stability and manage resources across multiple data centers and regions.
  • Oversee disaster recovery, cluster machine governance, and operations efficiency.
  • Develop tools to monitor and manage ML infrastructure and services.
  • Provide global on-call support for system and business operations.

Qualifications

  • Minimum Requirements:
    • Bachelor’s degree in Computer Science, Computer Engineering, or a related field.
    • Proficiency in programming languages such as Go, Python, or Shell in Linux.
    • Experience with Kubernetes and containers, with at least 3 years in operations and maintenance.
  • Preferred Qualifications:
    • Strong analytical and logical skills, a proactive approach, and team spirit.
    • Proficient in technical documentation.
    • Experience in large-scale ML distributed systems and GPU server operations.

Disclaimer:
SpringCube curates tech job listings from various company websites to support tech professionals in Singapore during these challenging times.

  1. No Endorsement: Job ads on SpringCube do not imply endorsement of their authenticity or quality.
  2. No Client Relationship: This company is not a client of SpringCube unless stated.
  3. Users must click to apply, redirecting to the employer’s career page.
  4. No Liability: SpringCube is not liable for inaccuracies.