Skip to content
← Back to job listings

Infra Support Engineer

Fuku · Taiwan

IT - Network / Systems / DB AdminEntry LevelQuick applyfull-timeabout 2 months ago

About The Role

Infra Support Engineer – GMI Global Infrastructure Team

Preferred Location

  • Taiwan
  • Malaysia

Responsibilities

  • Provide first and second-line technical support to customers for AI Infrastructure, including GPU/CPU nodes, networking, storage, orchestration, and platform services. Support is delivered via ticketing systems, emails, Slack, or other messaging platforms.
  • Support GPU cluster delivery, including system provisioning, image deployment, network validation, BIOS/firmware updates, and GPU driver/runtime installation.
  • Monitor system health and service-level indicators using alerts and dashboards; respond to alerts 24x7 as scheduled.
  • Triage incidents by gathering context, verifying scope and impact, and following standard operating procedures and runbooks to perform immediate mitigations.
  • Escalate incidents to global SRE engineers with clear, concise incident notes and relevant logs/traces.
  • Maintain incident logs, update status pages, and communicate timely updates to stakeholders during incidents.
  • Perform routine operational tasks such as log checks, health checks, capacity checks, and simple automated fixes.
  • Participate in postmortems and contribute actionable follow-ups to reduce recurrence of incidents.
  • Help maintain and improve standard operating procedures (SOP), run periodic runbook validation, and document new procedures.
  • Work collaboratively with developers and SRE teams to improve system reliability.

Qualifications

  • Bachelor’s degree in Computer Science or a related field.
  • Over 2 years of experience in IT operations, server administration, SRE, DevOps, or technical support.
  • Hands-on Linux experience, including shell, kernel, and log management.
  • Basic networking knowledge, including TCP/IP, DNS, HTTP, and VLANs.
  • Familiarity with monitoring, alerting, and logging tools such as Prometheus, Grafana, and AlertManager.
  • Experience with Nvidia GPU infrastructure and Kubernetes.
  • Comfortable collecting diagnostics, reading logs, and interpreting traces.
  • Strong troubleshooting mindset and ability to follow runbooks under pressure.
  • Excellent written and verbal communication skills for customer-facing incident handling.
  • Willingness to work shifts and participate in on-call rotations.
  • Bilingual in English and Chinese.
  • Visit the company's website for more information
  • Visit website

This listing was posted by a verified recruiter at Fuku. Report this listing