Explore outstanding Cloud & Infrastructure jobs.
See now

Senior Manager Service Reliability Engineer (24/7 Team)

Công ty Cổ phần Thanh toán số MobiFone
Tầng 30 - Tòa tháp C5 D’Capital, 119 Trần Duy Hưng, Phường Yên Hoà, Thành phố Hà Nội, Việt Nam, Yen Hoa, Ha Noi
At office
Posted 1 day ago
Job Expertise:
Job Domain:
Software Products and Web Services
Financial Services

Top 3 reasons to join us

  • Competitive salary & attractive benefits
  • Career growth and learning opportunities
  • Young, dynamic and collaborative culture

Job description

Job Purpose: Responsible for operating and ensuring the stability of large-scale, mission-critical applications and services running 24/7. Acts as Incident Commander, leading the end-to-end resolution of production incidents, coordinating technical teams to restore services in the shortest possible time, and sustaining the highest SLA commitments to customers and the business. Builds, standardizes and continuously improves the Incident / Problem / Change Management processes in line with ITIL standards, enhancing system reliability and resilience in accordance with the overall strategy of the Technology Division.

Responsibilities

Incident Management & Command

  • Act as Incident Commander with end-to-end accountability for high-severity production incidents (P1/P2).
  • Lead the war-room and coordinate cross-functional teams (Application, Infrastructure, Network, Security, Vendor) throughout incident resolution.
  • Ensure the fastest possible service restoration, minimizing MTTR (Mean Time To Recovery) and reducing impact to customers and business operations.
  • Maintain clear, timely communication with stakeholders (business, leadership, operations) across the full incident lifecycle.
  • Conduct Post-Incident Reviews (PIR), Root Cause Analysis (RCA) and define preventive actions.

Service Operations (24/7)

  • Operate mission-critical / high-availability systems on a continuous 24/7 basis.
  • Establish, monitor and enforce SLA / SLO and compliance across critical services.
  • Manage the Incident / Problem / Change lifecycle following the ITIL/ITSM framework.
  • Improve system stability, reliability and resilience.

24/7 Operations & Shift Model

  • Operate under a 24/7 model organized as shifts/teams, ensuring round-the-clock coverage including nights, weekends and public holidays.
  • Plan, manage and balance shift rosters and on-call schedules to guarantee adequate staffing and seamless handover between shifts.
  • Lead or participate in the duty-officer / on-call rotation as Incident Commander, ready to respond to major incidents at any time.
  • Ensure each shift maintains complete handover logs, runbooks and shift reports for full operational continuity.
  • Willing and able to work in rotating shifts (day/night) and respond outside office hours when major incidents occur.

Monitoring & Observability

  • Design and operate comprehensive monitoring systems (APM, infrastructure monitoring, log/observability).
  • Proactively detect issues through alert tuning and early warning signals.
  • Optimize dashboards and alerting strategy to reduce noise and false alerts.

Problem Management & Continuous Improvement

  • Analyze incident trends and identify recurring issues.
  • Implement automation and runbooks to reduce manual effort and resolution time.
  • Propose architecture / system design improvements to increase availability.

Leadership & Stakeholder Management

  • Manage, lead and mentor the IT Operations / Incident Management team, including shift supervisors and on-call engineers.
  • Act as the bridge between Business / Product / Engineering / Infrastructure.
  • Report directly to senior leadership on overall system health and status.

Your skills and experience

  • 7–12+ years of experience in IT Operations / ITSM / Incident Management.
  • Hands-on experience handling major incidents in a 24/7 operating environment.
  • Background in Banking / Fintech / Payment Gateway / E-commerce or other large-scale systems.
  • Strong understanding of ITIL / ITSM best practices.
  • Willing to work in a 24/7 shift-based model (3 shifts / 4 teams) and participate in on-call duty rotations.
  • Solid infrastructure foundation: Network (TCP/IP, Load Balancing, DNS), Server (Linux/Windows), basic Storage / Database.
  • Proficient with monitoring tools such as Datadog, Dynatrace, Prometheus, Grafana, ELK, Splunk, etc.
  • Experience with distributed systems, microservices, high availability / DR / failover.

Nice-to-have

  • Experience building an Incident Management framework.
  • Cloud experience (AWS / Azure / GCP).
  • ITIL certification or SRE knowledge is an advantage.
  • Prior role as Incident Manager / SRE Lead / IT Operations Manager.

Why you'll love working here

Salary & Allowances

  • 13 month salary with annual performance bonus, project incentives, sales incentives (based on position).
  • Lunch allowance: 730.000 VND/month.
  • Special occasion bonus: 2.500.000VND/year.
  • Annual leaves: Up to 20 days/year (based on levels).
  • Health: Social insurance, premium health insurance, yearly health check.
  • Laptop, screen and other needed facilities/accounts/tools for work.

Career Growth

  • Yearly salary review and promotion.
  • Diverse career path: Management or Expert and functions rotation opportunity.
  • Free learning sources in Udemy, Coursera, O’Reilly platforms; internal workshop, certification sponsorship, and exclusive mentoring from C-levels.
  • Recognition and awards at team and organizational levels.

Working Environment

  • Open & collaborative working space foster both individual focus and teamwork activities.
  • Young, dynamic, and collaborative working atmosphere.
  • Quarterly/yearly teambuilding & engaged internal events.

Công ty Cổ phần Thanh toán số MobiFone

Company type
IT Product
Company industry
Financial Services
Company size
51-150 employees
Country
Vietnam
Working days
Monday - Friday
Overtime policy
No OT

More jobs for you

Get similar jobs by email Subscribe