This job has been added to your Saved jobs.
You have reached the limit of 20 Saved Jobs. If you want to create a new one, please manage your Saved Jobs.
Top 3 reasons to join us
- Hybrid and flexible working environment
- Innovative Product
- Growth Opportunities
Job description
We're seeking an experienced Lead DevOps Engineer to spearhead our critical infrastructure transformation as PAVE.ai scales to enterprise level. This role will lead the strategic migration from Google Cloud Platform to AWS while building and managing a high-performing DevOps team. As Lead DevOps Engineer at PAVE.ai, you'll architect enterprise-grade infrastructure, establish site reliability engineering practices, and ensure 99.9%+ uptime for our vehicle inspection platform serving global automotive enterprises. This is a pivotal role that will define our infrastructure strategy and operational excellence as we process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces worldwide.
Cloud Migration Leadership
- Lead and execute the complete migration strategy from GCP to AWS, ensuring zero downtime
- Design and implement AWS enterprise architecture following Well-Architected Framework principles
- Create detailed migration roadmaps with clear milestones, risk assessments, and rollback plans
- Architect hybrid cloud solutions during transition phase to maintain business continuity
- Optimize costs during and after migration while improving performance and reliability
- Document migration processes and create runbooks for knowledge transfer
Team Leadership & Development
- Build and lead a world-class DevOps team, including hiring, mentoring, and performance management
- Define team structure, roles, and responsibilities for 24/7 operational coverage
- Establish DevOps culture and best practices across the engineering organization
- Create career development paths and training programs for team members
- Foster collaboration between DevOps, development, and security teams
- Lead incident response and post-mortem processes to drive continuous improvement
Site Reliability Engineering (SRE)
- Establish and maintain SLIs, SLOs, and SLAs for all critical services
- Design and implement comprehensive monitoring and observability strategies
- Build automated incident detection and response systems
- Ensure 99.9%+ uptime for production systems through proactive reliability engineering
- Implement chaos engineering practices to identify and fix potential failures
- Create capacity planning models to support 10x growth
Infrastructure & Automation
- Design scalable, secure, and cost-effective AWS infrastructure for enterprise workloads
- Implement Infrastructure as Code (IaC) using Terraform/CloudFormation
- Build CI/CD pipelines supporting multiple deployment strategies (blue-green, canary)
- Automate security compliance and governance using AWS native tools
- Implement auto-scaling and self-healing infrastructure
- Design disaster recovery and business continuity strategies
- Develop and enhance logging systems and observability tools (ongoing improvement initiative)
Enterprise Platform Development
- Architect multi-tenant infrastructure supporting enterprise isolation requirements
- Implement enterprise-grade security including VPN, SSO, and zero-trust networking
- Design data residency and compliance solutions for global operations
- Build platform services for logging, monitoring, secrets management, and service mesh
- Create developer self-service platforms to accelerate delivery
- Establish FinOps practices for cloud cost optimization
Strategic Planning
- Develop long-term infrastructure roadmap aligned with business objectives
- Partner with leadership to define technology strategy and investments
- Evaluate and introduce new technologies to improve operational efficiency
- Create business cases for infrastructure investments with ROI analysis
- Establish vendor relationships and manage AWS enterprise support
- Drive infrastructure standardization and consolidation initiatives
Success Metrics
- Complete GCP to AWS migration within 6 months with zero critical incidents
- Achieve and maintain 99.9% uptime across all production services
- Reduce infrastructure costs by 30% while improving performance
- Build and retain a high-performing DevOps team with <10% attrition
- Decrease deployment frequency from weekly to multiple times daily
- Reduce MTTR (Mean Time To Recovery) by 50%
Your skills and experience
Technical Skills
Cloud Expertise:
- Expert-level AWS knowledge (Solutions Architect Professional preferred)
- Strong GCP experience with migration expertise
- Multi-cloud architecture and management
- AWS services mastery: EC2, ECS/EKS, Lambda, RDS, S3, CloudFront, Route53
- Cloud networking: VPC, Transit Gateway, Direct Connect, Global Accelerator
- Security services: IAM, KMS, WAF, Shield, GuardDuty, Security Hub
DevOps & Automation:
- Infrastructure as Code: Terraform, CloudFormation, AWS CDK
- Configuration management: Ansible, Chef, or Puppet
- CI/CD platforms: Jenkins, GitLab CI, GitHub Actions, AWS CodePipeline
- Container orchestration: Kubernetes (EKS), Docker, Helm
- GitOps practices with ArgoCD or Flux
- Scripting languages: Bash, Go, Python
Site Reliability:
- Monitoring/Observability: Prometheus, Grafana, ELK, Datadog, New Relic
- APM and distributed tracing: OpenTelemetry, Jaeger
- Incident management: PagerDuty, Opsgenie
- SRE practices: Error budgets, SLI/SLO definition, toil reduction
- Performance tuning and capacity planning
- Chaos engineering tools: Gremlin, Chaos Monkey
Leadership Skills
- Proven ability to lead and inspire technical teams
- Experience managing remote and distributed teams
- Strong project management and organizational skills
- Excellent stakeholder management across technical and business teams
- Budget management and cost optimization experience
- Change management expertise for large-scale transformations
Soft Skills
- Excellent written and verbal communication skills in both English and Vietnamese
- Strategic thinking with ability to balance long-term vision with immediate needs
- Strong problem-solving skills with calm demeanor during incidents
- Ability to influence and drive consensus across organizations
- Mentoring mindset with passion for developing talent
- Adaptable to rapidly changing requirements and technologies
Experience
- 7+ years of DevOps/SRE experience with 3+ years in a leadership role
- Proven experience leading large-scale cloud migrations (GCP to AWS preferred)
- Track record of managing DevOps teams of 4+ engineers
- Experience with enterprise B2B SaaS platforms at scale (millions of requests/day)
- Demonstrated success improving system reliability from <99% to 99.9%+
Preferred Qualifications
- AWS Certified DevOps Engineer or Solutions Architect Professional
- Experience with AI/ML workload infrastructure and GPU clusters
- Knowledge of automotive industry compliance and regulations
- Experience with computer vision and image processing pipelines
- Serverless architecture and event-driven systems
- FinOps certification or demonstrated cost optimization achievements
- Experience with regulated environments (SOC2, ISO 27001, GDPR)
- Contributions to open-source DevOps/SRE tools
- Public speaking experience at DevOps/SRE conferences
- Experience scaling startups to enterprise level
Why you'll love working here
1. Competitive Compensation & Perks
- Attractive salary package.
- 15 days of annual leave.
- 13th-month bonus
- Premium healthcare coverage for you and your family.
- Thoughtful appreciation gifts throughout the year.
2. Growth & Learning Opportunities
- Work on cutting-edge, large-scale products in the car inspection field.
- Clear career paths for both technical experts and aspiring leaders.
- Continuous learning programs to sharpen your skills and grow your career.
- Learn from everything, everywhere—but be a smart copy-paster, not a copycat!
- Be ready to embrace and implement new ideas in a fast-paced environment.
3. An Inspiring Workplace
- Flexible hybrid work model and a strong focus on work-life balance.
- A modern, fully-equipped Office with a well-stocked pantry.
- Be motivated, creative, and passionate—we can’t ask for more!
- Respect and care for your teammates, your environment, and even yourself.
- Treat yourself well, and while you’re at it, save the Earth too.
4. A Mindset for Growth
- Have the courage to move fast, stay flexible, and take full responsibility for every single line of code.
- Always look back at your work and strive to make it better—nothing is perfect, and that’s where you come in.
- It’s okay to be late sometimes, but make sure you’re fully accountable and aware of your actions.
5. A Dynamic and Open Culture
- We don’t stick rigidly to the gameplan, so feel free to add or remove your own “blah blah” from this list. 😉
More jobs for you
Get similar jobs by email
NEW FOR YOU
Posted
19 hours ago
Systems Engineer (DevOps, System Admin, Python)
At office
Ho Chi Minh - Ha Noi
HOT
Posted
1 day ago
Remote Backend Developer (Python/NodeJS, AI Automation)
Remote
Ho Chi Minh - Ha Noi - Da Nang
Feedback