Việc làm này đã được thêm vào mục Việc làm đã lưu.
Bạn đã lưu tối đa 20 việc làm. Nếu bạn muốn lưu mới, hãy cập nhật Việc làm đã lưu.
Chuyên môn:
Lĩnh vực:
Sản Phẩm Phần Mềm và Dịch Vụ Web
3 Lý do để gia nhập công ty
- Hybrid and flexible working environment
- Innovative Product
- Growth Opportunities
Mô tả công việc
We're seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you'll play a critical role in our GCP to AWS migration while maintaining and improving system reliability. As an SRE at PAVE.ai, you'll implement best practices for monitoring, incident response, and automation to achieve 99.9%+ uptime. You'll work hands-on with AWS infrastructure to build resilient systems that process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces globally.
Key Responsibilities
System Reliability & Stability
- Monitor and maintain production systems to ensure 99.9%+ uptime
- Implement proactive monitoring and alerting to detect issues before they impact customers
- Perform root cause analysis for incidents and implement permanent fixes
- Create and maintain runbooks for common operational procedures
- Participate in 24/7 on-call rotation and incident response
- Conduct regular reliability reviews and implement improvements
AWS Infrastructure Management
- Deploy and manage AWS services including EC2, ECS/EKS, RDS, S3, CloudFront
- Optimize AWS infrastructure for performance, cost, and reliability
- Implement AWS best practices for security, backup, and disaster recovery
- Configure auto-scaling policies and load balancing for high availability
- Manage AWS networking components (VPC, Security Groups, ALB/NLB)
- Support migration efforts from GCP to AWS under Lead DevOps guidance
Monitoring & Observability
- Design and implement comprehensive monitoring solutions using CloudWatch, Prometheus, Grafana
- Set up distributed tracing and application performance monitoring
- Create meaningful dashboards and alerts for service health
- Define and track SLIs (Service Level Indicators) for critical services
- Implement log aggregation and analysis using ELK stack or similar
- Establish baseline metrics and identify performance anomalies
Automation & Infrastructure as Code
- Develop automation scripts to reduce manual operations and toil
- Implement Infrastructure as Code using Terraform and CloudFormation
- Create CI/CD pipelines for reliable and repeatable deployments
- Automate routine tasks such as backups, scaling, and maintenance
- Build self-healing mechanisms for common failure scenarios
- Develop tools to improve developer productivity and deployment velocity
Performance Optimization
- Analyze system performance and identify bottlenecks
- Optimize application and database performance
- Implement caching strategies to reduce latency
- Conduct load testing and capacity planning
- Fine-tune resource allocation and utilization
- Optimize cloud costs without compromising reliability
Incident Management
- Respond to production incidents with urgency and professionalism
- Follow incident management procedures and escalation protocols
- Document incidents and contribute to post-mortem analysis
- Implement preventive measures based on incident learnings
- Improve MTTR (Mean Time To Recovery) through better tooling and processes
- Maintain incident communication with stakeholders
Collaboration & Documentation
- Work closely with development teams to improve application reliability
- Provide guidance on reliability best practices during design phase
- Document infrastructure, procedures, and troubleshooting guides
- Share knowledge through team presentations and training sessions
- Collaborate on capacity planning and scaling strategies
- Support developers with production debugging and optimization
Success Metrics
- Maintain 99.9%+ uptime for assigned services
- Reduce incident MTTR by 30% within first year
- Automate 50% of manual operational tasks
- Zero critical security incidents
- Achieve all SLO targets for assigned services
- Complete AWS migration tasks on schedule
Yêu cầu công việc
Technical Skills
AWS Expertise:
- Strong proficiency with core AWS services (EC2, S3, RDS, VPC, IAM)
- Experience with container services (ECS, EKS, ECR)
- Knowledge of AWS monitoring and logging (CloudWatch, CloudTrail)
- Understanding of AWS security best practices
- Experience with AWS CLI and SDKs
- Familiarity with AWS Well-Architected Framework
SRE & DevOps Tools:
- Infrastructure as Code: Terraform, CloudFormation, or AWS CDK
- Configuration management: Ansible, Chef, or Puppet
- CI/CD tools: Jenkins, GitLab CI, GitHub Actions
- Containerization: Docker, Kubernetes, Helm
- Version control: Git, GitHub/GitLab
- Scripting languages: Python, Bash, or Go
Monitoring & Observability:
- Prometheus, Grafana, or similar metrics platforms
- Log management: ELK Stack, Splunk, or CloudWatch Logs
- APM tools: New Relic, Datadog, or AppDynamics
- Distributed tracing: Jaeger, Zipkin, or AWS X-Ray
- Alert management: PagerDuty, Opsgenie, or similar
Technical Fundamentals:
- Strong Linux/Unix system administration skills
- Networking concepts: TCP/IP, DNS, Load Balancing, CDN
- Database administration: PostgreSQL, MySQL, Redis
- Understanding of distributed systems and microservices
- Knowledge of security principles and best practices
- Experience with performance tuning and optimization
Soft skills:
- Strong problem-solving and troubleshooting abilities
- Excellent written and verbal communication skills in both English and Vietnamese
- Ability to work effectively under pressure during incidents
- Detail-oriented with strong documentation skills
- Team player with collaborative mindset
- Proactive approach to identifying and solving problems
- Continuous learning mindset for new technologies
Experience
- 2-5 years of experience in DevOps, SRE, or Infrastructure Engineering
- 2+ years of hands-on AWS experience in production environments
- Experience maintaining high-traffic, high-availability systems
- Proven track record of improving system reliability and uptime
- Experience with 24/7 on-call responsibilities and incident management
Preferred Qualifications
- AWS certifications (SysOps Administrator, DevOps Engineer, or Solutions Architect)
- Experience with GCP and cloud migration projects
- Knowledge of SRE practices from Google's SRE book
- Experience with AI/ML infrastructure and GPU workloads
- Familiarity with automotive industry or vehicle inspection systems
- Experience with chaos engineering and failure injection
- Knowledge of compliance frameworks (SOC2, ISO 27001)
- Experience with serverless architectures (Lambda, API Gateway)
- Contributions to open-source DevOps/SRE projects
- Experience with FinOps and cloud cost optimization
Tại sao bạn sẽ yêu thích làm việc tại đây
1. Competitive Compensation & Perks
- Attractive salary package.
- 15 days of annual leave.
- 13th-month bonus
- Premium healthcare coverage for you and your family.
- Thoughtful appreciation gifts throughout the year.
2. Growth & Learning Opportunities
- Work on cutting-edge, large-scale products in the car inspection field.
- Clear career paths for both technical experts and aspiring leaders.
- Continuous learning programs to sharpen your skills and grow your career.
- Learn from everything, everywhere—but be a smart copy-paster, not a copycat!
- Be ready to embrace and implement new ideas in a fast-paced environment.
3. An Inspiring Workplace
- Flexible hybrid work model and a strong focus on work-life balance.
- A modern, fully-equipped Office with a well-stocked pantry.
- Be motivated, creative, and passionate—we can’t ask for more!
- Respect and care for your teammates, your environment, and even yourself.
- Treat yourself well, and while you’re at it, save the Earth too.
4. A Mindset for Growth
- Have the courage to move fast, stay flexible, and take full responsibility for every single line of code.
- Always look back at your work and strive to make it better—nothing is perfect, and that’s where you come in.
- It’s okay to be late sometimes, but make sure you’re fully accountable and aware of your actions.
5. A Dynamic and Open Culture
- We don’t stick rigidly to the gameplan, so feel free to add or remove your own “blah blah” from this list. 😉
Việc làm tương tự dành cho bạn
Nhận các việc làm tương tự qua email
NEW FOR YOU
Đăng
19 giờ trước
Systems Engineer (DevOps, System Admin, Python)
Tại văn phòng
TP Hồ Chí Minh - Hà Nội
HOT
Đăng
1 ngày trước
Remote Backend Developer (Python/NodeJS, AI Automation)
Làm từ xa
TP Hồ Chí Minh - Hà Nội - Đà Nẵng
HOT
Đăng
4 ngày trước
AI Technical Lead (Machine Learning/ LLM/ Python)
Tại văn phòng
TP Hồ Chí Minh
Góp ý