Chaos Engineering Expert
Artificial Intelligent Network Solutions - "AINS"
Date: 20 minutes ago
City: Riyadh
Contract type: Full time
Position Title: Chaos Engineering Expert (Resiliency & Performance Engineer)
Location: Riyadh / Hybrid (On-prem + Off-shore)
Employment Type: Full-time .
Reports to: Head of Infrastructure / SRE / Platform Engineering
Role Overview
We are seeking an experienced Chaos Engineering Expert to help strengthen the resiliency, performance, and security posture of our hybrid infrastructure. In this role, you will design, execute, and analyze chaos experiments across our on-premises servers, databases, and application services, and work collaboratively with our team to embed resilience into our systems. Deliver an insight maturity report including improvements recommendations for system architecture, operational processes, and incident response, enabling us to anticipate and mitigate failures before they impact customers.
______________
Responsibilities
Technical Skills & Experience
Certifications & Education
Location: Riyadh / Hybrid (On-prem + Off-shore)
Employment Type: Full-time .
Reports to: Head of Infrastructure / SRE / Platform Engineering
Role Overview
We are seeking an experienced Chaos Engineering Expert to help strengthen the resiliency, performance, and security posture of our hybrid infrastructure. In this role, you will design, execute, and analyze chaos experiments across our on-premises servers, databases, and application services, and work collaboratively with our team to embed resilience into our systems. Deliver an insight maturity report including improvements recommendations for system architecture, operational processes, and incident response, enabling us to anticipate and mitigate failures before they impact customers.
______________
Responsibilities
- Design, plan, and implement chaos engineering experiments across all layers of our infrastructure (physical/virtual servers, network, storage, databases, applications, and services).
- Develop hypotheses (failure scenarios), define metrics, and create success criteria for experiments.
- Execute fault-injection / chaos tests (either in pre-production, staging, or controlled production environments), ensuring minimal risk to business operations.
- Monitor and instrument system behavior during experiments using observability tools.
- Analyze the results of experiments, identify vulnerabilities, failure modes, and weak points; derive actionable recommendations.
- Collaborate with DevOps, SRE, DBAs, security, network, BCM and operations teams to remediate issues uncovered by experiments and comply with systems RTO.
- Integrate chaos experiments into the CI/CD pipeline or as part of release/reliability practices.
- Build a chaos framework suitable for our hybrid environment.
- Document all experiments, including design, configuration, execution details (drills), results, lessons learned, and corrective actions.
- Develop and maintain runbooks, playbooks, and operational procedures for resilience testing.
- Participate in post-incident reviews, injecting learnings from chaos experiments into incident response and root cause analysis.
Technical Skills & Experience
- 6+ years of experience in site reliability engineering (SRE), performance engineering, and infrastructure engineering.
- Proven track record of designing and executing fault injection, resilience testing, chaos experiments.
- Deep understanding of on-premises infrastructure: physical and virtual servers, hypervisors, networking, storage.
- Experience with database systems (e.g., SQL, NoSQL) and how they fail / recover.
- Familiarity with application stacks, microservices, and distributed architectures.
- Proficiency in one or more languages used for automation or scripting (e.g., Python, Go, Java, or similar).
- Hands-on experience with tools such as Chaos Monkey, Gremlin, Chaos Mesh, Litmus Chaos, Toxiproxy, AWS Fault Injection Simulator (FIS), Azure Chaos Studio, or similar.
- Strong skills in monitoring, metrics, logging, and tracing (e.g., OpenText SiteScope, Datadog,).
- Experience integrating chaos testing into CI/CD pipelines and infrastructure-as-code workflows.
- Good understanding of security vulnerabilities and how fault injection might surface security risks.
- Familiar with risk assessment, threat modeling, or security hardening practices.
- Ability to work across teams (DevOps, DBAs, Ops, Security) and communicate complex findings in a clear manner.
- Strong documentation skills — proven ability to write detailed experiment designs, results, remediations, and technical playbooks.
- Analytical Skills:
- Strong analytical mindset, capable of interpreting results, identifying root causes, and recommending mitigations.
Certifications & Education
- Education: Bachelor’s degree in computer science, Engineering, or a related technical field.
- Certifications or formal training in chaos engineering (or resilience engineering) is a must.
- Chaos Engineering Fundamentals certificate is preferred.
- Certifications in SRE, or DevOps (e.g., Gitlab, Azure, Google Cloud, Kubernetes) are beneficial.
- Security certifications are a plus as security vulnerability testing is part of chaos experiments.
How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resumeSimilar jobs
Corporate Sales Manager
Mandarin Oriental,
Riyadh
1 day ago
Are you a master of craft? Do you thrive in a team that succeeds together, demonstrating integrity and respect while acting responsibly? Do you embrace a growth mindset? We invite you to become a fan of the exceptional.Mandarin Oriental is the award-winning owner and operator of some of the most luxurious hotels, resorts and residences located in prime destinations around...
Trainer - Riyadh
Concentrix,
Riyadh
2 days ago
Job Title: Trainer - Riyadh
Job Description The Trainer I is responsible for delivering client focused training to address the new hire, product update and recursive training requirement of Concentrix in support of client programs to ensure superior workforce preparation with best in class service and delivery. This position requires attaining and maintaining certification in the Trainer Certification Program (101/102),...
Client Partner
Snapchat,
Riyadh
2 days ago
Snap Inc
is a technology company. We believe the camera presents the greatest opportunity to improve the way people live and communicate. Snap contributes to human progress by empowering people to express themselves, live in the moment, learn about the world, and have fun together. The Company’s three core products are
Snapchat
, a visual messaging app that enhances your...