optimisen:~# projects/2019.02.03/

Continuous Delivery in AWS with Chaos Engineering

Feb 2019 - Nov 2019

Situation

Fast growing Startup \ Scaleup in online gaming
Node.js and Java apps hosted in Heroku
A random set of managed services for data, monitoring etc
Manually created infrastructure, experiencing failures and fire-fighting on a daily basis
Minimal CI in Travis

Task:

Solve immediate issues with unreliable infrastructure
Design new consolidated cloud-native DevOps Platform, reliable, customizable, secure and cost-efficient
Migrate from Heroku and managed services to the new Platform
Automate CI/CD at least for apps
Promote DevOps culture in Dev teams

Action:

Worked closely with CTO on architectural decisions to use:
- AWS - as cloud
- Kubernetes - for applications hosting, and specifically EKS
- native AWS services for data hosting (RDS, SQS, ElastiCache, Elasticsearch etc)
- Terraform for IaC (Infrastructure as Code)
- GitLab for CI/CD and Pipeline as Code
Built a DevOps team (5 people) from scratch, interviewing candidates and formalizing hiring process
Led this new team to build a new platform on AWS with EKS, RDS, ES, SQS, IAM, SSO, SSM, CloudFront, Route53, CloudWatch etc
Designed Continuous Delivery template in GitLab for infrastructure, with 100% automation in Terraform, bash and Docker
Designed Continuous Delivery template in GitLab for applications, based on Docker, Helm, k3s, Python
Migrated all micro-services from Heroku to the new Platform
Migrated existing minimal pipelines from Jenkins and Travis to GitLab
Introduced Chaos Engineering with weekly Chaos games, destroying and recreating from scratch pieces of infrastructure to improve reliability and test Disaster Recovery
Introduced ephemeral environments to support testing of app features with breaking compatibility
Setup VPC peering and integration to the trading engine from Nasdaq used as SaaS
Promoted DevOps philosophy with modern Agile methodologies, Continuous Delivery, Infrastructure as Code, Pipeline as Code, Lean, Kanban and TDD

Result:

Quick resolution of immediate infrastructure issues, and improvement of MTBF (Mean Time Between Failures) from 1 day to 1 month
Kubernetes-based stateless Platform with MTTR (Mean Time to Recover) of 1 hour, i.e. ability to re-create an environment from scratch fully automatically
Applications Continuous Delivery with ~20 minutes Cycle Time
Developers are able to author their own Dockerfiles and Helm charts, and self-service their code all the way to Prod