From private DataCentre and Docker Swarm to GCP and Kubernetes
Mar 2020 - Aug 2020
Situation
UK-wide accommodation booking provider, beginning international expansion
Fixed bare-metal infrastructure in 2 Data Centres, providing minimal HA (high-availability)
Some system components are not replicated, demonstrating several SPOFs (single points of failure)
Uneven seasonal load, wasting investment in hardware
“Bus factor” of 1
Delivery tool stack is ~ 3-5 years old, based on Docker Swarm and Ansible
Most DevOps services are self-managed, most tools are 2-3 major versions behind
Application monitoring is barely working in ELK
Task:
Pilot migration to GCP (Google Cloud Platform) to benefit from on-demand scaling, cost saving and HA
Design completely new tech stack based on Kubernetes, Helm, PaaS and managed services
Action:
PoC on a small slice of the system showing full migration to new Platform
Infrastructure as Code in Terraform with state in Terraform Cloud and automated tests in Kitchen-Terraform
Terraform code for all GCP resources: GKE (Kubernetes), Cloud SQL, Memorystore for Redis and Memcached
CI/CD in GitLab for infrastructure with multiple environments (Dev, Sandbox, Staging, Prod)
CI/CD in GitLab for applications, deploying to provisioned Kubernetes clusters, Helm, Cloud SQL, Memorystore, Solr
Kubernetes and Helm manifests converted from docker-compose and Docker Swarm approach
ELK (Elasticsearch, Logstash, Beats, Kibana) - fully refactored the whole setup and data pipeline for app logging, gradual upgrades from 4.x to 7.x, performance tuning of sharding, recovery, curator etc
Datadog - configured for monitoring and alerting of all the DevOps services
Setup detailed report on cloud costs in GCP Billing
Result:
Estimated 60% infrastructure cost-savings on full migration to the new platform
Application CI/CD pipeline with ~20 minutes Cycle Time to deliver changes all the way to Prod
ELK cluster recovery time on a single node downtime reduced from 3 hours to 6 minutes