About the customer
This client is a global digital freight forwarder that specialises in optimising and managing supply chains in real-time.
They offer a modern, technology-driven platform that provides transparency and efficiency to businesses handling complex logistics and global freight movements.
The challenge
As their SaaS platform experienced exponential growth, the need to scale quickly became critical.
However, their release process was highly manual, reliant on custom scripts and Google Cloud VMs that were treated like pets—each environment was manually configured and maintained, making it fragile and time-consuming to manage.
This organisation was faced with several challenges:
- Their infrastructure was struggling to handle the increasing volume of real-time data, leading to performance bottlenecks during peak usage.
- Major patches followed a “hope and pray” strategy, as there was no automated way to spin up replacement machines if something went wrong.
- Releases were unpredictable and complex, with multiple manual steps, resulting in frequent downtime and high operational risk.
- They lacked oversight on the stability of their services and visibility into the underlying metrics needed to control mean-time-to-recover, making it difficult to respond to incidents quickly and efficiently.
The solution
To address the scaling challenges and introduce automation, we transitioned all of their services from manually managed VMs to Docker containers running on Kubernetes. This shift allowed for more efficient and scalable management of their infrastructure, with Kubernetes providing self-healing capabilities to automatically handle container failures and ensure high availability.
Using KEDA we were able to provide scaling which matched the exact demand on the system, without waiting for the resource usage to spike first. Essentially allowing us to pre-allocate resources that we know will be required in the short-future.
We no longer feel anxiety when a customer contacts support. It’s liberating to actually feel excited when a customer reaches out to us. In the past this was a sign for panic as the system was likely down.
CTO
We implemented infrastructure as code using Terraform, enabling reliable, consistent, and repeatable infrastructure changes. This dramatically improved their release process by eliminating manual interventions, allowing developers to confidently deploy both applications and infrastructure updates. Kubernetes' GitOps approach ensured that the entire environment, from the infrastructure to the application level, was managed in a declarative, version-controlled way.
To further enhance the observability and reliability of their system, we integrated Prometheus for monitoring, Grafana for real-time visualisation of metrics, and Thanos for long-term storage and high availability of metrics across clusters. This setup gave the team deep insights into system performance and allowed them to proactively monitor the health and stability of their services.
As a result, the combination of Kubernetes' self-healing, Terraform's infrastructure-as-code reliability, and the comprehensive monitoring stack of Prometheus, Grafana, and Thanos transformed their ability to scale, release, and monitor their systems efficiently and predictably.
I am now able to actually start finishing these feature requests on our backlog. In the past I was too weighed down with ad-hoc work to find the time. It has made my job a lot more fun.
Lead Developer
Our impact
Initial Setup | With PolitePixels |
---|---|
Infrastructure manually managed through custom scripts and fragile VMs. | Automated, scalable infrastructure with Docker containers and Kubernetes. |
Releases were complex, manual, and prone to failure, resulting in frequent downtime. | Near instantaneous and reliable releases. |
Performance bottlenecks during peak traffic caused by slow response to increasing demand. | Pre-allocating resources to match demand before traffic spikes. |
Lack of visibility into system health, making it difficult to recover from incidents. | Real-time monitoring, deep insights, and proactive system management. |
Downtime and incidents created panic, with no quick way to recover. | Near elimination of unscheduled downtime. |
Key results
Near complete reduction in downtime due to Kubernetes' self-healing capabilities and automated scaling to meet application demands, ensuring high availability even during failures.
50% faster deployment times, with the shift to infrastructure as code using Terraform, enabling more frequent and reliable infrastructure changes.
Major increase in system visibility, thanks to the integration of Prometheus, Grafana, and Thanos, allowing for real-time monitoring, deeper insights, and proactive issue resolution.