Infrastructure Management at Scale with Cluster API, MachineDeployments and MachinePools

I have been managing infrastructure and organisations of different shapes and sizes since early 2017. The toolset has really evolved over time, and different solutions have been developed by organisations along the way. Starting out, I saw a number of places calling cloud provider APIs directly and building CLI tooling on top of them. Background: Infrastructure management with Terraform HashiCorp really gained market share here with Terraform. AWS also had CloudFormation, but across the organisations I have worked with or observed over time, most standardised on Terraform. ...

February 20, 2026 · 7 min · Tasdik Rahman

Scaling cluster upgrades for kubernetes

This post is more of a continuation of the talk I gave over at kubernetes bangalore k8s september 2022 meetup. Here are the slides, which you can take a peek over, to complement this post, if you would like to go through it before reading further. Context I will not repeat the content which is already there in the slides. Will also update this post with the talk link when the talk gets uploaded. But I do want to delve over into the idea of how I feel I would attempt to structure the upgrades next. This post is more on the infrastructure upgrade complexities arising from when managing double digit or more k8s clusters. ...

September 26, 2022 · 5 min · Tasdik Rahman

Musings with client-go of k8s

This post mostly is for documentary purposes for myself, about a few things which I ended up noticing while using client-go as I used it for deliveryhero/k8s-cluster-upgrade-tool, which used the out-cluster client configuration, a couple of things are specific to that setup, like client init, but other things like testing interactions via client-go are more generic. Initialization of the config client-go in itself, shows a couple of example of client init here, pasting the snippet here for context ...

September 22, 2022 · 6 min · Tasdik Rahman

To self host or to not self host kubernetes cluster(s)

A friend of mine asked this to me recently, about how was it to self host kubernetes clusters. And I was cursing myself about why I did not write this post earlier (I mean, technically I have written about how we used to do self hosting before, but not the pros and cons of it), as this was not the first time I had been asked this question. So this post is dedicated to my friend and to others when they chance upon this question. ...

November 27, 2020 · 6 min · Tasdik Rahman

Choosing between one big cluster or multiple smaller kubernetes clusters

This post is a continuation on the discussion which I was having with @vineeth But why would someone choose one large cluster over multiple small clusters? Aren't multiple clusters already a pattern in enterprises? — Vineeth Pothulapati (@VineethReddy02) November 20, 2020 Context is when I came across a tweet which demonstrated the ability of kubernetes to scale uptill 15k nodes due to recent improvements. 15k nodes cluster 🤯 https://t.co/VMWI7HeYHH — Tasdik Rahman (@tasdikrahman) November 20, 2020 The discussion was originally around costs and how much would it take to run one such large kubernetes cluster, but it went into a different direction altogether. ...

November 21, 2020 · 10 min · Tasdik Rahman

A few notes on GKE kubernetes upgrades

This post was originally published in Gojeks engineering blog, here, this post is a cross post of the same This post is more of a continuation to this tweet A few notes on @kubernetes cluster upgrades on GKE (1/n) — Tasdik Rahman (@tasdikrahman) July 21, 2020 If you are running kubernetes on GKE, chances are that you are already doing some form of upgrades for your kubernetes clusters, given that their release cycle is quarterly, which means you will have a minor version bump every quarter in the upstream. That is really a high velocity for version releases, but that’s not the focus of this post, the focus is on how you can attempt to keep up with this release cycle. ...

July 22, 2020 · 15 min · Tasdik Rahman

Our learnings from Istio’s networking APIs while running it in production

This was originally published under Gojek’s engineering blog by me, this post is a repost. We at Gojek have been running Istio 1.4 with a multi-cluster setup for some time now, on top of which, we have been piloting a few reasonably high throughput services in production, serving customer-facing traffic. One of these services hits ~195k requests/minute. In this blog, we’ll deep dive into what we have learnt and observed by using Istio’s networking APIs. ...

June 17, 2020 · 10 min · Tasdik Rahman

Specifying scheduling rules for your pods on kubernetes

This is more of an extended version of the tweet here If you haven't had a look at pod-affinity and anti-affinity, it's a great way which one can use to distribute the pods of their service across zones. https://t.co/iqhbyhruD8 (1/n) — Tasdik Rahman (@tasdikrahman) February 23, 2020 PodAntiAffinity/PodAffinity were released in beta some time back in 2017, in the 1.16 release for k8s, along with node affinity/anti-affinity, taints and tolerations and custom scheduling. ...

May 6, 2020 · 5 min · Tasdik Rahman

Route missing in kubernetes node with kuberouter as the CNI

Anyone who is evaluating into having a networking solution for their kubernetes cluster without having a lot of moving parts in the cluster, kuberouter provides pod networking, ability to enforce network policies, IPVS/LVS service proxy among other things. The problem which we faced specifically while running this in our clusters was missing routes upon restart of the node, or sometimes in the case when the node was joining the cluster as part of the worker node. ...

January 5, 2020 · 2 min · Tasdik Rahman

Various ways of enabling canary deployments in kubernetes

Update I gave a quick lightening talk about the same talk @ DevopsDays India, 2019. The slides for which can be found below What canary can be Shaping the traffic in a way, so that we could direct a % of traffic to the new pods and promoting the same deployment to a full scaleout and gradually phasing out the older release. Why canary? Testing on staging doesn’t weed out all the possible reasons for something failing, final testing for a feature being done on some part of the traffic is not something unheard of. Canary being a precursor to enable full blue green deployments. ...

September 12, 2019 · 4 min · Tasdik Rahman