Moving Canary deployments on AWS using ELB to kubernetes using Traefik

Moving Canary deployments on AWS using ELB to kubernetes using Traefik

Canary deployment pattern is very similar to Blue green deployments, where you are deploying a certain version of your application to a subset of your application servers. If everything is alright and you have tested out that everything is working fine, you route a certain percentage of your users to those application servers and gradually keep increasing the traffic till a full rollout is achieved.

One of the many reasons to do this can be to test a certain feature out with a percentage of users who use your service. This can be further extended to enabling a service to users of a particular demographic.

Canary deployments on AWS

Canary in our use case @ Razorpay, was used by one of our API’s which we served and gave out for consumption and the earlier method for canary deployments there before we were on kubernetes was to have two seperate Autoscaling Groups for the primary ASG serving the particular API and another ASG(with a smaller desired count for the ASG), let’s call it canary ASG for now.

Now both

were having their own individual ELB’s attached to them, with them being in our public subnet. Both the ELB’s would have a CNAME DNS record pointing to their public FQDN given out by AWS.

For simplicity of drawing the ASG groups, I have not shown the ASG groups for both the canary and the main service in 2 separate AZ’s, but it is the recommended way to go forward. As in case of an AZ failure, you have the other set of ASG instances to be routed by the ELB(with cross zone load balancing enabled)

The canary ASG would be attached to both the

The capacity(min: desired) for the main service is more than the capacity for the ASG for canary, and the canary ASG capacity’s max is set to it’s desired. The reasoning for this is that, any regression wouldn’t propagate to a larger number of users if autoscaling kicks in.

Since our ELB is an Internet-facing load balancer, it gets public IP addresses(one for each AZ). The DNS name of an Internet-facing load balancer is publicly resolvable to the public IP addresses of the nodes of the ELB. Therefore, Internet-facing load balancers can route requests from clients over the Internet.

The load balancer node that receives the request selects an attached instance using the round robin routing algorithm for TCP listeners and the least outstanding requests routing algorithm for HTTP and HTTPS listeners.

Hence, the canary instances would also get traffic in a round robin fashion.

Replicating the same in kubernetes

traefik runs as our L7 load balancer, or as our ingress controller inside kubernetes to route traffic to our kubernetes services for various microservices running inside our cluster.

traefik would be running on hostNetwork: true as DaemonSet

These pods will use the host network directly and not the “pod network” (the term “pod network” is a little bit misleading as there is no such thing - it basically just comes down to routing network packets and namespaces). So we can bind the Traefik ports on the host interface on port 80. That also means of course that no further pods of a DaemonSet can use this port and of course also no other services on the worker nodes. But that’s what we want here as Traefik is basically our “external” loadbalancer for our “internal” services - our tunnel to the rest of the internet so to say.

Sample configuration which you can use to deploy traefik

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: traefik-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: traefik-ingress-controller
subjects:
- kind: ServiceAccount
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: traefik-ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - pods
      - services
      - endpoints
      - secrets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik
  namespace: kube-system
data:
  traefik-config: |-
    defaultEntryPoints = ["http","https"]
    [entryPoints]
      [entryPoints.http]
      address = ":80"
        [entryPoints.http.redirect]
        regex = "^http://(.*)"
        replacement = "https://$1"
      [entryPoints.https]
      address = ":443"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: Service
apiVersion: v1
metadata:
  name: traefik-ingress-service
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - protocol: TCP
      name: http
      port: 80
    - protocol: TCP
      name: admin
      port: 8080
  type: NodePort
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: traefik-ingress-controller
  namespace: traefik
  labels:
    k8s-app: traefik-ingress-lb
spec:
  selector:
    matchLabels:
      k8s-app: traefik-ingress-lb
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
    spec:
      nodeSelector:
        edge-node-label: ""
      serviceAccountName: traefik-ingress-controller
      terminationGracePeriodSeconds: 60
      hostNetwork: true
      containers:
      - image: traefik:v1.7.16-alpine
        name: traefik-ingress-lb
        ports:
        - name: http
          containerPort: 80
          hostPort: 80
        - name: admin
          containerPort: 8080
        securityContext:
          privileged: true
        args:
        - --loglevel=INFO
        - --web
        - --kubernetes
        - --web.metrics.prometheus
        - --web.metrics.prometheus.buckets=0.1,0.3,1.2,5
        - --configFile=/etc/traefik/traefik.toml
        resources:
          limits:
            cpu: 200m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 150Mi
        volumeMounts:
        - name: config-volume
          mountPath: /etc/traefik
      volumes:
      - name: config-volume
        configMap:
          name: traefik
          items:
          - key: traefik-config
            path: traefik.toml

The diagram above shows two ASG’s for edge nodes, which will host the traefik daemonset(s).

There would be a CNAME DNS record for myapp.example.com which points to the public FQDN for the common ELB to which both the edge ASG’s are attached to. Traffic would be routed to the edge VM’s attached based on a round robin fashion here. Before that, the security groups attached to the ASG’s can also be configured to only allow TCP connections on port 80(others would be blocked automatically as it’s default deny).

Similarly a DNS record for canary would be there.

Traefik would be listening on port 80 on the host’s network for incoming requests, and there would be an ingress object in the namespace of the app which would define which service to route the traffic based on the hostname.

We can have an ingress object like the following in the namespace myapp for the services

# this feature is available only from traefik version 1.7.0 and upwards
# https://github.com/containous/traefik/releases/tag/v1.7.0
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: traefik-external
    traefik.ingress.kubernetes.io/service-weights: |
      myapp: 90%
      myapp-canary: 10%
  name: myapp-ingress
  namespace: myapp
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - backend:
          serviceName: myapp
          servicePort: 80
        path: /
      - backend:
          serviceName: myapp-canary
          servicePort: 80
        path: /
status:
  loadBalancer: {}

This way, traefik would route the traffic coming to myapp.example.com to the services

You can have multiple services to which you can specify the weights and I would only be repeating myself to what has been written here https://docs.traefik.io/user-guide/kubernetes/#traffic-splitting

Another thing to note here is that, the service to which you are trying to do a weighted routing for your canary, should be in the same namespace as the other service(which is myapp in this case). This was asked here in their issue tracker and they pointed out the same https://github.com/containous/traefik/issues/4043.

So you have seen how we can do canary deployments on AWS using traditional ELB’s and ASG’s as well as if you are on kubernetes with an ingress controller.

References

Credits

The Network diagrams were made using draw.io