Tasdik Rahman

ContainerDays 2024, Hamburg

2024-09-21T00:00:00+00:00

This was my first time at containerdays, Hamburg and I feel I missed out definitely on the previous ones by not attending!

This year’s conference, was a 2 days conference and I met a bunch of familiar faces from KCD Munich, which I attended this year and the last year too.

Takeaways

KCP

There was a great talk by Marvin around KCP.

kcp talk - building a platform engineering api layer with kcp

cncf sandbox project, https://www.kcp.io/
project maintainer presenting
- agenda
  - k8s as an API layer
    - primarily built to support container orchestration.
    - CRDs help extend k8s and allow us to use k8s resource model.
    - the k8s API is pretty awesome.
    - has developed over the last 10 years.
  - lightweight clusters to rescue
    - CRD is cluster scoped.
    - maybe we don’t need the full thing where we just need the API’s of k8s and not orchestration.
    - Hosted Control planes.
      - each cluster has it’s own control plane components
    - What if?
      - we partitioned an API server.
    - Similar to hosted control planes
      - but difference being, only a partition is created in an existing data store, the existing control plane components.
  - what is KCP
    - workspaces
      - multi-tenancy unit of isolation in kcp.
      - each workspaces has it’s own available API resource types.
      - API objects are not shared across workspaces.
      - cheap.
      - delegation of administrative permissions to workspace owners.
      - organised as trees.
  - the API marketplace
    - container orchestration has been removed.
    - API management can be done here since the above is not there.
    - Create APIs with APIExports.
      - a k8s resource model resource,
      - specific to kcp
    - k get api-resources
      - would show the workspaces.
    - k get apibinding
  - wrapping up.
    - API’s are eating the world.
    - build on top of k8s research and work
    - KCP allows you to use that focus on the implementation.
    - KCP is building a global control plane for API driven platforms.
    - Workspaces allow to mirror organisational hierarchy.

fireside chat with Kelsey Hightower

When I attended my first kubecon back in 2017, in Austin. I met Kelsey Hightower and it was highly inspiring to meet him back then and it wasn’t any different this time around too.

Some thoughts which I was able to scribble down during the fireside chat, where Kelsey was taking questions.

How did everything start?
- best answer is luck.
- was working since k8s was public with coreos being there.
Why k8s started off?
- MSFT betting big on it, when brendan joined.
- and they bet on it, which was big coming from a competitor
People say k8s is complex, because they don’t understand it.
- Linux is also complex, but no one says it’s complex explicitly.
K8s gave the infrastructure a type system
- similar to dynamic programming different with static programming languages.
- the CRD is what separated it out from all the systems.
minimalism
- Do your work at the best level.
separation of states
- LB - apps - databases
- to just separate the state from the application and upgrade life cycle is easier than.
k8s is it going away?
- mesos didn’t know if it was going away.
- k8s has a lot of problems, but later on, we might find something which makes easier and builds something new on top.
- if 10, 20 years goes by, if we are still doing the same thing, that would be not good as that means no progress.
- someone somewhere is trying to replace this
  - but it needs to be better.
Apache2 vs nginx.
- Lineage will continue, don’t be afraid, embrace it.
AI to be used together to successfully integrate with your workflow.
- but AI to replace everything? Maybe not as of now.

Open Cluster Management

Some notes on a discussion with someone I had over at the conference about OCM https://github.com/open-cluster-management-io/ocm and how they are using it.

open cluster management

can be used to deploy dependencies to the child clusters from the management cluster.
- there’s an agent running on the child cluster which talks to the management cluster’s API server.
- There’s a CRD for managing the cluster information present in the child cluster.
- but it doesn’t mostly show up the machine pools which you have in the child clusters all the way to the cluster CRD which is there for the child cluster in the management cluster.
  - Is a cncf project which is backed by redhat.

My Presentation

Last, but not the least, I was able to present my presentation ” How to make pod assignment to thousands of nodes every day easier”.

The excerpt of the talk.

New Relic operates tens of thousands of nodes across hundreds of Kubernetes clusters. Pod assignment to these thousands of nodes is done every day, as applications get deployed. I’ll share our experience in abstracting out the Kubernetes scheduling primitives from users, discuss their limitations and describe the solution. I’ll cover:

the complexity for end user, in specifying scheduling rules to Kubernetes at scale.
how we built a scheduling engine by extending Kubernetes via mutating admission webhooks, to translate declarative requirements by user into native Kubernetes scheduling constraints.
tradeoffs made in the system. After this talk, attendees will be better prepared to deal with the complexity of extending Kubernetes, to abstract pod assignment to nodes, especially at scale, for end users.

Here are the slides and the recording if you would like to go through them.

Ending notes

Learned a lot from attending ContainerDays this year and meeting the community, and I would like to attend ContainerDays, along with kubernetes days Munich again next year.

Until next time!

Oncall in product teams

2023-10-25T00:00:00+00:00

I have been oncall for as long as I can remember being in the industry, so far for every organisation I have been part of. Different things have worked at different phases of the organisation and the teams priorities. I thought I would put down some notes over things which I have seen have worked well.

Why do we need to have oncall rotations?

Because simply put, it would help people not burn out of only being oncall unofficially over time, as they get pulled into specifics of systems which they are more aware of.

Having a proper cycle which cycles through all engineers in the team uniformly, will also allow the teams knowledge and toil being distributed on what does the team get paged more often.

While I was at Gojek, I would recommend graduate engineers joining the team(and it was the same that I recommended to the new graduate engineers in my team), to always shadow while doing production support, as it would be a very good avenue to get onboarded to the services which the team would own and have touchpoints for.

Further more, it reduces the bus factor, as knowledge is then forced in a way to be distributed via docs, debugging problems, runbooks over time in the team.

Communication Channels

To organise comms during an incident, ideally, have channels, where people can update the status of incidents and talk/discuss on what are they doing.

Also is helpful, to have channels, where you run urgent cares for changes like deployment/config changes to help keep track of what changed for folks to have a look in the production systems, in case they have an incident and wanted to see what were the changes that went inside the system.

Oncall is for whom

Apart from the usual pattern of engineers of the product teams who are oncall, potentially it’s also very useful in situations for having senior engineers who have a lot of cross team information to be oncall along with leaders from the teams, as it helps again in distributing the load of oncall.

Getting paged too often

This is another post in itself, but reliability and product development pace need not be exclusive from each other. Change in essense might be the simplest way to introduce instability in the system, but that’s the nature of product development and preventing software rot.

Change is inevitable in the system, so the idea is to always make change, easy to make and easy to revert back as much as possible.

If the team is getting paged more often, the ideal path is to focus on preventing that fire first to prevent people from burning out, but balance is key.

More Reliability comes with a cost.

There are also some parts of the system, which a specific person might have more knowledge due to tribal knowledge possibly or simply because they have built it or spent huge amounts of time with it, and it could end up being that the person then gets paged more in-evitably because of that, this is a symptom to fix to distribute the work of oncall load on the person.

Primary and Secondary

Layering your oncall is helpful in case, there is a miss by the primary to respond to a page, which gets auto escalated to the layer above. If your team can follow the sun pattern for oncall, that’s potentially the best of options to have, to provide coverage for the team.

Oncall as part of SRE/Operations teams

There’s not a fixed set of checklists which will work for every team, but on a general note, reducing toil for oncall and having a rotation is a good place to start with and the iterate as we go.

I have also talked about how first responders for teams requests also is ideally good to be spread across the team over a fixed rotation cycle.

Ending notes

Oncall is a challenging problem but inevitable for teams, as long as you are providing something to the users. Heroism in oncall, along with toil management will be something which you will have to manage in any team nevertheless, as long as you are not completely ignoring these aspects, you are progressing forward with having a better oncall.

There’s a lot of good literature around the internet on how teams should run oncall, and I would highly recommend reading them. As always, I end up learning something new when I read how teams are operating and learn from them.

Keyboard setups over time

2023-09-27T00:00:00+00:00

Over the years, I have used different keyboards for myself, and my setups have changed over time.

This post is just me assimilating over the setups which I have used over the years to reflect back on the memories with each setup.

College days

I would use the laptop keyboard itself when I was in college back then, it had a really tiny keyboard and was an ultrabook which didn’t have a CD-ROM. The setup in my room was also not very ergonomic in the first place as I didn’t have a chair which was ergonomic enough for me with the desk. But it was nevertheless the first setup which I had in all fairness.

For elevating the laptop, from it’s usual level on the table, I would use books to raise it a bit above, but that was also acting up as a helper for the fan for the laptop, which was notorious to get heated up really quickly. One reason being, the fan was just too small and it being that compact, might not have really helped in the whole heatmanagement of the whole machine.

This setup was mostly what I used for the rest of college, this image dates back to 2015.

Getting my first keyboard

Or rather, getting my first set of keyboards, this was the time when I had heard about mechanical keyboards and was excited about having one for myself. In that phase, I got not one but two mechanical keyboards, both of them being Cherry MX Blue.

They did make some noise as compared to my last setup

The two keyboards being

Coolermaster CoolKeys Pro L, Cherry MX blue switches.
Keycool 84 Mini, Cheery MX blue switches.

Here’s a picture of me with both of the keyboards, over at possibly the first mechanical keyboard meetup in Bangalore, dating back to 2019, which also was the start of the /r/mkindia community.

Change over from Cherry MX blue to Cherry MX brown switch

At some point, I decided to try out the brown switch, as I hadn’t tried it before, it did feel different from the blue switch, but I got used to it over time.

The keyboard ended being a CM storm QuickFire Rapid

Current setup as of September 2023

Had been using my Keycool 84 mini along with my Coolermaster mostly, with a logitech pebble mouse. I really wanted to try the orthogonal layout along with a curved and split setup.

And I ended up with the Kinsesis Advantage 360 Pro, along with the logitech MX vertical mouse.

The keyboard certainly has taken me time to get used to for sure, but I have liked it so far, the initial days of using it were super slow, but I am slowly getting back to the touch typing on it.

This is combined with a standing desk and a nice ergonomic chair and a single monitor with a laptop stand. Looking back, the setup has certainly come a long way.

Renewing your root CA with a new root CA such that the older certs signed by old root CA are still valid

2023-05-18T00:00:00+00:00

Context

If you have a root CA which you used to sign certificates, and if the root certificate is about to expire, the certificates signed by the root CA will also become invalid after the root CA expires even if the certificates signed by it haven’t expired. As every certificate in the chain must remain valid for your certificate to be valid.

Also for example the kube-apiserver when it comes up, it --client-ca-file while it comes up, where you can pass the root CA.

Going about this

Now if you end up replacing this cert, you would have to replace all the certs with certs signed by the new root CA in the clients to allow un-interrupted functioning.

The other option is to append the new root CA inside the same file, this will allow the certs signed by both the new root CA and the old root CA to work, allowing time to update the certs for the clients. Why append in the same file? RFC 1422 mentions the same that a pem file can have multiple certificates

This will also allow no changes for any configuration changes being passed for --client-ca-file for example for the kube-apiserver

Generating the root CA again

In order for things to work as expected, we would need to generate the root CA with the same serial and v3 extensions, this will allow the certs signed by the older root CA to verify against the root CA

the CSR generated from the old private key and the old root CA

CACRT=existing-ca.pem
CAKEY=existing-ca-key.pem

NEWCA=renewed-ca.pem

# using the same serial as this is how the client certs signed by the original CA will be respecting the new CA
serial=`openssl x509 -in $CACRT -serial -noout | cut -f2 -d=`

# generate the csr from the old private key and the existing root CA
openssl x509 -x509toreq -in $CACRT -signkey $CAKEY -out $NEWCA.csr

# v3extensions from the last pem file to be used for the new root CA and using the same subject identifier as the original pem file
cat renewed-ca.pem.conf
[ v3_ca ]
basicConstraints= critical, CA:TRUE
keyUsage= critical, Certificate Sign, CRL Sign
subjectKeyIdentifier= <the-serial>

openssl x509 -req -days 3650 -in $NEWCA.csr -set_serial 0x$serial -singkey $CAKEY -out $NEWCA.crt -extfile ./$NEWCA.conf -extensions v3ca

And that’s pretty much, the $NEWCA.crt can be used now

References

neovim setup for golang 6 months in

2023-04-29T00:00:00+00:00

So far

I have been using my nvim golang setup for around more than 6months now and I am still learning something new everyday while I use it.

coc.vim works really well so far for me. Linting, autocomplete, jumping to definitions back and forth, code folding, checking for references for where a function/method is used, it works out for me well so far.

What I am trying to fix on my setup so far

The debugging experience for sure can be improved, I have tried using nvim-dap-go.

It does work for tests where you would not have a lot of nesting, for example works great for simple t.Run() calls which are not overly nested (at least from my usage so far) but I have had trouble using it with some ginkgo style tests where it’s not able to parse for the parent test when trying to add a debug point, internally it ends up using treesitter to parse the whole file.

Have an open issue here which I added more context in this issue https://github.com/leoluz/nvim-dap-go/issues/9#issuecomment-1521496733.

My setup for the same is very simple and mostly untouched from the usual config which is out there in the docs

lua <<EOF
local dap = require('dap')
local dapui = require("dapui")

dapui.setup()

require('dap-go').setup {
  -- delve configurations
  delve = {
      -- time to wait for delve to initialize the debug session.
      -- default to 20 seconds
      initialize_timeout_sec = 20,
      -- a string that defines the port to start delve debugger.
      -- default to string "${port}" which instructs nvim-dap
      -- to start the process in a random available port
      port = "${port}"
  },
  dap_configurations = {
    {
      type = "go",
      name = "Attach remote",
      mode = "remote",
      request = "attach",
    },
  },
}

-- You can use nvim-dap events to open and close the windows automatically
-- from: https://github.com/rcarriga/nvim-dap-ui
dap.listeners.after.event_initialized["dapui_config"] = function()
  dapui.open()
end
dap.listeners.before.event_terminated["dapui_config"] = function()
  dapui.close()
end
dap.listeners.before.event_exited["dapui_config"] = function()
  dapui.close()
end

EOF

Workflow

The flow of usage is pretty out of the box

add your breakpoints in the place you want with :lua require'dap'.toggle_breakpoint() at the places you want
then debug a specific test with :lua require('dap-go').debug_test()
to go over through the breakpoints, you do a :lua require('dap').continue()
and to terminate the process lua require'dap'.terminate()

Key maps being used

I am as of now using the following keymaps

nmap <silent> <leader>tb :lua require'dap'.toggle_breakpoint()<CR>
nmap <silent> <leader>tc :lua require'dap'.continue()<CR>
nmap <silent> <leader>tt :lua require'dap'.terminate()<CR>
nmap <silent> <leader>td :lua require('dap-go').debug_test()<CR>

My vim setup for golang

2022-10-28T00:00:00+00:00

Ok, not vim, but nvim

tl;dr what does all this get me in my setup

https://github.com/tasdikrahman/dotfiles/tree/master/vim

jump to definitions
jump to references
jump to symbols
fuzzy file search
code folding
jumping between test and implementation file
testing specific function
real time code linting

Cleaning my vim config after some time and ended up removing a bunch of things and starting afresh, a couple of plugins which had been archived but worked all the while were https://t.co/lmjXtN7HbS, switched to https://t.co/7ZSoKMgwBW as it was the recommended replacement (1/n)
— Tasdik Rahman (@tasdikrahman) October 28, 2022

Cleaned up my vim configuration which I was trying to set up for golang specifically. I did have vim-go already setup, but wanted to try out coc.vim along with gopls since it was around for sometime now. There were also a bunch of plugins which had been archived like syntastic

This setup will evolve and has space for a couple of things which I will incrementally add over time in the next few weeks.

Impressions so far

I turned off the lsp settings for ale, so that vim.coc would take it over, haven’t removed vim-go completely as I still use a couple of things from there.

The jump of definitions after moving over to gopls, and vim.coc giving over jumping to references, definitions, godocs comes almost the same to the setup for an IDE, along with being able to fold along with using nvim-treesitter is something which adds up really well along with ctrl+p

Changes and things kept from existing vim-go setup

Turned off the go autoimports functionality which is given over by vim-coc

The go-build and go-run keybindings which I had are already being used, same for the :GoAlternate keybindings to jump and split the panes when going over test and implementation files, along with the highlight configs is something which I have kept.

I did end up removing ultisnips which is one of the plugins presented in vim-go, with coc-go and coc-snippets, as they almost provide the same thing.

What does it lack as of now

I want to setup up delve as that’s what I feel is remaining as of now for a full out IDE setup which is what I was anyways using, this remains as of now. Will see how nvim-dap fixes up with my current setup and update this section with what I ended up with.

Working remotely in a geographically distributed team without burning yourself out

2022-10-08T00:00:00+00:00

Context

Wanted to pick ideas from folks who have worked in such setups effectively/led teams over the years across geographical continents. Given I recently took up a fully remote role, with the team that I am joining being spread across EU and in the US as of now.

This post is just for documentary purposes for me to look back how I have fared over the course of this year and the next as I take this journey of working in a remote first company.

Takeaways from my discussion with a couple of folks.

different machines for work and personal work
respecting your own boundaries of fixed time slots/work times which you work in the day.
having 1-1’s with folks in your team and your other team members who work with you from other teams.
written first communication
meet personally in the office helps further while working remotely.
talking with your respective manager to get a sense of how they want to drive the working style of the team and how they want you to work.
see what are you trying to balance along with your work times and see how they align and make sure you communicate it.
1-1 with your manager is you time since this is for the reportee to convey concerns/ideas to the person you are reporting to
not checking work devices frequently outside of work hours.
understand the set of practices already being followed by your team, if it has already been working remotely for sometime and follow them

Scaling cluster upgrades for kubernetes

2022-09-26T00:00:00+00:00

This post is more of a continuation of the talk I gave over at kubernetes bangalore k8s september 2022 meetup.

Here are the slides, which you can take a peek over, to complement this post, if you would like to go through it before reading further.

Context

I will not repeat the content which is already there in the slides. Will also update this post with the talk link when the talk gets uploaded. But I do want to delve over into the idea of how I feel I would attempt to structure the upgrades next. This post is more on the infrastructure upgrade complexities arising from when managing double digit or more k8s clusters.

What is the bottleneck at the end of the day

To just add to the last idea of the main tweet. Having seen and maintained the terraform setup to create/maintain clusters at a couple of places, and this way of doing it doesn't work very well after you start reaching cluster nos in 10's or 100's and more (1/n) https://t.co/C9cSaE812Q
— Tasdik Rahman (@tasdikrahman) May 3, 2022

We have a state for the cluster, which also ends up being managed by the way we add/delete/create the cluster with, this could be terraform or eksctl config or cloudformation, whatever you have chosen.

The way to introduce change in these types of setups is often more than not a human interaction happening over with them, it could be a terraform plan, apply for example in this case and the state of the change then gets stored in the state file.

Now imagine having to do the same over 10s or 100s of cluster.

I am increasingly leaning towards a way to manage/create the state of the k8s cluster, via some form of a control loop, whereas the whole terraform style of state is to have interrupts around it. While it can very well be engineered to reduce the interrupts, fundamentally I feel the difference in the way to operate the state is different from the terraform style and control loop style management.

This tweet further discusses this idea.

In general I prefer control-loop style Management of infrastructure, especially at scale. It’s easier to manage drift and be proactive about modifications. One-shot stuff like terraform gets fragile over time and it’s harder to evolve, turns out software is OK too.
— Lincoln Stoll (@lstoll) May 3, 2022

Ways to think about efficiently managing this drift and human bottleneck problem

As described in the last tweet, the control loop managing the state of the infrastructure in question, makes us think more in the direction in which [cluster-api] tries thinking around managing infrastructure. I have come across https://github.com/openshift/managed-upgrade-operator, which is the closest I have seen an OSS implementation going towards a complete implementation of this style of operation. Reasonably well documented and how I would probably lean into solutioning this problem for a large scale cluster sprawl when trying to solve it the next time.

While there are more examples of automation of rolling upgrades for node groups out there in the form of https://github.com/keikoproj/upgrade-manager, taking over the control loop approach for a specific subset of the problem statement of the upgrade process, in this case the node operations.

Along with https://github.com/hellofresh/eks-rolling-update and https://github.com/deliveryhero/k8s-cluster-upgrade-tool(P.S. is written by me), both of which are human operated, while although do solve the problem in case, a subset of it if not all, but the openshift’s implementation is a more human interrupt free way to go about the cluster upgrade process, which is what will allow the cluster upgrade process to be sustainably executed at the end of the day, without a lot of toil and effort. Even though if we end up adding the operations of cluster upgrade process further to this way, operations like control plane upgrade, upgrade of the managed node groups and k8s components, we have to at the end of the day, wrangle with the state and modifying it further with the tools which we would have created/managing the specific k8s resources.

Closing thoughts

The control loop strategy would make us the question or reconsider how we are creating and managing these k8s specific resources like control plane for the managed provider/self hosted cluster, managed node groups/self hosted node groups and the k8s components managed by helm. In any case, I foresee someone wanting to consolidate their automation or creation/management process back to the control loop automation which they have in order to have standardisation over time. More of this automation introduced would more more moving codepieces leading to maintenance creep, but the cost of which would be justifiable if you are measurably reducing the effort of the cluster upgrade process for the x number of clusters you own/manage/run.

The whole IAC being run via atlantis and having terragrunt to run plan/apply in the CI is a not so uncommon of a setup as of now, but the churn of cluster upgrades and the complexity which comes up with it, is the not the same as managing a couple of managed services via the terraform config, upgrade/delete of which is most of the times a one step plan/apply.

While to each to their own setup. But in general, reaching to a state where k8s clusters itself are not pets, is a fairly complex problem. If you have already reached that state, then your state of a new cluster creation is very mature, cleaning up of which might be the next logical step to think about, after which you can do a blue-green deployment of the new cluster setup. But again, I have not seen this work out successfully so far, hence, don’t think of your k8s cluster as a pod by this extension of thinking. Save the state of the cluster in a way in which you can modify it over time with your tooling.

If you have implemented some other setup for a large sprawl of k8s clusters for yourself. Would love to hear more about it.

References

Musings with client-go of k8s

2022-09-22T00:00:00+00:00

This post mostly is for documentary purposes for myself, about a few things which I ended up noticing while using client-go as I used it for deliveryhero/k8s-cluster-upgrade-tool, which used the out-cluster client configuration, a couple of things are specific to that setup, like client init, but other things like testing interactions via client-go are more generic.

Initialization of the config

client-go in itself, shows a couple of example of client init here, pasting the snippet here for context

...
var kubeconfig *string
if home := homedir.HomeDir(); home != "" {
  kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
} else {
	kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
}
flag.Parse()

// use the current context in kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
if err != nil {
	panic(err.Error())
}

// create the clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
	panic(err.Error())
}
...

This would by default go ahead and initialize the client with the current k8s context you are attached to. For example in your ~/.kube/config file

...
current-context: foo-cluster
kind: Config
preferences: {}
...

with the above, the client will get initliazed with the k8s context being foo-cluster.

Initiliazation happening by default to the default kube context

What ends up happening underneath is that, the method BuildConfigFromFlags(), which goes on to call ClientConfig(), which is where the k8s context is deduced, when the call to getContext() gets made. The flow defaults to the current context via the call made to getContextName().

This is also where we notice that we can add an override, which if added, would select that particular context, instead of choosing the default, which is picked up from the current context already selected in the ~/.kube/context. The only thing is this override is not accessible directly from the BuildConfigFromFlags method, which is what we were using initially, checking further, I didn’t find anything which would help with this in the package, or maybe I missed something feel free to point it out.

But how do we go about overriding this?

Initializing client-go to a user specified k8s context

We would just need to have the Context added to ConfigOverrides, when the call to ClientConfig() gets made at the in BuildConfigFromFlags()

// buildConfigFromFlags returns the config using which the client will be initialized with the k8s context we want to use
func buildConfigFromFlags(context, kubeconfigPath string) (*rest.Config, error) {
	return clientcmd.NewNonInteractiveDeferredLoadingClientConfig(
		&clientcmd.ClientConfigLoadingRules{ExplicitPath: kubeconfigPath},
		&clientcmd.ConfigOverrides{
			CurrentContext: context,
		}).ClientConfig()
}

// KubeClientInit returns back clientSet
func KubeClientInit(kubeContext string) (*kubernetes.Clientset, error) {
	var kubeConfig *string
	if home := homedir.HomeDir(); home != "" {
		kubeConfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
	} else {
		kubeConfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
	}
	flag.Parse()

	config, err := buildConfigFromFlags(kubeContext, *kubeConfig)
	if err != nil {
		return &kubernetes.Clientset{}, errors.New("error building the config for building the client-set for client-go")
	}

	// create the clientset
	clientSet, err := kubernetes.NewForConfig(config)
	if err != nil {
		return &kubernetes.Clientset{}, errors.New("error building the client-set for client-go")
	}
	return clientSet, nil
}

the init would then look like

...
kubeClient, err := k8s.KubeClientInit("cluster-name")
...

Testing client-go interactions

While writing a couple of interactions on deliveryhero/k8s-cluster-upgrade-tool, ended up checking what were folks doing to add specs for interactions with client-go, and we already had the https://pkg.go.dev/k8s.io/client-go/kubernetes/fake package out there which one could use to test the interactions with the client.

My specific cases for testing were quite simple to setup. The first thing again, which you would need to take care of is obviously, that the method which you want to test, should be able to take the client as a dependency which you inject in the spec. After that’s it’s super simple.

For example, for one case, I would want a behaviour, where querying a specific object type, when queried with specific attributes, would return me back that object. Then the methods behaviour of how it processed that output would be put under test.

...
client := fake.NewSimpleClientset(&appsv1.DaemonSet{
					ObjectMeta: metav1.ObjectMeta{
						Name:      "aws-node",
						Namespace: "kube-system",
					},
					Spec: appsv1.DaemonSetSpec{
						Template: corev1.PodTemplateSpec{
							Spec: corev1.PodSpec{
								Containers: []corev1.Container{
									{
										Image: "aws-node:v1.0.0",
									},
								},
							},
						},
					},
				})
...

Now any interactions which we would want to have with the client, where if we query for this object via the specific api’s we would get this object back.

func myMethod(k8sClient kubernetes.Interface, myObjects...) {
  // when the subject is under test, this method when passed the fake client with the above initialization would have the aws-node object present
  // which would be returned.
  daemonSet, err := k8sClient.AppsV1().DaemonSets(namespace).Get(context.TODO(), k8sObjectName, metav1.GetOptions{})
  ...
  ...
}

To give a more full example, here is a case where function returns the container image of the first container for a deployment of a daemonset object

func GetContainerImageForK8sObject(k8sClient kubernetes.Interface, k8sObjectName, k8sObject, namespace string) (string, error) {
	switch k8sObject {
	case "deployment":
		// NOTE: Not targeting other api versions for the objects as of now.
		deployment, err := k8sClient.AppsV1().Deployments(namespace).Get(context.TODO(), k8sObjectName, metav1.GetOptions{})
		if k8sErrors.IsNotFound(err) {
			return "", fmt.Errorf("Deployment %s in namespace %s not found\n", k8sObjectName, namespace)
		} else if statusError, isStatus := err.(*k8sErrors.StatusError); isStatus {
			return "", fmt.Errorf("Error getting deployment %s in namespace %s: %v\n",
				k8sObjectName, namespace, statusError.ErrStatus.Message)
		} else if err != nil {
			return "", fmt.Errorf("there was an error while retrieving the container image")
		}

		// NOTE: This assumes there is only one container in the k8s object, which is true for the components for us at moment
		return deployment.Spec.Template.Spec.Containers[0].Image, nil
	case "daemonset":
		// NOTE: Not targeting other api versions for the objects as of now.
		daemonSet, err := k8sClient.AppsV1().DaemonSets(namespace).Get(context.TODO(), k8sObjectName, metav1.GetOptions{})
		if k8sErrors.IsNotFound(err) {
			return "", fmt.Errorf("daemonset %s in namespace %s not found\n", k8sObjectName, namespace)
		} else if statusError, isStatus := err.(*k8sErrors.StatusError); isStatus {
			return "", fmt.Errorf(fmt.Sprintf("Error getting daemonset %s in namespace %s: %v\n",
				k8sObjectName, namespace, statusError.ErrStatus.Message))
		} else if err != nil {
			return "", fmt.Errorf("there was an error while retrieving the container image")
		}

		// NOTE: This assumes there is only one container in the k8s object, which is true for the components for us at moment
		return daemonSet.Spec.Template.Spec.Containers[0].Image, nil
	default:
		return "", fmt.Errorf("please choose between Daemonset or Deployment k8sobject as they are currently supported")
	}
}

Specific case when the object is of kind Deployment

func TestGetContainerImageForK8sObjectWhenK8sObjectIsDeployment(t *testing.T) {
	type deploymentArgs struct {
		k8sObject     string
		k8sObjectName string
		kubeContext   string
		namespace     string
		deployment    *appsv1.Deployment
	}
	tests := []struct {
		name   string
		args   deploymentArgs
		err    error
		output string
	}{
		{
			name: "When the object is of type deployment, the objectname is cluster-autoscaler, object exists and returns back the image",
			args: deploymentArgs{k8sObject: "deployment", k8sObjectName: "cluster-autoscaler", kubeContext: "test-context", namespace: "kube-system",
				deployment: &appsv1.Deployment{
					ObjectMeta: metav1.ObjectMeta{
						Name:      "cluster-autoscaler",
						Namespace: "kube-system",
					},
					Spec: appsv1.DeploymentSpec{
						Template: corev1.PodTemplateSpec{
							Spec: corev1.PodSpec{
								Containers: []corev1.Container{
									{
										Image: "cluster-autoscaler:v1.0.0",
									},
								},
							},
						},
					},
				}},
			output: "cluster-autoscaler:v1.0.0",
			err:    nil,
		},
		{
			name: "When the object is of type deployment, the objectname is cluster-autoscaler, object doesn't exist, returns back error",
			args: deploymentArgs{k8sObject: "deployment", k8sObjectName: "cluster-autoscaler", kubeContext: "test-context", namespace: "kube-system",
				deployment: &appsv1.Deployment{}},
			output: "",
			err:    errors.New("Deployment cluster-autoscaler in namespace kube-system not found\n"),
		},
	}

	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			client := fake.NewSimpleClientset(tt.args.deployment)

			got, err := GetContainerImageForK8sObject(client, tt.args.k8sObjectName, tt.args.k8sObject, tt.args.namespace)

			assert.Equal(t, tt.output, got)
			assert.Equal(t, tt.err, err)
		})
	}
}

A full example of the same which I wrote is here where GetContainerImageForK8sObject() is the function under test, which takes the client as a dependency and then in the test case we would use the fake client here like this

References

Moving over to www.tasdikrahman.com from tasdikrahman.me

2022-09-17T00:00:00+00:00

I have been writing over at this blog since mid 2015 and not much has changed over these years on this blog, the same theme, the same static website generater, the same color scheme.

Why did I end up moving?

The .me domain name, from what I could gather, the registry which is operating it has access to it only till 2023, not that I am anticipating anything out of the blue, but I felt it just makes sense long term to move to .com.

Plus, the .me domain was with me since it came free along with the github education pack back in the days in college, which is why my domain tasdikrahman.me existed in the first place, I never ended up moving away from it at that time.

What did I end up doing

Add redirect on the old page

Looking around, since the page was statically generated, I found it just easier to add a block

<meta http-equiv="Refresh" content="0; url='https://www.tasdikrahman.com'" />

in the head.html for the older website which I was having, since each generated html page would need to have it, ended up adding it in head.html here

There are more ways to do the same, one of them being https://about.txtdirect.org/hosted/, which would make use to TXT records, and act as the middle layer, to redirect back to wherever you want it to go.

I didn’t want to make use of an external service/entity at this point unless really required, hence didn’t end up going in that direction, although if you have other suggestions, would love to hear about them.

Adding redirect for tasdikrahman.com to redirect to www.tasdikrahman.com

This was simple and easy, adding a temporaty redirect on the domain registrar (google domains in this case) was simple and straight forward.

Enabling SSL on for the same again super simple, as github pages gives the option to enable it straight on the settings of the repo from where the static website is being generated from.

Ending notes

I have always loved the simplicity of static website generators, the fact that I haven’t changed anything for so long, and the add markdown, commit changes, git push model reducing the barrier to writing for me is a huge plus for me.

Github pages exposing the github actions used to build and deploy the pages also makes it easier in case I want to move to another static website generator at some point. Plus this made me realize that the blog has grown over to ~100MB as of now.

Stubbing and few other testing tidbits for python

2022-04-15T00:00:00+00:00

It’s been sometime since I wrote some python, and ended doing a bit of testing for a couple of routines which I ended up implementing. This post is more about me just condensing those ideas for python and how to do it in python, but the ideas are also a carryover from my other testing experiences, while using other languages and how my ideas for testing have progressed over time comparing some testing which I had done in some projects some years back. You can find a couple of more posts under https://www.tasdikrahman.com/blog/tag/testing/ where I have delved more into these topics.

As always, I will for sure look at this post at some point and notice improvements as my thoughts on testing progress and mature.

What about non-deterministic tests

I picked this piece of code from an old project which I worked on long back in college for this bit, if you look at the following block

# picked from here
# https://github.com/tasdikrahman/plino/blob/713ad80524bb4038cb08475b299b02cca3fe7feb/tests/test_plino_app_api_response.py#L37
    def test_api_spam_email_text(self):
        """
        Unit test to verify the 200 response code and the correct email_class
        returned by API when a spam email text is passed
        """
        payload = {
            'email_text': SPAM_TEXT
        }
        headers = {'content-type': 'application/json'}
        response = \
            requests.post(self.api_url, data=json.dumps(payload), headers=headers)
        r = json.loads(response.content)
        assert response.status_code == 200
        assert r['email_class'] == 'spam'

This for me, if I look at it now is more noise than signal, as the probability of it landing up in being classified to something which we are asserting it to, is probabilistic in nature, I would rather have this as a test oracle for me to be able to know what is going on for this case at least.

Another thing to note here in this spec if that, there is a scope creep happening here, I am trying to do two things at the same time. One being, trying to test the route api/v1/classify/, and check for the response code for a successfull response and the second thing being, I am also trying to test the domain specific implementation of domain logic. Mixing these two don’t really make sense at this point.

What I would have rather done at this point, is to inject the dependency of the domain specific logic to return the response for which I was testing the api response code for, making this spec deterministic and reducing the noise.

This would have also reduced the extra test behaviour which was being tested here in this case, which was unnecessary. Plus the underlying domain logic being tested was the classifier in this case, which would then be testing the 3rd party codebase itself, which is not required. What we would want to rather do is wrap our business logic around the responses which the 3rd party flow can give.

Stubbing responses, in this case stubbing a method which receives STDIN

Will pluck out the irrelevant details from this spec I wrote for fileinput.input(), the context was that a method was using fileinput.input() to read from STDIN, and we needed to test the original method, without actually waiting for STDIN in our test spec runner.

Here’s a snippet describing changing the design of the implementation to prevent the call of fileinput.input() while the test run.

# initial design
import fileinput

class IO:
    def read_stdin(self):
        """
        read_stdin will read the STDIN and process the data received

        :returns: list of sentences read line by line
        """
        lines = []
        cleaned_lines = []
        std_in = fileinput.input()
        for line in std_in:
            lines.append(line)
        # strip newlines
        cleaned_lines += [line.rstrip("\n") for line in lines]
        # remove empty strings
        return [x for x in cleaned_lines if x.strip()]

Now in our test spec, if we wanted to call this method and assert for list output which it returned, we would be stuck with the STDIN IO wait time here.

Rather adding an interface on top of this behaviour would help us further in stubbing that response out which we get from fileinput.input()

import fileinput

class IO:
    def read_stdin(self):
        """
        read_stdin will read the STDIN and process the data received

        :returns: list of sentences read line by line
        """
        lines = []
        cleaned_lines = []
        std_in = self.get_stdin()
        for line in std_in:
            lines.append(line)
        # strip newlines
        cleaned_lines += [line.rstrip("\n") for line in lines]
        # remove empty strings
        return [x for x in cleaned_lines if x.strip()]

    @staticmethod
    def get_stdin():
        return fileinput.input()

Now we can stub this method in our spec, since we already knew the behaviour of the stubbed method and what it would give us, we added the response value for it for our spec, effectively replacing an actual call. It’s interesting to see this in the decorator syntax provided and looks quite clean to read.

import fileinput
from unittest import TestCase
from mock import patch

class TestReadStdin(TestCase):
    @patch.object(io.IO, "get_stdin")
    def test_read_stdin(self, stub_get_stdin):
        content = """I could not help it, but I began to feel suspicious of this. At any rate, I made up my mind that if it so turned out that we should sleep together, he must undress and get into bed before I did.

Supper over, the company went back to the bar-room, when, knowing not what else to do with myself, I resolved to spend the rest of the evening as a looker on."""
        want = [
            "I could not help it, but I began to feel suspicious of this. At any rate, I made up my mind that if it so turned out that we should sleep together, he must undress and get into bed before I did.",
            "Supper over, the company went back to the bar-room, when, knowing not what else to do with myself, I resolved to spend the rest of the evening as a looker on.",
        ]
        # create the temp file
        with TestFileContent(content) as valid_file:
            stub_get_stdin.return_value = fileinput.input(files=valid_file.filename)

            io_obj = io.IO()
            got = io_obj.read_stdin()

            self.assertEqual(want, got)

Testing for STDOUT

It’s of value to test out the STDOUT being received for certain cases. For example

Class Printer:
    def print_this(self, key, value):
      print("{0} - {1}".format(" ".join(key), value))

spec for the same will look like

from unittest import TestCase
import io
import unittest.mock
# import Printer

class TestPrinter(TestCase):
    @unittest.mock.patch("sys.stdout", new_callable=io.StringIO)
    def assert_stdout(self, test_input_a, test_input_b, expected_output, mock_stdout):
        print_obj = Printer()
        print_obj.print_this(test_input_a, test_input_b)
        self.assertEqual(mock_stdout.getvalue(), expected_output)

    def test_print_ranked(self):
        test_input_a = foo
        test_input_b = baz
        want = "foo - baz"

        self.assert_stdout(test_input_a, test_input_b, want)

The mock_stdout arg is passed automatically by the unittest.mock.patch decorator to the assert_stdout method.

python already provides a great interface to creating temporary files and deleting them after their use case has been achieved, as compared to golang for example.

Is there a better way than to wrap the os.Remove(filename) with a defer, for a file created via io/ioutil in @golang?
— Tasdik Rahman (@tasdikrahman) April 9, 2022

The following block would add the content to the file straight up while creating the context block for the file.

class TestFileContent:
    def __init__(self, content):
        self.file = tempfile.NamedTemporaryFile(mode="w", delete=False)

        with self.file as f:
            f.write(content)

    @property
    def filename(self):
        return self.file.name

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        os.unlink(self.filename)

Now this can be simply used like this, whereas the name of the file can be plucked out via the filename attribute here.

class TestReadFiles(TestCase):
    def test_read_files_read_single_file(self):
        content = """
Supper over, the company went back to the bar-room, when, knowing not what else to do with myself, I resolved to spend the rest of the evening as a looker on."""
        want = "test_output"        ]
        with TestFileContent(content) as valid_file:
            got = test_method([valid_file.filename])
            self.assertEqual(want, got)

If you wanted to create create multiple temporary files in the same context for simpler cleanup.

        with TestFileContent(content_file_a) as file_a, TestFileContent(
            content_file_b
        ) as file_b:
          ...
          ...

Asserting for file not read errors

    def test_read_files_file_not_found(self):
        with self.assertRaises(FileNotFoundError):
            io_obj = io.IO()
            io_obj.read_files(["non-existent-file.txt"])

In case you would like to test the behaviour when the user is not allowed to read the file content due to permission error

    def test_read_files_file_read_permission_error(self):
        content = """foo"""
        with TestFileContent(content) as valid_file:
            io_obj = io.IO()
            # make file not readable to user
            os.chmod(valid_file.filename, 0o0230)
            with self.assertRaises(PermissionError):
                io_obj.read_files([valid_file.filename])

Will most likely add more references for myself here or in another post.

References

spf13/cobra not respecting mandatory flags as part of Prerun

2022-02-28T00:00:00+00:00

Just a continuation of the tweet, adding into small snippets for context.

Came across this issue where spf13/cobra would not work for mandatory flags set, when PreRun is set for a command while building a cli tool https://t.co/GmULRFrFfm (1/n)
— Tasdik Rahman (@tasdikrahman) February 28, 2022

Context

When you have both the PreRun and Run directives, and the mandatory flag present, which you expect to run before the PreRun directives mentioned, it will not be respected. This post is just a small nudge to prevent someone from trying to achieve the same as the same hasn’t been documented on cobra.dev side so far (in case nothing has been missed)

// sample snippet from https://cobra.dev/#prerun-and-postrun-hooks
...
...
func main() {

  var rootCmd = &cobra.Command{
    Use:   "root [sub]",
    Short: "My root command",
    PreRun: func(cmd *cobra.Command, args []string) {
      // Logic for PreRun
    },
    Run: func(cmd *cobra.Command, args []string) {
      // Logic for Run
    },
...
...

Now in case you have added mandatory flags inside your init() function

...

func init() {
  ...
  // picked from  https://cobra.dev/#required-flags
  rootCmd.Flags().StringVarP(&Region, "region", "r", "", "AWS region (required)")
  rootCmd.MarkFlagRequired("region")
  ...
}
...

Now if you run the cli without passing the required flag or r when running the cli, the routines inside PreRun don’t get run, and this flag is not respected anymore.

Workaround 1

Just move the flow inside the PreRun to Run, as there is no workaround as of now, even the same has been done by some other projects like keptn, where they have done the same. Agreed this bloats the Run, directive, but until there is a better solution, it’s better than not having any validation and repeating what the framework has already done for you.

Workaround 2

You can start using the cobra.ExactArgs() instead of using flags, this solution looks a bit confusing for the user using the CLI though, as there would be no named arguments and the CLI user has to then actually start remembering which positional arguments they need to pass to the CLI and which position is for what, leading to a CLI which requires to actually have a mental model

Comparison of the two workarounds for the cli

// case of using the mandatory flag withouth any PreRun and having the valition inside Run directive for the user to pass the region
$ my_cli foo-command -r=region-foo

// using the positional argument
$ my_cli foo-command region-foo

Ending notes

Given these two alternatives, I would prefer the 1st workaround simply because the CLI user doesn’t have to then remember what does the positional arguments stand for and we can leverage the framework to direct which flag is standing for what attribute and what do they need to pass further.

I didn’t find anything, the last open ticket on this is still open, I ended up enquiring in case this is by design, in case I have missed something. Will check back and update here in case there is an update.

There was an old issue which talks about the introduction of the mandatory flags, but doesn’t go into the PreRun directives.

References

Credits

Image credits for the post https://cobra.dev/

Evolution of support for infrastructure teams

2022-02-21T00:00:00+00:00

Context

As time has progressed, I have been part of teams of different sizes in terms of org size as well as the team sizes which I have been part of. This post is a conglomeration of the ideas I have picked up, things which have worked out/which haven’t and mental models developed as being a part of such infrastructure teams and growing with them.

Another thing noticed over time, would be the amount of adhoc work being slightly higher than other engineering teams out there, hence the difference in structure on how to handle it as we will see over the course of this post.

This post condenses the ideas presented here.

A little late to the party but here are a few ways, infrastructure teams could function during the organization growth phases and different engineering team sizes over time (1/n) 🧵 https://t.co/wUC7ujyaU3
— Tasdik Rahman (@tasdikrahman) February 21, 2022

Before we start

I am deliberately not putting out number of engineers here to fit a case because you will start feeling the toil and support request influx eventually, leading you to adopt and try out different models of support over time as this also depends on things apart from team size.

Initial days

For starters, for a new infrastructure team, you will notice the team getting adhoc requests on DM’s/Slack channels/phone calls too maybe and other mediums of communication which you use. This is workable only up till the point where the respective infrastructure team folks don’t get tired of responding to requests on the above mediums, which is bound to happen, as the distribution is random at best.

But if we look at it from the engineering teams perspective this is normal since there is no clear format/expectation set, which is what the infrastructure team needs to first do/would already be doing in some ways already. For the former part, the said person, receiving a request may then point it to the other team members on the standup that foo request was made and discuss it with others or if the task was too small maybe they already implemented it already before the next day.

The first thing which we can do in such a case would be, to ask folks requesting support to request for it on a respective slack channel and mark it as feature/bug fix/emergency request/query. If there’s a ticketing system/kanban board for the infra team, ask them to create the ticket/card there and let someone from the infra team(possibly the manager/lead) route such requests to the infra team members depending on their prio and availability.

This lets you solve a couple of things, the prio problem, which then gets offloaded to a single person rather than the whole team, and the problem of discovery of what were the support requests which the team got/solved.

One thing to note here that, there is most likely no incentive for the infra team member to solve tickets, hence I have only seen this model work when there is a gentle bump each time from someone who is keeping track of the prio of the tickets and their status.

As the team size grows

The team sizes grow, along with members in your team also growing, but the toil may also have increased over time, which could be a symptom of a couple of things. If you have been tracking the support requests which you have gotten, by this time you would have noticed what were the kind of requests which you most get frequently and how long do such requests take. The idea is to see what requests does your team get and how much time are you spending on such things for starters. After this, you would have a decent picture of requests/problems which your team is getting and you can focus on reducing the influx or requests by either.

Documentation to fix knowledge gaps for the team, helping them gain context and close off queries quickly. This also helps distribute knowledge inside the team. A general rule of thumb could be if you do it twice, write it down.

Automation being added to do repeated tasks, which have fairly predictable branch off’s from the happy paths, which can be handled by the automation for those cases when it deviates from the happy path.

This can be the next step of documentation, which supplements the knowledge by putting the operational knowledge in a tool. But remember that prematurely writing the automation is not any better than not having any automation.

All code is liability, so be careful on what you write the automation in, it should be something which can be maintained by at least a couple of people in the team, to help de-risk the automation becoming a fragile codebase which the team is afraid to touch.

Further growth and division into sub-teams

After a point of growth in your team, you will start noticing the larger infrastructure team to branch off and have separate sub teams for specific interest groups which the team works on. For example, observability, release engg, infrastructure, security, dev experience etc start getting formed as the team’s requirement for more people to focus on specific problems arise from the growth in the engineering team.

At this point, the original problem of toil still remains and will remain for each sub-team to handle, the only difference here will be that, each ticket will then have a specific team to get routed to.

It can also become a point of contention on what parts are to be owned by which sub-team, hence preferably, the ownership of specific components should be pre-decided for each sub-team, apart from the obvious ownership of components. This helps resolve issues/confusion arising from who would own a certain ticket, the back and forth hence causing a possible fallthrough in SLA.

On SLA’s, the teams would need to decide on the expectations on what is a good turn around time for a ticket requested, based on the priority on which it has been raised.

There will be cases when requests raised on priority P0/P1 might be P2/P3. A good rule of thumb is to have anything affecting affecting customers/bleeding money/security incident as P0 and needs immediate attention. The other prio groups can be discussed and decided upon. SLAs/tracking of requests/routing of tickets being at least have been addressed in this first pass, let’s look at SLA’s.

You might notice after a certain org engg team size and each sub-team having it’s own sub section in the Quarters planning doc, that you might be falling short on SLA’s for the support tickets turn around time.

It could be that there is automation missing, docs missing, or simply could be that the amount of support requests is simply big enough for the individual sub teams to be routed just via the TL/manager.

As they would also be distributing the request over to someone initially depending on their familiarity with who understands which stacks, but ideally this distribution in the sub team should be random at best as everyone should be distributed these requests equally.

If the request workload is still increasing, it might be time to introduce rotations for support inside the sub teams, having dedicated people from the sub teams to look and work on issues for the sprints duration and then rotating away afterwards

This helps in a couple of ways, for starters, helps prevent burnout for the team members as support becomes democratized within team. Since not everyone gets support tickets by default, the others can focus on strategic problems to be solved. The idea should be that, apart from toil work. There should be ample amount of time left for strategic work, which helps reduce toil work and improves dev experience, helping create good abstractions

Further improving productivity inside the sub-teams

You could further go into having dedicated folks for support for each teams to having dedicated folks for handling support for infrastructure sub-team engineering support. This will not immediately start showing any positive impact as with every new member in the team, it takes some time for the impact to come across, but having an L1 support structure pays back in time for the overall team

This L1 support team can be the first line of defense for anything related to infrastructure queries for all the sub-teams inside

Over time with pairing and knowledge transfers, the idea would be for the L1 support team to handle some level of queries/providing solutions with some/no amount of support from the sub-teams. As time goes, these L1 folks would have enough context to even start working inside the specific sub-teams and absorbed there if the teams would like to, creating a funnel of engineers joining the team via L1 -> sub-team member and so on and so forth.

At this stage your infrastructure team could very well be said to be in a mature stage. And would have been able to keep support request SLAs to a favorable number along with working on strategic things.

Final thoughts

As always, the main thing to note here would be reduction of toil as time goes by and keeping track of how much time is one spending on tickets/support requests, everyother solutioning follows this quest of reducing toil and a side outcome as part of it.

Would love to hear, how you folks handle support requests and adhoc work in your org

References

Cover image by Harald Süpfle, CC BY-SA 2.5 https://creativecommons.org/licenses/by-sa/2.5, via Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Webervogelnst_Auoblodge.JPG

Building the VM creation API for the org

2021-06-20T00:00:00+00:00

Over the years, there have been a lot of changes in the way, people create their virtual machines in their cloud environment.

At a very primitive level, one would simply go about doing it via the cloud provider console. A couple of clicks and lo and behold.

At a larger scale, people end up using automation to create these Virtual machines in the way they want them to be, given the manual nature of work would just start becoming a bottleneck in scaling quickly when required otherwise.

So how did we go about it?

Existing solution

I have spoken a bit about our current automation using proctor,

I will not delve more into that, but in gist, we had a way present in which developers could simply demand a VM getting created via a cli interface.

As time went by, the automations to create different variations of virtual machines based on language stack kept getting added, helping developers create Virtual machines on demand for their use cases.

This post is about how we ended up making the building block for a lot of automation around Virtual machine creation orchestration inside the org, as part of the platform team.

What is the need for a VM creation API in the first place?

Yes, true. Everything is working fine and dandy. People are able to create VM’s on demand whenever they want and however they would like to, via proctor and everyone is happy so far. What’s the issue then?

We come along with a couple of deprecation plans for the virtual machine infrastructure, where we have to deprecate whole fleets of VM’s for applications which were running on 16.04, which got deprecated in April this year.

It started with adding support for 20.04 to all the automation present in the org, to be able to support creating 20.04 VM’s via proctorˀ. It meant adding support to all the proctor automation scripts which we had, testing them out and so on and so forth.

The part to note here is that the automation’s interface was still the same via the command line tool to create VM’s for end users.

What we ended up doing

We already had a service which we had written to create kubernetes clusters on demand. Rughly, each kubernetes cluster would simply be a resource for this service, where it would store all the cluster related metadata.

Given the programmatic way of creating the kubernetes clusters (although I will share more about the challenges of this approach of creating and managing k8s clusters in another post), we wanted to add another resource on top of our modelling for this service. The service would at the end just call proctor and send it POST calls to create VM’s and have a callback link which this service would poll at the end of the day to check whether the given resources were created at the end of the day or not.

One difference from proctor, which would come here is that, one could ask for multiple VM’s getting created for the application which the service was trying to create VM’s for.

We also ended up changing the way the resources were named, to improve upon our existing way of naming resources (more on this later)

There’s now a central place where we are storing, which applications are running which version of VM’s and how many of them, of what language stack and what OS flavor and other additional metadata.

We also ended up adding a client for it, which people could use to create VM’s of their choice of demand, adding all the necessary helper methods (polling, checking resource status etc.)

User Flow

1) The user sends a POST call to orchestration service to the VM creation route, sending all the necessary information in the POST body to create the VM/VM’s for their application. Information like, which application, the environment, the type of VM to pick, the OS version etc

2) Orchestration service validates the request body and presence of certain fields in the request body, which are required for creation of the VM. eg: environment, language stack etc. It would also send a response back to the client, which would include the information like the VM names etc, which would be something which the user would find useful. Given the async nature of the response, we store the state of the creation of the VMs for each resource. The client can query the resource and the response body would also send the state details of the resource.

3) Orchestration service constructs the information required to send to proctor daemon to call the right proctor job, with the correct request body, sending multiple requests to proctor in case multiple VM’s are requested for creation.

4) Proctor daemon runs the automation to create the VM in a background job to provision and configure the VM.

5) Orchestration service polls proctor daemon for the job which was scheduled to create the VM, with a default backoff present.

6 & 7) proctor job completes, if the poll gets a response within the appropriate time, the status of the job is saved in orchestration service, which decides the final status of the resource in orchestration service DB too.

The user can at this point, make a GET call on the resource URI, to see the updated resource status to check what’s the current state of the state machine.

How has it been used so far

The very first thing, this API got used for was to automatically create our version of managed instances by the productivity team inside engineering platform. It’s a lovely abstraction which they came up with, but simply put, people would ask for a new application to be created from our developer portal, choosing what tech stack to use and which environment. Whether they would want a private reverse proxy or a public one. And they would get their VM’s automatically created for the application along with the loadbalancers, along with mapping the DNS’s to the respective loadbalancers.

The VM’s would get lazily created i.e the first time any deployment would happen on them and then moving forward the records of the infrastructure, which got created would then map to the application for the evironments.

This was as compared to where developers would have to create all the Virtual machines via proctor manually and do the above by themselves. Which led to a lot of time saved while creating a new application for a team.

Another problem statement which got easier due to this VM creation API, was the process of deprecating virtual machines of certain language stacks/OS versions

Recreating VM fleets became easier and programmatic and people could now build their own tooling on how they wanted to create VM’s, which we ended up building for starters for deprecating VM fleets for managed services which we talked about earlier.

Why didn’t we add it in proctor itself?

A couple of reasons. We already had a service which was used to create k8s clusters on demand. We wanted to add another resource to it, which would be used to create virtual machines, making it the central tool to orchestrate resource creation, the resource here could be anything, from a virtual machine to a k8s cluster.

We also felt it would be just cleaner as in interface, where proctor would be fronted by this orchestration service for resource creation and be the entrypoint for developers and the like for having any automation around resource creation needs with helper methods added to the client for their ease of use.

Could we have built everything back in proctor? Defintely. Would it work the same way as it does now? Mostly yes.

But we took to call to move the orchestration bit of every resource going forward to this new service. Be it orchestration of creation of new VM’s, new k8s cluster etc. This service would have the modelling to handle the creation, along with the inventory which we would get out the box (can be solved by other ways, proctor too in this case)

Moving forward

The end result was a lot of time saved in terms of developer time which they would have otherwise spent to create VM’s, cycle VM fleets with minimal developer intervention.

Which is the original and broader goal of automation too I feel, to move out more and more things out of the human hands so that people can move on to do other things. And as always, automation is a moving target for everyone, you build abstractions on top of abstractions sometimes to ease things out for everyone, without compromising on the usual things like quality of what is being delivered, ease of using.

Final thoughts

Creating this along with Vidit and Kartik was such a great experience. Thanks for the learnings.

Handling language stack deprecations: Part 2: Container infrastructure

2021-06-15T00:00:00+00:00

Given the number of language stacks which different product teams end up using inside the org, variations come, in the form of different versions being used, or different versions of dependent libraries coming in. This combination will quickly lead up to a whole set of container image variations for a particular language, crossed with operating system versions.

What is the problem then?

Tracking what is being used by different product teams and their services arises when we would want to know what infrastructure combination is being used by different services. Tracking this piece of information is paramount for a couple of reasons for the central infrastructure team.

If you don’t know what people run, you wouldn’t know what to deprecate. Or in other words. You wouldn’t know what you support for the product teams.

Upgrades/Deprecations

For example, Ubuntu 16.04 LTS recently got EOL’d, back in April 30th, 2021 where-in it stopped receiving security updates, creating an invetory of which services, in case they are using deprecated OS versions for example, would help in prioritizing and notifying the product teams about the impending upgrade.

This applies very well to the language stack deprecations needing to be done.

Library additions

Catering to different requirements of library versions and variations of libraries required by each language+os stack variation also crops up. Deprecating/upgrading existing versions would need tracking as a pre-requisite.

What did we end up doing

Given this would need to be solved for all applications inside the org, we started to tackle the problem starting with tracking the container images being used by the services onboarded to our internal service registry, to begin with and then handle the services which were not onboarded to the service registry.

Given our deployment flow, the cli which we give out to the developers, is the first touchpoint in the CI/CD pipeline. The container images to be used for packaging and deploying the application are specified in the CI/CD pipeline configurations. Now we would want to parse this information present already by one of the scripts being used to either deploy/build/run test stage.

Given the simplicity of the solution, it would be simpler to just parse that specific string line for the container images for packaging the application and the image which is being used as the base image.

The deploy scripts are just a wrapper which make use of the cli interacting with the central service registry to deploy the application, along with added logic to handle the response

Next was to just add a simple route on the service registry which would recieve an HTTP request to receive this information, everytime a deployment would happen via the CI/CD UI. Leading to updation of this information for each application.

Current state

As of now this script of fetching this information is run for each and every k8s deployment which we run via our deployment automation.

This has in turn helped us create the inventory of all the language stacks which are actually being used by the applications which are onboarded to our internal service registry and having deployments orchestrated via the same to our k8s clusters.

For deployments happening via other automation tools (helm for eg), we currently don’t have support to fetch what is the language stack being used.

Challenges faced

There were places where we wouldn’t be able to gauge the language stack version from the env vars which we would parse for an application, as there would be multiple versions present of that env var which were declared. We didn’t want to handle such cases initially, as we wanted to first cover the ideal cases first. Leaving this as an anamoly
Some applications hadn’t been deployed since months/years. Asking the devs to deploy was one solution, but there were cases where the services had no clear ownership, we didn’t want to deploy the application, to just fetch this information. We ended up creating an automation where we would fetch such applications from the source repository, running the parse logic and then making the http call to the service registry to store the information.

The solution was not perfect, but this worked for us to build the initial language stack inventory for containers.

The next thing which we did was to plot this information on grafana to be easily accesible.

Future

We would want to then onboard our other applications which are not present in the service registry, to track their stack versions. Which would allow us to have more coverage on what are we currently supporting for the org, which would be the next step.

Takeaways

If we don’t know what we are running, we will not know what are we supporting
This becomes paramount, whenever we are trying to resolve security issues/add patches to software/support existing software
Not having this information would mean, getting surprised about the issues which come along with the versions of stacks which you are not aware of.
Piggybacking on existing workflows helped expedite the whole process of fetching the language stack information for an application

If you liked what you read here, I have written a bit more about how we did the same for virtual machines here

Revamping Vesemir: our virtual machine deployment service

2021-06-12T00:00:00+00:00

This is a continuation of the post, which details into the working of vesemir and how it goes about introducing changeset. Give it a read before continuing reading this, to allow you to gather more context on the what and the why.

While this post will focus more more on how we went on with revamping vesemir for increasing it’s reliability, maintainability and modernizing it.

Continuing the thread around how we did the same for Vesemir. (1/n) https://t.co/Xol0uRraJv
— Tasdik Rahman (@tasdikrahman) June 12, 2021

Previous state of affairs

Trite as it is, everything in software I feel is improved over iterations and the same is true with vesemir here too. We have a functional piece of software which does things as expected and has been running for quite some time now. After the initial set of features, this project remained in a bug fix state for a bit.

When our team took ownership for the deployment ecosystem a year back, we started looking for ways in which we could better handle the support requests for deployment issues and to reduce the firefighting, given we were slowly starting to gain context bit by bit. The deployments for containers didn’t have much issues, given it was stable enough to not let such issues creep in.

For starters, there were a couple of boxes(VM’s) in which Vesemir was deployed, which would receive requests for deploying applications. We also noticed some teams independently cloning the initial vesemir VM’s, where they had added their own set of changes and then using this Vesemir VM to deploy their applications.

Centralized logging was something which we needed to add. Devs would look inside the specific VM boxes of vesemir to see what was going on, otherwise when a deployment failed (a very huge source of time sink for devs in our team/Level 1 support folks).

Monitoring was missing on the boxes, if and when CPU would spike (continuing to stay that way), we would not know about it and the deployments being processed would get slower and take more time. This would cause deployment failures, a bit of that is explained here in this post, as the workers picking up deployment requests were having a finite time to process the deployment request.

Integration testing for the changes before deploying them to the production set of vesemir VM’s was very manual and haphazard, prone to manual errors. It involved, removing one of the vesemir boxes, from behind the HAproxy and then deploying the changes to this vesemir box, testing it all along, and if everything went well, doing the same for other boxes. Again a huge time sink.

The way we deployed vesemir to the vesemir VM was very manual, someone would have to remove the box from the HAproxy manually, pull the changes to the boxes, and then restart the service, being supervised by systemd. The whole process of deploying would take around ~10-15mins for the whole set of vesemir boxes.

Before starting with anything, we would want to solve these issues for basic sanity of vesemir.

What did we fix first

As with everything, prioritization is crucial. Breaking down the tasks into tactical and strategic fixes so as to fix what was burning.

Fixes which would immediately give us value and would help us reduce, the firefighting were prioritised first. Which would help us get the breathing space.

Fixes/features which would help us long term value over a larger segment of users, re-architecting the setup, providing better reliability, refactoring which would help us maintain the codebase better were kept for the end.

Baseline fixes

To solve the onboarding issue of other devs in the team and to reduce the backlog of support requests received by our team for deployment failures, we ended up creating a playbook which would consist of the different failure modes for vesemir and their solutions. Which we gave to our platform support team (more on this model later), who would then look at the tickets and then solve the customer deployment failure issues. If there would be something which they wouldn’t be able to help with, we would come in as the second line of defence. This helped us immensely in freeing up time to solve the initial burning issues which we were seeing with vesemir.

Our team ended up picking, adding logging support to Vesemir, which would be then be pushed to a centralized logging platform. This would allow us, the support team, as well as the developers of the product teams to look into what really caused the issue and then give us that initial bit of information about what went wrong for further debugging.

Along with it, we also consolidated the vesemir VM’s and put them behind an HAproxy box, to front the service and loadbalance the incoming requests, which was missing earlier.

We also picked up adding monitoring and alerting for the vesemir VM’s which would allow us visibility into what was happening. This also allowed us to see the CPU going all the way upto 100%, without any signifant increase in deployment requests being received.

Vidit has written a great article on solving for the same , where we ended up going from python2 to python3, which helped fix the issue altogether of the CPU spiking and staying that way, helping us reduce the number of VM’s which we had been using.

We ended up replicating the production setup for an integration setup, which allowed us to have an exact replica of the production setup. The integration environment of the deployment orchestrator(also our service registry), was updated to start pointing to the integration environment of vesemir which we just created. This helped us test end to end, by sending deploy requests from the cli which we give to the developers. Allowing us to catch the errors before them landing up on production.

There were a couple of major bug fixes which we added, helping decrease the defect rate and increasing the reliability of deployments going through.

Deleting dead code, adding missing tests, a linter, integration testing scenarios, along with refactoring the codebase next, helped us maintain the codebase better and helping new devs getting onboarded to it much faster then before.

Adding support for storing deployment metadata like which application is getting deployed, the number of VM’s which we are deploying, the time it’s taking for the deployment, the chef tags being used to search for the application etc were also a few things which we started persisting in the vesemir database, helping us track metrics for deployments.

Deploying vesemir by the service which is used for deploying to kubernetes

The next big chunk of work which we wanted to tackle was to automate the manual deployment process of vesemir, which would not only be time consuming for a developer in our team but would also be error prone whenever someone would be doing it.

This is where normandy came into picture. More on normandy in another post, but it’s a service written on top of go-client which would effectively be used for doing CRUD operations on kubernetes resources required by an application when it is getting deployed to a kubernetes cluster(s)

The immediate benefit coming out of deploying vesemir via normandy, was that now what used to take more than 10-15mins of manually deploying vesemir to the VM’s. We now were having a system in place where we would be able to deploy vesemir with the help of our deployment tooling which we give out to the product developers, with the same deployment UX, along with rollback being available via the same interface. Helping us dogfeed our own service to us.

We ended up adding support for python stack based deployments via our deployment platform as part of this effort, which came as a strategic win for us.

Another big win for us, was that we now knew the exact dependencies for vesemir, on what it used and what it didn’t, which made it’s infrastructure reproducible and immutable.

Outcome

The biggest outcome of this revamp, came in terms of reducing the defect rate by ~18% from what it was earlier.

Monitoring, logging and reliability changes, along with better testing mechanisms introduced, helped reduce the hesistation by devs diving deeper into vesemir’s codebase, which helped in taking it from a codebase which people feared to introduce changes into to something where the changeset could now be introduced in a couple of minutes, with immutability baked in.

A lot of toil involved in maintaining/adding features/testing to vesemir was reduced, which helped with developer productivity.

We upgraded from an EOL’d version of python, which no longer was receiving any security fixes.

Why didn’t we just rewrite vesemir?

There were a couple of things, which we considered before deciding upon not to rewrite vesemir.

Con’s of a rewrite

the new piece will be missing the vesemir codebase which has been
- well tested for the behaviour by end-users.
- in used for a couple of years.
- patched with a lot of bug fixes which have been found and fixed over time with usage.
not taking into consideration the dev effort which will be required to maintain two systems during such a migration to the new tool, where-in making us maitain two systems at the same time.

Pro’s of a rewrite

Completely new codebase, following basic sanity checks and best practices, right from the start.
Rewrite in a language more familiar with the rest of our teams codebases(majority being in ruby/golang) rather than being the single service written in python.
Not necessary, that the shortcomings solved in this rewrite would overshadow the new bugs which would come in as part of the rewrite.

Takeaways

Learned a bunch during the course of the whole revamp, while creating the project plan and cutting out user stories for this, but the biggest for me was learning prioritization, deciding on what is more important of a problem to solve which would create more impact, helping the end users achieve(the product developers) achieve the end result of introducing their changeset for their service, in a reliable manner.

This was also the time when fresh grads had joined our team, mentoring them to ramp up on the codebase and the operational side of things was something which helped me deepen my own understanding around our domain.

This also was a good experience in dealing with a really hairy, legacy piece of software, working around it while enhancing it at the same time without causing issues to the customers(product team devs) and was something new.

Thanks for reading this piece. More on around normandy, our kubernetes deployment service, the other part of our deployment platform in another post.

Vesemir: Our virtual machine deployment service

2021-06-10T00:00:00+00:00

This post is a continuation of the tweet here

The build and deployment pipeline for each org will be different in some way or the other, given each co will have it’s own requirements. This thread talks a bit about our virtual machine deployment pipeline (1/n)
— Tasdik Rahman (@tasdikrahman) June 10, 2021

This was also cross posted originally for the gojek tech blog https://www.gojek.io/blog/introducing-vesemir-gojeks-virtual-machine-deployment-service

The build and deployment pipeline for each org will be different in some way or the other, given each co will have it’s own requirements. Even in my last org, the way our team enabled other teams to ship code/config changes, was pretty different from the way we do it in my current org.

Background

Our deployment platform comprises, of a central service registry, two separate services to deploy to virtual machines and kubernetes respectively, a cli to interact with the service registry, which is given out to the developers of the org using which they can do numerous things for their application in a self serve manner. One of those things, being the ability to deploy their applications via the CLI.

The idea is to abstract out the deployment process from the developers as much as possible, but also giving them the means to debug with the logs being shown to them, which are meaningful to them to figure out what went wrong if it doesn’t work as expected.

For this post, I will specifically talk about the virtual machine deployment service, called vesemir, which we have in place and how we went about revamping it’s infrastructure along with refactoring it to keep it healthy and maintainable for others to work on it’s codebase.

Given, GCP doesn’t have a service similar to AWS codedeploy, vesemir came out of the same requirement.

What does it really do?

Vesemir is a inhouse python service, which in a gist is a wrapper on top of chef API’s, where-in it receives a request for deploying a service in a particular cluster (a GCP project), filters out the VM’s where it has to deploy the changeset and then goes about deploying the changeset, either one at a time or at the level of concurrency insisted upon by the request.

It hels us deploy to integration and production environments, 300+ times every day.

How does it do it?

The initial request payload, bits which are of most relevance are

application name
environment
team
chef tags to be used for filtering the application VM’s
haproxy metadata (degradation time specified, cookbook/recipe/tags to filter the haproxy boxes, haproxy backend)
concurrency (number of VM’s to deploy at a time)

Once vesemir, receives this piece of information, it queries the chef server via pychef, with the information passed for the search tags, the query for which gets constructed and send to the central chef server. The hostnames and their IP’s then get parsed.

This takes care of the first bit of the problem where we know which VM’s are to be touched for deploying the changeset.

Along with this bit, the playbook.PlaybookExecutor gets initialised with the ansible playbook(more on this later) which we would be executing on the hosts found, the inventory file, and the variable manager named argument taking in extra variables which are going to be used by the templatised playbook.

We write the ansible inventory file in a temporary file, using NamedTemporaryFile which gets deleted after the whole request/response lifecyle for a deployment, which gets fed into the InventoryManager, where we pass the hostnamefile which we created above as sources

The ansible playbook which gets fed into playbook.PlaybookExecutor, is a static playbook with jinja templating, where we feed the options via the VariableManager into the extra_vars

The options named var, will take in details like, which user and other authz/authn details to be used by ansible to ssh into the hosts, along with options like forks where the concurrency is set for the number of boxes where we would want to run the playbook.

The next step is to execute the run method on the initialiazed object of playbook.PlaybookExecutor. The important bit here after the run is the collection of stats which we pick up, from the TaskQueueManager, which is what we then use to check for nodes where the playbook ran successfully or not, if the hosts were unreachable and so on.

This piece of information is then used to form a response to be given back to the client which has called vesemir.

As for the playbook and what does it do, in gist, it first, disables the server where it is first going to deploy, in the haproxy backend for the application servers. Introduces the changeset, restarts the service, enables this VM back in the HAproxy backend with weights if provided during the request, an option to sleep for a bit is also introduced which is again controlled by the request, before which the VM is inserted back with 100% weight.

The playbook is then looped over the hosts, returned by the chef query while searching, all while executing the playbook tasks on the hosts.

Pros of this deployment pattern

The obvious is since we control the deployment platform tooling, it allows for us to have a level of flexibility, not possible with vendors. Everything is more or less, an outcome of this flexibility which we get.

Shortcomings of this deployment pattern

Given the changelog is introduced in each VM in such a way, immutability is not possible, which is again to the AMI based approach, where AWS codedeploy would do something similar to inject the codebase into the target VMs.

More moving parts in the system, which depend on external dependencies. For eg: ansible being a very important part for vesemir to deploy the changeset, right from using piggybackin on the ssh interface provided, to the templatising of the steps of execution to be done.

Maintenance of a custom tool, which would have it’s own lifecycle, maintenance, bugs and feature requests.

Takeaways

There are numerous ways in which people end up deploying their codebases, but maintaining vesemir for sometime now for my current org, having seen both sides of having immutable deployments via container images, to deploying changeset via vesemir, it’s not very hard to see the selling point of immutability.

Thanks for reading this piece so far, more on how we ended up revamping vesemir in another post.

Bug which would cause some deployments to get triggered again and again

2021-06-06T00:00:00+00:00

This post is a continuation of this tweet here.

We recently encountered a bug in our deployment flow, which we were completely oblivious to. (1/n)
— Tasdik Rahman (@tasdikrahman) June 5, 2021

Bugs are present in every system, waiting to be discovered. As such, this one was no different.

What did the bug do?

Would cause an application to be deployed again and again, even though when it had been triggered only once to be deployed.

Context about the system involved

The part which handles requests to deploy an application to a set of VM’s is a custom service which we have written and maintained over the years. The exact workings of it is the subject for another discussion. But gist is that, it has an API which listens to requests and does the intended work of deploying a change set to a set of VM’s for the application and would return back an appropriate response.

Depending on whether the deployment was successful or not, the response body gets added with further information for the same for the client here to decipher whether the deployment was a success or not.

The deployment service runs behind gunicorn with a couple of sync workers handling the incoming requests, only other thing being that the workers are configured with a timeout configuration, which acts as request timeout configuration.

The current setup being sync, there is a timeout configuration also present on the reverse proxy which sits between the client and the service which deploys an application to VM’s. The proxy timeout configuration was set to the same value as the gunicorn worker timeout.

Orchestration for the whole deployment, right from sending the details of the deployment request (which app to deploy, which environment along with other metadata), is handled by a separate service, which also doubles up as the internal service registry (more on this later)

The deploy request which this broker service receives, eventually ends up being queued as a background job. One detail here being that the job processing framework used here, retries failures with an exponential backoff by default.

Which would mean that the retry will keep on happening for a couple of times, which ends up being a not so good number for the nature of the job beind handled, which is not an intended behavior for this flow. After the max no’s retries the job would be pushed to the deadset

Immediate fix

After killing this specific job manually, to immediately resolve the issue, we went a bit further into checking the cause because of which the job processing logic for the deployment job was causing an error, leading to the retry.

What caused it

Now the way the service handling deployments works is such that, out of a set of application VM’s, where it has to deploy, it will do it either 1 at a time or the level of concurrency being passed in the deployment request.

More on the workings of the series of steps taken while deploying this changeset later. The point to note here being that the deployment time increases as a factor of the number of VM’s present for the application, in the particular environment where it is being deployed.

What ended up happening was, the orchestrator would wait for the response from the deployment service, after x minutes set on the gunicorn worker, set same as the reverse proxy connection timeout too, the request would be abruptly closed by the proxy if it exceeded x mins.

All of this, even before a valid response could be received by the orchestrator from the deployment service.

The block handling the response from the deployment service, was having a guard to handle an HTTP error while making the request, but it was after the block where we would check for the status of a deployment after the response was received.

Given the timeout configuration, the HTTP error flow not being handled before the parsing the deployment response flow, it would end up error-ing out the deployment job, leading it to be retried by the background job processor.

What did we do to prevent this from happening again

The first fix added was to increase the response timeout in the reverse proxy to be comfortably more than the timeout set on the gunicorn timeout, so that if the worker times out, the response would be sent through, without the connection getting timed out by the proxy.

The second thing which we ended up doing was to keep a check on the deployment flow, before initiating the deployment on whether the deployment is already in a terminal state (failed/succeeded) which was easy to check given the state transitions being maintained.

We could have also made use of the background job processor’s max_retry setting in this case by setting it to 0, which would have not retried the job at all if it failed once. Another option would have been to use discard_on here https://edgeapi.rubyonrails.org/classes/ActiveJob/Exceptions/ClassMethods.html#method-i-discard_on

Although the root cause is not this, but an obligatory plug, of the fallacy of “Network is reliable”

Handling language stack deprecations: Part 1: Virtual Machine infrastructure

2021-02-02T00:00:00+00:00

This post is a continuation of this tweet thread

Language deprecation for stacks can be a task if you are on VMs, added to that the confusion on what version of that stack runs, in your inventory if it's not small. Summarizing what we ended up doing to bring visibility & giving people the ability to migrate themselves (1/n)
— Tasdik Rahman (@tasdikrahman) January 29, 2021

Compute VM’s

Given the nature of VMs and how they are run and created in our compute infrastructure environment. Managing, upgrading and adding fixes to them becomes a task in itself. Given that there is no control plane to control the lifecycle of these VM’s, the task is manual at best even though there is automation to delete and create VMs on demand (more on the VM creation API which we created in a different post).

If the workloads were on kubernetes, the deprecation step is as simple as a simple deploy of the application, after the base images would have been updated with the necessary changes required to deprecate the language stack.

Given the confusion on which language stack version ran on which VM’s of our inventory, this only added to the problem statement. Which is clear in itself, that we have to deprecate a group of VM’s which are running a particular language stack without causing any disruption to the workloads running on them.

Where to start?

To begin with, the automation would need fixing to disallow the creation of the compute infrastructure with the language stack which is out for deprecation. This would help remove the moving target of VMs which would get created with the automation you have. Having a finite number helps reduce the effort at the end.

Creating the inventory

Along with that, the next natural step would be the creation of the inventory of which set of VM’s are running the version of language stack which you would want to deprecate.

If the number of VM’s are in a few hundreds combined for the org, one can simply check their IAC configuration which they would have checked into their git repo’s, which they had applied to create the infrastructure.

Manual for an effort, but for starters would work just fine due to the smaller number of VM’s.

In contrast, if the inventory of VM’s is running in thousands of VM’s, the same approach is impractical at best. Furthermore, the whole idea of maintaining the IAC for these set of VM’s is counter-intuitive given the amount of config which would have to be maintained.

Compute at present is created on demand. A dev needing a compute VM can simply go ahead and use proctor (our in house automation orchestrator) to create a VM for them. (More on this in this talk

But what about inventory? While creating the VM, we add tags (both on the configuration management tool, along with the cloud provider resource which is created)

This allows someone to query the VMs with regards to which team/group the VM belongs to. While creating the VMs if your current automation is missing the addition of the language version stack version which it is going to create the version with. Which also makes the whole job very simple in terms of just querying your configuration management tool to get your VM’s for the query you would like to give it. In our case, we didn’t have these tags for the language stack versions.

Which brings us to the situation, where it’s not clear on which VM’s are actually running which version of a language stack. As to the number of language stack versions running in the VM’s would vary as and when support for different versions would have been added over time.

How did we go about it? Given the number of VM’s, it was best to have this generated in an automated manner, simply because of the impracticality of this being tackled in a manner which involved someone checking this and creating it manually. Along with the fact that, the VM infrastructure would keep increasing over time, the data would go stale really quick.

As this was also something which would have to be repeated for another language stack/os version/etc again sometime in the future, it was best to have some sort of a repeatable and predictable process which could be leveraged to do this in a faster manner.

The solution as of now is an airflow job, which runs at a specific time frame, after the execution of which, the end result are a few artifacts which it creates/updates

The artifact in our case being, a google sheet which will have entries of all VM’s which are running the particular language stack, along with metadata for eg: ownership details, os version, production/integration, what kind of compute VM is it etc.

What the airflow job does, apart from this, is to keep updating a RDBMS with information similar to what it inserted in the google sheet.

The RDBMS got imported as a data source on grafana, on top of which we gave the end users the VM’s under their ownership for that particular language stack. This allowed them to have a simple interface to the VM set, without the need for them to know any internal details.

How does having different view layers for this inventory help?

The dev folks would simply look at the VM hostname/IP and replace it with the automation they are familiar with to create a new VM to be used in place with older VM, giving them the power to do the whole activity without getting blocked on anyone

To keep track of how many VM’s each product group/team’s progress, what we also ended up doing was, keep a track of the number of VM’s they had for that particular language stack at the end of each week.

This number, would be then sent to the teams/product groups as an email which is powered by another scheduled airflow job, along with a grafana dashboard for viewing the set of instance and docs required for doing the whole activity

The following dashboard for example, shows the trend line for the number of VM’s on a particular language stack and their numbers over time for different teams/product groups

More on the automation used

Coming back to the airflow automation job specifics, the job which creates/updates the google sheet and inserts VM details to the RDBMS.

It makes use of the configuration management tool’s language specific client which is used in the script, to first filter the VM’s based on a specific automation template which gets attached to the VM when it gets registered to the configuration management tool.

This helps first sort the VMs which would be running some version of the language stack. What the script does next is, ssh into these VM list in batch and pluck out the details of the language stack by running a command which would be specific to the language stack

Which would then blurt out the version information, the script captures this information from the VM and logs out. A few other helper commands are run as part of the whole script, which would pluck out the metadata relevant and create/update the artifacts necessary.

The helper library created here, is flexible enough to take input in the form of the query which you would want to run on the VM to capture the relevant information, which helps in future inventory management.

Would love to hear what you folks do for this repetitive chore and how you do it.

If you liked what you read here, I have written a bit more about how we did the same for containers here

Maintaing aptly - The debian package manager

2020-12-23T00:00:00+00:00

This post is a continuation of this tweet thread

Sad to see aptly slowly https://t.co/zkXgsruAGi rotting, but works really well till the last 1.4.0 build as a debian package repository for your needs. (1/n)
— Tasdik Rahman (@tasdikrahman) November 11, 2020

Aptly is a debian package repository, the specific use case which we are using it for is pushing out application specific debian packages which will then be pulled out while deploying a new SHA/version of the application, to the app boxes. More on this in another post. But what this post will concentrate on, are a few things which we discovered while maintaining aptly, storing packages which ran into storage spaces consuming multiple TBs.

Use an SSD

This is a must here, since aptly serves the debian packages straight from the filesystem. If you are not using an SSD, you will definitely see a slowup in the package fetch/insert steps.

Add your cleanup scripts early

Make it known the application owners, that you will only keep, say last 10-15 packages in the filesystem. Given that the deb package would roughly take a few hundred MiBs(very rough estimate, can vary for you) you will reach a point where your initial storage will run out over time.

Throwing disk and compute can be done and is all fine, but the former will reach a limit, when your filesystem itself can’t support it any further (ext4, supports logically uptil 1Exbibyte (1EiB), but the point is that you don’t really end up in a situation like that where cleanup becomes a problem)

Even then, resize2fs (version 1.42.9) would simply fail if you tried increasing the disc more than 16TB, saying that the new size is too large to be expressed in 32bits. Adding to it, if you are running an older kernel version, eg 3.19.x which doesn’t handle 64 bit ext4 filesystems properly, you would be left in a puddle here.

Build time slows down over time

As with time, the index of the packages held inside aptly will grow, which will considerably increase the package publish step for your package consumers, which in most environments is a huge productivity kill. Imagine having to wait x amounts of minutes every now and then while trying to push a commit and deploying it over to the app boxes. If you combine this to the number of developers in the team/company, those are a lot of people hours right there.

The particular step is the package publish step, for a distribution which slows the whole process.

How to prevent this API slowdown in the publish step?

The publish step https://www.aptly.info/doc/aptly/publish/repo/ has an option/flag called --skip-contents, which will essentially not generate the index of the contents stored. We had tried unsuccessfully storing this in the aptly app config, but it seemed to not work.

After checking the codebase, in the specific route, which was /publish/:prefix/:distribution which used to take the most time, and for which we wanted to set the above setting.

// https://github.com/aptly-dev/aptly/blob/24a027194ea8818307083396edb76565f41acc92/api/publish.go#L232
// PUT /publish/:prefix/:distribution
func apiPublishUpdateSwitch(c *gin.Context) {
	param := parseEscapedPath(c.Params.ByName("prefix"))
	storage, prefix := deb.ParsePrefix(param)
	distribution := c.Params.ByName("distribution")

	var b struct {
		ForceOverwrite bool
		Signing        SigningOptions
		SkipContents   *bool
		SkipCleanup    *bool
		Snapshots      []struct {
			Component string `binding:"required"`
			Name      string `binding:"required"`
		}
		AcquireByHash *bool
	}

	if c.Bind(&b) != nil {
		return
	}
...
...

The var b would get it’s de-serialised content for SkipContents is what we anticipated, when we started passing the value skip-contents: true(as we saw in the docs), in the PUT call, as part of the final package step.But this also seemed to not work.

After digging a bit more, Vidit and Kartik discovered that we were passing the wrong header key which would be used to Skip contents, from their test suite, the configuration to allow setting this value, the header to be passed was SkipContents: true, after making the change. This was the step which was taking all the time as the package index size grew over time, making the whole package upload step slow.

Alternative ways of tackling the package index step slowdown

One route, which can be taken would be, to create a new aptly instance, start pushing your debian package files to this instance instead rather than the older slower one.

How would the client pick up packages from both these apt sources? Multiple sources can be specified in files under /etc/apt/sources.list.d/, which will be looked up while searching for a package. For newer packages, the client will pick it up from the newer apt source (new aptly instance) and for the older ones, it will pick it from the older apt source (older aptly instance)

Will immediately solve the problem of slower builds, as well as the sideeffect of having a clean newer setup, where you can start enforcing the standard of keeping foo number of packages from now on

The other solution is to either self host another alternative like Pulp3 etc, or a paid package manager, which would take away some bits of these off your plate.

Another thing to note here is that, aptly hasn’t had a commit on it’s master for quite some time along with a new release not being put our for some time now. There has been an open issue regarding the question of whether it’s maintained anymore. Although, I personally feel it’s feature complete for the set of features we have been currently using, and runs without any fuss whatsover for the most part, this is definitely something which you should consider as something while weighing down on options.

References

Credits

Post header image credits: www.aptly.info

What to avoid while doing PR reviews

2020-12-14T00:00:00+00:00

As with time this doc will change, but jotting my thoughts down here on what I feel I would like to avoid while I review PR’s.

Code formatting/style suggestions

I believe it’s best left to the machine to do this instead of a human trying to fixate their attention to this, given it takes away the precious time of the reviewer which could be diverted to review the crux of the changes which the submission tries introducing. A code formatter should ideally pick this step up from the human reviewers’s plate. An opinionated code-formatter/linter/style checker is the best option to have. An example for this will be gofmt/linter which weeds out code formatting issues right in the build/test step. rubocop is another great example.

What a tool is not able to enforce should be added as a team guide when reviewing PR’s, opining about certain styles and blocking the merging of the PR is not the best way to deal about changes being introduced, worse yet. If the person is new/not someone from the team, would only add up to the friction of contributing towards the codebase, which is not a good sign.

If there is something which the reviewer is very concerned as a style in the PR, they should ideally leave it be for this changeset if it was not caught by the automated tooling/not mentioned in the team style guide and raise it with the team to be voted upon from the next PR’s.

Obvious thing to avoid here is going against the language guidelines themselves, I would preferably rather stick to the language guidelines(pep8 for example) unless absolutely required. What this will enable is, someone new joining the team would have less familiar things to get onboarded to making them productive faster in terms of moving around the codebase, finding things or making sense of why for certain choices.

Wrote a small thread around the same here.

Was talking to a friend of mine who was pining about that their PR's would sometimes get a lot of nits in terms of style guides. (1/n)
— Tasdik Rahman (@tasdikrahman) December 14, 2020

Going over the whole design again

A higher level design change is not something which I will suggest on a PR, this needs to be caught right before someone starts implementing foo feature/refactoring. It’s a massive waste of time for the whole team, while you go over something which neither of the team members had consensus upon. Which is why I really like the approach of RFC like process where the changeset if big enough would involve a discussion and a consensus be formed upon, so that things don’t come as surprises at the implementation stage for both the reviewer and the driver of the changeset.

What this will also immediately do is also make others familiar with what you are trying to propose and point out any mistakes which might not get caught by you in the early design phase, which they might have seen/experienced.

Idealistic changeset

This is something which I would like to prevent in the codebase, if the feature adoption is hard for foo feature, it would probably not make sense to add it if no one is going to use it. There should be some strategy on how to get users for the problem you are solving and reducing the friction for adoption. If no-one is using that feature/it provides no added value in the codebase, it’s as good as dead code to me.

If the feature is not gonna get used, why bother adding it at that moment? If instead the attention can be diverted to more burning issues/solving customer problems which are gonna give more leverage.

Again, going via an RFC approach might be one way where this would get caught as others would be able to point out the shortcomings in planning on how this changeset is going to get adopted.

For example, there was this one change made in the codebase in the client of an API which would get distributed to developers internally for a toolset which our team would provide. This change would effectively introduce parsing of the API response and being quite tied to the exact semantics of the payload which it would recieve from the server. Trickling down such logic to the client meant, that any changes to the response would have to be backward compatible. And when we did end up introducing a change to the response payload, we ended up having to keep this versioned API backward compatible. While I agree that versioning is for this specific purpose, but I feel it could have been avoided if the client would have been decoupled from the exact response semantics in this case and would just about be very naive, instead the server being smart enough to send the appropriate data over to the client.

Accepting a large changeset which introduces too many changes

With all honesty, I really don’t feel that someone would be able to effectively, actually go through a PR which changes 50 different files and has a huge LOC changeset, without having to spend copious amounts of time dedicated to just reviewing the PR. I personally have felt that such changes are super hard to review and would considerably increase the PR review time. In worst cases, to just unblock the team member, you have to either trust their changeset while you have gone through it on a very high level and going ahead and merging their PR, this introduces the problem of the PR not having gone through the usual PR process which would be a bit more rigorous.

There is no right size for a PR, but ideally smaller changes, which have one purpose are easier to review. Say for example, refactoring a flow to remove redundancy and re-using another module from the same codebase. Adding a small feature which would not require a lot of work.

If the feature/changeset requires a lot of changes, it would either mean that the scope was not broken down properly, which ended up creeping inside the PR as a reflection.

Over time people will notice what is a big enough changeset for their specific services and they will start breaking the PR’s down, if that is not happening it’s definitely something which needs to be brought to notice by other members in the team.

Another common thing, as pointed out by Joy is that we sometimes tend to add something extra in addition to the original scope of the PR. This becomes a problem when the scope of change, if too big/unrelated to the original scope, would affect the review of the PR as a sideeffect of the reviewer then trying to review two different contexts/intents. Keeping the intent to just one thing helps the reviewer’s job.

Being pedantic about coverage

Code coverage is not really a good metric to measure software and it’s stability. While good (lesser bugs) software tend to have high coverage, it’s not necessary that it’s gonna solve all the problems you would have/encounter. I have written a bit more about this here on why even close to 100% code coverage would not prevent your software to be bug free.

What I would like seeing is ways in which the changeset can be tested to gain more confidence of what is being introduced. The easier to test, the better as it would reduce the feedback cycle.

Ending notes

These are some of the prompts which I usually try following, which would allow the changeset getting accepted in a timely manner without compromising on the quality too much, but would love to hear what you think about it.

Credits

Thanks to Joy and nemo for proof reading the post.

To self host or to not self host kubernetes cluster(s)

2020-11-27T00:00:00+00:00

A friend of mine asked this to me recently, about how was it to self host kubernetes clusters. And I was cursing myself about why I did not write this post earlier (I mean, technically I have written about how we used to do self hosting before, but not the pros and cons of it), as this was not the first time I had been asked this question. So this post is dedicated to my friend and to others when they chance upon this question.

Just for context, self host here does not only mean, kubernetes inside kubernetes, but in a broader sense, would mean, managing the control plane of the kubernetes cluster too, along with the worker nodes (which is the usual case these days with the cloud vendor k8s options). You would preferably be using solutions like kops, or typhoon here to self host your kubernetes clusters. But there are a bunch of really great tools out there these days apart from these two, which I personally like.

I had to pry out this tweet thread which I wrote about sometime back and it’s a good summary more or less.

If you can, don't try managing your own @kubernetesio clusters. It can become a huge engineering effort in itself very quickly. If the core business product is not around providing infrastructure to others, using #kubernetes solutions provided by cloud providers is not a bad idea
— Tasdik Rahman (@tasdikrahman) March 27, 2019

Don’t self host unless you have a very good reason to

The very obvious answer if you ask why, is because with all honesty, it’s a huge engineering effort to even run and manage vendor managed kubernetes clusters, let alone self hosted ones. I have already written quite a lot, on a bunch of reasons, one particular problem, from which there’s no running away away from, is kubernetes upgrades, which I have covered in detail here on what entails in one such upgrade, for you to understand the complexity of things.

And there are a bunch of things, which I have mentioned in this post on the complexities which entail with managing a kubernetes cluster, vendor managed or self-hosted alike and how effort multiplies as one goes into the direction of multiple clusters and different cluster sizes. If you have read the above two posts, you might just feel this post getting a bit repeated at this point, hence I will not delve into the exact details, as I have covered them in depth in the other two posts.

But Tasdik, I need to customize my kubernetes installation, I will lose that power if I go ahead with a vendor

A lot of flexibility in terms of customisations, in a vendor managed kubernetes cluster is alreay being provided these days. But even then, if you are really trying to use/set something esoteric which is really important to you and is required for your workloads, do keep in mind the maintenance of the cluster. If you do end up self-hosting it, you would need to know how to operate it and the propblems which come and the corner/edge cases with it.

An event like switching out our CNI, moving from kube-dns to core-dns, for example, is best abstracted if possible.

I get that people do run other things like postgresql or mysql on VM’s, but postgresql doesn’t need an upgrade every few months, and would just about run for years without an upgrade without too many hiccups for most part.

And not to forget kubernetes is a really fast moving project. v1.4.6 was released around in november, 2016. It was also a version (~1.8 to be precise) close to which we started running our production workloads back in 2017 for an org. Come 2020 november, we already have v1.18.12, and LTS not a thing as of now in the kubernetes world. And not to mention the amount of changes which happen each release. As with any fast moving project, things are bound to change, as Joe mentions here that kubernetes should be boring. Kubernetes will take some time, to become something similar to that. On a related note, this is a very good piece on that topic.

There’s just a mouthful of things to take care of too, in case you decide to manage your own etcd clusters, along with the control plane components, if you decide to self host. There’s great learning for sure in all this, but the idea is to provide a reliable peice of infrastructure over to your customers here first (assuming you work in the operations/platform team, your customers are your developers), unless the team is really rock solid in terms of their k8s knowledge and is able to keep up with self hosting, this sort of path can very quickly become a possible wrong decision for your team.

And not to forget the automation which needs to be maintained to bring up/upgrade parts if not all of the whole self-hosted cluster automation. The automation can vary based on the solution which you picked up to orchestrate the creation of your cluster and could be anything from shell scripts/terraform modules etc. The point is that, if it’s not a standard solution out there and you have hand smashed a lot of automation yourself, the added overhead of maintaing that automation shows up over time.

Ending notes

Unless providing kubernetes as a PaaS is your bread and butter and your main business, where you compete against others and have to absolutely innovate and stay on top of the game. In most cases you are better off just using a cloud vendor managed offering.

We ended up self-hosting back in the days, as the cloud vendored k8s offering was not present in our region and due to constraints of compute, being present on that particular region and no particular one, made us go the self-hosting route.

Having managed(both self hosted and vendor managed) and built platforms on top of multiple k8s clusters(cloud vendor managed control plane in this case), for the better part of 3 and a half years now, one thing for sure I love about k8s is the power it gives. The ease of deployments(rolling deployments out of the box), the immutable nature of deployments(containerized workloads), the programmatic interface for starters(kubernetes/client-go, need I even say more) and I am not even scratching the surface.

But at the same time, the right tradeoffs do need to made. Kubernetes absolutely solves some very core problems, but if you’re just in the phase of building your product, with not much of bandwidth. Would highly recommend thinking twice in case of going the self hosting route or even kubernetes, if you must for it to not become a distraction from the core task, which is reliable infrastructure. You might just be better off with plain simple ASG’s and instance groups for that matter with AMI’s. Less moving part is good to start with. But again, depends on your team, if everyone has a good amount of production experience with k8s, then why not? But even then, vendor managed k8s installation will again be my personal pick.

If you are in a very specialised environment where you absolutely know what you are doing and you must, this post is obviously not for you, in which case I would love to hear more about your learnings :).

Coincidentally, my last post was also about the trade-offs in specific situation similar to this situation, when dealing with kubernetes. Hope you have found this piece informative.

Credits

Thanks to nemo for proofreading and the folks at kubernetes for the post image.

Choosing between one big cluster or multiple smaller kubernetes clusters

2020-11-21T00:00:00+00:00

This post is a continuation on the discussion which I was having with @vineeth

But why would someone choose one large cluster over multiple small clusters? Aren't multiple clusters already a pattern in enterprises?
— Vineeth Pothulapati (@VineethReddy02) November 20, 2020

Context is when I came across a tweet which demonstrated the ability of kubernetes to scale uptill 15k nodes due to recent improvements.

15k nodes cluster 🤯 https://t.co/VMWI7HeYHH
— Tasdik Rahman (@tasdikrahman) November 20, 2020

The discussion was originally around costs and how much would it take to run one such large kubernetes cluster, but it went into a different direction altogether.

So what should the decision be? Rather, I would like to take the path, where we discuss a few things about both sides of the coin. While this post is not a recommendation on what one should do, but the idea is to guide you to take a more informed decision with the data points, trade-offs and constraints that you have.

Before we start, big, here will be relative, but for the sake of this conversation, let’s say you plan to run a cluster, with worker nodes greater than ~50 (this is a big cluster for me right now, more on the why part of it in a bit) and a small cluster for the context of this post is say, 5-10 worker nodes.

Basic house keeping

Having separate cluster(s) for staging and production, similar to how you would have been following the same for your compute infrastructure on VM’s.

Multi tenancy

Hard multi-tenancy is something, which might not work the way as expected as of now on k8s, even with netpols/namespaces and upcoming mechanisms like Hierararchical namespaces (aka HNC), which is still in incubation at the time of writing this.

That’s what I have last checked, but please correct me here if you came across something which tells otherwise.

So if your workloads do have a hard requirement for this, multiple k8s clusters would be a way here,

Upgrade charter

One operational aspect which is important to note here is the burden of upgrades. Keeping up with upgrades, with the release cycle of kubernetes is not a trivial task. I have written about our experiences in this thread sometime back, to give you an idea of what really goes into one such upgrade.

A few notes on @kubernetes cluster upgrades on GKE (1/n)
— Tasdik Rahman (@tasdikrahman) July 21, 2020

Even on a managed platform provided by the cloud vendors (where they maintain the control plane for you), it’s still an operational heavy task, and we are not even talking about upgrades on self hosted and managed clusters via tools for eg: kops.

Multiply this effort with multiple such clusters, and you will end up needing to have people dedicated to just do this. There is no such thing as an LTS release as of now, unless you are fine with running super old kubernetes installations which would have CVE’s reported and fixed in the upcoming releases/you are ok with doing big bang upgrades. Both of which might not be a great idea to begin with. Even if someone decides to run an archaic installation for long, if you feel running a super old installation will fly with your cloud provider, you will be in for a rude shock, where they can literally force upgrade your cluster(yes, they do it).

All of the toil which goes into cluster upgrades might have been a factor, which would have led to the initiation of this discussion here where folks discuss about modifying the kubernetes release cadence. I for one did vote, on moving to 3 releases than 4 every year. But there are also arguments against doing slower releases, which might make the next release bloated with features/deprecations and going against the philosophy of small incremental changes, but that discussion is for another post.

Given someone ends up with a large cluster, the upgrade is equally gonna be as operationally heavy, if not more than when someone was upgrading a bunch of clusters, as the operations to be done would remain more or less the same.

What might increase a side effect of the cluster being huge, is that someone would have to baby sit the whole upgrade operation longer, than what they would spend on a smaller cluster(given upgrade automation is not present/not mature enough to remove the human involved here). The upgrades here, have to be done one ASG group/node pool, at a time, and takes time even if done for a smaller cluster. Imagine having to do it with a large cluster size, having 100’s of nodes in each node pool. But I am sure there are ways in which someone would have improvised here, but I am yet to come across something like that.

The eventual state is to have automation which does a bunch of domain specific operations, and the human has to only get paged/notified in case when the upgrade gets done and dusted/there was some failure which occured, which the decision tree wasn’t able to resolve itself. Keiko project’s upgrade manager is one such tool which comes to mind, apart from https://github.com/hellofresh/eks-rolling-update.

And if there are non-standardized workloads in the cluster(folks hand applying yamls/no track of objects in cluster), there are various ways on how an upgrade can either get stuck due to pdb budgets of pods not being met when the node gets cycled/end up causing an outage. Which makes this whole operation for a focused cluster running x teams applications much more saner.

Partly due to the fact that, now if the cluster is owned by x team, you can at least ask them to be on call with you during the upgrade process, so that they can at least shadow the person doing the upgrade and monitor the business metrics of the team’s products being served out of the cluster being upgraded.

Hence, upgrade strategy is something to consider seriously while choosing the multiple clusters vs single cluster strategy.

API deprecations

API’s getting deprecated, for example 1.16 was a release which had a bunch of changes, which would have involved the cluster operators to upgrade their automation/helm charts to be compatible with those changes, if they had plans on upgrading from some x version to v1.16.

There’s no one to blame here in case of API deprecations, an object getting stabler and getting promoted to a more stable API, is a natural progression. To enjoy the benefits of a stable kubernetes object, it only makes sense to move to the stable api rather than being stuck on one which is less stable/getting deprecated in the next release.

For a larger org having bandwidth, managing multiple clusters for teams, they will eventually have automation over time to reduce the toil.

But if you’re a small org, with a small group of folks managing the k8s clusters, the manual toil will be quite high. The reasons are also obvious, the lack of bandwidht will attribute to them not being able to automate the redundant tasks required for the upgrade. Even if they manage to write some automation, the automation will become stale over time if not given prioritisation to maintain it, as the domain changes. Plus, an average operations team will also have developer requests coming in their way, prioritising all this along with tasks such as maintaing your k8s cluster? Definitely a hard task to begin with.

Although I have not personally used it, but I have heard good things about kube-no-trouble, which takes a stab at telling the deprecated API’s in a cluster.

Access management

Giving the right kind of access to the developers/operational folks/xyz person in the team is necessary problem to solve, unless you are giving admin privileges to only the operational folks/one groups of people. Even then, solving the same problem over for multiple clusters requires automation and a proper mechanism.

What is to be done when a person leaves/joins? How can access be granted in granular manner for certain rbac roles which are only to be given read access, but no deletion access. If you have been in a position, where someone from the team has deleted the deployment object of your service and you ended up with an outage. This can happen with anyone, but reducing the blast radius is not a bad idea.

If your workloads are sensitive (payments data etc.), you would require access for such environments to be tighly audited and managed.

There are a few tools out there which would help you in doing so, like krane, audit2rbac, which help you in this process (thanks to rahul for pointing them out to me).

Deployment automation

With all honesty, hand applying yaml files will only go so far. It works, but is a recipe for disaster over the long run, creating more spaghetti in the cluster and creating more problems than solving them. Problems like, who applied x resource object, who changed x resource, who is using this x resource. And the list so goes on. Multiple clusters or a single large cluster doesn’t matter here in this context, but someone editing something which they are not supposed to for some other teams product?

Namespace as a service, to teams, is a model which I have heard people doing, keeping all access to specific roles tied to a particular namespace.

Continous Delivery is hardly a requirement for anyone(mostly?), but the ability to reliably deploy something is a hard requirement in most cases though.

Having some form of CI, to safely modify the respective resources via a tool/process will prevent a lot of surprises in production. Audit trail logs in the CI pipeline for someone to see what happened, automated rollbacks too? That would be a sweet spot I would say. (We have built out our deployment platform similar to these practices described in my current org, but more on that in another post)

There are multiple tools out there which allow fine grained RBAC rules, CI/CD, progressive delivery to your kubernetes clusters, flagger, argo being a few ones to name.

Concentration risk

If we look at the flip-side, having one big cluster, concentrates the risk of failure.

If you are not on a vendor specific kubernetes installation, what happens when the control plane goes for a toss? 1 person having all the context is not scalable, what happens when the person who knows the operational know-how, to fix x problem, is not present to handle the pager?

What about zonal failures affecting the cluster? Do we have checks and balances to handle such an event?

What if the requests and limits for memory and CPU were not set properly for foo service in foo namespace, that it ended up hogging more cpu and memory, affecting baz service from baz team(baz service’s getting degraded). It’s not a great place to be in. Overprovisioning pods for teams/services and then gradually checking the trendline usage over time and then setting them over in requests and limits is one way to have sanity in the cluster and sizing it accordingly. More on resource quotas in another post.

Having one cluster, per product group is also a model which people follow, helping de-risk the affects of failure to not affect other products/product groups. But then the operation problem/complexity of managing multiple clusters arise.

Ending notes

Either of the two options, one large cluster vs multiple smaller clusters, both are an opinionated way to run clusters, or for that matter any compute infrastructure. What works best, might not work out in another context. Someone else’s best practice might turn out into a nightmare for another org/team to manage/run.

Given, most/all of these problems can be solved, the right trade-offs can be made when deciding for a solution and I hope this discussion helps you making the right decision in your context.

P.S. A good amount of discussion also happened on /r/kubernetes, here in this thread

References

Testing rake tasks with rspec

2020-10-20T00:00:00+00:00

This blog post is a continuation of this thread.

On trying to write a spec for one of the rake tasks, when trying to invoke the same rake tasks within the same @rspec contexts, for different flows, weirdly the tests failed if I ran the whole suite, but would pass if I ran them separately.
— Tasdik Rahman (@tasdikrahman) August 12, 2020

So for example

# ./lib/tasks/foo_task.rake
desc 'Foo task'

namespace :task do
  task :my_task, [:foo, :bar] => [:baz] do |task, args|
    ...
    # does my_task
    ...
  end
end

Now if we try writing a spec a for it

# ./spec/tasks/foo_task_spec.rb
require 'rails_helper'

Rails.application.load_tasks

describe "task_my_task" do
  context "foo case" do
    let(:arg1) {"foo"}
    let(:arg2) {"baz"}

    it "it does foo behaviour" do
      Rake::Task["task:my_task"].invoke(arg1, arg2)

      # assert the expected behaviour here related for foo case
    end
  end

  context "baz case" do
    let(:arg1) {"bazbee"}
    let(:arg2) {"foobee"}

    it "it does baz behaviour" do
      Rake::Task["task:my_task"].invoke(arg1, arg2)

      # assert the expected behaviour here related for baz case
    end
  end
end

Now if we try to run the specs for specific contexts, where the rake task is being invoked, all will work well, but when we try to run the specs for all the contexts in the test file, the first rake task will run, but the rest of them will start failing.

Which is confusing, turns out that the tasks can be invoked only once in a given context. Not sure of the history behind this or the reasoning on why this is the case, couldn’t find it. (let me know if you were able to get this bit, would be happy to learn it’s history)

How to make it work

To work around this, the task needs to be explicitly re-enabled in before the next spec is run.

Addin an after_each block would work for starters, if inside that block you would re-enable your task which you are trying to test out in your spec, which would mean that after each spec, inside which you are exercising your method, this routine is called. So something like

# ./spec/tasks/foo_task_spec.rb
require 'rails_helper'

Rails.application.load_tasks

describe "task_my_task" do
  after(:each) do
    Rake::Task["task:my_task"].reenable
  end

  context "foo case" do
    let(:arg1) {"foo"}
    let(:arg2) {"baz"}

    it "it does foo behaviour" do
      Rake::Task["task:my_task"].invoke(arg1, arg2)

      # assert the expected behaviour here related for foo case
    end
  end

  context "baz case" do
    let(:arg1) {"bazbee"}
    let(:arg2) {"foobee"}

    it "it does baz behaviour" do
      Rake::Task["task:my_task"].invoke(arg1, arg2)

      # assert the expected behaviour here related for baz case
    end
  end
end

reenable first resets the task’s already_invoked state, allowing the task to then be executed again with all it’s dependencies.

There are execute and invoke methods too which are mentioned in this SO answer. No preference as such to why I went with reenable.

Running all the specs would work now.

References

Credits

Picture credits to https://rspec.info/

A few things about database migrations

2020-10-18T00:00:00+00:00

This blog post is a continuation of these two threads.

A few things about database schema changes. (1/n)
— Tasdik Rahman (@tasdikrahman) October 17, 2020

This is where @rails active record migrations really shine. I find it's UX super clean. (1/n)https://t.co/vA6Jb345yc
— Tasdik Rahman (@tasdikrahman) October 18, 2020

The schema of your relational database, will change over time for your application. Trying to introduce these changes from dev setup -> integration/UAT -> production env, in a clear, consistent and repeatable manner, would definitely add value in trying to maintain repeatability across different environments.

Ways to introduce changes to your database

If someone is introducing the schema changes manually in these environments, tracking such changes and reproducing them becomes a task.

What about auditing what ran, or what was changed? Who introduced it? What happens when a schema change was made and you were not aware of it in the integration env (fill anything else here for that matter), which is also making your current build fail, but you have already wasted some time trying to debug it?

The time lost in debugging such issues, could be utilized somewhere else. The other thing to note here being, that using a single database for the integration environment, where multiple developers will deploy their codebases and introduce changes to the schema, is one way or the other, gonna bite the team down the line.

The ability to be able to recreate the structure of your database, consistently across local and dev environments will allow to iterate faster for sure. Having all the DDL and DML scripts, with a sequence, checked into your VCS allows one to at least track what has been run.

Tracking how your schema is evolving

But now how do you track which DDL/DML scripts ran for a particular environment and it’s database?

A very simple implementation to solve this is described here where you have a table specifically to track this,which records the sequence of the DDL/DML script, which last ran, the time when it was run & a small description mentioning what is it doing.

And whenever someone is running any DDL/DML command, they would also need to insert into this table, helping in keeping track what’s the last status of the database for that environment.

There are much more mature & well tested ways to do this, like active record migrations in rails, Flyway in java world, golang-migrate in golang et al. The root of it being, having a way to know what ran where & having a repeatable way to setup the schema of your database across different environments.

This is where rails active record migrations really shine. I find it’s UX super clean.

All migration files, along with the schema of your database is also checked into the VCS along with your business logic, prefixed with a UTC timestamp, the timestamp being in the schema_migrations table, which active record would maintain inside your database.

The idea is the same. Track the status of which DDL/DML scripts have run for your application.

Want to see, what’s the status of your database and what has run on it? rails db:migrate:status is gonna give you all the status of all the migration scripts and the status of whether it has already been applied or not.

Want to roll back the migration which was run on your database? rails db:rollback STEP=n, to rollback n versions of the migration id which have already been applied. The n being any integer value, of the number of migration files you want to roll back by.

Want to redo a migration which was rolled back/run it again? rails db:migrate:redo VERSION=<UTC-timestamp-prefix-of-migration-file>.

A similar approach can be found in golang-migrate, where each schema change is introduced with an SQL script numbered sequentially starting from 1 for example (the 1 can be any 64 bit unsigned integer), and ending with a suffix of up.sql and down.sql

Each new schema change will be added in a new SQL file, and numbered accordingly. Users would then add the helper methods provided by golang-migrate, for running migrations into the cli interface for their app, for both applying the migrations and to rollback them.

Both the examples, we can infer that the central theme is the same, which is to keep track of what ran and what did has not run along with a way, either via timestamps or via simple count to keep track of the order of the migration scripts

Another thing to notice here is that, both of them encourage you/give you mechanisms to write the schema change such that it is possible to reverse it to a previous state.

Constraints which can be put on the database

Rails allows one to introduce validations on models before persisting the object to the database. But it’s also important to have the same validation wherever you can on your schema. Allowing both the ORM and the database to enforce validations.

The model.update method, does this for you. It will do the validations and the callbacks required for example, along with updated_at/updated_on for you, whenever you are trying to update the attributes for the object.

This will immediately help, in having the 1st-order check to prevent inserting something you shouldn’t have. The 2nd-order check being in your database schema itself. Being present as your final guard for the entries to not be dirty. While I don’t remember anyone telling that it is a rule of thumb to have both, but it’s not either uncommon to stumble upon this either. You may argue that it goes against DRY, but I feel having the final validation on the schema of the database, definitely acts as the final source of truth where you can always fallback on and there’s no downside to it.

A very simple example of this can be, when you are trying to put a check on a column in your database to ever not be a null value, even if your model has a validation to protect against inserting a null value, it can be backed by a database constraint at the same time for the database for enforce this at the end too along with the ORM.

There are also methods like update_column, which will straight up update the attribute which you want to insert to the database, skipping all the validations, callbacks etc. Don’t use it in your DML script, as part of a migration file, unless you have a good reason to.

Rolling out migrations for your applications

As to how to roll out migration changes to the end application? I feel since the state changes over schema changes, it is similar to doing a new deployment for your application

For example, if you are dropping a column from a table, the code changes should go first, where you remove reference to that table everywhere and then do a deploy. Do your e2e tests, check if everything is working and then in the next deployment, dropping that column.

As with all changes, I feel iterative and small changes are always easier to make sense of or debug if an issue arises, given the changelog will be easier to grok through and pin-point easily what could have introduced the issue.

As I learn more on this, I would love to hear on resources to read more on this topic/processes/techniques which you folks follow, which have worked well for you.

References

Credits

Cover image credits to Shripal Dapthary. Source

The making of bhola - your cert expiration overseer - Part 1

2020-10-08T00:00:00+00:00

You might have already seen me writing a bit about bhola already on twitter, I wrote a little bit about why I have been building bhola. This post is more of a continuation to this tweet and what I envision it to be moving forward.

Do you sometimes wake up, with a call by someone from your team, telling you some SSL cert has expired? Do you keep track of SSL cert expirations on your to do notes or excel sheets? Would you like to be on top of such x509 cert renewals? https://t.co/MVFRZCUlZN is for you (1/n) pic.twitter.com/pj8JHJEkje
— Tasdik Rahman (@tasdikrahman) September 18, 2020

What was the inspiration

Do you sometimes wake up, with a call by someone from your team, telling you some SSL cert has expired? Do you keep track of SSL cert expirations on your to do notes or excel sheets? Would you like to be on top of such x509 cert renewals?

All of this are directly due to

No visibility about when the certificate is expiring
No alerts in form of email or test message when the certificate is expiring

Then bhola is for you!

What does bhola do?

v0.1 of Bhola, gives you a dead simple API, which you can use to ask Bhola, to track domains which have certs attached to it. It automatically checks for the cert expiration in the background keeping note of when is it expiring.

The operator can set a buffer period, which would bhola, then use to see if it meets the threshold number of days, before the cert is going to expire, before marking the cert, that it needs renewal asap.

Want to check, what domains, is it already tracking? Bhola comes with a very simple bare minimum UI, which one can use to check, what domains are being tracked, when are they expiring and other metadata details, like who issuer of the domain, when is it expiring, when was it issued.

What is required by Bhola to run? It just needs a good old postgres to function to keep track of the domains, and that’s it. Nothing shiny. Plain and simple.

Further on v0.2 of bhola adds support for sending notifications to slack as part of evolving, from just being a dashboard to something which can be used to preemptively alert the operator on is the certificate expiring.

It will alert for all the domains, which have already expired/are about to expire within the buffer period which you have set & send notification to your slack channel via webhook endpoint, periodically checking in the interval set by the operator, for expiration. (2/n) pic.twitter.com/IdpDGxJQtr
— Tasdik Rahman (@tasdikrahman) October 3, 2020

Further more, smaller improvements like 1 step dev setup, docker-compose setup and container images available for docker would make reproducing the setup for bhola easier than before, than the status in milestone 0.1.

Not that it matters much, but rails has been a fun framework to work on, especially with rspec, practicing BDD and TDD has been a great experience.

How does one even insert a domain to be tracked as of now?

Example request

$ curl --location --request POST 'localhost:3000/api/v1/domains' \
  --header 'Content-Type: application/json' \
  --data-raw '{
      "fqdn": "https://expired.badssl.com"
  }'

Example response

{
    "data": {
        "fqdn": "expired.badssl.com",
        "certificate_expiring": true,
        "certificate_issued_at": "2016-08-08T21:17:05.000Z",
        "certificate_expiring_at": "2018-08-08T21:17:05.000Z",
        "certificate_issuer": "/C=US/ST=California/L=San Francisco/O=BadSSL/CN=BadSSL Intermediate Certificate Authority"
    },
    "errors": []
}

querying the domains stored

Example request

$ curl --location --request GET 'localhost:3000/api/v1/domains' \
  --header 'Accept: application/json'

Example response

{
    "data": [
        {
            "fqdn": "www.tasdikrahman.com",
            "certificate_expiring": false,
            "certificate_issued_at": "2020-05-06T00:00:00.000Z",
            "certificate_expiring_at": "2022-04-14T12:00:00.000Z",
            "certificate_issuer": "/C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert SHA2 High Assurance Server CA"
        },
        {
            "fqdn": "expired.badssl.com",
            "certificate_expiring": true,
            "certificate_issued_at": "2016-08-08T21:17:05.000Z",
            "certificate_expiring_at": "2018-08-08T21:17:05.000Z",
            "certificate_issuer": "/C=US/ST=California/L=San Francisco/O=BadSSL/CN=BadSSL Intermediate Certificate Authority"
        }
    ],
    "errors": []
}

Are there other ways to do such domain expiration checks?

Yes, there are are other ways to do this.

If you are on cert-manager, then you can make use of https://grafana.com/grafana/dashboards/11001, adding alerts on top of the dashboard should not be very complicated.

There is an exporter https://github.com/ribbybibby/ssl_exporter, which will do the scraping for you and expose the expiration date of the cert as the metric ssl_cert_not_after, which then you can add an alert on top.

scrape_configs:
  - job_name: "ssl"
    metrics_path: /probe
    static_configs:
      - targets:
          - example.com:443
          - prometheus.io:443

Where example.com and prometheus.io would be your scrape endpoints, in this example.

Thanks to Joy, for pointing the above out to me.

Then there is a guide, on how to do with prometheus, via the the blackbox exporter

And the good old openssl command will always be there

$ echo | openssl s_client -servername www.tasdikrahman.com -connect www.tasdikrahman.com:443 2>/dev/null | openssl x509 -noout -dates
notBefore=Aug 19 13:59:14 2020 GMT
notAfter=Nov 17 13:59:14 2020 GMT

How does me running bhola help then?

The above tools will work, no doubt about it, if you already are on these systems, you can definitely leverage them to make use of them in ways similarly shown above. But even then, redundancy is never a bad thing to have.

Adding to it, one plus which bhola has is the validations before tracking endpoints, not tracking invalid/not having certs attached to endpoints. Which helps in keeping the entries sane.

If you are using letsencrypt, given that the certificates would be expiring within 3 months and that they also send email notifications, you would have used some automation to renew your certificates, due to the 3months expiration policy of LE certs. Having bhola as your external system to monitor your domains, would be an extra guard against the automation failing silently or the emails getting missed.

Furthermore, bhola panders more to the userbase, who are just in search of something, running which they can just start tracking and getting alerts for their domains, rather than tinkering with tools which they may be unfamiliar with, hence further reducing their friction in prioritizing their efforts to add alerting on domain expirations, rather than first trying to run prometheus(if they aren’t already) or for those who don’t yet have the right level of automation maturity for their certificate renewals. If you/your org are already on a level where you have already done and dusted this alerting and cert renewal part via one of the ways described above or via some other way, then if I may say, you would come under a minority and not the norm.

Albeit there is an overhead with bhola which will be running a rails service and a postgres db. The decision then would with the operators, on what kind of solution would they be comfortable maintaining. People should always have multiple options for such tools to pick with, depending on their comfort level and the level of overhead they would want in their system.

Assumptions made by bhola

bhola assumes that the dns being inserted, resolves to a single IP, so in case you are doing dns loadbalancing on a single FQDN, with multiple IP’s behind it, it may try connecting to whichever IP first get’s returned.
bhola will not register the domain to be tracked, if it can’t reach it, it would be apt to place bhola somewhere, in your network, which would make it possible for bhola to resolve your dns endpoints with ease, so in case, the domains which you are trying to track, if they resolve to a private IP, make sure bhola can reach them.
bhola will not register the domain to be tracked, if it doesn’t have an SSL cert attached, it will not track it.

What bhola will not be

will not generate certificates for you by being the intermediate broker
will not install the certificates for it’s clients
will not provide a UI to generate/install/replace the certs for it’s clients

What’s next?

I envision bhola to be a 1 stop service for your needs of tracking your domain expirations for starters and there are quite a few thing which I want to see in it, in future. Some of them being

Ability for it to associate domains and alerts with users, this will allow bhola to be multi-tenant.
- The idea is to have a system in place if someone wants to enable this feature, this can be turned on with just enabling a feature flag when they start the the webserver.
- while accessing the api to do insertion for tracking domains, add authz/authn.
Ability to delete domains associated with a user.
Ability to sign up using email id.
Ability to send alert notifications of all the domains associated to the user in their specified email id.
Not an immediate goal, but I want to host this on my own infrastructure, as a public facing endpoint.

While I have not spread the above in specific milestones, but I would mostly pick up the first one for milestone 0.3.

As bhola is completely open source, would love to hear what you feel can be added to make bhola better than before.

Why I chose to do TDD for my new side project

2020-10-07T00:00:00+00:00

This post is more of a continuation to this tweet

One thing which I tried doing differently this time with one of my side projects is to do TDD from the start. Someone may ask why? It's just a side project no? (1/n)
— Tasdik Rahman (@tasdikrahman) October 4, 2020

I have been building bhola in my free time, and one thing which I tried doing differently this time with it, was to practice TDD from the start.

But why?

Someone may ask why? It’s just a side project no? True, yes. It is, but let me explain why I tried this out.

One reason is that, for some of my past side projects, when someone creates an issue/submits a PR. I wouldn’t necessarily remember everything which I did/why I did x instead of y, when I would have authored it (more on how this can be improved later)

Taking the liberty to quote Ajey.

What ever code you write, it will be out of context in 18 months, write tests along with it, so at least people know what you meant
— Ajey Gore (@AjeyGore) May 19, 2017

Coming back to say reviewing a bugfix/feature PR. Having no coverage for those specific routines which were modified, would mean I either would have to rely on my gut feeling, or I would have to test it by pulling the changes.

This in turn would do two things, for one, it would create a form of resistance, as to even review the PR, it would mean me having to also manually test out things and see if changes are not having any regression/the feature works as expected. Which would mean, I would either get swamped by the things to do to just review something, making the requests pile up one by one and then ending up in a position where there are multiple stale PR’s which have been just lying there. (If you have ever experienced this with any of my repositories, I sincerely apologise, I will strive to be better.)

The 2nd thing which would be a by-product of this, is that for these changes, I am doing the testing manually, which would mean I would have spent say x amount of time doing it which could have been used for something else.

This x amount of time, would vary wildly, depending on many factors. Some can be, how good are the docs, which would allow 1 to replicate the setup quickly(another reason why I really love 1 step dev setup commands)? Familiarity with the codebase so as to remember all the cases, corner cases included, so that you don’t miss them.

It’s natural for someone to not remember minute details of the codebase, when they are looking at it again after weeks/months/years. Naturally, they will need some time to again get acclimatized to the codebase which they had interacted/authored.

This is where tests for the routines bring in value. It’s your 1st level of safety net which you have spread out to weed out changes which would break your expected flow/behaviour.

Will this solve all my problems?

As luck will have it, I have an example from the side project which I was working on itself, where the coverage was high and covered the specific flow, but would ultimately fail when trying to run it!

There’s one specific flow, where the service reads an env var from the environment variable via Figaro. The value is a plain boolean var of true. Now to simulate this in the spec, what I did was simply stub the call to the method of the figaro lib, to return the value I wanted for the flow. The problem being here, that I was stubbing the wrong type for the value! Figaro, when it reads the env var, it reads it as a string rather than a boolean(or any other type for that matter, all will be read as a plain string), which is where I was going wrong. This in turn would also affect the way, the implementation would happen.

Here’s a small snippet from the changelog of https://github.com/tasdikrahman/bhola/pull/65 for reference, to give you an idea of what I am trying to depict here and what I changed to fix the same in the spec as well as the implementation.

diff --git a/app/jobs/check_certificate_job.rb b/app/jobs/check_certificate_job.rb
index 93fb16b..76f995b 100644
+++ b/app/jobs/check_certificate_job.rb
--- a/app/jobs/check_certificate_job.rb
@@ -10,7 +10,7 @@ class CheckCertificateJob < ApplicationJob
     Domain.all.each do |domain|
       if domain.certificate_expiring?
         Rails.logger.info("#{domain.fqdn} is expiring within the buffer period")
+        if (Figaro.env.send_expiry_notifications_to_slack == 'true') && !Figaro.env.slack_webhook_url.empty?
-        if (Figaro.env.send_expiry_notifications_to_slack == true) && !Figaro.env.slack_webhook_url.empty?
           message = "Your #{domain.fqdn} is expiring at #{domain.certificate_expiring_not_before}, please renew your cert"
           slack_notifier = SlackNotifier.new(Figaro.env.slack_webhook_url)
           begin
diff --git a/spec/jobs/check_certificate_job_spec.rb b/spec/jobs/check_certificate_job_spec.rb
index f4fa49b..71c40c8 100644
+++ b/spec/jobs/check_certificate_job_spec.rb
--- a/spec/jobs/check_certificate_job_spec.rb
@@ -48,7 +48,7 @@ RSpec.describe CheckCertificateJob, type: :job do

           it 'will not call SlackNotifier#notify' do
             allow_any_instance_of(Domain).to receive(:certificate_expiring?).and_return(true)
+            allow(Figaro).to receive_message_chain(:env, :send_expiry_notifications_to_slack).and_return('false')
-            allow(Figaro).to receive_message_chain(:env, :send_expiry_notifications_to_slack).and_return(false)
             allow(Figaro).to receive_message_chain(:env, :slack_webhook_url).and_return(slack_webhook_url)
             expect_any_instance_of(SlackNotifier).not_to receive(:notify).with(anything)

So as you see, it’s not necessary that following the above practices, will allow you to create bug free software.

Bhola had ~99.63% coverage at the time this bug was present in it, but it didn’t stop it from having this bug.

100% code coverage doesn’t mean that your software is bug free/free of issues. The only real test is when your software is getting used by someone. This is where it should behave/perform as it is expected out of it. There’s no silver bullet.

Being proud of 100% test coverage is like being proud of reading every word in the newspaper. Some are more important than others.
— Kent Beck (@KentBeck) December 24, 2016

So what’s the use then?

But having a high coverage would also mean, that you can refactor without fear, and have a faster feedback cycle than before, i.e testing for changes manually.

The 2nd level of safety net can be end-to-end integration tests for your codebase, which would run with each commit, the same way your unit tests would run with each commit.

The value here out of these 2 safety nets, is that you will be able to ship with more confidence, compared to not having these 2 safety nets at all

Why I chose to do TDD here?

There’s a lot of literature around this, but for me personally I feel it allows me to think in terms of contract and how a routine should behave. As the behaviour is what we would really like to test for routine rather than the exact mechanics.

To add to it, the tests would act as documentation when I would go through them, telling me how a particular routine behaves under different scenarios. It also encourages baby steps and a faster feedback loop for something functional as fast as possible.

For reference, a few years ago, I wrote this thing called plino(spam filtering as an API) back in college days. I wasn’t aware of the testing literature back then (still learning), but what I ended up writing was an integration test for the api.

It has absolutely no coverage for other routines which are present. It’s just by luck, that the codebase is small and someone will be able to quickly grok it and understand what is happening, but the overload of the same happening in larger codebase does affect maintenance.

If I have to compare it with bhola, I ended up having coverage for even a small routine which just does a POST to an external API. Someone might think it’s an overkill, why do we need all this if barely anyone is using this?

Another question which comes is, at the end it would be the functioning lines of code which your consumer of the software would be interacting with, not the tests. So why write tests? But would skipping these mean, taking a hit on maintainability, I feel the answer is yes.

As for bhola, I feel I would definitely have more confidence and a faster feedback cycle when adding changes to it in future.

If you liked this piece, I have written a few more under #testing and #tdd.

Backpacking trip to Alleppey and Kochi

2020-08-31T00:00:00+00:00

I did this trip last year, in August 2019. Finishing this was long overdue.

Trying to follow up with the manager of Khawa karpo as Sushant and Rajat also finished their sharing of dinner which we were grabbing to call it a day. We all rushed as I grabbed my takeaway, to get a rick to catch my bus which was due to leave in about 15mins or so.

Didn’t try bargaining too much and just went ahead with the exorbitant charge for the meagre ~1km which was to be traversed.

Reaching the bus stop, Deepak was waiting already for me to board the bus whose tickets we had booked 2-3days ago.

Fast forward a few more tense moments, till we manage to switch seats with someone to have both of us sitting together, we settled into our seats for good and off we went, the bus zooming through the traffic as we edged closer to leaving the city behind. Zoning out looking out of the window, looking at the city dwellers leaving office and going back in their shiny cars and two wheelers made me realise that most of us are part of the same race, each day when we try getting back to out lives after work. Only difference today being, that I was escaping away back from the humdrum of the city to something new. That new being. God’s own Country - Kerala

As I devoured my takeaway from Khawa korpo, I felt giddy happy as I have always wanted to visit Kerala, which is something which I wanted to do while I was doing my solo trip in the south-western edge some years back. And this was my moment I guess.

Alleppey

And what better timing could it have been. I was visiting during the legendary Nehru Trophy Boat Race, the cherry on top being that Sachin Tendulkar was gonna visit too being one of the chief guests.

Morning shined bright on our faces waking us up to the serene view in front of us, the drenched roads with the fresh bits of rainfall, the trees growing on the side of the road, making way for us as we swished passed them.

All of this, while checking google maps constantly to look for when do we need to get off the bus to be nearest to the hostel where we were gonna crash in Alleppey.

Similar to the gokarna trip, I managed to get bunk beds for me and Deepak in Zostel Alleppey, and off we went walking with our rucksacks towards the hostel, looking at google maps every now and then to check if we were not headed in the wrong direction.

Luckily, we got a rick to deliver us straight in front of the hostel. It was still super early for us to checkin to our rooms, but our host was kind enough to let us keep our rucksacks aside and let us use the washrooms.

I love how zostels have always ended up placing their hostels at some really gorgeous places. This being no different, it was right next to the Alleppey beach!

Another early bird was Samiksha, who had arrived maybe a day or two earlier than us to the zostel and we ended up striking a conversation with her and our host about the things which we had planned on doing.

As Anirudh had bailed out the last minute of the whole thing, we had an extra ticket spare with us for the Nehru tropy, which samiksha was ready to grab as we were laying down our plan for the day, to her.

After dilly dallying for a bit on whether to have breakfast right now or later, we went ahead with skipping it till we reached the venue as we heard from our host that the whole thing gets very crowded and the seats are assigned on a first come first serve basis, which made us not risk getting a distant seat in the stands.

As we arrived closer to the venue, famished with walking almost more than a km or two, guessing the main entrance of the race, which we had highly underestimated when we got down of our rick. We caved in with our hunger pang and decided to try out a small restaurant which was right next to the road which we had to follow to get to the entrance of the event.

To just describe this road, the road had an adjacent canal running next to it, which had a lot of boats, ranging from the size of a small canoe to that a decently sized ferry which would be able to carry at least 20-30 people. Their operators trying to woo us with their tarrifs, for them to chaperone one around the canals as well as the backwaters, bundling meals along with the whole deal.

Coming back to what we ate, I ordered some appam for myself, of which I repeated another serving of the same, along with some chicken dosa.

Filling out our tummies, gave us the energy to trod faster towards the venue and take our seats.

Nehru Trophy

The atmosphere was electric to say the least, everyone cheering their own team, clear evidence of a few rivalries between a few teams, the sound of the vuvuzela echoing the stands, the vendors shuttling around, trying to keep up with the demands of the buyers, rain droplets hitting your countenance as you cheer for the rowers from each team as they row with the mission of winning. All of this happening in the middle of the majestic backwaters.

It also accompanied a lovely cultural show being put up by folks as we started the race, which they were performing on the top of huge boats. Absolutely brilliant!

One thing which I was really happy seeing was that the organisers had asked for people to buy 10 rupee tokens to get stamped for their plastic bottles, and then they people can give the tokens back to the organisers by showing their number of bottles which they had gotten the token for to get back their token money, good way to incentivise people not throwing the plastic bottles.

We left at around lunch time as we passed the crowd off through the main gates, in our search for another restaurant where we would vanquish our thirst and pacify out tummies.

And the stubbornness to eat something authentic was really an itch we add had, so off we went in an aimless search for something which was cheap as well as something which was local.

We landed up at Subash Hotel, not very far off from where we started searching for a hotel. And luck behold, it was also co-incidentally a Toddy (palm wine) shop. The host seeing all three of us a bit uncomfortable at first as we had samiksha with us, told us not to worry at all by showing us around the kitchen and his family who were helping him run the shop. The sight was a delightful one, where we could see freshly made tapioca, fried fish chips and of course toddy. We ended up settling down in one corner of the restaurant and out host helped us order a bit of everything which we saw in the kitchen. Me and Deepak didn’t have the Toddy, but samiksha managed to gulp down the whole Toddy bottle to herself. Our host sheepishly had a look at both me and Deepak as smiled at samiksha when both our glasses were empty and she was drinking straight out of the jar.

The road adjacent to the shop was also having a small canal running beside it, which gradually merged into Kollam-Kottapuram waterway and wasn’t it a sight to just look at the fresh green fields and the cocunut trees, we could see people occasionally drive(row) past us as we washed our hands after our fulfilling meal.

Alleppey Light house

We ended up taking a rick back to our hostel as our legs had given up after all the running around since morning. As we zoomed past the roads, we also crossed the Alleppey beach light house, which has been since the time of erstwhile Travancore, before India’s independence. Alleppey light house catered to one of the busiest ports and trade centers in the southern coast, after the arrival of the dutch, portuguese and english traders.

Looking at this beauty, we got off at the light house as our hostel was not too far off, to look at the sun set from right next to the beach.

The beach was not very crowded and we ended up taking a small place right next to the pillars. Pillars? Yes, pillars. Or what was left off them. If you’re wondering what these pillars were doing here, my best guess would be that these were the remnants of the port where ships used to dock and load/unload the cargo. While the watchful lighthouse, gazed over all of us spraying light over the horizon as it moved it’s shower in circles.

We didn’t waste time in removing our shoes, keeping our socks aside and feeling the sand on our toes, the waves gushing in seaweed into our feet and us trying to keep up with not tumbling over into the sea by the shear strenght of the water gushing towards to us.

The pillars had rust from all the years of standing still, getting beat down by the waves and all saline water treating it all day long. Wild moss growing along it’s body, it almost felt like it was at this point part of the sea and an extension to it and not something mankind had put there.

We stayed around till the the sun set and all we that would guide us back were the shining stars above us, the hostel was barely a few hundred meters and off we went trodding down the beach with the occaional dog running around us.

Dog tired as we went back to our rooms, crashing back into our beds after taking a shower and quickly falling asleep. This marked the end of the first day for me and Deepak in Alleppey.

Me and Deepak had planned on visiting the beach early in the morning and that’s what we did after we woke up, we wore our sandals and off we went to just walk on the beach as the waves splashed onto the shore, taking back a bit of it with each visit it made.

The boats were parked in at the shores for the fisherman after they had come back from their hunt into the sea and the boats awaiter their masters as they stood there, their nose facing the sea, getting ready for their next shift.

While walking around, we encoutered bunch of puppies cuddled together to keep themselves warm with the perennial breeze of the sea. On coming closer, they all ran towards their mother who was probably in search of some food.

Which also reminded that we need to get some breakfast too! I didn’t order myself much, a sandwich and an omletter along with a serving of watermelon was all I had for breakfast as I was too occupied with planning on what we were going to do during the course of the day and as usual of me, we always made the plans on wwhich place to visit and what to do on the go. But not to digress, it has always been alright whenever I did so, which is why I wasn’t too worried, as long as we didn’t sit inside our rooms and not do anything.

Bugging our host on things to do, we managed to land ourselves a scooter for hire for the whole day. The catch was that Samiksha was to also accompany us along for the day, which would mean that either we had to land another scooter for rent from our host. Unluckily, the last scooter had been taken, but even if we got another scooter for ourselves, only Deepak amongst all three of us had an actual license for a 2 wheeler, hence even getting that extra scooter was not something which we could have had made much use. I know, I know. But I do have a license now. But not to digress, this meant that we had to do a tripple seater pillion ride wherever we were going.

Alleppey Coir Musueum

And that’s what we did, not very proud of it. But Deepak driving and both me and Samiksha, huddling together in the back with our raincoats, holding an umbrella, cutting through the wind, trying to save us all from getting more wet, off we went into towards the Coir museum, which was good ~8kms from our hostel. Luckily, it was a straight road and we covered the distance after a bit zooming around, trying to maintain our balance and not let the wind take away the umbrella.

The proctor of the hall was also kind enough to answer our questions about who used to work here and how many people would be working here on average on such machines. She also mentioned that the govt had incentivised studying coir courses and setup hostels for folks to come and learn the art while earning a small stipiend too which I felt was really helpful to attract more people to take this up while also sustaining themselves from the stipend money.

I had only heard about coir beds until that point, but I had no clue that one could do so many things using coir. We saw, actual beds, boats, houses, umbrellas, wall art, carpets, eco-friendly plant beds and what not. All of this being made from cocunut being the raw material.

Traditionaly coir was spun by bare hands, by simply twisting fibres between the palms of the hands, the introduction of ratts(spinning wheels) in the 19th century, significantly increased the productivity.

During the mid 18th century, a few europians with experience with jute products in Bengal, arrived in Alappuzha with two Bengali technicians to explore the prospects of the Coir yarn, on seeing their success, a lot of other industrialists set up their shops there. The establishment of such organised coir production facilities helped Alleppey become the unchallenged Coir Capital of the world.

Marari beach

The next stop was Marari beach, which was an odd ~7-8kms from the place which we were at. We could have gone with the usual way, which is the highway, but since we were doing a tripsy, we didn’t wanna take the risk of getting a ticket for this. Plus, what’s the fun on taking the highway when we had the alternative of going via the country side.

And this was probably one of the best decisions of the whole day. The roads were certainly very narrow at places, but the plus side of the whole thing was the fact that we could see everyone doing about their daily lives and us being the witness for it. Coir as we saw, was a huge industry, we saw a bunch of godowns which were operating out of small/mid-sized houses which from where they would probably process the fibre obtained from coconuts. The houses were built in a distinct style, where one could see the porch and the borders of the roof having a certain angle, which is very common to places where there is heavy rainfall, the columns were made of wooden, shaped like a cylinder, with the midriff being a bit broader from both the ends. And the verandah would usually be huge.

There came a point in the middle of all this, where we had to cross a huge puddle of water. Imagine 3 people holding to each other, trying to cross this puddle and not fall into it at the same time. Poor Deepak had to balance the scooter as well as both me and Samiksha, while we both were laughing it out while the poor chap took us across.

After circling around the beach for a bit (thanks to us for putting the wrong landmark), we finally reached the beach and it was serene. It has been the only beach, where I have actually noticed the sea cutting through land and delving inside the mainland. As the waves crashed on the shore, gnawing away at the shore, and taking parts and bits of whatever it could scavenge, inwards towards the mainland.

It was quite the sight.

The beach was luckily not that crowded when I reached there, plus also due to the fact that it was lunch time which added to the small number of people who were there.

After leaving Marari beach and it’s sand on our toes with us, we ended up in a restaurant which was on the side of the road, which we found out out of the blue. The food was amazing, although it was also the case that we had not eaten anything properly since the morning plus us being famished by the time we had left Marari. But not to take away from the restaurants food, it was quite good. On that note, we also managed to clean our feet and remove all the sand which we had gathered on our toes in the washroom of the restaurant after we ate.

Kumarakoram

As usual, we did not have a clue on what to do next. We were sort of split between going for a boat house in the backwaters or going to the bird sanctuary which Samiksha was mentioning. At the end we decided to do whatever was open by the time we would have reached the place.

We took off on our scooty, to find someone who would be able to give us a quick boat ride. Yes, it was that random. We were literally scourging our way Aryakara, trying to find for a party which would take us in their shelter, as it had also started drizzling quite hard. Imagine 3 people in an old beat down scooty, trying to just move around coast. This is when we stumbled upon a small narrow alley which was leading to the backwaters, we could also hazily see that there was a boat achored on the side and there were a few people standing there.

It started raining heavily and we had to park the scooter and hastily make a run for the shed. What we had stumbled upon was a govt outpost for jetty’s which would be used to transport people from one side to the other side of the backwater. We all looked at each other and decided, sure we can try this out, as it was not very far from the time when it would get dark, plus our search effort for a house boat was not leading us to anything special in particular, we decided to hop in into the next boat which was coming in, the fun part of it? We could also take our scooter inside with us as the trawler had provisions for it and so did the other few people who got in with us.

The tickets were subsidised and if I recall correctly, they costed less than 50INR individually, along with our scooter, which is pretty cheap if you ask me relatively. The next stop was Kumarakoram. As the trawler drifted past the sea plants, the view was breathtaking!

It must have been an odd 15-20mins by the time which we reached the coast of kumarakoram, the trawler took it’s own sweet time to reach the last end spot where we were supposed to get down. We could see the distance we had traversed, it was only trees only the swamp on our sides which gave way to the backwaters, at the horizon where we could see the fine line of trees of Aryakara which we had left behind.

On getting down from the boat and collecting our scooter, we headed straight to the small shops where we could see the owners of the houseboats, but sadly, the coversations were unweildy. We had already crossed the time until which they ply their services and they rightfully didn’t budge on that. Dejected, we decided to head back to our parked scooter and decided that we would take the scooter back all the way to our hostel.

Now we are talking about a good ~40kms give or take! In a scooter! While doing a trippsy!

But we had already made up our minds and off we went. It was certainly not comfortable before. Add on top of the fact that we were literally doing this from the time the day had started and we had embarked on our little adventure.

We crossed numerous boathouses, resorts, empty houses, fishermen coming back with their catch while on our way back our hostel. And when we finally reached, I just wanted to crash at my bed and even the hard cushion felt like an expensive high end mattress at the end of the day.

Just when the grasp of slumber was closing in, hearing my own grumbling tummy woke me back up from what little sleep that I was about to get. After freshening up, I pulled Deepak too outside of his room to sit in the common room.

It just so happened that a few folks ended up jamming together while we lazed away at the sidelines, our host and a few of his friends and some of the fellow hostelers

I ended up singing while Samiksha did the strings, while Sakshi took was vlogging her trip video. Thanks Sakshi for capturing this. Do give a watch to her original video which I used to trim this part out.

After all this, we went in search for a restaurant, at a time which I would easily consider quite late for given everything was more or less closed. On our way, we crossed a wedding ceremony, where we could see people dancing and being jolly. Now I have never done this before, but everyone more or less chimed in to agree upon us asking the folks inside if they had any food left! We literally ended up gatecrashing the thing and a few of us ended up dancing with the folks inside(Sakshi captures the same in her vlog)

There was a lot of banter and tomfollery which followed when the younger folks saw us with cameras or maybe they were just really having fun in the absense of the elders there to control what they were doing, either way, we let them all be and went ahead in search of a restaurant which we did find at the end of the whole thing(Sakshi has captured this too in her blog, alright, I will not repeat this again.)

We headed straight back to our hostel rooms and crashing for the night after the long day.

I ended up waking early in the morning and waking Deepak along with me too, to just run around the beach and spend sometime there, which we both had been planning to do. The best early morning beach view that I have ever had. We got to see small crabs, trying to dig through the sand, make a run for the sea as soon as they saw us. And just about that time, it also started raining quite heavily, we did not have any gear to protect us from the rain, which made us run towards the shelter to get some respite from the rain, as we watched the fishermen come back with their days catch. It was just beautiful, seeing the clouds pour their hearts out, flooding the sidelines of the porch of where we were standing. One thing which really stood out was a pair of swimmers, in full gear, cutting past the sea, further away from the shore, swimming against the waves, with each wave hitting them, their skill and practice clearly showing up as they moved away farther and farther away from the shore and we lost sight of them the moment they crossed the iron columns of the poles which used to be the erstwhile dock.

Did I forget to mention that, there was a local dog which had been accompanying us all the time since we had almost started from the hostel towards the light house?

This dog also managed to take away my flip-flops and make me run after itself for a good distance before he gave them up back to me. While the intension was clear for the dog that it wanted to play early in the morning for this, not so hostile human being which it found strolling around in the beach, it also managed to playfully bite (mouthing if you may) as we walked along the shore (more on this later.) much to our amusement.

Heading back to our hostel, we quickly took a shower and picked up our packed backpacks, heading for the train station before which we baded goodbye to Samiksha and whoever was awake when we were leaving.

We hadn’t eaten anything, which made us look for a decent place to fill our tummies, we did find a small eatery, which was being run by the railways. On settling down, we ordered a few dosas and cup of tea for each of us, while we waited for the train to arrive. The train did arrive, but little did we know that the train was to halt for only about 2 minutes. And here we were hastingly stuffing food into our mouths, trying to also eat and clear the bill, much to the amusement of everybody around us.

Kochi

The next leg in this trip was our visit to kochi, train tickets were pretty cheap, we got it for about less than 50INR each. The train itself was a passenger one, which meant it was a bit slower and halted at quite a few places, but that also gave us a chance to admire the countryside. The lush green meadows, the swamps and the local livestock grazing onto grass.

We reached in about an hour or two after we started to Ernakulam, from there we had to either take a ride back to the fort kochi or take a bus, we chose the latter. The only problem being, that we had a tough time finding a ride which would take us to fort kochi, with the clock ticking past 2pm and our heavy rucksacks behinds our backs, was making us rethink the whole idea of taking a bus ride to fort kochi but to rather just take a cab and just be done with it. But to our good luck, we did get a bus. which was also not very crowded.

The ride back to fort kochi was rougly around more than 10kms, which took us about another half an hour to reach. We saddled down our bags in a corner.

As luck would have it, our hostel where we were going to stay was hardly a few meters away from the bus stop which we got down off. And you might have guessed it right, it was a zostel again! We had to dump our backs in the common area, as the host was out for lunch, we quickly freshened up in one of the spare washrooms around the common area and off we went in search of some food.

We settled for this restaurant, right next to the fishing nets, which was a bit pricy? But well, we were really famished and we didn’t really mind paying premium for some decent food. For someone visiting, they should definitely try out the prawns and the likes when they visit.

It was a no brainer to decide on going for a walk alongside fort kochi and the fishing nets, once we were done with finishing up our meal.

If you haven’t seen the chinese fishing nets before, they are quite something to watch being operated by the fishermen, which is exactly what we did. They are huge, made with bamboo shoots and just about are the most complicated fishing nets I have ever seen, I am sure they are efficient than a lot more of the nets which we have around in Assam, where they use simple nets which you would imagine a fishermen to have, but this was entirely different which I hadn’t personally seen before. A couple of fishermen would be required to operate it.

Moving on, the boulevard which leads up to the edge of the fort, is bordered by a very old mansion. I say mansion, because you will have to actually see this once to gauge the sprawling lawn and experience the architecture of the house which is definitely back from the time Kochi was inhabited by the Portuguese and the Dutch. And there are a bunch of houses like this along the road and this not being a singular event.

We headed back straight to our hostel rooms after we had finished this small walk around the fishing nets, Deepak decided that he would sleep the afternoon out.

There was a Kathakali which was about to start nearby the place where we were staying, called Kerala Kathakali centre, this was probably the first time I was going to watch this live with the performers doing their preparation live in front of the audience, this was definitely something which I didn’t want to miss out.

As we all settled down the small but cozy auditorium, the crowd was tring to find a seat as close as possible to the stage which would allow them to be closer to the whole spectacle.

The whole thing was just breathtaking, right from the expressions, to the preparation, to the dexterity and experience of the performers which made it look so fflawless. All in all, this is a great way in which they are preserving their culture and also showcasing it to the people of other cultures.

After this ended, I headed back to the hostel room, to finally give some rest to my feet, which was much required. Crashing on my bunk bed with the fan on top of my head, made me fall asleep for a good few hours, before being waken up by Deepak to head out for some dinner.

The next day, after waking up, we hired a scooter for rent for the whole day, which we planned to make full use of before we left from kochi, and off we went to get some breakfast. Next stop was Pepper house, which is one of the few old spice houses back from the time when these buildings were used for actually storing spices, no surprises here on what pepper house used to store.

The view was lovely and the service was exemplery, although a bit pricey compared to other the eateries around, but would say definitely worth it.

After checking out the local art gallery and library inside pepper house, we headed out for the narrow streets of Mattencherry, which was also very close to Jew Town. The roads are filled with a lot of antique shops which sell everything right from really old antiques, paintings and souvenirs to collectibles which have been created by the localites.

The 16th century, Jewish Synagogue had a lot of rich culture and history associated with it, which was definitely fascinating. The best part of all this was the fact that, there were still a few local Jewish folks living around near the Synagogue.

We visited the naval museum after this, which had a bunch of historical writings around the naval history of Kochi, we also stumbled upon an officer back from Bangalore who asked us to tell more about the Ulsoor lake and the ASC center not from from it, such a small world. The funny part is, we got so lost in all the readings that somehow we did not notice that the musuem was closing down(the host at the gate, did tell us that it was supposed to close down at 5pm, but well we didn’t really keep track of time), and we almost got locked out inside the museum and had to literally rush back towards the entrance of the museum bunker (yes they have converted actual bunkers into musuems, how cool is that!)

After this, we went decided to head back to our hostel rooms, but before that, we headed to get some early dinner in one of the restaurants around the art galleries, around near the fishing nets, we didn’t fuss around too much and ate quickly, before heading back to drop off our scooter back to the garage from where we picked it up.

Reaching our hostel, we packed our stuff, and bade goodbye to the host, after hopping onto our taxi to head back to ernakulam Junction from where our train back to Bangalore was supposed to leave, trivia about this ticket was that, both of our tickets got confirmed a day before our departure.

Reaching bangalore early next morning, made us come back to our reality and realize that our trip was about to get/finally over.

I have very fond memories of Kerala. It was a lovely trip and I had a lot of fun writing this.

Until next time!

A few notes on GKE kubernetes upgrades

2020-07-22T00:00:00+00:00

This post was originally published in Gojeks engineering blog, here, this post is a cross post of the same

This post is more of a continuation to this tweet

A few notes on @kubernetes cluster upgrades on GKE (1/n)
— Tasdik Rahman (@tasdikrahman) July 21, 2020

If you are running kubernetes on GKE, chances are that you are already doing some form of upgrades for your kubernetes clusters, given that their release cycle is quarterly, which means you will have a minor version bump every quarter in the upstream. That is really a high velocity for version releases, but that’s not the focus of this post, the focus is on how you can attempt to keep up with this release cycle.

Quite a few things are GKE specific, but overall at the same time, there are also a lot of things which apply to in general any kubernetes cluster, whether self hosted or a managed one.

That being said. Let’s quickly set context, on what exactly do we mean by when we say a kubernetes cluster.

Components of a kubernetes cluster

In any kubernetes cluster, it would consist of your master and worker nodes. Two sets of nodes for different kind of workloads to run on them. More on the kind of workloads which run on them.

The master nodes in the case of GKE are managed by googlecloud itself, now what does it entail? It means, the components like api-server, controller-manager, etcd, scheduler etc, will not have to be managed by you in this case, which would have been an added operational burden.

I will not go into what each and every component goes into detail, as the docs do a good justice on what do they do, but just to summarise, the scheduler schedules your pods to nodes, the controller manager consists of a set of controllers used to control the existing state of the cluster and reconcile it with the state stored in etcd, the api-server is your entry point to the cluster, which is where each and every component comes to interact with other components.

How should I create a cluster

We use terraform along with gitops to manage the state of everything related to gcp, although I have heard good things about pulumi, whatever works for you at the end of the day, but having the power of being able to declaratively configure the state of your infrastructure cannot be understated.

We have a bunch of cluster creation modules inside our private terraform repository, which makes creation of our GKE cluster, literally just a call to the module along with some sane defaults and the custom arguments which vary along with any cluster, a git commit and push, and then the next thing, which one sees is the terraform plan, right in the comfort of the CI, if all looks good, they do a terraform apply as the next step in the same pipeline stage.

With a few contextual things, about how we are managing the terraform state of the cluster, let’s move on to a few defaults which we set.

By default, one should always choose regional clusters. The advantage of this is that, then GKE will maintain replicas of your control plane across zones, which makes your control plane resilient to zonal failures. Since the api-server is the entry to all the communication and interaction, this going down is basically you losing control/access to your cluster, but your workload will continue to run unaffected (if they don’t depend on the api-server or k8s control plane, more on this in some)

Components like Istio, prometheus operator, etc and even good old kubectl which depend on the api-server, may not function momentarily as the control-plane is getting upgrades in case your cluster is not a regional cluster.

Although, in the case of regional clusters, I haven’t personally seen any service degradation/downtime/latency increase while the master upgrades itself.

Master upgrades come first before upgrading anything

The reason for that is, the control plane needs to be upgraded first and then the rest of the worker nodes.

When one is upgrading the master nodes(you will not see the nodes in GKE, but this would be running somewhere as part of VMs/borg pods et al. whatever google is using as an abstraction), the workloads running on it like the controller-manager, scheduler, etcd, the api-server are the components which are getting upgraded to the version of k8s which you are setting it to.

Master upgrades need to happen and then one can move on the worker node upgrades, the process of master node upgradation is very opaque in nature, as GKE manages the upgrade for you and not the cluster operator, which might not give you a lot of visibility on what exactly is happening. But nevertheless, if you just want to learn on what’s happening inside, you can try typhoon and try upgrading the control plane of the cluster brought up using that, which I used to live upgrade the control plane of a self hosted k8s cluster in devopsdays 2018 india’s talk.

GKE cluster master upgraded, what next

The next obvious thing after you have done your GKE master node upgrade, is to upgrade your worker nodes, in the case of GKE, you would have node pools, which would in turn be having nodes being managed by the node pools.

Why different node pools? One can use separate node pools, to run different kind of node pools, which can then be used to segregate the workloads which run on the nodes of that node pool, for example. One nodepool can be tainted to run only prometheus pods, and then the prometheus deployment object can then tolerate that taint to get scheduled on that node.

What consists of the worker nodes

This is the part of the compute infra, which is what you get to interact with if you are on GKE.

These are the node pools, where your workloads will run at the end of the day.

As to components which make up the worker nodes, (excluding your workload) would be

kube-proxy
container-runtime (docker for example)
kubelet

Again, not going very deep into what each thing does, but on a very high level, kube-proxy is responsible for the translation of your service’s clusterIP to podIP at the end of the day, along with that also nodePort.

Kubelet is the process, which actually listens to the api-server for incoming instructions to schedule/delete pods to the node in which it is running. This instruction is in turn translated to the api instruction set which the container runtime (docker, podman for example) understands.

These 3 components, are again managed by GKE, and whenever you are upgrading your nodes, kube-proxy and kubelet gets upgraded, the container runtime need not receive and update while you upgrade. GKE would have it’s own mechanism to do it, but on a very high level, it does so by changing the image versions of the control plane pods.

We haven’t seen a downtime/service degradation happening due to these components getting upgraded on the cluster.

One good thing to note here is that, the worker nodes can run a few versions behind the version of the master nodes, the exact versions can be tested out on your staging clusters, to have more confidence while doing your production upgrade. But for example, I have seen if master is on 1.13.x, the nodes run just fine while even if they are on 1.11.x. Recommended is only a 2 minor version skew.

Anything one should check while upgrading to a certain version?

Since the major release cycle is quarterly for kubernetes, one thing for sure which operators have to check is the release notes and the changelog for each version bump, as they usually entail quite a few api deletions and major changes.

What happens if my cluster is regional while upgrading it

If your cluster is regional, the node upgrade happens zone by zone. You can control the number of nodes which can be upgraded at once using the surge configuration for the node pool, turning autoscaling off for the node pool is also recommended during the node upgrade upgrade.

If surge upgrade is enabled, a surge node with the upgraded version is created and it waits till the kubelet registers itself to the api-server, marking it ready after the kubelet reports the node as healthy back to the api-server, at which point, the api-server can direct the kubelet running on the surge node to schedule any workload pods

In case of a regional cluster, another node from the same zone is picked up to be drained, after which the node is cordoned, it’s workload rescheduled and then the end of this, the node gets deleted and removed from the nodepool.

Release channels

Setting a release channel is something which is highly recommended, we set it to stable for the production clusters, and the same for our integration clusters, with that being set, the nodes will always run the same version of kubernetes as the master nodes (excluding the small amount of time when the master is getting upgraded.)

There are 3 release channels, depending on how fast you want to keep up with the kubernetes versions released upstream

rapid
regular (default)
stable

Setting maintenance windows will allow you to control when these upgrade operations are supposed to kick in, once set, the cluster cannot be upgraded/downgraded manually, so if one doesn’t really care about granular control, they can not choose this option.

I haven’t personally downgraded a master version, please try this out on a staging cluster if you really need to. Although if you look at the docs, downgrading master is not really recommended.

Downgrading a node pool version is not possible, but you can always create a new node pool, with the said version of kubernetes and delete the older node pool.

Networking gotchas while upgrading to a version 1.14.x or above

If you are running a version lesser than 1.14.x and don’t have the ip-masq-agent running and if your destination address range falling under the CIDR’s

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16

the packets in the egress traffic will not be masqueraded, which means that the node IP will be seen in this case.

The default behaviour after 1.14.x (and on COS), packets flowing from the pods stop getting NAT’d. This can cause disruption as you might not have whitelisted the pod address range.

One way is to add the ip-masq-agent and add the config for the nonMasqueradeCIDRs list the destination CIDR’s like 10.0.0.0/8 (for example if this is where your destination component like postgres lies), in this case the packets will use the podIP as the source address when the destination (postgres) receives the traffic and not the nodeIP.

Can I upgrade multiple node pools at once?

No you can’t, GKE doesn’t allow you to do so, even when you are upgrading one node pool, the node which get’s upgraded and picked for upgraded is not something which you will have control over.

So far we have discussed on what happens when the node pools get upgraded and the master node pools getting upgraded by GKE

Would there be a downtime for my services when we do an upgrade?

Let’s start with the master component, if you have a regional cluster, while upgrading, since it happens zone by zone, even if your service is making use of the k8s api-server to do something, it will not get affected, although you should definitely try replicating the same for your staging setup assuming both have similar config.

Coming to the worker nodes. Well it depends.

How do I prevent/minimise downtime for my services deployed?

For stateless applications, the simplest thing to do is to increase the replicas to reflect the number of zones in which your nodes are present. But it need not to be necessary that scheduling of pods happen on each node across zone, kubernetes by default doesn’t handle this but gives you the primitives to handle this case.

If you want to distribute pods across zones, you can apply podantiaffinity in the deployment spec for your service, with the topologyKey set to http://failure-domain.beta.kubernetes.io/zone for the scheduler to try scheduling it across zones. (You can read a more detailed post on scheduling rules which you can specify which I wrote sometime back here)

Distribution across zones will make your service resilient to zonal failures, the reason we increase the replicas to greater than 1 is that, when the nodes get upgraded, the node gets drained and cordoned and the pods get bumped out from that node.

In case the service which has only 1 replica, and it is scheduled on the node which is set for upgrade by GKE, whilst the scheduler tries finding itself a new node, there would be no other pod which is serving requests, which would cause a temporary downtime in this case

One thing to note here is that, if you are using PodDisruptionBudget(PDB), and if your running replicas are the same as the minAvailable specified in the PDB rule, the upgrade will just not happen, this is because of the fact that the node will not be able to drain the pod(s), as it respects the PDB budget in this case, hence you have to either

increase the pods such that the running pods are > minAvailable specified in your PDB
remove the PDB rule specified.

For statefulSets, you might have to take a small dowtime while you are upgrading, as the pods on which the stateful set is scheduled, that gets bumped out, the pvc claim will again be made by another pod, when it gets scheduled on the other node.

But Tasdik these steps of upgrade are so mundane

Agreed, it is mundane, but there’s nothing stopping anyone from having a tool do these things for you, have a look at eks-rolling-update, GKE is way easier if you look at it. Which makes fewer touch points and cases where things can go wrong

one being the pdb budget being the showstopper for your upgrade if you don’t pay attention
replicas being 1 or so for services
quite a few replicas being in pending or crashloopbackoff
statefulsets being an exception and needing to be handholded.

For most of the above, you can initially start with having a fixed process (playbook) which one needs to follow and run through for each cluster whenever you are upgrading, so even though if the task is mundane, one knows which checks to follow and what to do to check the sanity of the cluster after the upgrade is done.

Replicas being set to 1 is just being plain naive, let your deployment tool have sane defaults of having replicas of 3 for minimum (3 zones in 1 region assuming you have podantiaffinity and a best case effort gets logged by the scheduler)

For the pods being in pending state, it means, you either are trying to request cpu/memory which is not available in any of the nodes present in the node pools, which again means, you are either not sizing your pods correctly, or there are a few deployments which are hogging resources, either way, it’s a smell that you are not having enough visibility into your cluster.

For statefulsets, I don’t think you can prevent not taking a downtime. So that’s there.

After all the upgrades are done, you can backfill the upgraded version numbers and other things back to your terraform config in your git repo.

Once you have rinsed and repeated these steps above, you can very well start with automating a few things.

What we have automated

We have automated the part, where the whole analysis of what pods are running in the cluster, we extract this information out in an excel sheet. Information like

replicas of the pods
age of the pod
status of the pods
which pods are in pending/crashloopbackoff
node cpu/mem utilization

The same script handles inserting the team ownership details of the service, by querying our service registry and storing that info.

So all of the above details, at your tips, by just running the script in your command line and switching context to your clusters context.

As of now, the operations like

upgrading the master nodes to a certain version
disabling surge upgrades/autoscaling the nodes
upgrading the node pool(s)
reenabling surge upgrades/autoscaling
setting maintenance window and release channel if already not set

All the above being done via the CLI.

The next parts would be to automate the sequence in which these operations are done and codify the learnings and edge cases to the tool.

This is a bit mundane, no doubt. But this laundry has to be done, there’s no running away from it. Until one has automated the whole/major chunk of it, what we are currently doing in our team is to rotate people around doing cluster upgrades.

One person gets added to the roster, while there is a person who is already in the roster from the last week who will drive the upgrade for the week, also while giving context to the person who has just joined.

This helps in quick context sharing as well as the person who has just joined, they get to upgrade the clusters by following the playbooks, hence filling the gaps as we go forward.

The important part is that, you always come out of the week, with something improved, some automation added, some docs added. While also allocating dev time for automation explicitly in your sprint.

Ending notes

All in all, GKE really has a stable base which in turn allows us to focus more on building the platform on top of it rather than managing the underlying system and improve the developer productivity by building out tooling on top of the primitives k8s gives you.

If you compare this to something like running your own k8s cluster on top of VMs, there is a massive overhead of managing/upgrading/replacing components/nodes of your self managed cluster which in itself does require dedicated folks to handhold the cluster at times.

So if you really have the liberty, a managed solution is the way to go, take this from someone who has managed prod self hosted k8s clusters, and I will be honest, it’s definitely not easy, and something which if you can, should be delegated to focus on other problems

References

Credits

Image credits to wikipedia and kubernetesio

Structured logging in Rails

2020-07-07T00:00:00+00:00

This post was originally published in Gojeks engineering blog, here, this post is a cross post of the same

If you are on rails, you would have noticed that the rails logs which you get by default are quite verbose and spread across multiple lines, even if the context is of processing just one simple controller action.

What I will discuss in this post is how can one sanitize the logs, without losing out on information along with how you can add additional information for your log lines to make full use of the querying features of your logging platform.

What is not in scope of this blog is setting up the mechanism to push the logs from your rails app to your respective logging platform.

Let’s take the example of a simple health check controller which you would add to just make the health check pass for your rails app deployed on kubernetes

# app/controllers/health_check_controller.rb
class HealthCheckController < ActionController::Base
  def ping
    render json: { success: true, errors: nil, data: 'pong' }, status: :ok
  end
end

and for the config (shown for the development evironment for this example)

# config/environments/development.rb
Rails.application.configure do
    config.log_tags = [:request_id]
    config.log_level = :debug
end

A simple route for the GET verb, to call the #ping action in the HealthCheckController

# config/routes.rb
Rails.application.routes.draw do
    get '/ping', to: 'health_check#ping'
end

This is what the logs would look like for the route ping

[my-app-fbf8d7bfc-wk5cd] I, [2020-07-01T07:19:05.007174 #1]  INFO -- : [ec0ad0ba-dfb2-419b-bd81-5feb7dacb308] Processing by HealthCheckController#ping as HTML
[my-app-fbf8d7bfc-wk5cd] I, [2020-07-01T07:19:05.007874 #1]  INFO -- : [ec0ad0ba-dfb2-419b-bd81-5feb7dacb308] Completed 200 OK in 0ms (Views: 0.2ms)
[my-app-fbf8d7bfc-wk5cd] I, [2020-07-01T07:19:05.290929 #1]  INFO -- : [86332306-62e4-412e-a690-eee8253ab1c8] Started GET "/ping" for 10.177.3.1 at 2020-07-01 07:19:05 +0000
[my-app-fbf8d7bfc-wk5cd] I, [2020-07-01T07:19:05.292363 #1]  INFO -- : [86332306-62e4-412e-a690-eee8253ab1c8] Processing by HealthCheckController#ping as HTML

You can notice a few things here,

the logs are spread over multiple lines, adding to the difficulty in debugging the whole request/response flow.
- given you would be pushing to your logging platform, let’s say EFK, which would allow you to do full text search, the configuration to have the request-id for each log would come handy. A little better than not having anything at all. (Have a look at baritolog if you haven’t, our in house EFK platform)
- if you are not pushing to any logging platform, then you would be debugging this by tailing the logs of the rails app somewhere, if deployed to kubernetes, doing a kubetail or if on VMs, sitting inside the VM and then tailing the logs of each and every application server instance. But then the first step here would be to obviously have centralized logging.
extremely verbose, hindering debugging, if you don’t have that already.
no clear way to parse the logs as the logs format are non-standard (if you hadn’t added config.log_tags = [:request_id] which is not present by default)

Ways to improve this

Taking things one at a time, you can compact out the log so that it stays meaningful and not verbose the way it is right now, by using something like lograge. There are a bunch of other alternatives like https://logger.rocketjob.io/ and https://github.com/shadabahmed/logstasher, but I went ahead with lograge since it has been around for sometime now and has a larger adoption, my use case of what I required to do worked, and the use case for this as such was just to have certain things injected in our logs along with json formatted logging, and this worked very well for our use case.

Add lograge in your Gemfile

...
gem 'lograge', '~> 0.11.2'
...

and do a $ bundle, after which you need to add setup the configuration as followed

# config/environments/developement.rb
Rails.application.configure do
  config.lograge.formatter = Lograge::Formatters::Json.new
  config.lograge.enabled = true
  config.lograge.base_controller_class = ['ActionController::Base']
  config.lograge.custom_options = lambda do |event|
    {
      request_time: Time.now,
      application: Rails.application.class.parent_name,
      process_id: Process.pid,
      host: event.payload[:host],
      remote_ip: event.payload[:remote_ip],
      ip: event.payload[:ip],
      x_forwarded_for: event.payload[:x_forwarded_for],
      params: event.payload[:params].except(*exceptions).to_json,
      rails_env: Rails.env,
      exception: event.payload[:exception]&.first,
      request_id: event.payload[:headers]['action_dispatch.request_id'],
    }.compact
  end
end

config.lograge.base_controller_class = ['ActionController::Base'], this part assumes that each controller will be inheriting from ActionController::Base, you can include any other controller which doesn’t inherit from the Base controller, in order for lograge to pick it up.

along with

class ApplicationController < ActionController::Base
    protect_from_forgery with: exception

    def append_info_to_payload(payload)
        super
        payload[:host] = request.host
        payload[:remote_ip] = request.remote_ip
        payload[:ip] = request.ip
    end
end

now if you do a

$ curl -I localhost:3000/ping | grep -i "request"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
X-Request-Id: 4967a677-ab86-4a10-8a01-ea520951cf46

and check the logs of your rails app

{"level":"INFO","progname":null,"message":"{\"method\":\"GET\",\"path\":\"/ping\",\"format\":\"*/*\",\"controller\":\"HealthCheckController\",\"action\":\"ping\",\"status\":200,\"duration\":8.36,\"view\":0.22,\"db\":0.0,\"request_time\":\"2020-07-07 16:10:17 +0530\",\"application\":\"MyApplication\",\"process_id\":7869,\"params\":\"{}\",\"rails_env\":\"development\",\"request_id\":\"4967a677-ab86-4a10-8a01-ea520951cf46\"}"}

Now the whole log line captures a lot of metadata so that you can debug for each log line, you can also notice that one of the keys present in the log line is the request_id present, in this way, what is happening is, you have a unique trace id for your particular request, which you can do a full text search on your logging platform, if it supports it.

To extend this further, what one can do is capture the request_id, and pass it along the each controller’s flow, to capture it, you can simply do a request.request_id. Not that you have the request_id, if you are doing a Rails.logger.{info|debug} you can use it to log the request_id, this way, the request_id’s generated would now also be propagated to the custom logs which you would be adding to your application. The biggest advantage is, now you can add this request_id to each and Rails.logger.{info|debug} every core flow, which would be logged now, what this gives you is, you can just put the request_id in your search param in your logging platform, and it will give you all the logs which has this key in it.

You would also be able to capture additional information like host, remote_ip, ip of the rails app servicing it. by simply doing request.host, request.remote_ip, request.ip, request.request_id

Why not use lograge for even the custom logger?

Lograge doesn’t support this.

What should I do now to add structured logging for my custom logs

You can add a custom logger for your application and initialize it in your application config for the environments wherever you want it.

only thing you need to add is

# app/logger/log_formatter.rb
class LogFormatter < ::Logger::Formatter
  def call(severity, time, program_name, message)
    message = '' if message.blank?
    severity = '' if message.blank?

    {
      level: severity,
      progname: program_name,
      message: message,
    }.to_json + "\r\n"
  end
end

initialize it in the application config (development here in this case for this example)

# config/environments/developement.rb
Rails.application.configure do
    config.log_formatter = LogFormatter.new
end

and just to test the above out

# app/controllers/health_check_controller.rb
class HealthCheckController < ActionController::Base
  def ping
    Rails.logger.info("bazbar, request-id: #{request.request_id}")
    render json: { success: true, errors: nil, data: 'pong' }, status: :ok
  end
end

How will my logs look like after this?

{"level":"INFO","progname":null,"message":"bazbar, request_id: 4967a677-ab86-4a10-8a01-ea520951cqw3"}
{"level":"INFO","progname":null,"message":"{\"method\":\"GET\",\"path\":\"/ping\",\"format\":\"*/*\",\"controller\":\"HealthCheckController\",\"action\":\"ping\",\"status\":200,\"duration\":8.36,\"view\":0.22,\"db\":0.0,\"request_time\":\"2020-07-07 16:10:17 +0530\",\"application\":\"MyApplication\",\"process_id\":7869,\"params\":\"{}\",\"rails_env\":\"development\",\"request_id\":\"4967a677-ab86-4a10-8a01-ea520951cqw3\"}"}

As you can see, the logs from both

the controller logs which lograge is showing
the Rails.logger.info is spitting

are having

logs in json format
log line has the request_id appended to it in the message key

That’s all folks

References

Credits

Image credits https://pixabay.com/photos/logs-timber-wood-logging-lumber-690888/ and https://en.wikipedia.org/wiki/Ruby_on_Rails

Our learnings from Istio’s networking APIs while running it in production

2020-06-17T00:00:00+00:00

This was originally published under Gojek’s engineering blog by me, this post is a repost.

We at Gojek have been running Istio 1.4 with a multi-cluster setup for some time now, on top of which, we have been piloting a few reasonably high throughput services in production, serving customer-facing traffic.

One of these services hits ~195k requests/minute.

In this blog, we’ll deep dive into what we have learnt and observed by using Istio’s networking APIs.

How we do what we do

To help visualise the process better, let’s consider a workload that can be thought of as a logical unit (VMs, k8s pods, etc.), which is the source of traffic. A workload comprises of a service and an envoy proxy sidecar. Simply put, 2 workloads would comprise of 2 sets of service + proxy.

A service inside this workload is something present in the service registry, which is addressable over the network. Services define a name, which is typically a valid DNS hostname, a set of labeled network endpoints, ports, and protocols. The service registry could be the k8s service registry/consul, etc.

A gateway would be a proxy that receives traffic on specific ports, which can be a logical or a physical proxy in the network that defines L3-L6 behaviours, the sidecar (in this case, the Istio proxy) is also a gateway in that sense. Similar to workloads, a gateway also represents a source of traffic.

If you’re on a cloud provider, like GCP, you can create a service object of type Loadbalancer for the ingress gateway, and it’s not a bad idea to have staticIP for this.

With the above in picture, the networking APIs of Istio which we have dabbled with are VirtualService, DestinationRule, and Gateway.

Making use of these networking APIs

Here are a few combinations which can be used to achieve respective results:

Ingress (Gateway + VirtualService)
Traffic splitting, TCP routing (VirtualService)
Canarying, Blue-green (VirtualService + DestinationRule)
Loadbalancing Config (DestinationRule)
Egress to external Services (ServiceEntry)

One setup which we use is to create a default gateway for the cluster, which would handle all Ingress traffic for the services in the cluster. It’s also possible to have multiple gateways.

If one wants Ingress and nothing else, defining a VirtualService object for the service enables routing traffic to their service — Given the A record to point to the static IP of the gateway LB has been already created.

For capabilities like traffic shaping, an extra object of class DestinationRule has to be defined. This allows us to specify multiple subsets, which in turn need to be defined in the VirtualService spec to specify the weights for the different versions of the service. Based on the desired weights and rules defined in the VirtualService spec, the traffic needs to be routed.

How does traffic get shaped?

Traffic shaping happens in this form

although it may appear as below

When the traffic hits the gateway pods, the envoy proxy sidecar attached to the gateway would already know the pod IPs of the destination service. In simple words, the VirtualService and DestinationRules spec get translated to the envoy config.

But… What about the orchestration of CRUD operations of these resources?

Looking at this from an orchestration perspective, if one is to have 0 disruptions in the requests being serviced by the end service, the ordering of the creation of these resources also matter.

For starters, orchestrating the CRUD of these resources happens by putting them behind different CI pipelines and by using Helm would also work.

We use the istio/client-go to manage the lifecycle of the Istio resources, using our in-house deployment toolchain, powered by our service called Normandy (more on this in another blog post). This gets orchestrated by another service, which is our internal service registry, where we also store our service-related information and cluster mapping, along with other relevant information.

A big advantage of this approach of not letting a developer orchestrate the rules themselves is that we directly remove any room for errors that would occur because of the changes introduced manually. Leaving this to the tool in hand (Normandy and our service registry, in this case) allows us to put our logic inside a tool. This enables us to standardize the way we do the CRUD over the networking APIs of Istio, as well as the Kubernetes objects. Hence, any bugs we found would find their way back to the tool, standardizing, and enhancing the tool further.

If only VirtualService is being used to direct traffic to the service, the orchestration is pretty straight forward:

Create the k8s service object and deployment object for the service, after which the VirtualService object is created. No changes to the gateway.

The orchestration for deployment for a service would be:

Deployment object → VirtualService update/create

While using VirtualService + DestinationRule, the orchestration would be to:

Do the CRUD of DestinationRule first, and then the CRUD of VirtualService. The above ordering is important.

When there is already a VirtualService and DestinationRule existing for the service, and if the VirtualService is updated first, the subset being pointed to would not be present in the DestinationRule, causing 5xx errors till the DestinationRule is updated.

Everything seems in place now. There’s one subset in each VirtualService and DestinationRule for the service and things are working as usual without any 5xx errors while the orchestration is being done. But there are some unanswered questions.

How does the above handle the case when there are more than 1 subsets, while doing some sort of traffic shaping, for instance? How is a zero-downtime deployment achieved?

We didn’t have a use case with more than two subsets, as at any given point, we would only need to shape traffic to 2 versions of the service, but having more than 2 subsets is definitely possible.

Simply generating a new set of DestinationRule and VirtualService, and updating the existing one, in that order, is an approach that would work. However, this would cause timeouts for the service for a very brief period, and depending on the load, one might not see anything or encounter a 5xx error.

In a high throughput service, this definitely is a problem. More than that, the very process of a deployment causing 5xx’s feels wrong and seems like the system is not behaving the way it should. In such cases, it is the orchestration layer causing the problem.

Here’s an approach that would work:

1️⃣ orchestrateDestinationRule ⇨ 2️⃣ orchestrateVirtualService ⇨ 3️⃣ cleanupDestinationRule

Step 1: Append on the newer subsets to the DestinationRule without removing the older subset. In this case, one will be appending the labels which distinguish their newer deployment object for the newer version of the service.

Step2: Update the VirtualService with the newer subset.

Step3: Clean up the subsets in the DestinationRule which are not required anymore.

Remember there’s a good chance one might hit the case of 5xx again even if the above steps have been followed, which is explained here.

Why does the 5xx arise?

A small fraction of time is required by the Envoy Proxy sidecars to get the updated DestinationRule and VirtualService config (in the case above) from pilot and synced on each proxy. Chances are, the envoy sidecar of the service pod might not have been updated yet when the requests come in. This in turn makes some of the proxy sidecars to have the correct updated config, while some still possess the old config.

How to tackle this?

A small hack would be to add a small sleep for ‘x’ amount of time while orchestrating, which would look like:

orchestrateDestinationRule ⇨ sleep(x) ⇨ OrchestrateVirtualService ⇨ sleep(x) ⇨ cleanupDestinationRule

How do we arrive at the ‘x’ duration?

The metric pilot_proxy_convergence_time exposed can be tracked, to see how long it takes on an average, over a period of time, to infer the time it takes to sync the updated configs over each proxy sidecar. If the value appears to be on the higher side, try horizontally increasing the pilot replicas and check if that number comes down.

This is a hacky solution, but currently works on 1.4, as there is no other way to do this in the istio/client-go. We watch the status of the objects like DestinationRule and VirtualService, and check if they have been propagated everywhere in each proxy before moving on to orchestrating the next set of resources.

There is an option in the CLI to do a istioctl experimental wait, which blocks until the specified resource has been distributed.

For later versions of Istio, the Istio object status would have the information if the object has been reconciled with every proxy or not, which would help someone using the client-go to poll on this and proceed only if the resource has been propagated everywhere. This would be present in 1.6 directly, which also comes with a single binary for the control plane of Istio.

There is an issue tracking for this too. Find it here.

Keeping the above discussion around orchestration in mind, the following cases would arise (accompanying the pseudocode) if you are:

Trying to do normal deployments.
Doing Canary deployments with one version set as primary and the other version as candidate version. Note that in this case, as an identifier, we will use a deployment ID to distinguish between the two versions.

For this example, we would assume that the service is named foo and below are two versions of DestinationRule and VirtualService spec for 2 subsets:

Note: In the pseudocode ahead in the cases, the list object inside the DestinationRuleSpec and the VirtualService would represent the subset values, instead of having the whole spec presented, for the sake of simplicity of explanation.

Case 1

When it is a normal deployment, where the deployment, DestinationRule, and VirtualService for resource don’t exist.

Existing resources:

Deployment Object: Not present
DestinationRule: Not present
VirtualService: Not present

Orchestration required (in this order):

Create Deployment: foo-123
Create DestinationRule: [deployment-id: 123]
Create VirtualService: [subset: foo-123, weight: 100]

Case 2

When it is a normal deployment, where the deployment, DestinationRule, and VirtualService for resource already exist.

Existing resources:

Deployment Object: foo-123
DestinationRule: [deployment-id:123]
VirtualService: [subset: foo-123, weight: 100]

Orchestration required (in this order):

Create — Deployment — foo-456
Update — DestinationRule — [deployment-id:123, deployment-id:456]
Update — VirtualService — [subset: 456, weight: 100]
Update — DestinationRule — [deployment-id: 456]
Delete — Deployment — foo-123

Case 3

When it is a Canary deployment.

Existing resources:

Deployment Object: foo-123
DestinationRule: [deployment-id: 123]
VirtualService: [subset: 123, weight: 100]

Orchestration required (in this order):

Create — Deployment — foo-456
Update — DestinationRule — [deployment-id: 123, deployment-id: 456]
Update — VirtualService — [{subset: 123, weight: 90}, {subset: 456, weight: 10}]

Case 4

When we are promoting a Canary deployment, with 456 being promoted.

Existing resources:

Deployment — foo-123, foo-456
DestinationRule: [deployment-id:123, deployment-id:456]
VirtualService: [{subset: 123, weight: 90, subset: 456, weight: 10}]

Orchestration required (in this order):

Update — VirtualService — [subset: 456, weight: 100}]
Update — DestinationRule — [deployment-id: 456]
Delete — Deployment — [foo-123]

Case 5

When we are rolling back a Canary deployment, with 456 being rolled back.

Existing resources:

Deployment — foo-123, foo-456
DestinationRule: [deployment-id: 123, deployment-id: 456]
VirtualService: [{subset: 123, weight: 90, subset: 456, weight: 10}]

Orchestration required (in this order):

Update — VirtualService — [subset: 123, weight: 100]
Update — DestinationRule — [deployment-id:123]
Delete — Deployment — [foo-456]

Case 6

When we are increasing the weight of Canary deployment, with 456 being rolled to a higher percentage (Example: 20%).

Existing resources:

Deployment — foo-123, foo-456
DestinationRule: [deployment-id:123, deployment-id: 456]
VirtualService: [{subset: 123, weight: 90},{subset: 456, weight: 10}]

Orchestration required (in this order):

Update: VirtualService: [{subset: 123, weight: 80}, {subset: 456, weight: 20}]

One thing to point out above is that, after the CRUD operations of DestinationRule and VirtualService, one needs to add the wait time discussed above, in case v1.4 of Istio is being run. From v1.6, the status of the resource, whether it has reconciled on each and every proxy or not, can be polled directly on the resource object. More on this is described here.

This translates to do a poll on the status.conditions[0].status directly for the VirtualService and DestinationRule resource object, till we get the value True for the type Reconcile.

That’s all for now!

References

The workload example is borrowed from the presentation given by Shriram Rajagopalan here
https://istio.io/docs/ops/best-practices/traffic-management/#avoid-503-errors-while-reconfiguring-service-routes
https://github.com/istio/istio/issues/23956
https://twitter.com/tasdikrahman/status/1267357405181968384
https://istio.io/docs/reference/config/networking/virtual-service/
https://istio.io/docs/reference/config/networking/destination-rule/
https://istio.io/docs/reference/config/networking/gateway/

AddTrust Root expiration fix

2020-05-31T00:00:00+00:00

With the root cert expiring for sectigo, the older linux distributions are not properly ignoring the cert.

I have seen this affect boxes which ran ubuntu 16.04, but there would be others too. Didn’t notice anything on Debian 10(buster)

As people have pointed out around, this is an openssl 1.0.2 bug. So even a system upgrade wouldn’t help the situation wouldn’t help, as this would require an actual distro upgrade.

Programs which don’t depend on openssl(like go binaries), won’t get affected by this. Services/client on Ruby/Jruby for example, on the other hand will have problems similar to curl.

The same goes for programs which do certificate pinning on their clients. Personally, saw a saas vendor dole out a fix for this yesterday for their python client, These clients would see external calls to other endpoints failing.

This is also a great twitter thread by Ryan on twitter

How curl fails in a typical affected client box

$ curl https://myremoteserver.com
curl: (60) server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.

What should I do if I am affected by this?

the option to upgrade your distro is more or less out of the picture if you are currently affected by this, as you would need a tactical fix. This being more of a long term strategic fix.
Andrew mentioned a fix which would involve putting a ! before mozilla/AddTrust Root cert
kingsly suggested this fix where we do a

echo -n > /usr/share/ca-certificates/mozilla/AddTrust_Low-Value_Services_Root.crt;\
echo -n > /usr/share/ca-certificates/mozilla/AddTrust_Public_Services_Root.crt; \
echo -n > /usr/share/ca-certificates/mozilla/AddTrust_Qualified_Certificates_Root.crt;\ 
echo -n > /usr/share/ca-certificates/mozilla/AddTrust_External_Root.crt;\
update-ca-certificates

The above would empty out the contents of the file, you could further use chattr +i on the file to change the file attributes such that the above files are not modified from the state which we have set them to, but the bad part of the changing this attribute is that when do a system update, this file would not get updated.

A less disruptive approach, for debian based systems as pointed out by shani would be to do a

sudo sed -i -e 's/^mozilla\/AddTrust_External_Root.crt/!mozilla\/AddTrust_External_Root.crt/' /etc/ca-certificates.conf
sudo update-ca-certificates --fresh

Specifying scheduling rules for your pods on kubernetes

2020-05-06T00:00:00+00:00

This is more of an extended version of the tweet here

If you haven't had a look at pod-affinity and anti-affinity, it's a great way which one can use to distribute the pods of their service across zones. https://t.co/iqhbyhruD8 (1/n)
— Tasdik Rahman (@tasdikrahman) February 23, 2020

PodAntiAffinity/PodAffinity were released in beta some time back in 2017, in the 1.16 release for k8s, along with node affinity/anti-affinity, taints and tolerations and custom scheduling.

Depending on your use case, you can definitely use a few of them for specific type of workloads, to achieve certain outcomes, for things running in your k8s cluster.

Will talk about how you can achieve distributing your pods across zones for your service using pod-affinity and anti-affinity.

One can use preferredDuringSchedulingIgnoredDuringExecution, which can be used for podAntiAffinity, for the scheduler to not be fixated on the constraints you put, rather it would give a best case effort to schedule the pods based on your constraints.

podAntiAffinity would use the pod labels to tell the scheduler, if pods of a particular label (eg: app: foo) are already scheduled on a node, then it would to schedule the pod in a node different than the one where the pod with above label is already scheduled.

The pods can be spread across zones by specifying topologyKey: http://failure-domain.beta.kubernetes.io/zone under podAffinityTerm, this assumes that the nodes are already spread across different zones to prevent catastrophe arising from zonal failure, having a regional cluster is a good idea

If you have multiple rules under podAffinityTerm, you can specify weights to them to tell the scheduler the importance of each rule, if you have just one rule, you can specify 100 to it.

A very simple example of podAntiaffinity being specified for a deployment, where we let the scheduler take a best case approach

You can notice, that the pods present here are getting scheduled in the same node, while the scheduler was trying it’s best to spread the pods out. If you had presumable 3 nodes, in 3 different zones, the rule above would have tried spreading out the pods in all the three zones, giving you resiliency over zone failures.

For the purpose of this example, I have a single node cluster in place.

$ k get nodes
NAME                 STATUS   ROLES    AGE   VERSION
kind-control-plane   Ready    master   97s   v1.14.10

$ k get pods -owide -l 'app=store'
NAME                          READY   STATUS    RESTARTS   AGE    IP           NODE                 NOMINATED NODE   READINESS GATES
redis-cache-566bcff79-2qrzr   1/1     Running   0          110s   10.244.0.6   kind-control-plane   <none>           <none>
redis-cache-566bcff79-grhwk   1/1     Running   0          110s   10.244.0.7   kind-control-plane   <none>           <none>
redis-cache-566bcff79-ktmxv   1/1     Running   0          110s   10.244.0.5   kind-control-plane   <none>           <none>

One thing to note here is that, you can specificy multiple matchExpressions inside podAffinityTerm.labelSelector.

So if you wanted the scheduler to take into consideration another label like app_type: server, you can do something like

- podAffinityTerm:
    labelSelector:
    matchExpressions:
    - key: app
        operator: In
        values:
        - store
    - key: app_type
        operator: In
        values:
        - server

so both the match expressions will be AND‘d while the scheduler is evaluating the rules.

Now if we would want this rule to be enforced while being schduled you can use requiredDuringSchedulingIgnoredDuringExecution instead of preferredDuringSchedulingIgnoredDuringExecution

You would notice that the other two pods go into Pending state, since there is only one node present to spread out, given we are trying to enforce a required during scheduling rule, the scheduler would keep trying to schedule the other two pods to a node which doesn’t already have a pod with the label app: store, which isn’t able to find as there is only one node.

$ k get pods -owide -l 'app=store'
NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
redis-cache-66b88fd4fc-7dc65   0/1     Pending   0          6s    <none>       <none>               <none>           <none>
redis-cache-66b88fd4fc-sbh5z   0/1     Pending   0          6s    <none>       <none>               <none>           <none>
redis-cache-66b88fd4fc-wszsq   1/1     Running   0          6s    10.244.0.8   kind-control-plane   <none>           <none>

Checking the logs of one of the pending pods would give the same

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  18s (x3 over 3m16s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.

“IgnoredDuringExecution” means that the pod will still run if labels on a node change and affinity rules are no longer met. There were plans to have requiredDuringSchedulingRequiredDuringExecution which will evict pods from nodes as soon as they don’t satisfy the node affinity rule(s), but I haven’t seen that in the release docs, so not sure when is it scheduled to be added.

Why not use node anti-affinity?

When node affinity/anti-affinity will be used to schedule pods in the specified nodes but it does not ensure the best effort to distribute pods evenly across nodes.

Side effects of having anti-affinity rules?

Inter-pod affinity and anti-affinity require a substantial amount of processing which can slow down scheduling in large clusters significantly. The docs do not recommend using them in clusters larger than several hundred nodes.

Empty topologyKey is interpreted as “all topologies” (“all topologies” here is now limited to the combination of http://kubernetes.io/hostname, http://failure-domain.beta.kubernetes.io/zone, and http://failure-domain.beta.kubernetes.io/region). Each can be used to spread the pods using different constraints.

How is it different from taints?

Taints allow a Node to repel a set of Pods.

So in order for a pod to be scheduled on a node, a pod has to tolerate that taint. This comes useful for workloads, where say you only want to schedule pods of a certain type on a set of nodes. A straight away use case is used in the control plane nodes of k8s, where you would not want your workload pods to get scheduled, or say nodes, where you only want to run prometheus dedicatedly, given it can take quite a bit of memory.

Using a custom scheduler

Albiet I haven’t used it, but I feel it’s something which you should do only if you have a very good understanding of what’s gonna happen if you pluck out your default scheduler from you workloads podspec, as for in most cases, the default scheduler just works fine. Unless you are doing something really out of the usual.

References

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity

A Few Notes on Etcd Maintenance

2020-04-24T00:00:00+00:00

This was originally published under Gojek’s engineering blog by me, this post is a repost.

If you have worked around managing Kubernetes clusters on your infrastructure — instead of going with a managed version provided by cloud providers — chances are that you already are managing an etcd cluster. In case you are new to it, this post is for you.

We’ll get the basics out of the way first, and define what Etcd is.

Etcd, is just a distributed key-value store, which uses raft consensus algorithm in the back of it, to provide a fully replicated, highly available key-value store.

A telling point of its stability is the fact that Kubernetes (API server) uses it as it’s key-value store for storing state of the whole cluster. It uses Etcd’s ‘watch’ function to monitor this data and to reconfigure itself when changes occur. The ‘watch’ function stores values representing the actual and ideal state of the cluster and can initiate a response when they diverge.

As to choosing Etcd over other databases like Redis, Zookeeper or Consul, I feel that it’s out of scope for this blog post considering there a lot of posts out there which go about detailing this. This post will try listing down a few things about the maintenance activities which we are currently running for our production Etcd databases, to keep them healthy.

Monitoring

One of the first things which you can do is, adding a few basic alerts which will serve as the foundation for our maintenance activities.

I will be only repeating myself here if I list down any more alert rules, as the official alert docs here mention most of the important alerts. The one rule to notice apart from leader election and process being down is the DB size. This rule is not mentioned in the official rules doc, as this is something which people tune based on their DB size.

The metrics emitted by Etcd are in Prometheus format. A few things on our side which we add while emitting is to add the environment label, depending on which environment the ETCD VM is running, as you can see in the alert above. You can remove that while trying it out on Alertmanager.

To avoid running out of space for writes, the keyspace history has to be compacted. Once reached, this would be obvious from the errors received by your client using Etcd.

Track etcd_mvcc_db_total_size_in_bytes, as etcd also emits etcd_debugging_mvcc_db_total_size_in_bytes_gauge in some versions. The reason to track the former, is that it can get dropped in later releases. It’s not a bad idea, to make sure to not track anything with debugging with it in the metric name

Compaction

As the space quota is limited, one would need to clear out the keyspace history, which can be achieved through compaction.

Compaction truncates the per-key change log.

Using one of these auto-compaction modes is usually recommended:

For periodic compactions, pass --auto-compaction-retention to the Etcd process while starting, eg: --auto-compaction-retention=1 would run compaction every one hour. The mode picked up here would be periodic, this is similar to passing --auto-compaction-mode=periodic

The other mode of compaction is revision-based, which is similar to passing --auto-compaction-mode=revision. We don’t use this as the use case for us is having a large keyspace rather than having a huge number of revisions for a key-value pair.

There’s a single revision counter which starts at 0 on etcd. Each change made to the keys and their values in the database is manifested as a new revision, created as the latest version of that key and with the incremented revision counter’s value associated with it.

For revision-based auto compaction mode, the behaviour should ideally be similar to when one runs something like $ etcdctl compact 4, after which the revisions prior to the compaction revision version become unavailable. So in this case, if you would do a $ etcdctl get --rev=3 some-key would fail in this case.

If you need more headspace, one can pass --quota-backend-bytes with the desired space to set space quota, while starting up the Etcd process. The default is 2GB, but you can max it up till 8GB. It is ideal to keep this size lower.

Defragmentation

Compaction is not enough, as the internal DB exhibits fragmentation after compaction, leaving gaps in the backend database, which would still cause disk space to be consumed. This is fixed by running defragmentation on the Etcd DB.

Defragmentation operation removes the free space holes from storage

Combining compaction and defragmentation, along with the right set of monitoring can build the base for maintenance of your etcd cluster/node.

One thing to note here is that defragmentation should be run rather infrequently, as there is always going to be an unavoidable pause. Defragmentation to a live member blocks the system from reading and writing data while rebuilding its state.

This is it is recommended to keep the DB size smaller — the default again being 2GB. The larger the DB size being consumed, the more time it will take to defragment. Depending on how critical Etcd is to your app, you can take this into consideration.

Would an HA setup help here?

An HA Etcd setup won’t help in reducing the pause while defragmenting, as Etcd leader re-election will only happen when the defrag takes longer time than the leader election timeout, which shouldn’t happen if your DB is below 2GB and if compaction runs regularly.

In a clustered setup, defrag can be run per node, or can be directly run on the whole cluster by running $ etcdctl defrag --cluster to do it for all members of the cluster.

Running defrag operations

Personally, I prefer a systemd.timer to run the defrag operation, compared to a traditional cronjob, as one would get the logs by default for each and every trigger by passing the timer service to journalctl, which is far more intuitive to someone logging onto the Etcd box.

Using the above two systemd services, you can run the defrag operations based on when you would want it to run.

Provisioning Etcd

We use the Etcd community cookbook to provision the Etcd servers, and haven’t noticed any issues so far.

The sample systemd service is for a single node Etcd database. Depending on if you require an Etcd cluster or don’t want etcdv2 API to be available — among other things — the parameters to your ExecStart will change. (This is well-documented in the cookbook repo).

We haven’t tried running Etcd on Kubernetes. Although statefulSets are something slowly gaining traction, we didn’t feel running this on top of Kubernetes was something we wanted to do. That being said, we have heard good things about etcd-operator, although the project has been archived.

So how do we automate provision of Etcd?

We do it via a proctor script automation, paired with a chef cookbook on top of https://github.com/chef-cookbooks/etcd with a few extra things, to create the defragmentation related systemd services.

A few optimizations you can do

It is preferable to have the storage device attached to the Etcd VM to be in the same network, as Etcd is extremely dependent on storage latency, and some operations are blocked until most nodes have accepted a change. If latency passes certain critical numbers, one can end up with leader election storms, where no leader keeps the lease for long enough to actually make changes. It’s preferable to never use remote block storage for Etcd, as there are just too many places it could go wrong in weird ways.

Layering multiple layers of fault tolerance that aren’t aware of each other (in this case the remote storage) might lead you to end up with no fault tolerance at all.

Since Etcd is highly I/O dependant, it would make sense to have an SSD as the disk type attached to the ETCD instance, as this is something which is very critical to have lower disk write latency.

Sizing your VM type depending on your workload is crucial to not have performance issues (more discussion here).

References

Thanks a ton to @youngnick for being super active on #etcd channel on Kubernetes Slack and helping us dig deeper, to Shubham for seeing this through with me, and my teammates for sticking with us.

Introducing Kingsly — The Cert Manager

2020-04-22T00:00:00+00:00

This was originally published under Gojek’s engineering blog by me, this post is a repost.

There’s one thing all devices connected to the Internet have in common — they rely on protocols called SSL/TLS to protect information in transit.

SSL/TLS are cryptographic protocols designed to provide secure communication over insecure infrastructure.

Any communication over the public internet should be encrypted, for which we need SSL certificates. There are many cases for public communication in GOJEK as well. Some of them are listed below:

Client VPN connection
Inbound connections from mobile frontend to backend
Portals exposed over the public Internet.
Service to service communication over the public Internet

While the industry is moving towards a zero trust network, certificate management has been a big pain for us.

This post details how we built Kingsly, GOJEK’s open source certificate management tool.

Managing SSL/TLS certs until yesterday

A certificate is a digital document that contains a public key, some information about the entity associated with it, and a digital signature from the certificate issuer.

In other words, it’s a shell that allows us to exchange, store and use public keys. With that, certificates become the basic building block of the Public Key Infrastructure

We use letsencrypt heavily at GOJEK, to generate SSL/TLS certificates for numerous use cases, which can range from putting them behind our HAproxy Loadbalancers, envoy proxies, IPSec VPN’s and a lot of other places.

As of the end of 2018, the whole setup of renewing the certificates (it’s basically regenerating the certificate in the case of letsencrypt) installed at different places, was a manual process.

All letsencrypt SSL certificates, including renewals, are valid for no more than 90 days from their issue date. Thirty days before the certificate expires you will begin receiving renewal notices.

Seeing the above, someone from our team would renew these certificates manually and do the needful.

Although, this served our purpose for the time being, it was becoming difficult and time-consuming for one person to manage this piece of infrastructure with the increasing number of IPSec VPN’s.

Managing your own PKI infrastructure is a hard problem in itself and leaving it to manual processes always leaves room for human error.

The Kernel team, which I am part of, focuses on solving infrastructure problems, improving systems resiliency and developer productivity.

Hence the motto: Productivity through Automation

Problems with our current approach

We have faced a lot of issues around cert management:

Using a wildcard certificate. If such certificate gets compromised, then all the subdomains are affected.
Renewal of expired certificates.
Certificate inventory. Keeping a list of hundreds of subdomains and maintaining expiry is an operational nightmare. Installation of letsencrypt locally on all the VMs makes it difficult to keep inventory at one central place.
Manual generation/non-standardized way of generating and renewal of SSL Certs leaves room of human error
Certs shared over email/slack.
No audit trail.
100+ Certs scattered across org with little or no visibility of when is it expiring until one gets an email.

We need a solution to this problem — something very basic without a high learning curve.

The features we wanted in our tool

Certificates stored in a central manner
APIs exposed to create/renew the certificate
Automatic renewal a certain period before a cert expires
Centralised tracking and notification
Common API for internal users

Enter Kingsly

It all started last December, when our team was contemplating what to ship for our internal company hackathon.

It's going to be a looooong night here. 20+ teams taking part in our internal hackathon. @gojektech

Let the games, sorry, hacks begin 😉 pic.twitter.com/6yRjfo4GX6
— Gojek Tech (@gojektech) December 5, 2018

Our second internal hackathon in Bangalore starts today! 😎

We're pulling an all nighter starting at 4pm today to 12pm tomorrow noon. 😈

01101000 01100001 01100011 01101011 pic.twitter.com/BBN3Filu5D
— Gojek Tech (@gojektech) December 5, 2018

This tool was a perfect candidate for the night as it was both:

a pain point
something which we really wanted to automate

By the end of the night, we had it in a working condition and generating SSL certificates was as simple as doing a

$ curl -X POST http://kingsly.host/v1/cert_bundles \
  -u admin:password \
  -H 'Content-Type: application/json' \
  -d '{
        "top_level_domain":"your-domain.com",
        "sub_domain": "your-sub-domain"
    }'

and the response from the Kingsly server would be:

'{
  "private_key":"-----BEGIN RSA PRIVATE KEY-----\nFOO...\n-----END RSA PRIVATE KEY-----\n",
  "full_chain":"-----BEGIN CERTIFICATE-----\nBAR...\n-----END CERTIFICATE-----\n"
}'

A simple JSON response which can be digested by a client the way it wants.

We had a framework for the product. Now what?

We didn’t want this to end up as a hackathon project that remains a desolate creature in one’s source code repository. So we decided to extend Kingsly and make it into something which fit the initial set of requirements we had in mind for it.

Moar features!!

After the cert generation part, the next feature which we wanted was automatic renewals of those certs.

The client (the IPSec client which we built) would keep polling the Kingsly server, get the certificates, and see if the current cert inside the IPSec VPN was different from the one which the server returned. If not, the client would replace the certs in the IPSec VPN box with the ones it received from the Kingsly server. It would then continue with the flow of tasks needed for the IPSec VPN to pick up the new certs, something which used to be done manually by a human.

Who gets to request generation of certs?

This was still an unsolved problem in the whole equation. We would need to devise a way in which we would be able to deny requests from clients.

One thing which we were sure about was: authentication and authorization should not be handled by the application.

A very simple solution was to put the Kingsly web server behind an HAproxy. Here, we would have a frontend rule pointing to a backend having an ACL with a list of IPs which would be allowed through.

This checked the initial problem of only allowing the clients which were whitelisted.

But the other problem would be maintaining the updated list of all the clients being allowed. On top of that, we were trying to automate a process. Hence, settling for something which eventually needed manual intervention didn’t quite fit our vision. So, we started exploring other possibilities and stumbled upon IAP, which fit the bill for our use case.

Identity Aware Proxy (IAP) is the GCP provided solution for user as well as service authentication.

IAP is nothing but Oauth 2.0 implementation over a proxy.

For authentication with IAP, some required headers need to be added to every request which goes out of the client box. We created a proxy service, which adds the authentication details to the request. For service-to-service authentication, it uses OAuth 2.0 using creds of service accounts.

Why use IAP for auth?

Central authorization layer
Application level access control
Allows individual and group-based access policies
Enforces HTTPs
Mandatorily redirects HTTP requests to HTTPS
Attaches client identity to request headers for further processing by downstream applications/proxy

There’s more to come

Build support for HAproxy, envoy proxy and the other places where we require x.509 certificates which will be interacting with Kingsly, in the kingsly-certbot client.
Have better monitoring to ensure whether certificates installed correctly.
Allow authorization on top of Kingsly, to open it up to developers to enable them to request SSL certificates and usher the road towards SSL/TLS enabled applications.
CRD to generate certs for applications inside our k8s clusters. We will be evaluating the use cases similar to how we did for Kingsly before going ahead with it.

As the saying goes, there’s no silver bullet in software and Kingsly checks off the list of requirements we had from a tool.

Kingsly is open source! Check the links below 🖖

Route missing in kubernetes node with kuberouter as the CNI

2020-01-05T00:00:00+00:00

Anyone who is evaluating into having a networking solution for their kubernetes cluster without having a lot of moving parts in the cluster, kuberouter provides pod networking, ability to enforce network policies, IPVS/LVS service proxy among other things.

The problem which we faced specifically while running this in our clusters was missing routes upon restart of the node, or sometimes in the case when the node was joining the cluster as part of the worker node.

For us, the issue would come around as a the kiam (which we were using for identity management for pods inside the k8s clusters) pod would go into CrashLoopBackOff as described by me in the github issue https://github.com/uswitch/kiam/issues/49, as the dns resolution would fail (more on that later)

We were using the latest version of Coreos, but we found out that the version 1576.5.0 of Coreos was not plagued by this problem.

This has been defined in detail in the github issue

The problem was that there was race condition caused by systemd-networkd.service trying to manager the tunnels and modigying the routes causing the missing routes. Whenever networkd was restarting, all the tunnels would go away with it.

It is best described by Niel here.

To fix this, a file in the networkd dir /etc/systemd/network/ so it starts ignoring those interfaces and doesn’t manage them as described by Lomkju

[Match]
Name=tun* kube-bridge kube-dummy-if

[Link]
Unmanaged=yes

It was tested to be working for the following coreos version as mentioned by Lomkju

Container Linux by CoreOS 1967.6.0 (Rhyolite)
Kernel: 4.14.96-coreos-r1

Various ways of enabling canary deployments in kubernetes

2019-09-12T00:00:00+00:00

Update I gave a quick lightening talk about the same talk @ DevopsDays India, 2019. The slides for which can be found below

What canary can be

Shaping the traffic in a way, so that we could direct a % of traffic to the new pods and promoting the same deployment to a full scaleout and gradually phasing out the older release.

Why canary?

Testing on staging doesn’t weed out all the possible reasons for something failing, final testing for a feature being done on some part of the traffic is not something unheard of. Canary being a precursor to enable full blue green deployments.

Why?

If you don’t use feature flags for your services, canary testing becomes paramount to test out features.

Approaches to enable canary

Using Bare bone deployments in k8s

How?

Creation of two sets of deployments and referring to figure 1, v1 and v2, both would be separate deployment objects with separate label selectors. Both v1 and v2 deployments would be exposed via the same svc object which would point to their pods.

Advantages

Plain and simple.
Can be done without any plugins/extra stuff in the vanilla k8s we get in GKE.

Disadvantages

Traffic will be a function of replicas, and cannot be customized. For example, if the traffic splitting is done between the two deployment with v1 having 3 replicas and v2 having 1 replica, the traffic split for canary will be 25%

Using Istio

How?

In an istio enabled cluster, we need to set the routing rules to configure the traffic distribution.

Similar to the approach above, we have two deployments and svc objects for the same service, called v1 and v2.

The rule will look something like

$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
    - helloworld
  http:
  - route:
    - destination:
        host: helloworld
        subset: v1
      weight: 90
    - destination:
        host: helloworld
        subset: v2
      weight: 10
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: helloworld
spec:
  host: helloworld
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
EOF

Advantages

Has been out there in the wild for quite some time. i.e Battle tested
Flagger can be added to have automated canary promotion.
GKE has an add on feature which can be used to install istio in our clusters
Traffic routing and replica deployment are two completely orthogonal independent functions
Focused canary testing, eg: instead of exposing the canary to an arbitrary number of users, if you wanted the users from some-company-name.com to the canary version, leaving the other users unaffected, you can do that too with a rule to match the headers for the match to check for the cookie for the above.

...
spec:
  hosts:
    - helloworld
  http:
  - match:
    - headers:
        cookie:
          regex: "^(.*?;)?(email=[^;]*@some-company-name.com)(;.*)?$"
...

Tracing gets as a side benefit

Disadvantages

Another add on to manage inside the cluster if gone through the route of installing the istio version

Using Linkerd

How?

Linkerd has a canary CRD that enabled how a rollout should occur

It automatically creates two sets of deployments for a deployment name podinfo

NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
podinfo              ClusterIP   10.7.252.86   <none>        9898/TCP   96m
podinfo-canary       ClusterIP   10.7.245.17   <none>        9898/TCP   23m
podinfo-primary      ClusterIP   10.7.249.63   <none>        9898/TCP   23m

And then you have to do a rollout to start the traffic shaping to happen to the podinfo-canary. A detailed post on how to do this is here

Advantages

Flagger can be integrated for automated canary promotion/demotion
Battle tested and has been in use by the virtue of being the first service mesh
Disadvantages
Another component inside the k8s cluster to be maintained

Using Traefik

How?

It requires the same setup of two deployment and svc objects for the service which needs to have canary enabled for it and makes use of the ingress object in k8s to define the traffic split between the services.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    traefik.ingress.kubernetes.io/service-weights: |
      my-app: 99%
      my-app-canary: 1%
  name: my-app
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: my-app
          servicePort: 80
        path: /
      - backend:
          serviceName: my-app-canary
          servicePort: 80
        path: /

Advantages

Easy to setup with no frills, as it’s just an ingress controller in the k8s controller (like contour/ nginx-ingress-controller)
Doesn’t need the pods to be scaled.
Support for tracing included.

Disadvantages

No inbuilt process to shift weights from v1 to v2 or revert back traffic in case of increased error rates. Ie. not a clear cut way to integrate with flagger for automated canary promotion/demotion

References

Canary using istio

https://istio.io/blog/2017/0.1-canary/

Bare bones canary on k8s

https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments

Canary using traefik

Canary using linkerd

https://linkerd.io/2/tasks/canary-release/
https://linkerd.io/2/features/traffic-split/ Done using https://flagger.app/
https://www.tarunpothulapati.com/posts/traffic-splitting-linkerd/ Excerpt: “Flagger combines traffic shifting and L7 metrics to do canary deployments, etc. It will slowly increase the weight to the newer version, based on the metrics, and if there is any problem (e.g. failed requests), it would roll back. If not it will continue increasing the weight until all the requests are routed to the newer version. Tools like Flagger can be built on top of SMI and they work on all the meshes that implement it.”

Handling signals for applications running in kubernetes

2019-04-24T00:00:00+00:00

When the power goes off in a device in a linux based system, one can think of ways in which this event can be handled in the applications running on it. One thing to note is that, when you plug the power cable off, the power doesn’t really go off immediately.

But this needs to be notified to processes so that they can handle such an event and save the state of the application (if any).

A few possible ways of doing so would be to

Save everything in RAM to disk and then restore the contents stored on disk back to RAM when the startup happens next, but the problem with this approach is that, storing and retrieval on/from a disk is a slow process.
Use a file to store the state of power, where 0 would denote that the power is on and 1 would mean that the power has gone off, but this approach would mean that the processes running in the system should keep track of this bit stored in the file.
The kernel sends a signal like SIGTERM to the process and leaves it to the process on how it handles this signal.

How a container gets stopped is something which is very important or not, depending on your application of course.

I will discuss how does docker and kubernetes at large, and how the containers orchestrated via them handle signals very briefly in this blog post.

What is a signal

A signal is a software interrupt and a way to communicate the state of a process(es) to another process, the OS and the hardware.

By interrupt, we mean that whenever a signal is received by a process. It will stop doing whatever it is doing and handle it by either doing something about it or ignoring it.

A few Signals and what they intend to do(credits www.usna.edu) are listed below.

Signal     Value     Action   Comment
----------------------------------------------------------------------
SIGHUP        1       Term    Hangup detected on controlling terminal
                              or death of controlling process
SIGINT        2       Term    Interrupt from keyboard
SIGQUIT       3       Core    Quit from keyboard
SIGILL        4       Core    Illegal Instruction
SIGABRT       6       Core    Abort signal from abort(3)
SIGFPE        8       Core    Floating point exception
SIGKILL       9       Term    Kill signal
SIGSEGV      11       Core    Invalid memory reference
SIGPIPE      13       Term    Broken pipe: write to pipe with no
                              readers
SIGALRM      14       Term    Timer signal from alarm(2)
SIGTERM      15       Term    Termination signal
SIGUSR1   30,10,16    Term    User-defined signal 1
SIGUSR2   31,12,17    Term    User-defined signal 2
SIGCHLD   20,17,18    Ign     Child stopped or terminated
SIGCONT   19,18,25    Cont    Continue if stopped
SIGSTOP   17,19,23    Stop    Stop process
SIGTSTP   18,20,24    Stop    Stop typed at tty
SIGTTIN   21,21,26    Stop    tty input for background process
SIGTTOU   22,22,27    Stop    tty output for background process

Each signal has a default value and action defined. You can take a look at sys/signal.h and find out more about each and every other signal defined in it.

Passing the signal to a process from the command line

When you do a ctrl+c, it’s the same as sending a SIGINT signal and when you type ctrl+z, it’s the same as sending SIGTSTP, and when we type fg or bg that is the same as sending a SIGCONT signal.

Passing signals to containers

When you issue a docker stop, docker send SIGTERM to the process running as PID 1 inside the container and waits for 10 seconds before it sends a SIGKILL to the kernel which will then straight terminate the process, if the process hasn’t terminated within that time frame.

A docker kill will not give the container process, an opportunity to stop gracefully, but will straight ahead kill it.

How kubernetes handles signals

When you do a

$ kubectl delete pods mypod

it will send a SIGTERM and then wait for a set number of seconds to send a SIGKILL to the process, this period is known as the grace termination period of the pod and can be configured in the podSpec

If your process doesn’t handle SIGTERM, then it will get SIGKILLed. Processes which are killed are immediately removed from the etcd and the API, without waiting for the process to actually terminate on the node.

If you have anything which needs a graceful shutdown, you need to implement a handler for SIGTERM.

If there are more than one containers in a pod, they both will be sent a SIGTERM and one would want to implement the right strategy on how they should be terminated.

One mistake which is pretty common and something which can be missed is using the non-exec form of CMD , for example, then your process is running as a child process of shell and not really running as root. It will be running as /bin/sh -c myapplication

The problem with this is that, shell will never forward this signal to the child process, which is your application process and it will not be able to handle the SIGTERM for which you have written a handler for.

In any case, this would make your process be SIGKILLed. Instead one should use the exec form CMD

/bin/sh -c myapplication

or use the exec form of ENTRYPOINT

Passing custom signals to container processes

Let’s take an example of a sample application, which has a dockerfile like this

FROM alpine:3.5

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
RUN apk add --no-cache bash

ENTRYPOINT ["/entrypoint.sh"]

And the contents of your entrypoint.sh are

#!/bin/bash

trap "echo TERM" TERM
trap "echo HUP" HUP
trap "echo INT" INT
trap "echo QUIT" QUIT
trap "echo USR1; sleep 2; exit 0" USR1
trap "echo USR2" USR2

ps aux
tail -f /dev/null

Now if you build this container and run it in the background, if you try stopping the container by doing docker stop. You will notice that the container process takes 10 seconds before the process dies and get SIGKILLed

To avoid that, we do a signal rewrite using dumb-init which runs a PID 1 for the docker container. You can read here about why running your application process as PID 1 is not usually a good idea.

FROM alpine:3.5

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
RUN apk add --no-cache bash

# Change 1: Download dumb-init
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init

# Change 2: Make it the entrypoint.  The arguments are optional
ENTRYPOINT ["/usr/local/bin/dumb-init","--rewrite","15:10","--"]
CMD ["/entrypoint.sh"]

Now if you try issuing a docker stop, you will notice that the process exits out and it will get the USR1 signal and process it and then exit, which is what we wanted.

Signal rewriting for gracefully terminating apache2 and nginx

If you want to initiate a graceful shutdown of an nginx server, you should send a SIGQUIT. None of the Docker commands issue a SIGQUIT by default. So the solution for kubernetes is to rewrite the signal SIGTERM to a SIGQUIT for nginx.

## for full source, check https://github.com/Yelp/casper/blob/master/Dockerfile.opensource
FROM ubuntu:xenial

# Manually install dumb-init as it's not in the public APT repo
RUN wget https://github.com/Yelp/dumb-init/releases/download/v1.2.1/dumb-init_1.2.1_amd64.deb
RUN dpkg -i dumb-init_*.deb

## Your application requirements

# Rewrite SIGTERM(15) to SIGQUIT(3) to let Nginx shut down gracefully
CMD ["dumb-init", "--rewrite", "15:3", "/code/start.sh"]

This will send the signal of SIGQUIT to nginx to gracefully handle termination

For gracefully terminating apache2, it requires a SIGWINCH signal to be passed to it. Which can be done by passing the signal 28 to it.

## your dockerfile contents before this
CMD ["dumb-init", "--rewrite", "15:28", "/code/start.sh"]

References

Container Image Structuring for container runtimes

2019-04-10T00:00:00+00:00

While you might have read posts about docker being dead, but given its adoption. That’s not really the case.

While we have other container runtimes like runc, containerd, rkt and some others. Docker is still something which a lot of folks running containers use as their container runtime.

What this post will describe is one of the many approaches of structuring your container images, keeping in mind reusability, security and best practices in mind and keeping them as lightweight as possible. At the time of writing this, this is something which is still used to run production container workloads in my last company.

Prelude

Before going ahead, just so that we are on the same page.

A Container image is a filesystem tree that includes all of the requirements for running a container, as well as metadata describing the content. You can think of it as a packaging technology.

A container is composed of two things: a writable filesystem layer on top of a container image, and a traditional linux process. Multiple containers can run on the same machine and share the OS kernel with other containers, each running as an isolated processes in the user space. Containers take up less space than VMs (application container images are typically tens of MBs in size), and start almost instantly.

Source: project atomic, container best practices

Introduction

Immutable Server pattern is something which we used to follow in my last company. Netflix has written at length on how they do it. More on how we used to do it in another post.

I will not go into the how and why of immutable infra in this blog post, as that is something which deserves its own post.

Docker presents fit’s right in if you follow the above pattern for your infrastructure.

Which is, if you are baking the whole application using packer or something similar, including config inside the AMI,and then adding that AMI in the launch config for the ASG so that the newer instance which comes up when the ASG scales up, is an exact copy of the instances already present in the ASG.

What you have is repeatable infra in short, with the above pattern. And you start treating servers as cattle and not pets.

The layering of container images

We used to follow a layered approach of immutable infrastructure, where we would have a base layer.

Base Layer

contains a fresh copy of an operating system (alpine Linux in this case) and would include core system tools, (eg: such as bash or coreutils, curl, dumb-init et al) and tools necessary to install packages and make updates to the image over time.

As for the Intermediate container images, each would use the base layer, hence inheriting from the base image.

Intermediate Layers

Language runtime
- python-27
- php-7.1
- go-1.8.3
Web server
- apache2
- nginx
Combination of (specific web server + specific language runtime)
- python-27-{ nginx/apache2 }
- php7-nginx-{ nginx/apache2 }
- golang-nginx-{ nginx/apache2 }

Note: The above intermediate layers are just to show you an example, you can replace it with your use case.

Dependency managers like pip/composer/golang-dep would go in this layer for the next layer to make use of it and after their use we can clear their cache.

For example in the case of

pip : rm -rf ~/.cache/pip/*
composer : composer clear-cache
apk: rm -rf /var/cache/apk/* that is if --no-cache is not being passed whilst apk add <package>
go-dep: the cache might be useful for debugging if something went wrong with that old cache. But this can be debated of whether or not to remove $GOPATH/pkg/dep

An example of such a setup

Base layer

FROM gliderlabs/alpine:3.4

ENV ALPINE_VER=3.4
ENV ALPINE_SHA=45ba65c1116aaf668f7ab5f2b3ae2ef4b00738be

RUN apk update && \
    apk add xorriso git xz curl tar iptables cpio bash && \
    rm -rf /var/cache/apk/*

RUN apk add -U --repository http://dl-cdn.alpinelinux.org/alpine/edge/testing aufs-util

RUN addgroup -g 2999 docker

after which you would create the container image from this Dockerfile. And for the sake of this example, you would name it as base-image

Intermediate Layer

To create a JAVA based intermediate layer

FROM tasdikrahman/base-layer:0.1.0

ENV LANG=C.UTF-8

RUN curl -LO 'http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jre-8u131-linux-x64.tar.gz' -H 'Cookie: oraclelicense=accept-securebackup-cookie' \
	&& chown root:root jre-8u131-linux-x64.tar.gz \
	&& tar -xzf jre-8u131-linux-x64.tar.gz \
	&& rm jre-8u131-linux-x64.tar.gz \
	&& mv jre1.8.0_131 /usr/local/lib

WORKDIR /usr/local/lib/jre1.8.0_131

ENV JAVA_HOME /usr/local/lib/jre1.8.0_131
ENV PATH $JAVA_HOME/bin:$PATH

RUN apk del --no-cache curl tar # wget ca-certificates

Application Layer

FROM tasdikrahman/java-base:0.1.0

# Your application specific requirements etc.

Application Image layer

This is where the container image would contain dependencies specific to the application and other required tooling, inheriting other things from the previous layers.

Security

Dumb-init should be specified as the entrypoint for the application container image which is yet to be followed in some remaining container images so that /entrypoint.sh is executed as CMD as an argument to dumb-init. More on why have something like dumb-init as PID 1
If the service does not need root privileges, create a new user and switch the user with USER directive in the application container image.

RUN groupadd -r myapp && useradd -r -g myapp myapp
USER myapp

Adding better security vulnerability testing
- Drone Clair Plugin

Keeping the size of the docker image small

At each layer

the necessary package managers should be cleared of their cache
Unnecessary layers file system layers should not be created
Unwanted packages/libs should not be added.

The above division ideally, should always be maintained and any new requirement should always go into the layer most appropriate for it

Use .dockerignore wherever necessary as when building the image, docker has to prepare context first, gather all files which would be used in a process. Default context contains all files in the directory, which would include things like .git directory for example. Which can get pretty big citing the .git/objects subdir.
Optimizing COPY and RUN directives by putting least frequently changed things on the top of the Dockerfile, which would help us enable caching better.
Whenever possible, chaining commands together (if possible) and sorting multi-line arguments alphanumerically, which will help avoid duplication of packages and make the list much easier to update. This also makes it a lot easier to read and review. Adding a space before a backslash () helps as well.

Example:

RUN apt-get update && apt-get install -y \
  bzr \
  cvs \
  git \
  mercurial \
  subversion

Good to have

Linting, which can enforce standardization across the container images, a possible solution will be https://github.com/projectatomic/dockerfile_lint

References

http://docs.projectatomic.io/container-best-practices/
https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
https://opensource.googleblog.com/2018/01/container-structure-tests-unit-tests.html
https://github.com/Yelp/dumb-init

Self hosting kubernetes

2019-04-04T00:00:00+00:00

kubernetes has been around for some time now. At the time of writing this article, v1.14.0 is the latest release and with each new release, they have a bunch of new features.

This post is about the initial setup for getting the kubernetes cluster up and running and assumes that you are already familiar with what kubernetes is and a rough idea on what the control plane components are and what do they do. I gave a talk on the same subject of self-hosting kubernetes in DevOpsDays India, 2018

You can find the slides of the talk above.

Although there are other tools like kubeadm which would help you to self-host kubernetes, I would like to show how bootkube does it.

What is self-hosting kubernetes?

It runs all required and optional components of a Kubernetes cluster on top of Kubernetes itself. The kubelet manages itself or is managed by the system init and all the Kubernetes components can be managed by using Kubernetes APIs.

source: CoreOS tectonic docs

In a nutshell, static Kubernetes runs control plane components as systemd services on the host. Its simple to reason about and the repo has educational docs, but in practice, it’s fairly static (hard to re-configure). Self-hosted Kubernetes runs control plane components as pods. A one-time bootstrapping process is done to set up that control plane. Configuring hosts becomes much more minimal, requiring only a running Kubelet. This favours performing rolling upgrades through Kubernetes, the cluster system, and provisioning immutable host infrastructure. A node’s only job is to be a “dumb” member of the larger cluster.

The cluster which you see above is a self-hosted cluster, hosted on digitalocean using typhoon. For reference, you can check the terraform config to bring up your own cluster using typhoon.

What you can see above is how you can use kubectl to interact with the different components of the kubernetes cluster and how they are abstracted in terms of the native kubernetes objects like deployments, daemonset

Is this something new?

This has been in discussion for quite some time and a lot of people have already done it in production

Why?

What are the usual properties of the k8s control plane objects.

Highly available
Should be able to tolerate node failures
Scale up and down with requirements
Rollback and upgrades
Monitoring and alerting
Resource allocation
RBAC

How is self-hosted kubernetes addressing them

Small Dependencies: self-hosted should reduce the number of components required, on the host, for a Kubernetes cluster to be deployed to a Kubelet. This should greatly simplify the perceived complexity of Kubernetes installation.
Deployment consistency: self-hosted reduces the number of files that are written to disk or managed via configuration management or manual installation via SSH. Our hope is to reduce the number of moving parts relying on the host OS to make deployments consistent in all environments.
Introspection: internal components can be debugged and inspected by users using existing Kubernetes APIs like kubectl logs
Cluster Upgrades: Related to introspection the components of a Kubernetes cluster are now subject to control via Kubernetes APIs. Upgrades of Kubelet’s are possible via new daemon sets, API servers can be upgraded using daemon sets and potentially deployments in the future, and flags of add-ons can be changed by updating deployments, etc
Easier Highly-Available Configurations: Using Kubernetes APIs will make it easier to scale up and monitor an HA environment without complex external tooling. Because of the complexity of these configurations tools that create them without self-hosted often implement significant complex logic.
Streamlined, cluster lifecycle management: you can manage things using your favourite tool like kubectl

Let’s try explaining the above one at a time.

Small dependencies

Forget about masters for a second, for worker nodes, the components above is all you need for it to connect to the cluster.
Everything is running inside kubernetes.
Master nodes might have systemd units and some other specialised scripts to run.
There is no distinction between the nodes.

The nodes are only differentiated on the basis of k8s labels which are attached to nodes and are the only way one can distinguish between the master the other kinds of nodes.

This is done in order so that the scheduler can schedule only the required workloads on the nodes of a particular kind, for example. The API server should only be running on the master nodes, and hence it will try using the label which is attached to the master nodes, tolerate the taint and get scheduled on the master nodes.

Adding labels to a node is as simple as doing a

$ kubectl label node node1 master=true

Introspection

All the control plane objects run as one or the other kubernetes objects, what happens then is you get the power of doing something like kubectl logs .. on the particular object to get the logs.

You can go ahead and send these logs to your logging platform just like how you would do to your application logs. Making them searchable and gaining more visibility on the control plane objects.

Cluster Upgrades

Doing a cluster upgrade for a particular component is as simple as doing a kubectl set-image on the particular control plane object.

First comes the API server. A certain flow needs to be replicated while doing upgrades, but other than that. This is really just it for upgrades

Easier highly available configurations

As the core control plane objects are running as kubernetes objects like deployments, you can increase their replica count and make it scale up/down to the desired number which you would want.

Wouldn’t really affect the scheduler/controller-manager/api operations too much if you are thinking of scaling them up and expecting performance improvement as each of them take a lock on etcd and elect a leader, hence only one of them is active at any given point.

Streamlined cluster lifecycle management

Like any of your apps, your control plane objects can be applied to your cluster in a similar way.

$ kubectl apply -f kube-apiserver.yaml
$ kubectl apply -f controller-manager.yaml
$ kubectl apply -f flannel.yaml
$ kubectl apply -f my-app.yaml

Which gives you a familiar ground to be in, of course, you should try not doing the above in prod and have some kind of automation to do things instead of doing kubectl apply.

How does all this work?

There are three main problems to solve here

Bootstrapping

Control plane running as daemonsets, deployments. Making use of secrets and configmaps
But … We need a control to plane to apply these deployments and daemonsets on

How to solve this?

Bootkube can be used to create the temporary control plane which can be then used to inject the control plane objects.

How does bootkube work?

A very rough way of describing the set of operations

Bootkube starts, and drops api-server, scheduler, controller-manager, and etcd pod manifests into: /etc/kubernetes/ manifests.
This static (and temporary) control-plane starts, then bootkube injects daemonsets, deployments, secrets, & configMaps which make up the self-hosted control-plane.
Bootkube waits for the self-hosted control-plane to start, and when all required pods are running, bootkube deletes the static manifests in /etc/kubernetes/manifests and exits (leaving us with a self-hosted cluster)

So you have manifests in two directories

bootstrap-manifests

which would hold the temporary control plane manifests, which bootkube would bring up

manifests

which would be the manifests applied, when the API server pivots from the temporary one to the one which was injected and applied when bootkube exited.

Does this even work?

What you see above are the logs of a controller node when the cluster is brought up. The initial docker ps -a would show the bootstrap control plane objects which were brought up bootkube.

The second docker ps -a shows the list of containers which are part of the pods when all the other manifests in the manifests dir got applied.

Tools to self host k8s clusters

Although I have used typhoon to bring up my kubernetes clusters, check out gardener cloud which can also help you achieve self hosted kubernetes.

References

Solo Backpacking trip to Hampi, Gokarna and Goa

2019-03-22T00:00:00+00:00

It was a cold Friday night, had just come back home after wrapping up the farewell party which was thrown by my colleagues from my last company. I was feeling worn out of all the activities from the day, but I kept reminding myself that I had to start packing for the impromptu solo backpacking trip which I had planned before I joined my next gig in the coming 10 days.

Where was I planning to go?

Hampi the imperial capital of Vijayanagar, a 14th century empire

Hampi is the forgotten empire, there was a time when diamonds were sold on the streets before it’s fall. Such was the grandeur.

Hampi rose into prominence in the early 14th century when the Kampili Kings rose to power. In 1327, the kingdom was attacked by Muhammad-bin-Tughluq who took two brothers, Bukka and Harihara as prisoners along with thousands of other people. These brothers tricked the Sultan into setting them free and returned to Kampili to set up a kingdom of their own with its capital at Vijayanagara. Thus the Vijayanagara Empire was founded by Harihara I and Bukka I of the Sungama dynasty in 1336. The Sungama dynasty was followed by the Saluvas and the Tuluvas each of whom added Vijayanagara’s architectural beauty.

At one point Hampi was also one of the biggest trading centres of the world. Vijayanagar brought a lot of wealth, fame and splendour to Hampi. In those times, most markets in Hampi were always crowded and swarming with buyers and also merchants.

The empire encompassed almost the whole of modern South India.

It is quite a unique experience I would say. On one hand, you experience a culture rich in history(every street and corner is steeped in history) and on the other hand on crossing the river, you enter into another world — Virupapur Gadde or more popularly known as the Hippie Island. Contradictory indeed! You will find the street lined with cafes which cater to foreign tourists from all over the world. Tourists from Israel are a common sight.

The easiest way to reach Hampi is to take a bus till Hospet. After which you can take a city bus to the town of Hampi. Which barely costs you anything. If you are feeling lucky, you can book a Karnataka sarige class, KSRCTC bus from Bangalore’s Majestic bus stand, till Hospet like I did. But a fair warning is that, if you have a big rucksack with you and are looking for a good night’s sleep(which I feel everyone does), you are better off booking a better bus service for yourself. But nonetheless, it’s a very good service, provided it charged me just about 230 INR, for a distance of close to 300 km.

For staying around Hampi, you can book one of the lodges which are nothing but the cafe’s around the Virupaksha temple, the prices are quite cheap. I booked myself a room for a meagre 450INR for two nights in of the nearby lodges. The one which I was staying in, had bathrooms which were not well maintained at all. I was lucky that there was a public bathroom near the lodge, where I used to get done with my business. But given the price of the lodge, I wouldn’t have expected any better.

There is one particular food joint just right next to the lodge, which used to serve Dosas and I used to get breakfast and dinner, along with some tea. The smell of the batter getting cooked on the stove and it getting served on your plate with some hot sambar. Heavenly!

The plan for the very first day was to get a cycle (they can be rented at around 100/200 INR for the whole day) and cover the area near Virupaksha temple and the road which leads to the southern parts of the ancient city.

It’s part of the group of monuments at Hampi, designated as a UNESCO world heritage site.

The day on which I visited, it just so happened that there was a marriage ceremony going on. There was a huge crowd which is as expected. Lakshmi, the temple elephant would be seen blessing the visitors with her trunk. I would later get to know that it’s quite common for weddings to happen here.

Right next to the temple, is the Hemakuta hill which leads to other temples in the vicinity. The hill encircled on its three sides by massive fortification. To the north is the enclosure wall of Virupaksha temple. The complex has three gateways. More than 30 shrines stand on this hill. These vary from elaborate structures with multiple sanctums and rudimetaries, single-celled construction. Most of these temples have stepped pyramidal type of superstructure.

There is a part of the empire, which used to be the palace grounds of the king. It had a part where the sacred water tanks are situated. The sacred tanks were related to various rituals and functional aspects of the temples and the people surrounding the temples. The tanks were considered to be sacred places by the people of Hampi in ancient times.

There are some water tanks that are not related to the temples. Some of the water tanks are situated within the Royal Enclosure and they were built for the use of the members of the royal family of Vijayanagara. There were a few large public water tanks as well that were for use of the general people.

The pushkaranis in Hampi was an integral part of the people’s lives during the time of the Vijayanagara Empire. Since the temples were an important part of the social and cultural lives of the people of ancient Hampi, the water tanks also gained significance among the people.

In many cases, the water tanks served as the venue for the annual boat festivals. During such festivals, the images of Gods and Goddesses were taken out of the temples for a coracle ride on the water tanks. Sometimes the images were placed on the pavilions that can be seen in the middle of some of the pushkaranis.

The one you see in the pictures, is the pushkarani of the royal enclosure. The key attraction of the tank is the symmetrical layout of the steps. It is a 5 tiered tank where each tier comprises of a few steps.

The soldiers used to eat their meals in these giant stone utensils, that’s about 5 times the size of an average eating utensil which you can find at a regular household these days. The city had it’s drainage system even back then, which is something very rare in civilisations of that time.

The structure which you see on the top is called the Mahanavmi Dibba, a three tired stone platform, rising to a height of 8 meters. and is located to the north-east of the royal enclosure. It was one of the most important ceremonial structures of royal used, built in granite and subsequently encased in sculptured schist stone. It is dated circa 16th century AD.

There are references to the use of the platform by the royal family, for important festivals like Mahanavami, by Abdur Razak and Domingo Paes.

Right behind the Virupaksha temple, is the massive Tungabhadra river flowing. It’s quite a sight during sunset. The common folks of the village cross the river to the other side of the river using the circular boats which look really cute. More on that in a little bit.

One of the best parts of travelling is that I get to have conversations with folks whom I generally wouldn’t have gotten to have a chat with. This time being no different.

I met Yang, who hails from Japan. When I met him, he already had been travelling for quite some time. He had been hiking across India for quite some time now and his next step is going to Kochi. He had quit his job in an investment bank in Tokyo last year and had been travelling ever since.

The Virupaksha temple, as it looks like during sunset.

Day 2 - Hampi

The second day came pretty quickly, usual things. Getting breakfast at the same place. All prepped for the trek towards the eastern side of Hampi.

The hike was a 3 kilometre trek from my lodge, the Vitthala group of temples.

The path to the temple is not that hard IMO if you are alright with walking in the sun with little to zero presence of humans in the path under the scorching sun. You walk alongside the river banks in some parts and get to witness the ruins which were left behind when the empire had fallen.

The other way of reaching to the Vitthala temple from Virupaksha is to take a 12 km auto ride which would push you back by some 800 INR or so depending on how well you can bargain with the auto drivers.

Whichever suits you well. I wanted to take that hike as my lodge owner had mentioned it should be doable given I had lots of water.

The highlight of the temple is the stone chariot, at the entrance, a reproduction of a professional wooden chariot, is perhaps the most stunning achievement, typical of the Vijayanagar period. It houses an image of Garuda, the vehicle of Lord Vishnu.

Outside the temple, to the east, is a huge bazaar, measuring 945 meters in length and 40 meters in width leading to a sacred tank known as lokapavani.

If you keep walking in that direction, you will what is called Talarigatta gate, which is located strategically to the north, north-east of the Vijayanagara city leading to the river known as ‘Talarigatta’(toll collection point). The gateway, which is narrow, is built into the fortification wall which enclosed the capital city. It’s a two-storied structure with a provision for a guard pavilion in the first storey, the latter having beautifully cut plaster decoration.

Just after that, I hitchhiked with Pasha to the nearest main road to Kamalapur, which is the nearest town coming in between Hampi and Nimbapura. Was around 4kms. Pasha has a soft drink ship right outside the Vitthala temple and was going out to deposit the empty glasses to the distributor when I met him. He has two kids, named Aminah and Aliyah whom he adored. He asked me about my family during the ride and was describing how he loved Bombay and the Haji Ali Dargah. He complained about the traffic there though.

Here’s a picture of Pasha and me, which I managed to get with him just after he dropped me.

I got dropped very near to the museum, so the natural thing was to just go and visit it. It was interesting to see a lot of sculptures and old coins from the empire. And for a change, I was not out in the scorching sun.

As I was heading back to the main Hampi city, I had to find my way back. I had a few options. The easiest was to take an auto and head back to the city. Or wait for the red city bus to come along and halt there at that intersection. I waited for the bus to come and it felt like eons, but it didn’t arrive.

While I was waiting there, I met this kid called Mahadev whom I had met just yesterday when I had visited Virupaksha. He was one of the many small kids who was selling the map of Hampi in front of the temple. And I could immediately recognize him. He was waiting for the same bus as I was and he was on his way back home after a regular day at school.

And while we waited, we talked about his family. His father was an auto driver here and his mom used to sell food at her stall near the Virupaksha temple grounds. He had an elder sister who was studying in the next town. I asked him what does he like about this place. He said. it was the people and that he liked meeting new people and getting to know them.

After waiting for a long while and both of us sharing some apples which I was carrying, Mahadev decided he would get on the back of a motorcycle and reach the town, he generously offered me his ride when a kind gentlemen offered to give him a lift, but I asked him to take the ride as I was hoping to get something similar.

As luck would have it, generous gentlemen offered me a ride back into the city. He struck up a conversation immediately after we started towards the city, asking me about where I was from and how was I liking the city.

The next thing on the list was to get on the circular boats which I saw when I hiking down the river. The owner of the boating service was quoting somewhere around 1200 INR for around an hour. I managed to bring it down to some 400INR after a lot of cajoling.

Later that night, I was lazing around the same local tea shop where I used to hang around. I ended up eating at the same place. Took a bus back to Hospet after which I had to get on a bus from Hospet on to my next destination.

Another uncomfortable bus ride

The bus ride to Gokarna was not exactly the best ride, but it was pretty cheap. I was travelling in the state bus again, no surprises here that I couldn’t get any sleep again.

I met Yang, yet again while on my way to Gokarna. And we had a chat again about where we were headed and what are we gonna do next.

The trip to Gokarna from Hampi is an overnight one and when I woke up to the morning, I was pleasantly surprised by the gentle breeze on my face flowing in through the tiny crevice of the window which slid open as the bus crawled through the bridge.

I got down at the Gokarna city bus stand and decided to walk all the way up to the hostel which I had booked for the stay for one night. It was a kilometre of a hike before I could reach my hostel.

I was welcomed by the hosts back at the hostel and the hostel was right next to the ridge where you could have a clear view of the main beach and the view was absolutely gorgeous!

The bunk bed which I was sharing along, had folks who were there for a quick getaway before they attended a friend’s wedding, an army officer who was on her way back home before she got to go back to Leh where she was commissioned, I met Amazon, who was from Sussex and who was on a break before he headed back home and joined his new gig, along with Michael who worked as a Police officer in Australia.

I lazed around the dining area of the hostel which was right next to the ridge, where I and a couple of folks were just sharing with each other where they had been travelling around and singing songs while someone played some guitar sipping over some milkshake.

The afternoon was sliding over and we started to plan around a hike around the beaches of Gokarna.

We made it down from the hills from where the hostel was situated and headed to the main beach, where we boarded onto the speed boat which was waiting for us to take us to God’s own beach.

The boat ride was roughly a 30-minute ride till we reached the destination and we were greeted with occasional dolphins who would swim right next to us and come out of the water every now and then.

Just after you get down at any of the beaches which are farther away from the main roads, you would find a bunch of folks who would have holed up in makeshift tents, playing soccer, cooking food in the open kitchen which they have created. Most of them being from Israel.

We started our trek to the beach right next to God’s own beach and the beaches along the way, all the way to the main Gokarna beach.

Kudle Beach is popular with those that are staying in town but want to spend the day at the beach. At the left end of the Gokarna beach, a narrow path goes up a hill, where you cross a (Rama) temple en route. This temple also has a natural water spring which according to the locals never stops running. The water is quite drinkable. After climbing up some stairs, you will find flat ground and some breath-taking views of Gokarna beach as you turn around to see the distance you covered. As you move along, about 10 minutes walk from this place, the flat ground leads to a narrow lane, which goes down to Kudle beach, the second of Gokarna’s beaches. This beach looks very unkempt, desolate and dirty in off-seasons. You will hardly find a soul here then.

One of the beaches which we crossed was called the Om beach, it gets its name from the fact that, if you standing at the ridge. Just before Om beach. You can actually see the shape of Om being formed by the beach. On the way, we crossed the famous Namaste cafe and some others which had cropped up.

At the end of the Om beach, there is a path going up the hill. Here one has to get around a hillock(about 20-minutes walk) to reach Half-Moon beach. take this trail, and when you reach a fork in the trail, take a right for the coast route, and left for the forest route. They will both take you to the same place. Half-moon beach is so named because the shape resembles that of a half-moon.

At the end of the Half-moon Beach, a small trail leads to Paradise beach, also known as Full-moon Beach. It’s around 20 minutes walk from Half-moon Beach. The thing to remember here is after crossing the first set of rocks, one should not try to climb the hill. Rather try getting around the hill. Its a much easier climb. The steep climb up the hill will take you to the next village, Bellekan. This is the last of the Gokarna beaches.

Gokarna is a place where a lot of foreigners come during their season time. By season time, they mean when it’s too cold out there back in their home countries. I met this couple from Russia and how they used to come here every winter.

By the time we reached, the hilltop of the main Gokarna beach, it was already time for sunset. And I swear, it has been one of the most alluring sunsets I have witnessed in my life so far!

We could see, the sun getting gobbled up by the sea. The orange it was, shining bright. Spreading joy and heat, there were a few folks who were meditating, someone was drawing something and busy scribbling something on their scratch pad and here I was looking at the distant sea blankly. Just doing nothing but staring at it. I still remember that moment as I write this down.

To be honest, I didn’t want the sunset to end and wanted it to just continue the whole day. And all of us just sitting there till we got the view till our heart’s content.

We headed back to the hostel to get a quick shower after which we got some dinner at a local restaurant which was serving konkani cuisine.

Just after the dinner, we got to know about the festival which was happening at the Kotitheertha temple tank in the centre of the town and this whole section of the town was lit up with diyas and lights. Everyone was out on the streets trying to get a glimpse of the festival.

The day ended with all of us crashing in our bunk beds, for me it was also the last night in Gokarna as I packed up my bags.

Morning came in quick and I had to rush to the local bus stop, to grab a bus to Gokarna beach railway station. And I thought I just cut it in time for the local train station where I was to board the train to my next stop.

Next stop the historic city of Margao, Goa

The train journey was joined by Michael, whom I had met back in the hostel where I was staying over. The train ride there was a meagre 30 INR and it would take roughly 3 hours to reach the destination.

After reaching Panjim, which was the last stop of the train. I and Michael parted ways as he was headed further north of Candolim, which was where I was going.

Local transport is a little bit hard to find in Goa is what I noticed, most tourists either rent a car or a scooter and you can do the whole city in that. But many are left with the option of going around with the other options.

I reached my hostel and freshened up a bit before heading straight to the beach right behind my hostel, it was a meagre 5-minute walk back to the beach and I was greeted by quite an old restaurant called Mama Cecilia’s shack. Ordered some shrimps and just sat down on the beachside chairs to relax and watch the sunset while dogs were running around in the water, playing with their owners and people jogging.

Candolim beach is another relatively busy soft sandy beach that due to erosion can be narrow in some places. Hawkers selling mostly cheap clothes and along with masseurs offering massages makes the beach a lively spot. A polite but firm no always works ~ though these people can be very interesting to talk to.

Hands down, the best restaurant in Candolim is After 7. Great selection of food and wine (mainly European) served in a beautiful garden. Great service, very romantic. The chap that owns the place ‘Leo’, super friendly, and very passionate about his restaurant. Opp Pedro Martina Resort, 1/274b, Gaura Vaddo.

Candolim itself is a small (but populated) village and it can be covered by a brief walk in 20 mins. Walking or bicycling along roads is not a great pleasure as its crowded with cars, extremely noisy and dusty. Proper walking passes yet to come.

I crashed around in my bunk bed as I settled in for the night after lazing around and having conversations with the folks in the common area of the hostel, where I met a guy from China who had been living in India for some months now.

He mentioned that he used to work for a startup back in China and described how he had no work-life balance when he was there and got burned out. He planned to stay a few more months, travelling to Varanasi and Rajasthan before heading back.

The next morning was quite pleasant, woke up and had some breakfast and Mama Cecilia’s. The shrimps are to die for! I headed to Basilica de Bom Jesus

Famous throughout the Roman Catholic world, the imposing Basilica de Bom Jesus contains the tomb and mortal remains of St Francis Xavier, the so-called Apostle of the Indies. St Francis Xavier’s missionary voyages throughout the East became legendary. His ‘incorrupt’ body is in the mausoleum to the right, in a glass-sided coffin amid a shower of gilt stars.

Constructed between 1594 and 1605 AD, this church has a main altar, four side altars, two chapels, a sacristy and a choir.

The main altar, the whole back wall is designed, like the facade in numerous carvings in wood, of pillars, friezes and arabesques all gilt in pure gold. Above the Altar and Tabernacle stands a giant statue of St. Ignatius of Loyola in the priestly vestments, nearly three meters high.

The church is dedicated to Jesus — in Portuguese Bom Jesus meaning Good Jesus.

The day ended with me heading back to the hostel and packing my bags, yet again to my final destination. Which was home indeed.

I boarded a bus back to Bangalore, this one being an Airavat. As I felt I really need to rest for a bit as the last few days had quite been something for me.

Reaching home, was another level of satisfaction. I had finally completed the trip which was an impromptu one and I was very happy that I made it.

I did something very similar a few years back when I did my backpacking trip to Himachalwhile I was in college. And this felt very similar, just that this one was solo.

7 days of backpacking across the southern edge of India had come to an end after completing close to 1400kms.

As you might have noticed, I had to rush through the last part of the trip. The initial plan was to spend two nights in every place, but as luck would have had it. It was cut short as I had to be present in Bangalore for DevOpsDays India, where I would be giving a talk.

Would love to visit the southern coast again, covering Kerala and Mangalore. Cochin and Aleppey comes over my mind.

Until next time!

Object Comparison

2019-03-21T00:00:00+00:00

When do you say two objects are equal?

Taking example of the below two and having ruby as the language,

Comparing primitive objects

irb(main):001:0> 1 == 1
=> true
irb(main):002:0> 'tasdik' == 'tasdik'
=> true
irb(main):003:0>

Comparing custom objects

But what if you are having a custom class defined which has attributes for itself, you can’t really compare them using == default.

For example.

irb(main):001:0> module Money
irb(main):002:1>   class Wallet
irb(main):003:2>     attr_accessor :rupee, :paise
irb(main):004:2>
irb(main):005:2>     def initialize(rupee: 0, paise: 0)
irb(main):006:3>       @rupee = rupee
irb(main):007:3>       @paise = paise
irb(main):008:3>     end
irb(main):009:2>   end
irb(main):010:1> end
=> :initialize
irb(main):011:0>
irb(main):012:0> office_wallet = Money::Wallet.new
=> #<Money::Wallet:0x00007fb4328889c8 @rupee=0, @paise=0>
irb(main):013:0> personal_wallet = Money::Wallet.new
=> #<Money::Wallet:0x00007fb430991290 @rupee=0, @paise=0>
irb(main):014:0>
irb(main):015:0> office_wallet == personal_wallet
=> false
irb(main):016:0>

But as we can see that the two objects have the same attributes, which would be rupee and paise having the same values. They should, logically be the same thing and they should have returned true as a result of the == comparison

Object equality in ruby

== is a general comparison operator in ruby

At the Object level, == returns true only if obj and other are the same object. Typically, this method is overridden in descendant classes to provide class-specific meaning.

This is the most common comparison, and thus the most fundamental place where you (as the author of a class) get to decide if two objects are “equal” or not.

The double equals method should implement the general identity algorithm to an object, which usually means that you should compare the object attributes and not if they are the same object in memory.

Equivalence relation

The general contract when we are overriding the == operator in ruby is that it should implement an equivalence relation.

Which has the following properties.

Reflexive: For any non-null reference value x , x == x and must return true .
Symmetric: For any non-null reference value x and y, x == y and y == x must return true .
Transitive: For any non-null reference value x, y and z, if x == y and y == z return true .Then x == z should return true .
Consistent: For any non-null reference value xand y, multiple invocations of x == y must consistently return true or consistently return false .
For any non-null reference value x, x == nil must return false.

If the above rules are violated, your equality operator which you have overridden will behave erratically.

Overriding eql? and hash methods too

But once, you have overridden the == operator, you must also override eql? and hash methods, the reason being that even if you have never seen these methods being called anywhere. The reason is these are the methods the Hash object is going to use to compare your object if you’re using in as a Hash key. The thing is, Hashes have to be fast to figure out if a key is already in there and to be able to do this they just avoid comparing every single object, they just go by “clustering” objects in groups by using the value returned by your object’s “hash” method and then, once in a cluster, they compare the objects themselves using “eql?”.

Then searching for a key in a Hash, they first call “hash” in the key to figure out in which group it would be, then they compare the key with all the other keys in the group using the “eql?” method.

irb(main):001:0> module Wealth
irb(main):002:1>   class Money
irb(main):003:2>     attr_accessor :rupee, :paise
irb(main):004:2>
irb(main):005:2>     def initialize(rupee: 0, paise: 0)
irb(main):006:3>       @rupee = rupee
irb(main):007:3>       @paise = paise
irb(main):008:3>     end
irb(main):009:2>
irb(main):010:2>     def ==(other)
irb(main):011:3>       if other.nil? || !other.instance_of?(Wealth::Money)
irb(main):012:4>         false
irb(main):013:4>       else
irb(main):014:4>         rupee == other.rupee && paise == other.paise
irb(main):015:4>       end
irb(main):016:3>     end
irb(main):017:2>
irb(main):018:2>     alias eql? ==
irb(main):019:2*
irb(main):020:2*     def hash
irb(main):021:3>       [rupee, paise].hash
irb(main):022:3>     end
irb(main):029:2>   end
irb(main):030:1> end
=> nil
irb(main):025:0> office_wallet = Money::Wallet.new
=> #<Money::Wallet:0x00007f96c411ad20 @rupee=0, @paise=0>
irb(main):026:0> personal_wallet = Money::Wallet.new
=> #<Money::Wallet:0x00007f96c40063d0 @rupee=0, @paise=0>
irb(main):027:0>
irb(main):028:0> office_wallet == personal_wallet
=> true
irb(main):029:0>

Not the most elegant solution of a hash method, but I hope you get the idea.

A better hash method?

The requirement is that, two objects

have to be equal if they have the same hash values
but two objects having same hash values, may or may not be equal

Which brings one to the question, can we have hash functions which never have collisions?

There have been discussions about whether it is possible for a function to exist, which will never produce hash collisions. But it’s a hard problem to solve.

So when you are trying to override the equals method in ruby, you should also override the hashand eql? method.

Object equality in JAVA

Similar to ruby, when you are trying to implement equivalence relation in java, you have to override the

equals
hasCode

methods in java

public boolean equals(Object object) {
  if (this == object) {
    return true;
  } else if (object == null || **getClass() != object.getClass()**) {
    return false;
  }

  MyClass other = (MyClass) object;
  return this.x == other.x && this.y == other.y;
}

One of the most common mistakes which people make while overriding the equals method is that instead of using the Object class, they use the class, in which they are overriding the equals method itself.

References

https://www.harukizaemon.com/blog/2005/12/28/how-to-write-eql-in-ruby/
https://ruby-doc.org/core-2.5.3/Hash.html#class-Hash-label-Hash+Keys
https://javarevisited.blogspot.com/2011/02/how-to-write-equals-method-in-java.html
https://stackoverflow.com/questions/7156955/whats-the-difference-between-equal-eql-and
https://stackoverflow.com/questions/1931604/whats-the-right-way-to-implement-equality-in-ruby’
https://mauricio.github.io/2011/05/30/ruby-basics-equality-operators-ruby.html

What should and should not be tested in unit tests?

2019-03-13T00:00:00+00:00

I have written about F.I.R.S.T principles of testing and TDD as a school of thought

Probably an extreme opinion, but this is how Jeff Atwood puts it

I Pity The Fool Who Doesn’t Write Unit Tests

But what should you test?

This I generally try to follow

Test the common case of everything you can. This will tell you when that code breaks after you make some change (which is, in my opinion, the single greatest benefit of automated unit testing).
Test the edge cases of a few unusually complex code that you think will probably have errors.
Whenever you find a bug, write a test case to cover it before fixing it
Add edge-case tests to less critical code whenever someone has time to kill.

This will not only help you deliver and release faster, but will also make you more confident about your own codebase.

Writing tests and having 100% code coverage does not necessarily mean that your code is bug free, but I feel it’s certainly better than having no tests at all.

What should you not test?

The code is trivial. A getter that returns 0 doesn’t need to be tested, and changes will be covered by tests for its consumers.
The code simply passes through into a stable API. I’ll assume that the standard library works properly.
The code needs to interact with other deployed systems; then an integration test is called for.
If the test of success/fail is something that is so difficult to quantify as to not be reliably measurable, such as steganography being unnoticeable to humans.
If the test itself is an order of magnitude more difficult to write than the code.
If the code is throw-away or placeholder code. If there’s any doubt, test.

References

https://softwareengineering.stackexchange.com/a/147075/169827
https://softwareengineering.stackexchange.com/a/754/169827
https://blog.codinghorror.com/i-pity-the-fool-who-doesnt-write-unit-tests/

F.I.R.S.T principles of testing

2019-03-13T00:00:00+00:00

First principles of testing stand for

Fast
Isolated/Independent
Repeatable
Self-validating
thorough

Bugs are introduced in the parts of code, which we usually don’t pay attention to, or places which are too hard to understand.

Fast

The developer shouldn’t hesitate to run the run the unit tests at any point of their development cycle, even if there are thousands of unit tests. They should run and show you the desired output in a matter of seconds

Isolated

For any given unit test, for its environment variables or for its setup. It should be independent of everything else should so that it results is not influenced by any other factor.

Should follow the 3 A’s of testing: Arrange, Act, Assert

In some literature, it’s also called as Given, when, then.

Arrange

All the data should be provided to the test when you’re about to run the test and it should not depend on the environment you are running the tests

Act

Invoke the actual method under test

Assert

At any given point, a unit test should only assert one logical outcome, multiple physical asserts can be part of this physical assert, as long as they all act on the state of the same object.

Preferably, don’t do any actions after the assert call

Repeatable

Tests should be repeatable and deterministic, their values shouldn’t change based on being run on different environments. Each test should set up its own data and should not depend on any external factors to run its test

Self-validating

You shouldn’t need to check manually, whether the test passed or not.

Thorough

should cover all the happy paths
try covering all the edge cases, where the author would feel the function would fail.
test for illegal arguments and variables.
test for security and other issues
test for large values, what would a large input do their program.
should try to cover every use case scenario and not just aim for 100% code coverage.

References:

https://github.com/ghsukumar/SFDC_Best_Practices/wiki/F.I.R.S.T-Principles-of-Unit-Testing
https://martinfowler.com/bliki/GivenWhenThen.html
https://xp123.com/articles/3a-arrange-act-assert/

Test-driven development as a school of thought

2019-02-08T00:00:00+00:00

Software is eating the world, and so is the world of software development constantly changing.

One way of developing a project would involve

analysts figuring out the business requirements and sit for a few weeks if not months
these requirements would be given out to the architects who would in case break the problem down into manageable chunks
the chunks themselves would be then given out to the teams which would be the delivering the specific modules.

Nothing, but a typical waterfall model scenario, coming with it’s obvious pros and cons.

TDD or Test driven development follows a similar approach to this, which would be gathering the requirements, doing the analysis, modelling the desired system before writing any code for it.

To write a test, you will be gathering what the input should be and what the output should be(on a very high level of thought).

Once you have some knowledge about the what needs to be done, you will be able to write those fine grained requirements in the form of tests.

When you are writing those tests, you will figure out the gaps and misunderstandings in the requirements before you’ve committed those gaps and misunderstandings to the project in the form of executable code.

What you would gain immediately after adopting TDD

you get a better understanding of the project that you’re gonna write.

This is because of the fact that, you are only implementing the features that are required immediately in the first iteration.

you will have more confidence while refactoring your code

The problem comes when you would want to refactor certain areas of your code base, and this is where TDD shines if you ask me. You would be confident about the functionality as you would only be testing the end behaviour of your program and not the implementation details of how is it being achieved in unit tests.

This way, you can easily go with the Red Green refactor cycle.

immediate feedback on whether what you’re working is fine or not, hence fewer bugs

If you’re covering all the edge cases as part of writing the tests, you are already sorted in having a system which has minimal bugs in terms of functionality.

loosely coupled, highly modular code.

Something which I have been trying to follow lately is to not check in any production code, before writing the tests for it.

Write the specs, run the spec, let it fail (red). Implement the interface that you’re testing (make the code green), and then refactor.

Some of the things which I noticed worked well while writing tests were

Avoid writing procedural style tests
Use test doubles while testing, and make sure you are running single units of code in isolation.
Follow the given when then style of formatting your tests
Unit tests need true isolation, and they shouldn’t be hitting databases or opening sockets when you are testing something.
Use only one assertion per test.

Testing is like security, you can never be 100% per cent sure whether you’ve got it, but it surely adds to the confidence on what you’ve built.

So let’s say if you have written some 20 tests, and all of them pass, you wouldn’t be sure if what you have written is correct or not. But let’s say you started with all them being in the red state, you would have much more confidence in the system that you have built.

This back and forth does take time. But it’s a process which tries to address the concern of having doubts about whether what you built is a resilient system or not.

Resources

http://misko.hevery.com/2008/08/14/procedural-language-eliminated-gotos-oo-eliminated-ifs/
https://martinfowler.com/bliki/GivenWhenThen.html
https://stackoverflow.com/questions/920992/unit-test-adoption
http://www.natpryce.com/articles/000714.html
https://javaranch.com/journal/200603/EvilUnitTests.html

Moving Canary deployments on AWS using ELB to kubernetes using Traefik

2018-10-25T00:00:00+00:00

Canary deployment pattern is very similar to Blue green deployments, where you are deploying a certain version of your application to a subset of your application servers. If everything is alright and you have tested out that everything is working fine, you route a certain percentage of your users to those application servers and gradually keep increasing the traffic till a full rollout is achieved.

One of the many reasons to do this can be to test a certain feature out with a percentage of users who use your service. This can be further extended to enabling a service to users of a particular demographic.

Canary deployments on AWS

Canary in our use case @ Razorpay, was used by one of our API’s which we served and gave out for consumption and the earlier method for canary deployments there before we were on kubernetes was to have two seperate Autoscaling Groups for the primary ASG serving the particular API and another ASG(with a smaller desired count for the ASG), let’s call it canary ASG for now.

Now both

primary ASG
canary ASG

were having their own individual ELB’s attached to them, with them being in our public subnet. Both the ELB’s would have a CNAME DNS record pointing to their public FQDN given out by AWS.

For simplicity of drawing the ASG groups, I have not shown the ASG groups for both the canary and the main service in 2 separate AZ’s, but it is the recommended way to go forward. As in case of an AZ failure, you have the other set of ASG instances to be routed by the ELB(with cross zone load balancing enabled)

The canary ASG would be attached to both the

main ELB for the service
canary’s separate ELB

The capacity(min: desired) for the main service is more than the capacity for the ASG for canary, and the canary ASG capacity’s max is set to it’s desired. The reasoning for this is that, any regression wouldn’t propagate to a larger number of users if autoscaling kicks in.

Since our ELB is an Internet-facing load balancer, it gets public IP addresses(one for each AZ). The DNS name of an Internet-facing load balancer is publicly resolvable to the public IP addresses of the nodes of the ELB. Therefore, Internet-facing load balancers can route requests from clients over the Internet.

The load balancer node that receives the request selects an attached instance using the round robin routing algorithm for TCP listeners and the least outstanding requests routing algorithm for HTTP and HTTPS listeners.

Hence, the canary instances would also get traffic in a round robin fashion.

Replicating the same in kubernetes

traefik runs as our L7 load balancer, or as our ingress controller inside kubernetes to route traffic to our kubernetes services for various microservices running inside our cluster.

traefik would be running on hostNetwork: true as DaemonSet

These pods will use the host network directly and not the “pod network” (the term “pod network” is a little bit misleading as there is no such thing - it basically just comes down to routing network packets and namespaces). So we can bind the Traefik ports on the host interface on port 80. That also means of course that no further pods of a DaemonSet can use this port and of course also no other services on the worker nodes. But that’s what we want here as Traefik is basically our “external” loadbalancer for our “internal” services - our tunnel to the rest of the internet so to say.

Sample configuration which you can use to deploy traefik

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: traefik-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: traefik-ingress-controller
subjects:
- kind: ServiceAccount
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: traefik-ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - pods
      - services
      - endpoints
      - secrets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik
  namespace: kube-system
data:
  traefik-config: |-
    defaultEntryPoints = ["http","https"]
    [entryPoints]
      [entryPoints.http]
      address = ":80"
        [entryPoints.http.redirect]
        regex = "^http://(.*)"
        replacement = "https://$1"
      [entryPoints.https]
      address = ":443"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: Service
apiVersion: v1
metadata:
  name: traefik-ingress-service
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - protocol: TCP
      name: http
      port: 80
    - protocol: TCP
      name: admin
      port: 8080
  type: NodePort
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: traefik-ingress-controller
  namespace: traefik
  labels:
    k8s-app: traefik-ingress-lb
spec:
  selector:
    matchLabels:
      k8s-app: traefik-ingress-lb
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
    spec:
      nodeSelector:
        edge-node-label: ""
      serviceAccountName: traefik-ingress-controller
      terminationGracePeriodSeconds: 60
      hostNetwork: true
      containers:
      - image: traefik:v1.7.16-alpine
        name: traefik-ingress-lb
        ports:
        - name: http
          containerPort: 80
          hostPort: 80
        - name: admin
          containerPort: 8080
        securityContext:
          privileged: true
        args:
        - --loglevel=INFO
        - --web
        - --kubernetes
        - --web.metrics.prometheus
        - --web.metrics.prometheus.buckets=0.1,0.3,1.2,5
        - --configFile=/etc/traefik/traefik.toml
        resources:
          limits:
            cpu: 200m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 150Mi
        volumeMounts:
        - name: config-volume
          mountPath: /etc/traefik
      volumes:
      - name: config-volume
        configMap:
          name: traefik
          items:
          - key: traefik-config
            path: traefik.toml

The diagram above shows two ASG’s for edge nodes, which will host the traefik daemonset(s).

There would be a CNAME DNS record for myapp.example.com which points to the public FQDN for the common ELB to which both the edge ASG’s are attached to. Traffic would be routed to the edge VM’s attached based on a round robin fashion here. Before that, the security groups attached to the ASG’s can also be configured to only allow TCP connections on port 80(others would be blocked automatically as it’s default deny).

Similarly a DNS record for canary would be there.

Traefik would be listening on port 80 on the host’s network for incoming requests, and there would be an ingress object in the namespace of the app which would define which service to route the traffic based on the hostname.

We can have an ingress object like the following in the namespace myapp for the services

myapp
myapp-canary

# this feature is available only from traefik version 1.7.0 and upwards
# https://github.com/containous/traefik/releases/tag/v1.7.0
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: traefik-external
    traefik.ingress.kubernetes.io/service-weights: |
      myapp: 90%
      myapp-canary: 10%
  name: myapp-ingress
  namespace: myapp
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - backend:
          serviceName: myapp
          servicePort: 80
        path: /
      - backend:
          serviceName: myapp-canary
          servicePort: 80
        path: /
status:
  loadBalancer: {}

This way, traefik would route the traffic coming to myapp.example.com to the services

myapp : 90% of the traffic would be routed here.
myapp-canary: 10% of the traffic would be routed here.

You can have multiple services to which you can specify the weights and I would only be repeating myself to what has been written here https://docs.traefik.io/user-guide/kubernetes/#traffic-splitting

Another thing to note here is that, the service to which you are trying to do a weighted routing for your canary, should be in the same namespace as the other service(which is myapp in this case). This was asked here in their issue tracker and they pointed out the same https://github.com/containous/traefik/issues/4043.

So you have seen how we can do canary deployments on AWS using traditional ELB’s and ASG’s as well as if you are on kubernetes with an ingress controller.

References

Credits

The Network diagrams were made using draw.io

Monoliths are just fine

2018-10-24T00:00:00+00:00

A lot of great material has already been written out there around what microservices are and what they are not.

What I would try putting down here is what I saw as we grew from a monolith to a microservices architecture over the period of time back here at Razorpay. Please take it with a grain of salt when you read this as this is going to be opinionated.

Microservices are something which you grow into, not something you start with

Having a microservice(s) based architecture just for the sake of having one is kind of like using kubernetes to deploy a service or two on it. In the end, you will only overcomplicate things by having multiple things to manage. And trust me, kubernetes is hard. I say that with at least a year and a half of running production workloads at Razorpay on self-hosted kubernetes and having scaled them to what it is now. It’s complicated and can be hard to get right at first. But makes life much easier when you have multiple services to manage at the very least.

If you have a look at Shopify, they are one huge Ruby on Rails shop. And the last time I checked their scale is pretty huge. So the argument that microservices are the only way to scale is not entirely true.

There are reasons why I feel monoliths are the way one should start with, because

Microservices are complicated to manage

If you’re just starting to write your app, you need to get market validation fast and get the features out before your competition does, instead of having to worry about 10 other different things. Your customers/investors probably won’t care about how your service works or what is their underlying tooling here unless they are techies and would be curious about how is it working internally but at the end of the day, they just want things to work!

This is the part, where you are working on your prototype and you don’t need the extra overhead of worrying about how the 10 different microservices you wrote are interacting with each other. As at the end of the day, the tooling and the architecture matters, but not if you have not even attained that critical mass. This is a little similar to fight over what language to choose over the other, of course, there would be obvious choices for some specific jobs but you can always go first with the one which you and the rest of the team are most comfortable with to beat the averages

Plus the overhead of managing the health checks, monitoring and graceful shutdown for all the microservices is something which you need to put on your head around when starting with microservices apart from writing your main business logic.

They bring a lot of baggage that you might never have seen before (i.e. big learning curve) and the myth of isolated changes is just that, a myth. Unless it is some low-level thing, you cannot change it without impacting other services and this is no different than a monolith. It just replaces internal calls between services of your monolith with flaky and slower network calls.

Architect your monolith right

When you start off with your monolith, there would be places where you feel you see things repeating. And that’s where you separate that out into functions or modules. But you should not be afraid to repeat yourself at the start and try to modularise/abstract out everything, as you would not like to end up with leaky abstractions

When you understand the boundaries and functions well enough, that’s when you would start modularizing your codebase. So as to have clear distinctions of which part of the codebase does what. This helps in a few things, specific teams being able to work on specific codebases and clear distinction between core functionality and helper functions.

When you are done dividing your codebase into modules, using a message queue for asynchronous communication between the modules would make more sense. This will enable you to debug faster, distribute the work of different components to different teams for things to start with.

With all the above in check, you already have a lot of things sorted which would not become a technical debt when trying to move to microservices.

From what I have noticed is, over-optimising from the start is just gonna bog you down. The priority at the start should be to

Get it working.
Get feedback about your product.
Fix the bugs when they are reported or when you find any.
If things go down, look why they went wrong and fix them. This is where you will learn from your experience.
Repeat the whole process.

I think of it this way, your product is not a program like ls command which is feature complete, you need to constantly iterate upon it, but even before that. You have to give out a working model for people out there to use it.

Moving to a microservice?

The argument of moving to a microservice can be made when

you have divided certain work among certain teams, by virtue of which there would be times when there would be friction, miscommunication happening over when contributing to certain parts of the codebase which come common when different teams are working. Even if you have managed to write something, it’s quite possible that something which you added might have had a regression over something, and at this point, your automated tests in the CI should ideally catch them. But having a clear separation of work when you have multiple teams working on things, microservices can make sense.
you would want to rewrite a piece of codebase into something more performant.
Feature Velocity
autonomy of teams to iterate faster with their own choice of technology
make deployments quicker, impact on story structure.

The jump from monolith to service-oriented thinking is a huge one. But the jump from a few services to more is much easier.

Most of the times, I’ve found a push to microservices within an organization to be due to some combination of:

1) Business is pressuring tech teams to deliver faster, and they cannot, so they blame current system (derogatory name: monolith) and present microservices as solution. Note, this is the same tired argument from years ago when people would refer to legacy systems/legacy code as the reason for not being able to deliver.

2) Inexperienced developers proposing microservices because they think it sounds much more fun than working on the system as it is currently designed.

3) Technical people trying to avoid addressing the lack of communication and leadership in the organization by implementing technical solutions. This is common in the case where tech teams end up trying to “do microservices” as a way to reduce merge conflicts or other such difficulties that are ultimately a problem of human interaction and lack of leadership. Technology does not solve these problems.

4) Inexperienced developers not understanding the immense costs of coordination/operations/administration that come along with a microservices architecture.

5) Some people read about microservices on the engineering blog of one of the major tech companies, and those people are unaware that such blogs are a recruiting tool of said company. Many (most?) of those posts are specifically designed to be interesting and present the company as doing groundbreaking stuff in order to increase inbound applicant demand and fill seats. Those posts should not be construed as architectural advice or best practices

In the end, it’s absolutely the case that a movement to microservices is something that should be evolutionary, and in direct need to technical requirements. For nearly every company out there, a horizontally-scaled monolith will be much simpler to maintain and extend than some web of services, each of which can be horizontally scaled on their own.

IMHO, the monoliths vs microservices debate is akin to monorepos vs multi-repos: they are both strategies used to share work when your organization grows. Both can work well, depending on your tooling and organization.

But do not forget that those abstractions layers you add, while very useful (say, for release velocity), might also be a direct application of Conway’s Law

Which means that refactoring some code, might sometime require refactoring your organisation, so if you lack the ability to do that incrementally, you might converge to an ossified system that stops evolving.

So, Microservices or Monoliths?

It depends.

References

https://martinfowler.com/bliki/MonolithFirst.html
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/
http://www.paulgraham.com/avg.html

Pillars of Observability

2018-10-01T00:00:00+00:00

Haven’t written around for much of this year, hope it changes going down to the end of this year. This year has been very fruitful in terms of learnings and I can’t wait to share what I have learned.

This post would basically be an introduction to what I have understood by the term of observability into your infrastructure and the services which are hosted on top of it.

There are 3 pillars of observability:

metrics
logs
tracing

While all of them partially overlap, each have different purpose.

Metrics

These are numbers related amount of events that happened in the time range. Like:

number of successful/failed/overall requests
cumulative duration of requests
bucketed histogram of requests’ durations

The main point of metrics is that these are small, you can gather them cheaply and store them for a long period of time. These give you an overall overview of the whole system, but without insights.

So, metrics answer to the question “how is my system performance changes through time?”.

I would add visualization to it too, as it goes hand in hand with metrics.

Historically people have used statsd along with graphite as the storage backend. I personally prefer prometheus. Which is an open-source, metrics-based monitoring system. It does one thing and does it pretty well, with a simple yet powerful data model and a query language which lets you analyse how your applications and infrastructure are performing.

I believe in the philosophy that

if something moves, you should track it and graph it. If it decides to make a run for it.

The thing about prometheus is that they are pretty mature in the ecosystem and have a lot of support in terms of contributors from the OSS community as well as native kubernetes metrics support. Aggregrating metrics from k8s objects is pretty straightforward with the service discovery provided by prometheus.

And they have a lot of exporters which they have written and some being supported by the community. And if one wants to instrument their code, they can use any of the client libraries to do so. Network metrics, application metrics, you name it.

Heck, a friend of mine wrote an exporter for an ISP provider, you can actually go ahead and check the metrics here

grafana can be used to visualize the data which is being scraped by prometheus (which can be added as a source to grafana), which works on a pull model.

InfluxDB is another Time series database, which would also work on a similar manner and they would have an enterprise plan for HA too, which prometheus achieves currently using thanos

Logs

These are single events that happened in the system, ex. single request. It will often contain also some information that is in the metrics (like request duration), but it will also contain more context, like IP of the requester and precise time of the request. The problem with logs isn’t that these are larger in size than just metrics itself, so you cannot store them for so long as the metrics and due to their size, you need to reduce the number of logs you send.

So, logs answers to the question “what happened in my system?”.

You can either go with a hosted solution or a SaaS-based solution for logging.

Logs are a critical part of the infra and without logs, I would say developers are just shooting in the dark when trying to debug their applications. Which makes it mission critical. I personally have experienced that going with a hosted SaaS solution, if you have a small team is something which I would suggest. As when your volumes grow, you would have to invest a lot of time to invest in fixing stuff when it breaks in the logging side. The most popular approach is pushing your logs to Elasticsearch, fluentd pushing it and kibana as the frontend. All making it stand as EFK, which is actually a pretty decent way to go about it.

But managing your ES clusters can get nasty if you are hosting them on a public cloud. Spikes on CPU during ingestion of logs is pretty common from what I have noticed. Again, depends on whether you would want to tradeoff developer time trying to fix it or move it to a SaaS based product.

If you are operating at a very large scale, it would obviously make more sense for you to host logging infra inside your infrastructure as the costs would be enormous with a SaaS product.

Facebook recently Opensourced their distributed logging platform, logdevice, last time I checked. One of their engineers mentioned in an HN comment that

LogDevice ingests over 1TB/s of uncompressed data at Facebook. The maximum limit as defined by default in the code for the number of storage nodes in a cluster is 512.

I guess you get what I am trying to put across here as a point.

Tracing

This one is for monitoring how a single event is behaving in the system. So you have information on how long this request spent in LB, backend, DB, how long it spent on external services, and so on. So as you see, this overlap with both metrics and logs, but still not exactly. The main problem is that while metrics and logs can have different formats in each of the services, tracing needs to be uniform because your system needs to be “request ID aware”. Also, traces can be very large, that is why in general you do not trace all requests, but only some of them to reduce internal traffic.

So, tracing answers to the question “how is my system components interacting with each other?”.

Opentracing is just a standard that needs services like Jaeger to implement the actual tracing. Tracing tools are about following a request as it moves between different systems.

Will be writing another post in and around metrics with prometheus shortly, Until next time!

References

http://code.flickr.com/blog/2008/10/27/counting-timing/
https://codeascraft.com/2011/02/15/measure-anything-measure-everything/

Trip to Taiwan, 2017

2017-10-07T00:00:00+00:00

Dum Chai had become one of my favorites, thanks to the after work-hour chill time with colleagues. That day being no different, we were in our usual routine.

But I was in for a surprise. I was getting my VISA finally after much hassles that day. All the mindless haggling and tireless procedures with the paperwork which I had to run through at the last moment. Oh man, that was something.

Exactly A week later

It was the first week of June and I was already in my last month of my internship at Cisco, Bangalore. My mom for a change that day was not telling me to pack my bags (which I hadn’t till the last night before travel).

My flight was quite late at night, so I had time to devour some good dinner and chill with some good music, which I did.

Boredom was nothing to be worried about as I had Joyce by my side during the whole time. She works for a firm based out of China and frequents a lot to Indian cities. And we had discussions ranging from the Great firewall, the various provinces of China and the cultural differences and similarities that we had. It was pretty interesting and we exchanged emails to which she would promise me that she would be my guide when I visit China the next time.

I had a transit via Bangkok which I reached quite early that morning. The layover got over quickly while I changed my terminal to catch my connecting flight to Taipei.

It was just the time of sunrise and trust me, even if I was sleep deprived. The sunrise was quite something. Dunno if I did justice while capturing it.

I slept almost all the way to Taipei, not that I wanted to. But it was almost forced by my body. Effects of not sleeping at all the previous night I guess.

Arrival at Taipei airport was at around the afternoon, hunger caved in

Luckily there was a Subway near the exit terminal and it was quite the lunch. Now came the time to actually travel back to my hotel room.

I should remind you that very few people actually can speak English back at Taiwan and if you are not comfortable asking questions to strangers without a guide with, you are pretty much doomed.

I was staying over at the northern part of Taiwan and it’s extremely pretty. You have a very good mix of traditional restaurants mixed with the new age restaurants and all the swanky stuff. Especially, I found quite a lot of national banks littered around the place I was staying.

If you are starting out from the airport towards Nangang, you would be having some choices on how one would be willing to travel. Was suggested by people to take a cab, but the what the heck right? Took a metro where I had to change two stations with absolutely no English written over the directions. I dunno how did I not get lost.

And boy was my room not clean!

The fluffy bed, the cosy cushions, soft carpet. It was amazing. Even the bathroom, a good nice bath tub with some good marble. Trust me, this is the most sophisticated commode that I have ever come across. Literally had to use Google translate for this.

I had this itch of roaming around a little bit, still a sleepyhead. But couldn’t help it. This city is so beautiful! I had the most amazing noodles back in a local Taiwanese resto. It’s very different to eat in the sense that it’s quite thick than the normal noodles that people have here back in India. And you get a pretty generous helping of it too.

On the way back to our hotel room, stopped by the local store. It was quite intriguing that you could see almost everything at the local store. Some localites were having their regular sip of coffee and chilling around, found a guy studying in the corner too. Possibly from a university.

Came back, crashed on my bed, played some good music and before I knew it, I was fast asleep.

The morning after, I could really feel the pain in my shoulders carrying two bags over the city and trust me. It’s not a pleasant feeling.

Roamed a little bit more before leaving for the venue where the conference was gonna happen.

The venue was a research institution which would be surrounded by woods. The campus was huge and was filled with lush greenery. It was a pretty place.

Here’s a small vlog from my first day back there.

PyCon Taiwan

The next morning was to be the first day of PyCon Taiwan, where I was slotted to give a talk later on the third day of the conference. I was excited as well as scared as this was gonna be my first international conference. Add that it was gonna be my first PyCon Talk.

As with the last 4 PyCon India’s that I have attended, this was quite different in the sense that it had a lot more diversity in terms of the demographics where people were coming from. We had people coming from China, South Korea, Japan, Australia and the rest of Asia Pacific. It was amazing to see how PyCon brought all of us together.

After the talks ended, we all headed to Marriott, where we were to have the Speaker Night. I swear, I didn’t have such good sea food anywhere else. Sorry Goa, but that’s the truth.

Had a lot of good conversations about open source, work culture back in Taiwan and Asia at large and how it impacted one’s health, work life balance and what not. We all made promises of staying in touch. And yes, some of us still are.

I crashed over my bed and that was the end of the tiresome day for me.

Morning was a breeze, had some great sleep after dozing off talking back home and some friends. Breakfast equally good from yesterday’s dinner.

Full with this good of a breakfast, left for the 2nd day of the conference. All the talks which I attended were top notch and the speakers really put in a lot of effort to prepare the talks.

Over the alleys, outside the talk halls. I struck some great conversations.

After the talks for the day ended, and some great dinner. We are all in for a treat to a great performance by my good friend Adrian. Trust me, he has his way with music. Crazy show!

The show ended with a we decided to visit Taipei 101, formerly the world’s tallest building.

We got a cab and off we went. Taipei was more than a acquaintance now, I would remember that cars would actually stop when there was a green signal at the zebra crossing, the rules were actually being followed.

The streets were bustling like ever before, we could see a huge influx of the local residents as well as a huge amount of tourists. Streets performers, speed painters and what not. The one which drew a lot of attention was a blind man playing violin with his dog watchfully sitting with him.

Xinyin District is one of the well off parts of the city. Lot of big name shops and brands. You name it you have it.

Most of the levels on the ground floor of Taipei 101 are filled with big name shops and eateries. Some of the levels above are filled with offices. A famous one being that of Google’s. Too bad I couldn’t go up and see the balancing weights which were made to counter earthquakes. Real piece of engineering.

On the way back, we were passing through the City Square and there seemed to be some festival of a kind going on there, or maybe a carnival. Well I really don’t know what to call that. There were skimpily dressed girls dancing over the stages which had poles and there were men clicking photos with them and in one case one dude literally joining them on the stage. Not posting the pictures for those reasons.

And to my surprise, we met Andrew Godwin and Russell Keith-Magee, they shared the same surprised look seeing the carnival. And off we walked to the nearest metro station which would take us back to our Hotel. Both Andrew and Russell were kind enough to indulge in a great conversation during this time. Russell talked about funding open source projects to sustaining oneself while continuing contributions towards them.

Andrew and I had a long chat about his motivations towards Django channels and what excites him about open source. I was amazed by the humbleness of both.

We settled for the night at our respective rooms and I slept like a log. Needed some good sleep as I had my talk slotted for tomorrow. I was nervous, scared and excited all at the same time.

The day finally came and I think I did okayish while giving my talk, will not say that I didn’t have any rough edges while giving the talk but I guess I have a lot of space for improvement.

The day ended with me leaving the venue after all the talks ended, next stop being the famous night markets of Taiwan. Had heard and read a lot about them. I ended up at the Raohe Street Night Market which is one of the oldest night markets in Songshan District.

Took a metro till Songshan Station and the place was as lively as it had been described in the other places.

You would find all kinds of cuisine there. It was a non-vegetarians delight to say the least.

Squids, octopus, dried fish and fruits, numerous kinds of noodles and what not.

Next up was Lungshan Temple of Manka in Wanhua District. The temple was built in Taipei in 1738 by settlers from Fujian during Qing rule in honor of Guanyin. This temple has served as a municipal, guild and self-defence centre, as well as a house of worship.

You could feel the calmness setting in your mind by listening to the slow chants by the people which they were reciting in order to complete their prayers to their deity.

I just sat there for some time at a corner just to relax a while, to forget the bustling noise of the city and all the fatigue I had.

There was another night market just a little up the temple, this night market was filled with massage parlours and some really old restaurants which were literally serving snake soup and the like. Nope, I didn’t have any of that. Was not feeling excited.

On the way back, while passing through the Lunshan metro station, which was known for fortune readers. You would find one shop at every other corner in the street. It was crazy, I was literally being hounded by people asking me to come in. This is some serious business out there for some people. Got some souvenirs on the way back to the airport from this kind lady who was very inquisitive about what I was doing here back in Taipie. When she leart that I had come to give a talk at a tech conference, she was very surprised that I was here all alone while being so young. Not sure how one can react to that. Anyways, had a good small chat about both our cultures and it was very interesting to know her perspective.

My flight was set for departure early in the morning for Bangkok, had got nothing but a lot of time to kill at the airport as I reached at around 10pm.

While charging my phone, I met this girl who was from Macau pursuing her bachelors in Computer Science at Taiwan. It was interesting to get to know that back at her place, they have a real scarcity for good universities and that most people have to come out of there to get an education. She just came back from home after a short vacation. Just as we were discussing about the night markets of Taiwan, a guy joined us in the conversation and added his view point about how this is what actually built the culture of Taiwan, how it provided food and moneny on the table of the average Tiwanese citizen who had opened up a shop at one of the night markets.

Fast forward a few hours, I had the pleasure of talking to his wife who was from Philippines and his little baby boy. He talked about his life here back at Taiwan about how he had to run away from South Afica because he lost his job as a teacher back in the early 1990’s. He talked at length about how he left everything behind to come here and how he missed his family, how he could not meet his dying father who was suffering from cancer back at South Africa.

Those were the sad parts, he got really emotional talking about how his wife was facing problems about immigration back here in Taiwan and how bad the living conditions were back in Philippines.

Matter of fact, he had his flight for Philippines slotted for the next morning as he was going to meet his wife and son. I was very happy that he was able to reconcile with his bad past and look forward to the future of his family.

Night ended and we bade good bye as I left to catch my flight to Bangkok. While I was sitting and latching my seat belt, I remembered the last few days here in Taiwan. It was a great journey indeed, and I am very grateful to the people back in Taiwan to be very accepting and kind.

Until next time Taiwan :)

GSoC 2017 with oVirt - Ending Notes

2017-09-03T00:00:00+00:00

Start of Coding Period

This was also the time when I started my first contributions to oVirt. I surely remember getting just started with Ansible which was to be used in the project with oVirt.

Lukas was very patient to help me with understanding the parts which would be used by me in the project and was very patient in every step.

Right from the application period of March, 2017. Every week has been a new learning experience for me.

I had a lot of side projects but never had I worked with an organisation this big and many developers behind it. It was an experience it in it’s own.

During this phase, I was travelling in between during the phase 2 due to relocation to a different state. That was from Bangalore to Dehradun. Quite a pretty place if you ask me.

Would sometimes work out of the local coffee shop and trust me internet can be a problem sometimes if you cross your FUP.

There were times when I literally got badly stuck in a bug or a problem which I couldn’t get my head through which also taught me that if you stick to the problems long enough they can be solved

This was also the time when I got into a habit of tracking everything related to side projects or GSoC.

The above is back from my 1st evaluation.

Continuing the tradition for the last month

Not the cleanest, but works for me.

I was lucky to be given a chance to speak at PyCon Taiwan 2017.

(From left: Subhankar, founders of function Space and me on the extreme right)

From attending my first conference in 2014 to giving a talk at one of the PyCon’s. It sure has been a journey.

Ending notes

As the coding period ends, I would like to thank my mentor Lukáš Svatý, for being there for my silly questions and giving me the direction, without whom nothing would have been possible and also the community members and people in the oVirt project. Without their guidance, I would be lost.

Here is an assimilated link to all my contributions and blog posts so far.

Takeaways for me

Experience of working remotely with a group of extremely talented people
Code formatting. Readability is important
Writing documentation which others can understand.
Reusing preexisting code for my own needs (looking at you engine-remote-db playbook)
Identifying refactoring needs. Did some extensive rewrites to the remote DWH PR
Writing code with performance in mind.

Future Work

Continuing the work for Metrics Playbook deployment

As I write the last post for phase 3. I look back at the things contributing to oVirt has taught, things which will be there with me even after this. And I am glad that I submitted that proposal that day :)

Until next time!

Second Phase - GSoC, work on 3 VM setup of oVirt installation

2017-07-30T00:00:00+00:00

It has been a week since the Phase 2 results are out. And we proceed to the last and final Phase of the GSoC. I couldn’t blog regularly in the last phase due to many reasons, which I want to change this Phase. But anyhow, this was collectively the output of my work in Phase 2

Setup

This approach will follow a 3 box VM setup.

For clarity sake, the VM’s can be assumed for now as

VM A: engine.ovirt.org: which stored engine db and the main engine installation.
VM B: dwhservice.ovirt.org: which will host the dwh service
VM C: dwhdb.ovirt.org: which will store the ovirt_engine_history db

On VM A

---
- hosts: engine
  vars:
    ovirt_engine_type: 'ovirt-engine'
    ovirt_engine_version: '4.1'
    ovirt_engine_organization: 'dwhmanualenginetest.ovirt.org'
    ovirt_engine_admin_password: 'secret'
    ovirt_rpm_repo: 'http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm'
    ovirt_engine_organization: 'ovirt.org'
    ovirt_engine_dwh_db_host: 'remotedwh.ovirt.org'
    ovirt_engine_dwh_db_configure: false
  roles:
    - role: ovirt-common
    - role: ovirt-engine-install-packages
    - role: ovirt-engine-setup

Running this would be

$ ansible-playbook site.yml -i inventory --skip-tags skip_yum_install_ovirt_engine_dwh,skip_yum_install_ovirt_engine_dwh_setup

$ ansible-playbook site.yml -i inventory --skip-tags skip_yum_install_ovirt_engine_dwh,skip_yum_install_ovirt_engine_dwh_setup
 [WARNING]: While constructing a mapping from /Users/tasdikrahman/development/gsoc/ovirt-ansible/site.yml, line 4, column 5,
found a duplicate dict key (ovirt_engine_organization). Using last defined value only.


PLAY [engine] *****************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************
ok: [engine.ovirt.org]

TASK [ovirt-common : complain if no ovirt source is specified] ****************************************************************
skipping: [engine.ovirt.org]

TASK [ovirt-common : install libselinux-python for ansible] *******************************************************************
ok: [engine.ovirt.org]

TASK [ovirt-common : creating directory repo-backup in yum.repos.d] ***********************************************************
changed: [engine.ovirt.org]

TASK [ovirt-common : create repository backup] ********************************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-common : copy repository files] ***********************************************************************************

TASK [ovirt-common : install rpm repository package] **************************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-common : create repository files] *********************************************************************************

TASK [ovirt-engine-install-packages : yum install engine] *********************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-engine-setup : check if ovirt-engine running (health page)] *******************************************************
FAILED - RETRYING: check if ovirt-engine running (health page) (2 retries left).
FAILED - RETRYING: check if ovirt-engine running (health page) (1 retries left).
fatal: [engine.ovirt.org]: FAILED! => {"attempts": 2, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://engine.ovirt.org/ovirt-engine/services/health"}
...ignoring

TASK [ovirt-engine-setup : copy default answerfile] ***************************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-engine-setup : copy custom answer file] ***************************************************************************
skipping: [engine.ovirt.org]

TASK [ovirt-engine-setup : update setup packages] *****************************************************************************
skipping: [engine.ovirt.org]

TASK [ovirt-engine-setup : run engine-setup with answerfile] ******************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-engine-setup : check state of database] ***************************************************************************
ok: [engine.ovirt.org]

▽
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

TASK [ovirt-engine-setup : check state of engine] *****************************************************************************
ok: [engine.ovirt.org]

TASK [ovirt-engine-setup : restart of ovirt-engine service] *******************************************************************
changed: [engine.ovirt.org]

TASK [ovirt-engine-setup : Open port 5432 for opening connection to remote DWH if not being setup on this host] ***************
changed: [engine.ovirt.org] => (item=5432)

TASK [ovirt-engine-setup : check health status of page] ***********************************************************************
FAILED - RETRYING: check health status of page (12 retries left).
FAILED - RETRYING: check health status of page (11 retries left).
FAILED - RETRYING: check health status of page (10 retries left).
FAILED - RETRYING: check health status of page (9 retries left).
ok: [engine.ovirt.org]

TASK [ovirt-engine-setup : clean tmp files] ***********************************************************************************
changed: [engine.ovirt.org]

RUNNING HANDLER [ovirt-engine-setup : Reload firewalld] ***********************************************************************
changed: [engine.ovirt.org]

PLAY RECAP ********************************************************************************************************************
engine.ovirt.org           : ok=16   changed=10   unreachable=0    failed=0

On VM C

On testdwhdb.ovirt.org

$ ssh root@dwhdb.ovirt.org
[root@dwhdb ~]# hostname
dwhdb.ovirt.org
[root@dwhdb ~]# yum install postgresql-server -y > /dev/null
[root@dwhdb ~]# su -l postgres -c "/usr/bin/initdb --locale=en_US.UTF8 --auth='ident' --pgdata=/var/lib/pgsql/data/"
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

fixing permissions on existing directory /var/lib/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 32MB
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
Success. You can now start the database server using:
    /usr/bin/postgres -D /var/lib/pgsql/data
or
    /usr/bin/pg_ctl -D /var/lib/pgsql/data -l logfile start
[root@dwhdb ~]#
[root@dwhdb ~]# systemctl start postgresql.service
[root@dwhdb ~]# systemctl enable postgresql.service
Created symlink from /etc/systemd/system/multi-user.target.wants/postgresql.service to /usr/lib/systemd/system/postgresql.service.
[root@dwhdb ~]# su - postgres
-bash-4.2$ psql
psql (9.2.18)
Type "help" for help.
postgres=# create role ovirt_engine_history with login encrypted password 'password';
CREATE ROLE
postgres=# create database ovirt_engine_history owner ovirt_engine_history template template0 encoding 'UTF8' lc_collate 'en_US.UTF-8' lc_ctype 'en_US.UTF-8';
postgres=# \q
-bash-4.2$ exit
[root@dwhdb ~]# tail -3 /var/lib/pgsql/data/pg_hba.conf
host    ovirt_engine_history    ovirt_engine_history    xxx.xx.3.0/24     md5 # engine
host    ovirt_engine_history    ovirt_engine_history    xxx.xx.11.0/24     md5 # dwhservice

[root@dwhdb ~]#
[root@dwhdb ~]# sed -n '59,66p' /var/lib/pgsql/data/postgresql.conf
listen_addresses = '*'		# what IP address(es) to listen on;
					# comma-separated list of addresses;
					# defaults to 'localhost'; use '*' for all
					# (change requires restart)
#port = 5432				# (change requires restart)
# Note: In RHEL/Fedora installations, you can't set the port number here;
# adjust it in the service file instead.
max_connections = 150			# (change requires restart)
[root@dwhdb ~]# systemctl start firewalld
[root@dwhdb ~]# firewall-cmd --zone=public --add-port=5432/tcp --permanent
success
[root@dwhdb ~]# firewall-cmd --reload
success
[root@dwhdb ~]# iptables -S | grep 5432
-A IN_public_allow -p tcp -m tcp --dport 5432 -m conntrack --ctstate NEW -j ACCEPT
[root@dwhdb ~]#

So now that we have both the engine and ovirt_engine_history in place. We need to set up the VM which will setup up the dwh service on it.

dwh service VM

$ ssh root@dwhservice.ovirt.org
[root@dwhservice ~]# hostname
dwhservice.ovirt.org
[root@dwhservice ~]# yum install -y http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm > /dev/null
[root@dwhservice ~]# yum install -y ovirt-engine-dwh-setup > /dev/null
http://ftp.jaist.ac.jp/pub/Linux/Fedora/epel/7/x86_64/repodata/fe30bd7c1f6f8d6f4007e9096a0c0fdd305c550f8c136ba86793577edc9dc571-updateinfo.xml.bz2: [Errno 14] HTTP Error 404 - Not Found
Trying other mirror.
To address this issue please refer to the below knowledge base article

https://access.redhat.com/articles/1320623

If above article doesn't help to resolve this issue please create a bug on https://bugs.centos.org/

warning: /var/cache/yum/x86_64/7/ovirt-4.1-epel/packages/libtommath-0.42.0-5.el7.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 352c64e5: NOKEY
warning: /var/cache/yum/x86_64/7/ovirt-4.1/packages/ovirt-engine-lib-4.1.4.2-1.el7.centos.noarch.rpm: Header V4 RSA/SHA1 Signature, key ID fe590cb7: NOKEY
Importing GPG key 0xFE590CB7:
 Userid     : "oVirt <infra@ovirt.org>"
 Fingerprint: 31a5 d783 7fad 7cb2 86cd 3469 ab8c 4f9d fe59 0cb7
 Package    : ovirt-release41-4.1.4-1.el7.centos.noarch (installed)
 From       : /etc/pki/rpm-gpg/RPM-GPG-ovirt-4.1
Importing GPG key 0x352C64E5:
 Userid     : "Fedora EPEL (7) <epel@fedoraproject.org>"
 Fingerprint: 91e9 7d7c 4a5e 96f1 7f3e 888f 6a2f aea2 352c 64e5
 From       : https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-7
[root@dwhservice ~]# engine-setup
[ INFO  ] Stage: Initializing
[ INFO  ] Stage: Environment setup
          Configuration files: ['/etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf']
          Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20170802052209-bba7e4.log
          Version: otopi-1.6.2 (otopi-1.6.2-1.el7.centos)
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment customization

          --== PRODUCT OPTIONS ==--

          Configure Data Warehouse on this host (Yes, No) [Yes]:

          --== PACKAGES ==--

[ INFO  ] Checking for product updates...
[ INFO  ] No product updates found

          --== NETWORK CONFIGURATION ==--

          Host fully qualified DNS name of this server [dwhservice.ovirt.org]:
[WARNING] Failed to resolve dwhservice.ovirt.org using DNS, it can be resolved only locally
          Setup can automatically configure the firewall on this system.
          Note: automatic configuration of the firewall may overwrite current settings.
          Do you want Setup to configure the firewall? (Yes, No) [Yes]:
          The following firewall managers were detected on this system: firewalld
          Firewall manager to configure (firewalld): firewalld
[ INFO  ] firewalld will be configured as firewall manager.
          Host fully qualified DNS name of the engine server []: engine.ovirt.org
          Setup will need to do some actions on the remote engine server. Either automatically, using ssh as root to access it, or you will be prompted to manually perform each such action.
          Please choose one of the following:
          1 - Access remote engine server using ssh as root
          2 - Perform each action manually, use files to copy content around
          (1, 2) [1]:
          ssh port on remote engine server [22]:
          root password on remote engine server engine.ovirt.org:

          --== DATABASE CONFIGURATION ==--

          Where is the DWH database located? (Local, Remote) [Local]: Remote

          ATTENTION

          Manual action required.
          Please create database for ovirt-engine use. Use the following commands as an example:

          create role ovirt_engine_history with login encrypted password '<password>';
          create database ovirt_engine_history owner ovirt_engine_history
           template template0
           encoding 'UTF8' lc_collate 'en_US.UTF-8'
           lc_ctype 'en_US.UTF-8';

          Make sure that database can be accessed remotely.

          DWH database host [localhost]: dwhdb.ovirt.org
          DWH database port [5432]:
          DWH database secured connection (Yes, No) [No]:
          DWH database name [ovirt_engine_history]:
          DWH database user [ovirt_engine_history]:
          DWH database password:

          Please provide the following credentials for the Engine database.
          They should be found on the Engine server in '/etc/ovirt-engine/engine.conf.d/10-setup-database.conf'.

          Engine database host []: engine.ovirt.org
          Engine database port [5432]:
          Engine database secured connection (Yes, No) [No]:
          Engine database name [engine]:
          Engine database user [engine]:
          Engine database password:

          --== OVIRT ENGINE CONFIGURATION ==--


          --== STORAGE CONFIGURATION ==--


          --== PKI CONFIGURATION ==--


          --== APACHE CONFIGURATION ==--


          --== SYSTEM CONFIGURATION ==--


          --== MISC CONFIGURATION ==--

          Please choose Data Warehouse sampling scale:
          (1) Basic
          (2) Full
          (1, 2)[2]:

          --== END OF CONFIGURATION ==--

[ INFO  ] Stage: Setup validation

          --== CONFIGURATION PREVIEW ==--

          Firewall manager                        : firewalld
          Update Firewall                         : True
          Host FQDN                               : dwhservice.ovirt.org
          Engine database secured connection      : False
          Engine database user name               : engine
          Engine database name                    : engine
          Engine database host                    : engine.ovirt.org
          Engine database port                    : 5432
          Engine database host name validation    : False
          DWH installation                        : True
          DWH database secured connection         : False
          DWH database host                       : dwhdb.ovirt.org
          DWH database user name                  : ovirt_engine_history
          DWH database name                       : ovirt_engine_history
          DWH database port                       : 5432
          DWH database host name validation       : False

          Please confirm installation settings (OK, Cancel) [OK]:
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stopping dwh service
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Creating/refreshing DWH database schema
[ INFO  ] Generating post install configuration file '/etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf'
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up

          --== SUMMARY ==--

[ INFO  ] Starting dwh service
          Please restart the engine by running the following on engine.ovirt.org :
          # service ovirt-engine restart
          This is required for the dashboard to work.

          --== END OF SUMMARY ==--

[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20170802052209-bba7e4.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20170802052510-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ INFO  ] Execution of setup completed successfully
[root@dwhservice ~]# service restart ovirt-engine-dwh
The service command supports only basic LSB actions (start, stop, restart, try-restart, reload, force-reload, status). For other actions, please try to use systemctl.
[root@dwhservice ~]#
[root@dwhservice ~]# service ovirt-engine-dwhd restart
Redirecting to /bin/systemctl restart  ovirt-engine-dwhd.service

Now restart the service systemctl restart ovirt-engine on the testengine.ovirt.org VM and go to https://testengine.ovirt.org/ovirt-engine/

Automatinng it using Ansible

The architecture of the overall system is similar to the one above.

Setup the testengine.ovirt.org the way we did the setup in the first step.

Next would be setup the postgresql instance which would host our ovirt_engine_history, we can do

On VM C

the VM has to be setup for the ovirt_engine_history VM which is to be used by dwhservice.ovirt.org VM.

---
- hosts: dwhdb
  vars:
    # the below vars are explained in `install-postgresql/defaults/main.yml` and 
    # also the other configurable variables are placed there
    engine_vm_network_cidr: '139.162.45.0/24' # Network where the Engine VM lies
    dwhservice_vm_network_cidr: '139.162.61.0/24'
    ovirt_engine_dwh_db_password: 'password'
  roles:
    - role: ovirt-engine-remote-dwh-setup/install-postgresql

Running this would be

$ ansible-playbook site.yml -i inventory

PLAY [dwhdb] ******************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************
ok: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : check PostgreSQL service] ********************************************
fatal: [testdwhdb.ovirt.org]: FAILED! => {"changed": false, "failed": true, "msg": "Could not find the requested service postgresql: host"}
...ignoring

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : install postgresql] **************************************************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Check if the db is initialized] **************************************
ok: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Initialize the postgresql db] ****************************************
 [WARNING]: Consider using 'become', 'become_method', and 'become_user' rather than running su

changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Start postgresql.service] ********************************************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : creating directory for sql scripts in /tmp/ansible-sql] **************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : copy SQL scripts] ****************************************************
changed: [testdwhdb.ovirt.org] => (item=ovirt-engine-dwh-db-user-create.sql)
changed: [testdwhdb.ovirt.org] => (item=ovirt-engine-dwh-db-create.sql)

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : create engine DWH DB and user] ***************************************
changed: [testdwhdb.ovirt.org] => (item=ovirt-engine-dwh-db-user-create.sql)
changed: [testdwhdb.ovirt.org] => (item=ovirt-engine-dwh-db-create.sql)

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Adding engine and dwhservice vm IP's in the dwhdb conf to be acessed remotely] ***
changed: [testdwhdb.ovirt.org] => (item=host    ovirt_engine_history    ovirt_engine_history    139.162.45.0/24     md5 # engine)
changed: [testdwhdb.ovirt.org] => (item=host    ovirt_engine_history    ovirt_engine_history    139.162.61.0/24     md5 # dwhservice)

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Edit the config file /var/lib/pgsql/data/postgresql.conf] ************
changed: [testdwhdb.ovirt.org] => (item=sed -i -- 's/max_connections = 100/max_connections = 150/g' /var/lib/pgsql/data/postgresql.conf)
changed: [testdwhdb.ovirt.org] => (item=sed -i "60ilisten_addresses = '*'" /var/lib/pgsql/data/postgresql.conf)

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Enable firewalld and open up port 5432] ******************************
changed: [testdwhdb.ovirt.org] => (item=systemctl start firewalld)
changed: [testdwhdb.ovirt.org] => (item=firewall-cmd --zone=public --add-port=5432/tcp --permanent)
changed: [testdwhdb.ovirt.org] => (item=firewall-cmd --reload)

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Restart postgresql for the loading the newer configs] ****************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : check PostgreSQL service] ********************************************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : clean tmp files] *****************************************************
changed: [testdwhdb.ovirt.org]

TASK [ovirt-engine-remote-dwh-setup/install-postgresql : Enable postgresql.service to start at boot time] *********************
ok: [testdwhdb.ovirt.org]

PLAY RECAP ********************************************************************************************************************
testdwhdb.ovirt.org        : ok=16   changed=12   unreachable=0    failed=0

On VM B

Now VM which hosts the dwhservice needs to be setup.

---
- hosts: dwhservice
  vars:
    ovirt_engine_type: 'ovirt-engine'
    ovirt_engine_version: '4.1'
    ovirt_rpm_repo: 'http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm'
    ovirt_engine_host_root_passwd: 'pycon2017'  # the root password of the host where ovirt-engine is installed
    ovirt_engine_firewall_manager: 'firewalld'
    ovirt_engine_host_fqdn: 'testengine.ovirt.org'  # FQDN of the ovirt-engine installation host, should be resolvable from the new DWH host
    ovirt_engine_db_host: 'testengine.ovirt.org' 
    ovirt_engine_dwh_db_host: 'testdwhdb.ovirt.org'
    ovirt_engine_dwh_db_password: 'password'
    ovirt_engine_history_db_on_dwhservice_host: False
  roles:
    - role: ovirt-common
    - role: ovirt-engine-install-packages
    - role: ovirt-engine-remote-dwh-setup

Running this would be

$ ansible-playbook site.yml -i inventory --skip-tags skip_yum_install_ovirt_engine,skip_yum_install_ovirt_engine_dwh -vvv

Now restart ovirt-engine in VMA by doing a systemctl restart ovirt-engine

Issues Faced

The very first thing which I faced as an issue was that, there was no proper documentation in the oVirt Downstream docs in place (as on now) in place. This made me research more about the things. Good in one way, anyway I raised report on bugzilla about here. https://bugzilla.redhat.com/show_bug.cgi?id=1475706 There was another snag like the previous one where an OTOPI variable was not being logged in the answerfiles which led me to

Reference Commits

Future Work

Cleanup of the PR’s opened so far.
Work on test upgrade playbook.
role for ovirt metrics deployment
role for direct upgrade

References

http://www.ovirt.org/documentation/install-guide/appe-Preparing_a_Remote_PostgreSQL_Database_for_Use_with_the_oVirt_Engine/
https://www.linode.com/docs/databases/postgresql/how-to-install-postgresql-relational-databases-on-centos-7
https://www.postgresql.org/docs/8.1/static/sql-createrole.html
https://support.rackspace.com/how-to/postgresql-creating-and-dropping-roles/

I have to say, I got stuck a lot while trying this out and there have numerous times where I was just about to give up. But my mentor was always there to support my questions and doubts. Thanks for that Lukas :)

More to come

Week 3 and 4, GSoC 2017 - dozens of cloud vm's, ansibling, finding bugs, testing

2017-06-28T00:00:00+00:00

At last I have got a hold of IRC’s and I declare my love for irssi. The combination of tmux and irssi is a boon for me. I tried different clients like weechat,Epic but I found irssi to be more appealing, quite frankly use any of these two if you are looking for something using which you can chat on irc’s on the terminal.

My current installation of irssi is there on my VPC hosted on linode. For logging my chat’s and conversations, I have setup logrotate daemon to log the channels and private chats in their respective directories.

So what does it look like? Definitely a huge shift from the web chat interface and more customisable.

I love the current setup. Just one thing which I think is missing out from here is that I would like to get a notification when I get a message or someone mentions my name in a channel. Will have to look that up.

You can try Limechat if you are on a MAC.

Between, I will be there with the handle tasdikrahman hanging around the channels #ovirt on OFTC as well as on freenode, mostly #ansible and #gsoc.

Anyways, that’s for my recent addition of irc love.

Talking about work, I have been working on my existing PR looking into the feedback which my mentor and other members of the infra team have given.

And right now, I am working on adding playbooks for configuring a remote DWH on an ovirt-engine setup.

Prelude

Reading up the docs, the DWH is a historic database that allows users to create reports over a static API using business intelligence suites that enable you to monitor the system. It contains the ETL (Extract Transform Load) process created using Talend Open Studio and DB scripts to create a working history DB.

This history database(ovirt_engine_history to be precise) can be utilized by any application to extract a range of information at the data center, cluster, and host levels.

Simply put, it’s a BI tool for your ovirt installation.

oVirt Engine uses PostgreSQL 8.4.x as the database platform to store information about the state of the virtualization environment, its configuration and performance. At install time, ovirt engine creates a PostgreSQL database called engine. You have the option to either install this on the same host or a different host(remote)

The ovirt-engine-dwh package creates a second database called ovirt_engine_history, which contains historical configuration information and statistical metrics collected every minute over time from the engine operational database. Tracking the changes to the database provides information on the objects in the database, enabling the user to analyze activity, enhance performance, and resolve difficulties.

You can track configuration data, statistical data into your ovirt_engine_history database based on your needs.

But why should someone run the DWH service on a different host?

Quite obviously running a whole lot of services on one host would be very memory heavy. And what happens when someone tries to squeeze out a lot of things from a single instance?

There is a decrease in performance output due to fight amongst the processes for resources. So it’s very logical to distribute your services to different hosts if possible.

The above architecture basically shows the simple setup of the ovirt-engine and dwh all being on the same host.

Now even when you are trying to seperate out the load by distributing the services running in a host, you further have a flexibily offered by oVirt here.

Some being.

engine db itself being on a differnt host rather than on the host where ovirt-engine is running.
DWH being installed on a different host than the one where ovirt-engine is installed.(feature being available only after oVirt Engine 3.5)
DWH being installed on a different host and the DWH db being on a 3rd host.

The first one is covered by the ansible role ovirt-engine-remote-db which lets you do so

I tried tackling the 2nd one on the list in my 3rd week.

Had the initial discussions about issue #9 with Lukas and we narrowed down to the end goals and tasks for that particular issue on github.

The docs are pretty clear on the process itself and a bit of poring over them made the things which were to be done.

As I had mentioned in my earlier post about not automating something which you haven’t achieved manually yet.

That applied here too.

For starters, I rebuilt 2x2gig Centos7 boxes on Linode which were hardened using my own custom ansible playbook called ansible-bootstrap-server. I had a good lesson about security when one of my servers got compromised so this one was a must.

Some assumptions for the two systems

Expanding on the first one, it should be able to do so by using the FQDN’s of the respective VM’s in question.

To add to the above,

Ports 80, 443, 5432(postgres) should be open on both these hosts.

You can open the above ports using firewalld using (assuming you are root)

$ firewall-cmd --zone=public --add-port=80/tcp --permanent
$ # further adding any rules to appended to iptables
$ firewalld-cmd --reload  # for the rules to take effect
$ iptables -S  # to check whether the changes have taken effect

iptables would spit every rule defined out there for your system. That would get ugly real quick. A cleaner way would be to do a

[root@dwhtest-3-engine ~]# iptables-save | grep 443
-A IN_public_allow -p tcp -m tcp --dport 443 -m conntrack --ctstate NEW -j ACCEPT
[root@dwhtest-3-engine ~]#

After these, the machine is just set for configuration.

Manual installation

Install and setup ovirt-engine on machine A, ovirt-engine-dwh on machine B, see that dwhd on B collects data from the engine on A.

On A:

Installing DWH on a remote host from the one having ovirt-engine

[root@dwhtest-3-engine ~]# hostname
dwhtest-3-engine.ovirt.org
[root@dwhtest-3-engine ~]# yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm
[root@dwhtest-3-engine ~]# yum -y install ovirt-engine
[root@dwhtest-3-engine ~]# engine-setup
[ INFO  ] Stage: Initializing
[ INFO  ] Stage: Environment setup
          Configuration files: ['/etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf', '/etc/ovirt-engine-setup.conf.d/10-packaging.conf']
          Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20170625105958-cyvseu.log
          Version: otopi-1.6.2 (otopi-1.6.2-1.el7.centos)
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ INFO  ] Stage: Environment customization

          --== PRODUCT OPTIONS ==--

          Configure Engine on this host (Yes, No) [Yes]:
          Configure Image I/O Proxy on this host? (Yes, No) [Yes]:
          Configure WebSocket Proxy on this host (Yes, No) [Yes]:
          Please note: Data Warehouse is required for the engine. If you choose to not configure it on this host, you have to configure it on a remote host, and then configure the engine on this host so that it can access the database of the remote Data Warehouse host.
          Configure Data Warehouse on this host (Yes, No) [Yes]: No
          Configure VM Console Proxy on this host (Yes, No) [Yes]:

          --== PACKAGES ==--

[ INFO  ] Checking for product updates...
[ INFO  ] No product updates found

          --== NETWORK CONFIGURATION ==--

          Host fully qualified DNS name of this server [dwhtest-3-engine.ovirt.org]:
[WARNING] Failed to resolve dwhtest-3-engine.ovirt.org using DNS, it can be resolved only locally
          Setup can automatically configure the firewall on this system.
          Note: automatic configuration of the firewall may overwrite current settings.
          Do you want Setup to configure the firewall? (Yes, No) [Yes]:
          The following firewall managers were detected on this system: firewalld
          Firewall manager to configure (firewalld): firewalld
[ INFO  ] firewalld will be configured as firewall manager.

          --== DATABASE CONFIGURATION ==--

          Where is the Engine database located? (Local, Remote) [Local]:
          Setup can configure the local postgresql server automatically for the engine to run. This may conflict with existing applications.
          Would you like Setup to automatically configure postgresql and create Engine database, or prefer to perform that manually? (Automatic, Manual) [Automatic]:

          --== OVIRT ENGINE CONFIGURATION ==--

          Engine admin password:
          Confirm engine admin password:
[WARNING] Password is weak: it is based on a dictionary word
          Use weak password? (Yes, No) [No]: Yes
          Application mode (Virt, Gluster, Both) [Both]:

          --== STORAGE CONFIGURATION ==--

          Default SAN wipe after delete (Yes, No) [No]:

          --== PKI CONFIGURATION ==--

          Organization name for certificate [ovirt.org]:

          --== APACHE CONFIGURATION ==--

          Setup can configure the default page of the web server to present the application home page. This may conflict with existing applications.
          Do you wish to set the application as the default page of the web server? (Yes, No) [Yes]:
          Setup can configure apache to use SSL using a certificate issued from the internal CA.
          Do you wish Setup to configure that, or prefer to perform that manually? (Automatic, Manual) [Automatic]:

          --== SYSTEM CONFIGURATION ==--

          Configure an NFS share on this server to be used as an ISO Domain? (Yes, No) [No]:

          --== MISC CONFIGURATION ==--


          --== END OF CONFIGURATION ==--

[ INFO  ] Stage: Setup validation
[WARNING] Cannot validate host name settings, reason: resolved host does not match any of the local addresses
[WARNING] Warning: Not enough memory is available on the host. Minimum requirement is 4096MB, and 16384MB is recommended.
          Do you want Setup to continue, with amount of memory less than recommended? (Yes, No) [No]: Yes

          --== CONFIGURATION PREVIEW ==--

          Application mode                        : both
          Default SAN wipe after delete           : False
          Firewall manager                        : firewalld
          Update Firewall                         : True
          Host FQDN                               : dwhtest-3-engine.ovirt.org
          Configure local Engine database         : True
          Set application as default page         : True
          Configure Apache SSL                    : True
          Engine database secured connection      : False
          Engine database user name               : engine
          Engine database name                    : engine
          Engine database host                    : localhost
          Engine database port                    : 5432
          Engine database host name validation    : False
          Engine installation                     : True
          PKI organization                        : ovirt.org
          DWH installation                        : False
          Configure local DWH database            : False
          Configure Image I/O Proxy               : True
          Configure VMConsole Proxy               : True
          Configure WebSocket Proxy               : True

          Please confirm installation settings (OK, Cancel) [OK]:
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stopping engine service
[ INFO  ] Stopping ovirt-fence-kdump-listener service
[ INFO  ] Stopping Image I/O Proxy service
[ INFO  ] Stopping vmconsole-proxy service
[ INFO  ] Stopping websocket-proxy service
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Upgrading CA
[ INFO  ] Initializing PostgreSQL
[ INFO  ] Creating PostgreSQL 'engine' database
[ INFO  ] Configuring PostgreSQL
[ INFO  ] Creating CA
[ INFO  ] Creating/refreshing Engine database schema
[ INFO  ] Configuring Image I/O Proxy
[ INFO  ] Setting up ovirt-vmconsole proxy helper PKI artifacts
[ INFO  ] Setting up ovirt-vmconsole SSH PKI artifacts
[ INFO  ] Configuring WebSocket Proxy
[ INFO  ] Creating/refreshing Engine 'internal' domain database schema
[ INFO  ] Generating post install configuration file '/etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf'
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up
[ INFO  ] Starting engine service
[ INFO  ] Restarting ovirt-vmconsole proxy service

          --== SUMMARY ==--

[ INFO  ] Restarting httpd
          Please use the user 'admin@internal' and password specified in order to login
          The engine requires access to the Data Warehouse database.
          Data Warehouse was not set up. Please set it up on some other machine and configure access to it on the engine.
          Web access is enabled at:
              http://dwhtest-3-engine.ovirt.org:80/ovirt-engine
              https://dwhtest-3-engine.ovirt.org:443/ovirt-engine
          Internal CA 1F:E4:07:AF:E2:63:27:8F:4A:E2:A1:8D:F2:63:9B:BA:29:F7:3D:21
          SSH fingerprint: 38:ca:86:6d:82:50:ca:c7:9c:03:ed:bc:0b:3a:e5:33
[WARNING] Warning: Not enough memory is available on the host. Minimum requirement is 4096MB, and 16384MB is recommended.

          --== END OF SUMMARY ==--

[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20170625105958-cyvseu.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20170625112418-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ INFO  ] Execution of setup completed successfully
[root@dwhtest-3-engine ~]#

So ovirt-engine is set on machine A.

The only difference here being when running engine-setup on the host, answering No to configuring Data Warehouse:

...
Configure Data Warehouse on this host (Yes, No) [Yes]: No
...

The other thing to note in the above installation is the answerfile which was generated out of the installation process.

Answerfile?

Tools like engine-setup, engine-cleanup and ovirt-engine-rename which are based out on OTOPI(key value pair store of data and configuration), all generate something called as an answerfile upon completion of their (be it successful or erroneous).

This file is placed under /var/lib/ovirt-engine/setup/answers/ ending with a *.conf.

So have system A which is present on some state0, “S0” in this sense includes basically everything relevant - versions of relevant packages, enabled repos, history (such as previous runs of these tools, other data accumulated over time, etc), other manual configuration, etc.

The expected way to use these answer files is:

Have a system A in some state S0
Run one of the tools interactively, answer its questions as needed, let it create an answer file Ans1. Basically answering all the places where the program was waiting for STDIN.
System A is now in state S1.
Have some other system B in state S0, that you want to bring to state S1.
Run there the same tool with –config-append=Ans1

When used this way, the tools should run unattended. If they still ask questions, it’s generally considered a bug.

Which allows and makes way for automation.

Manually editing such an answer file is generally not supported/expected and should not be needed. You might do that to achieve special non-standard goals. If you do that, you should thoroughly verify that it works for you, and use in a controlled environment - same known initial state, same versions of relevant stuff, etc.

Leveraging the things seen in this answerfile

We already have in place an ansible role for installing ovirt-engine along with DWH on the same host here named ovirt-engine-setup.

Having a look at an existing answerfile, take ovirt-engine-setup/templates/answerfile_4.0_basic.txt.j2 for example.

As it’s based out on OTOPI, and I had for reference the newly generated anwerfile from the succesfull install on machine A.

What I really wanted was a diff between these two answerfiles, one which was generated when DWH was installed along with the engine on the same host and the other being when the DWH would be configured on a remote host

46c46
< OVESETUP_DWH_CORE/enable=bool:True
---
> OVESETUP_DWH_CORE/enable=bool:False
49c49
< OVESETUP_DWH_DB/secured=bool:False
---
> OVESETUP_DWH_DB/secured=none:None
52,53c52,54
< OVESETUP_DWH_DB/host=str:localhost
< OVESETUP_DWH_DB/user=str:ovirt_engine_history
---
> OVESETUP_DWH_DB/host=none:None
> OVESETUP_DWH_DB/user=none:None
> OVESETUP_DWH_DB/password=none:None
55c56
< OVESETUP_DWH_DB/database=str:ovirt_engine_history
---
> OVESETUP_DWH_DB/database=none:None
57c58
< OVESETUP_DWH_DB/port=int:5432
---
> OVESETUP_DWH_DB/port=none:None
60,61c61,62
< OVESETUP_DWH_DB/securedHostValidation=bool:False
< OVESETUP_DWH_PROVISIONING/postgresProvisioningEnabled=bool:True
---
> OVESETUP_DWH_DB/securedHostValidation=none:None
> OVESETUP_DWH_PROVISIONING/postgresProvisioningEnabled=bool:False

This gave me a clear idea of what I needed to do with the existing answer file to make it suitable for configuring a remote DWH.

diff --git a/roles/ovirt-engine-setup/templates/answerfile_4.0_basic.txt.j2 b/roles/ovirt-engine-setup/templates/answerfile_4.0_basic.txt.j2
index 728a307..d60d03f 100644
--- a/roles/ovirt-engine-setup/templates/answerfile_4.0_basic.txt.j2
+++ b/roles/ovirt-engine-setup/templates/answerfile_4.0_basic.txt.j2
@@ -29,16 +29,29 @@ OVESETUP_DB/port=int:{ {ovirt_engine_db_port} }
 OVESETUP_DB/filter=none:None
 OVESETUP_DB/restoreJobs=int:2
 OVESETUP_DB/securedHostValidation=bool:False
+
+OVESETUP_DWH_DB/secured=none:None
+OVESETUP_DWH_DB/host=none:None
+OVESETUP_DWH_DB/user=none:None
+OVESETUP_DWH_DB/password=none:None
+OVESETUP_DWH_DB/database=none:None
+OVESETUP_DWH_DB/port=none:None
+
+OVESETUP_DWH_DB/dumper=str:pg_custom
 OVESETUP_DWH_DB/filter=none:None
 OVESETUP_DWH_DB/restoreJobs=int:2
+
+OVESETUP_DWH_DB/securedHostValidation=none:None
+
 OVESETUP_ENGINE_CORE/enable=bool:True
 OVESETUP_CORE/engineStop=none:None
 OVESETUP_SYSTEM/memCheckEnabled=bool:False
@@ -65,13 +78,21 @@ OVESETUP_CONFIG/imageioProxyConfig=bool:True
 OVESETUP_PROVISIONING/postgresProvisioningEnabled=bool:True
 OVESETUP_APACHE/configureRootRedirection=bool:True
 OVESETUP_APACHE/configureSsl=bool:True
+
+OVESETUP_DWH_CORE/enable=bool:False
+
 OVESETUP_DWH_CONFIG/scale=str:1
 OVESETUP_DWH_CONFIG/dwhDbBackupDir=str:/var/lib/ovirt-engine-dwh/backups
 OVESETUP_DWH_DB/restoreBackupLate=bool:True
 OVESETUP_DWH_DB/disconnectExistingDwh=none:None
 OVESETUP_DWH_DB/performBackup=none:None
+
+OVESETUP_DWH_PROVISIONING/postgresProvisioningEnabled=bool:False
+
 OVESETUP_DB/password=str:{ {ovirt_engine_db_password} }
 OVESETUP_RHEVM_DIALOG/confirmUpgrade=bool:True
 OVESETUP_VMCONSOLE_PROXY_CONFIG/vmconsoleProxyConfig=bool:True

the variable ovirt_engine_dwh_db_configure would be a bool defined inside our play yaml file telling ovirt-engine-setup to not configure DWH on that host.

This modification sets us up for the first part.

Configuring the remote DWH manually first

If you are trying to access the web adming on machine A, it would show you an error like this.

On machine B

[root@dwhtest-3-dwh ~]# yum -y update > /dev/null
[root@dwhtest-3-dwh ~]# yum -y install ovirt-engine-dwh-setup > /dev/null
[root@dwhtest-3-dwh ~]# engine-setup
[ INFO  ] Stage: Initializing
[ INFO  ] Stage: Environment setup
          Configuration files: ['/etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf']
          Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20170625114213-l4cku9.log
          Version: otopi-1.6.2 (otopi-1.6.2-1.el7.centos)
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment customization

          --== PRODUCT OPTIONS ==--

          Configure Data Warehouse on this host (Yes, No) [Yes]:

          --== PACKAGES ==--

[ INFO  ] Checking for product updates...
[ INFO  ] No product updates found

          --== NETWORK CONFIGURATION ==--

          Host fully qualified DNS name of this server [dwhtest-3-dwh.ovirt.org]:
[WARNING] Failed to resolve dwhtest-3-dwh.ovirt.org using DNS, it can be resolved only locally
          Setup can automatically configure the firewall on this system.
          Note: automatic configuration of the firewall may overwrite current settings.
          Do you want Setup to configure the firewall? (Yes, No) [Yes]:
          The following firewall managers were detected on this system: firewalld
          Firewall manager to configure (firewalld): firewalld
[ INFO  ] firewalld will be configured as firewall manager.
          Host fully qualified DNS name of the engine server []: dwhtest-3-engine.ovirt.org
          Setup will need to do some actions on the remote engine server. Either automatically, using ssh as root to access it, or you will be prompted to manually perform each such action.
          Please choose one of the following:
          1 - Access remote engine server using ssh as root
          2 - Perform each action manually, use files to copy content around
          (1, 2) [1]:
          ssh port on remote engine server [22]:
          root password on remote engine server dwhtest-3-engine.ovirt.org:

          --== DATABASE CONFIGURATION ==--

          Where is the DWH database located? (Local, Remote) [Local]:
          Setup can configure the local postgresql server automatically for the DWH to run. This may conflict with existing applications.
          Would you like Setup to automatically configure postgresql and create DWH database, or prefer to perform that manually? (Automatic, Manual) [Automatic]:

          Please provide the following credentials for the Engine database.
          They should be found on the Engine server in '/etc/ovirt-engine/engine.conf.d/10-setup-database.conf'.

          Engine database host []: dwhtest-3-engine.ovirt.org
          Engine database port [5432]:
          Engine database secured connection (Yes, No) [No]:
          Engine database name [engine]:
          Engine database user [engine]:
          Engine database password:

          --== OVIRT ENGINE CONFIGURATION ==--


          --== STORAGE CONFIGURATION ==--


          --== PKI CONFIGURATION ==--


          --== APACHE CONFIGURATION ==--


          --== SYSTEM CONFIGURATION ==--


          --== MISC CONFIGURATION ==--

          Please choose Data Warehouse sampling scale:
          (1) Basic
          (2) Full
          (1, 2)[2]:

          --== END OF CONFIGURATION ==--

[ INFO  ] Stage: Setup validation

          --== CONFIGURATION PREVIEW ==--

          Firewall manager                        : firewalld
          Update Firewall                         : True
          Host FQDN                               : dwhtest-3-dwh.ovirt.org
          Engine database secured connection      : False
          Engine database user name               : engine
          Engine database name                    : engine
          Engine database host                    : dwhtest-3-engine.ovirt.org
          Engine database port                    : 5432
          Engine database host name validation    : False
          DWH installation                        : True
          DWH database secured connection         : False
          DWH database host                       : localhost
          DWH database user name                  : ovirt_engine_history
          DWH database name                       : ovirt_engine_history
          DWH database port                       : 5432
          DWH database host name validation       : False
          Configure local DWH database            : True

          Please confirm installation settings (OK, Cancel) [OK]:
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stopping dwh service
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Initializing PostgreSQL
[ INFO  ] Creating PostgreSQL 'ovirt_engine_history' database
[ INFO  ] Configuring PostgreSQL
[ INFO  ] Creating/refreshing DWH database schema
[ INFO  ] Generating post install configuration file '/etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf'
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up

          --== SUMMARY ==--

[ INFO  ] Starting dwh service
          Please restart the engine by running the following on dwhtest-3-engine.ovirt.org :
          # service ovirt-engine restart
          This is required for the dashboard to work.

          --== END OF SUMMARY ==--

[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20170625114213-l4cku9.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20170625115725-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ INFO  ] Execution of setup completed successfully
[root@dwhtest-3-dwh ~]#

Restarting the ovirt-engine service on machine A starts the whole thing.

Debugging tips

If for any reason, the admin panel is still giving you an error.

Try checking the output of iptables -S, to see whether you are allowing incoming connections on port 5432 on both machines.
Are both hosts able to resolve to their public IP’s using the provided FQDN
Try entering those passwords slowly.
If all fails, the logs of each engine-setup command would be there for your rescue.

Automating it

For testing my newly created ovirt-engine-install-remote-dwh, provisioned 2x2gig CentOS 7 vms on linode

tasrahma at TASRAHMA-M-C2MT in ~/development/gsoc/ovirt-ansible (remote-dwh-fresh-engine-install●●)
$ ssh root@dwhmanualenginetest.ovirt.org
root@dwhmanualenginetest.ovirt.org's password:
[root@dwhmanualenginetest ~]# hostname
dwhmanualenginetest.ovirt.org
[root@dwhmanualenginetest ~]# systemctl start firewalld
[root@dwhmanualenginetest ~]# firewall-cmd --zone=public --add-port=80/tcp --permanent
success
[root@dwhmanualenginetest ~]# firewall-cmd --zone=public --add-port=443/tcp --permanent
success
[root@dwhmanualenginetest ~]# firewall-cmd --zone=public --add-port=5432/tcp --permanent
success
[root@dwhmanualenginetest ~]# firewall-cmd --reload
success
[root@dwhmanualenginetest ~]# exit
logout
Connection to dwhmanualenginetest.ovirt.org closed.

tasrahma at TASRAHMA-M-C2MT in ~/development/gsoc/ovirt-ansible (remote-dwh-fresh-engine-install●●)
$ ssh root@dwhmanualdwhtest.ovirt.org
root@dwhmanualdwhtest.ovirt.org's password:
[root@li1462-178 ~]# hostname
dwhmanualdwhtest.ovirt.org
[root@li1462-178 ~]# vi /etc/hosts
[root@li1462-178 ~]# systemctl start firewalld
[root@li1462-178 ~]# firewall-cmd --zone=public --add-port=80/tcp --permanent
success
[root@li1462-178 ~]# firewall-cmd --zone=public --add-port=443/tcp --permanent
success
[root@li1462-178 ~]# firewall-cmd --zone=public --add-port=5432/tcp --permanent
success
[root@li1462-178 ~]# firewall-cmd --reload
success
[root@li1462-178 ~]# exit
logout
Connection to dwhmanualdwhtest.ovirt.org closed.

With these done, the servers are ready for my new ansible roles.

First would be the installation of the ovirt-engine host

---
- hosts: engine
  vars:
    ovirt_engine_type: 'ovirt-engine'
    ovirt_engine_version: '4.1'
    ovirt_engine_organization: 'dwhmanualenginetest.ovirt.org'
    ovirt_engine_admin_password: 'secret'
    ovirt_rpm_repo: 'http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm'
    ovirt_engine_organization: 'ovirt.org'
    ovirt_engine_dwh_db_host: 'remotedwh.ovirt.org'
    ovirt_engine_dwh_db_configure: false
  roles:
    - role: ovirt-common
    - role: ovirt-engine-install-packages
    - role: ovirt-engine-setup

Running this play file would be done like

$ ansible-playbook site.yml -i inventory --skip-tags skip_yum_install_ovirt_engine_dwh,skip_yum_install_ovirt_engine_dwh_setup

After which you have to configure your remote DWH installation to the previous host which has the ovirt-engine installation which again has two possibilities

On your machine B which would be dwhmanualdwhtest.ovirt.org in this case

---
- hosts: dwh-remote
  vars:
    ovirt_engine_type: 'ovirt-engine'
    ovirt_engine_version: '4.1'
    ovirt_rpm_repo: 'http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm'
    ovirt_engine_host_root_passwd: 'admin123' # the root password of the host where ovirt-engine is installed
    ovirt_engine_firewall_manager: 'firewalld'
    ovirt_engine_db_host: 'dwhmanualenginetest.ovirt.org' # FQDN of the ovirt-engine installation host, should be resolvable from the new DWH host
    ovirt_engine_host_fqdn: 'dwhmanualenginetest.ovirt.org'
    ovirt_engine_dwh_db_host: 'localhost'
    ovirt_dwh_on_dwh: True
  roles:
    - role: ovirt-common
    - role: ovirt-engine-install-packages
    - role: ovirt-engine-remote-dwh-setup

Running this play file would be done like

$ ansible-playbook site.yml -i inventory --skip-tags skip_yum_install_ovirt_engine,skip_yum_install_ovirt_engine_dwh

After this you have to restart your ovirt-engine on the host installed by doing a service ovirt-engine restart

And you have the whole setup now.

Possibilty of the database of dwh on 3rd machine

I have not taken this case as of now in the new playbook which I wrote. I was able to figure out how to do it manually but the specific changes for getting this option to work using our roles is still pending.

Next goals

The very first thing to do now would be to modify the existing ovirt-engine-setup-remote-dwh that I wrote to include the case of putthing the DB of the DWH on a 3rd machine.
Also, Migration dwh from local machine of engine to remote machine in a new playbook which counts with already installed engine with local dwh and moved dwh to this remote server is also on the task list
Writing tests for the new additions

Pitfalls

The most arcane error that I faced during this was when I was trying to automate the configuration of the remote DWH with the engine host.

From a first glance at the error traceback, it was clear that there was some error reading the engine hostname from the OTOPI based answerfile that I had generated.

I cross checked again and again from the answerfile which was generated from the manual installation of remote DWH as described in the process above, but to no avail.

Both of them looked just the same. I was not missing anything out from the original answerfile and they were basically the same thing minus the hostnames and passwords.

To be frank, I really didn’t know what was missing out here.

On a closer look at the manual input being given while installation and the answerfile being generated from it, I noticed that the engine HOST FQDN that we provide is not being logged in the OTOPI answerfile.

Cross checking with the answerfile being generated on a new host, and sure enough. There was a prompt for the host engine FQDN which confirmed my hunch.

This had to be the case! I mean there was no other explanation. I asked Lukas and he said this possibly might be a bug and suggested that I file a bug on bugzilla. Which I did here at https://bugzilla.redhat.com/show_bug.cgi?id=1465859 which lead me to my first bug report :)

Cross checking with the answerfile being generated on a new host, and sure enough. There was a prompt for the host engine FQDN which confirmed my hunch.

Anyway, now I just had to figure out a way of what was the OTOPI config being logged in the log files when I did enter this FQDN. Checking the logs for that particular engine-run, the logs showed me that the variable being logged.

It was OVESETUP_ENGINE_CONFIG/fqdn. I placed this on my template answerfile and added the remote engine host value to it.

And sure enough, it ran successfully :)

Crazy day! But that feeling when you successfully debug something. It’s always priceless no matter if it isn’t the first time you are doing so.

Until next time!

Week 1 and 2, GSoC 2017 - Travel, Code, Good food

2017-06-13T00:00:00+00:00

So it’s quite some time since I wrote a thing or two about the things which I have doing over the last 2 weeks.

This month has been a roller coaster ride if you ask me. Many reasons to it.

Some being that I travelled to PyCon Taiwan which marked my first international trip and also my first PyCon talk. Got my uni results. Fingers crossed but heck. I scored a perfect 10 in the last sem! The final version of trumporate’s UI is almost done and me and Rituraj have to just put some final touches to (blogpost for the whole development process is pending. You can find the first one here)

As for the GSoC work. I have been working closely with Lukas on the engine-rename role which has been assigned as one of the first tasks for the first review.

Approach

Don’t automate something which you haven’t achieved manually yet!

I read this somewhere and the author was one of the employees of chef software at one of the chef conferences. Now don’t ask me what was I doing watching chef videos. I really don’t remember what led me to that. Thanks to youtube’s reco engines!

But never the less, I think the quote is quite true even in the literal sense.

I mean, if you are trying to repeat a task which consists of several small tasks with some automation tool. You ought to have gone through the manual process of doing that manually once. Especially when you are new to either the tool/process that you are trying to automate or new to the automation tool itself or both.

The reasoning behind this is very simple. As you are bound to get hiccups when trying to do the above, you will get frustrated trying to debug the errors that you are getting while automating the process.

I mean what would you be pinpointing your error to? The process which you are following? Is that wrong? Or the syntax or way-of-doing-things which your automation tool follows?

Hard to say if you ask me.

So with this in mind I first tried renaming an ovirt-engine installation task manually.

Nothing fancy, you find the exact steps over here from one of the ovirt documentation pages

The command is very simple (assuming you are inside your ovirt-engine installation)

/usr/share/ovirt-engine/setup/bin/ovirt-engine-rename \
     --newname=ovirte1n.home.local \
     --otopi-environment="OSETUP_RENAME/forceIgnoreAIAInCA=bool:'True' \
     OVESETUP_CORE/engineStop=bool:'True' \
     OSETUP_RENAME/confirmForceOverwrite=bool:'False'"

The above tries, to the rename the engine-setup with the new name with minimal(read no) external interference.

More like expect

so the value to the optional argument --newname would be the new engine-name that you wanna replace the older one with. This was given back at the time of the engine-installation but now you want it renamed.

When the engine-setup command is run in a clean environment, the command generates a number of certificates and keys that use the fully qualified domain name of the Manager supplied during the setup process. If the fully qualified domain name of the Manager must be changed later on (for example, due to migration of the machine hosting the Manager to a different domain), the records of the fully qualified domain name must be updated to reflect the new name. The ovirt-engine-rename command automates this task.

The ovirt-engine-rename command updates records of the fully qualified domain name of the Manager in the following locations:

/etc/ovirt-engine/engine.conf.d/10-setup-protocols.conf
/etc/ovirt-engine/imageuploader.conf.d/10-engine-setup.conf
/etc/ovirt-engine/isouploader.conf.d/10-engine-setup.conf
/etc/ovirt-engine/logcollector.conf.d/10-engine-setup.conf
/etc/pki/ovirt-engine/cert.conf
/etc/pki/ovirt-engine/cert.template
/etc/pki/ovirt-engine/certs/apache.cer
/etc/pki/ovirt-engine/keys/apache.key.nopass
/etc/pki/ovirt-engine/keys/apache.p12

Once the command is executed (assuming you got no error’s), and with no surprises. The engine has been renamed. But there are some things yet to be done.

The next thing is to change the hostname. This is usually done by editing /etc/hostname and rebooting.

You can go with the hostnamectl command if you are on a RHEL based system. Even hostname command works if you. If you are using DNS to pull out the hostname, then prepare relevant DNS and/or /etc/hosts records for the new name.

Assuming you were accessing the web-portal before doing your engine-rename part. You will notice that you are note able to access the web interface

I was testing all this out on a 2gig Linode centos7 box (Love these guys). Right now it’s all manual provisioning. I have to take a look at terraform sometime later.

The current workflow for me would be to create a new VM and then provision it using the ovirt-engine-setup role which set’s up the ovirt-engine for you. I have written a blog post here if you are curious about how to do so using an ansible role.

Once that is done, your path for the role to rename the engine is quite clear.

For my personal setup, I had to now change the entry in my local dev box /etc/hosts to the new hostname which I gave for the ovirt-engine.

Now open the new DNS mapping on your browser. And you should be able to see the WEB UI up.

Automating it

Now as we have the roadmap to what needs to be done, I needed to write an ansible-role for it.

A very simple to guess edge case here is that the new engine name is same as the one already present. How do I check that?

Checking logs would be the answer for that. I needed to check under /etc/ovirt-engine for that. Namely the files below

/etc/ovirt-engine/engine.conf.d/10-setup-protocols.conf
/etc/ovirt-engine/ovirt-vmconsole-proxy-helper.conf.d/10-setup.conf
/etc/ovirt-engine/isouploader.conf.d/10-engine-setup.conf
/etc/ovirt-engine/logcollector.conf.d/10-engine-setup.conf

The engine-name would be logged under these files.

What I wanted to have in place was check whether the new name is not the same as the existing name out there.

So if I did a grep recursively on the directory of /etc/ovirt-engine for the with the new-name, it would either return something (this case means the engine-name is already at the required state) or nothing at all(this case being when you the new-name is a new one and not the one already present)

Registering this grep value to a file on /tmp/engine-rename-logs-current.grep and then comparing it with a template file using diff which stores the expected grep result does the job.

If the diff with this expected file returns nothing, we are already at our required state.

If not the engine-rename command is run.

After which again a recursive grep is done again on /etc/ovirt-engine with the search parameter as the new engine name.

This result is registered and compared to the expected grep result. If this passes, we now have our ovirt-engine at the desired state. i.e the name of the engine was renamed successfully.

After which we are checking the engine health at the end of the role.

The handlers which were notified when creating the temporary files are executed by the playbook at this point which delete them from the file system.

Testing all the above

Now testing these changes out is the real thing.

We are using docker containers to test out the roles. I would say this is approach is quite fast and gives us the works-everywhere advantage.

We need two containers, one for the engine and one for the remote-db which we are using Chris Meyer’s, provision_docker tooling.

The first role in any test would be to provision the docker containers and make sure that they are reachable.

ovirt-ansible/tests/containers-deploy.yml is the first role which is included as the first role to be executed in all of the test-$.yml files.

The above role inside, calls the role named provision_docker twice for provisioning the docker inventory groups specified as key:value pair to the role. Which would be nothing but the engine and the remote-db containers specified inside the tests/inventory file.

The other roles follow.

The role ovirt-engine-rename would be coming just before we would be running the role of ovirt-engine-cleanup.

Adding mine just above that will do the trick.

The particular PR which covers the above https://github.com/rhevm-qe-automation/ovirt-ansible/pull/132/files

Few gotcha’s

The command $ hostnamectl set-hostname test.ovirt.org which would be used to to change the hostname was giving weird errors inside the CI and the error traceback didn’t give much out apparently.

I mean what does Success mean here? Quite the irony if you ask me. Hah

Searching around, didn’t result to much. Meanwhile Lukas thought there was some issue with the DBUS which was maybe due to some permissions change just before the particular task.

Lukas suggested that these tests were not specific to ovirt and that I try running the tests on a local setup. Had another 2gig linode server for testing but all the same. The tests failed with the same error again. Which suggested that this must be an issue with the command itself being run inside the docker container.

After some fooling around this issue, I remembered that even the hostname command can be used to change the name of the dev box temporarily.

Tried that and it worked!

Another thing which was making the tests fail were the ansible-lints.

I had to replace shell module usage with command module in most of the places. Turns out that both these modules are just the same.

In the most use cases both modules lead to the same goal. Here are the main differences between these modules.

With the Command module the command will be executed without being proceeded through a shell. As a consequence some variables like $HOME are not available. And also stream operations like <, >, | and & will not work.
The Shell module runs a command through a shell, by default /bin/sh. This can be changed with the option executable. Piping and redirection are here therefor available.
The command module is more secure, because it will not be affected by the user’s environment.

So by default, the command module is all that one requires to achieve most of things.

Also, there were some places in my PR where I had no choice but to use the shell module. For those, I had to pass the skip_ansible_tag for ansible-lint to skip those tasks.

And it’s always good to see your CI server mailing you things like this one.

So far it’s been a great time for me and I have learned tons. Thanks for reading till this point.

Cheerio!

EDIT:

Yaniv Kaul suggested me that I try renaming the container using the http://docs.ansible.com/ansible/hostname_module.html module to rename the engine, but the module failed with the same DBUS error as the previous one.

Not a big change per se for testing it out.

diff --git a/roles/ovirt-engine-rename/tasks/main.yml b/roles/ovirt-engine-rename/tasks/main.yml
index 8c96082..ed28298 100644
--- a/roles/ovirt-engine-rename/tasks/main.yml
+++ b/roles/ovirt-engine-rename/tasks/main.yml
@@ -49,7 +49,9 @@
 # to get the hostname, if not let it remain
 - name: Changing the system host name to reflect the new engine-name
   # command: hostnamectl set-hostname 
-  command: hostname { { ovirt_engine_rename_new_fqdn } }
+  # command: hostname { { ovirt_engine_rename_new_fqdn } }
+  hostname:
+    name: '{ { ovirt_engine_rename_new_fqdn } }'
   tags:
     - skip_ansible_lint

Had chat with Lukas about it and he suggested that most of the systems use the DNS to pull out the hostname. So that task in the playbook ovirt-engine-rename would not be necessary and commented out on most cases.

Will be having a look at Lago and ovirt-system-test as you suggested as Yaniv Kaul suggested for end to end testing.

Older Blog post

PyCon Taiwan 2017, Taipei

2017-06-12T00:00:00+00:00

I have been going to PyCon India since the 2014 edition and have only missed PyCon Pune, the new regional PyCon that we have which was held at the start of this year.

Every year? Yes. But why? I dunno, maybe because I love python too much? Also partly because it’s the only time in the year where I get to meet all my friends in the Python community.

But this year was a little different.

This was the first year when I was going to a PyCon being held in a different country altogether and I would be lying if I said I was not excited about it.

Another reason to be excited was

Will be speaking at @PyConTW, 2017. Couldn't be more excited to meet you guys @data__wizard @uranusjr @andrewgodwin @freakboy3742 😄 #pycontw
— Tasdik Rahman (@tasdikrahman) June 1, 2017

Yes. This was the first time I would be speaking in PyCon! I was going to share the stage with some great developers and that was also a scary thought in itself.

Crazy right? I know and I was pumped!

I mean what if I fumble up, make some dumb mistake or what if I get a stage fear while presenting?

Genuine questions but I was not thinking of any of the above when I started my trip from Bangalore

BLR ✈️ BKK ✈️ TPE. I thought jet lag wasn't a thing. I was wrong. #pycontw #python
— Tasdik Rahman (@tasdikrahman) June 8, 2017

Travelling for a whole day straight does take toll on your body. And oh boy, the body aches!

Day 1

The keynote was given by Hsuan-Tien Lin who is working at appier right now. His talk revolved around the choices one should make while building an AI system.

It's a full house at @PyConTW 🙌 #pycontw #day1 pic.twitter.com/mV2XTdtZ2L
— Tasdik Rahman (@tasdikrahman) June 9, 2017

This was followed by another keynote by Carol Willing which revolved around python for education highlighting recent innovations from Science, Data Science, web, and electronics (Raspberry Pi, MicroPython and CircuitPython) as well as a brief recap of innovative ideas from PyCon in Portland.

"The State of Python for Education" by our keynote speaker @WillingCarol 😄 #pycontw pic.twitter.com/fGfP5BTSoT
— Tasdik Rahman (@tasdikrahman) June 9, 2017

The talks I attended that day were Python module in rust and Understanding Serverless Architecture. The rust talk was positioned more off as an intro to the language itself and how it was somewhat similar in terms of packaging and distribution of modules with Python. There are quite few more things which I noticed were very similar like Rust Lambda was very similar to Python’s lamda’s. It got real interesting when the speaker went into the details of how we could use rust code inside python just like how someone usually uses cffi, cpython or ctypes for achieving the same.

The server less talk delved into how one would leverage the new architectural styles. Which would be not having a server at all! Crazy stuff I would say and a great talk by my friend Rohit

Day 1 ended with all the speakers and organisers for PyCon Taiwan heading to the speaker night being held at Courtyard Taipei. And the food was delightful.

Day 2

Andrew Godwin gave his keynote which revolved around a discussion about how most software is never designed with failure in mind, and the consequences can range from slightly annoying to life-threatening. He gave an insight about how the industry has already tackled failure incredibly effectively—aviation—and what lessons we can learn from it and apply to our own code both locally and distributed.

Skynet? Nah. It's with us 😀 #pycontw pic.twitter.com/OIkvXU1eZK
— Tasdik Rahman (@tasdikrahman) June 9, 2017

Some other talks which I attended were Global Interpreter Lock: Episode III - cat < /dev/zero > GIL and Why do projects fail? Let’s talk about the story of Sinon.PY

@data__wizard here at @PyConTW with his Serverless talk. You go boy! 😀 #pycontw pic.twitter.com/pySudXYnf4
— Tasdik Rahman (@tasdikrahman) June 9, 2017

Later that day we had PyNight which started with a brief performance by my friend Adrian and he plays really good!

PyNight solo by Adrian over here. And this guy has some serious 🎹 skills I must say! #PyConTW pic.twitter.com/2QmBbmvZCg
— Tasdik Rahman (@tasdikrahman) June 10, 2017

I finally got to have a chat with Andrew and Russell and it was a fanboy moment for me

Had a great convo about OSS revenue models, personal projects, community building & Django Channels w/ @andrewgodwin @freakboy3742 #PyConTW pic.twitter.com/NgcQGRFQDn
— Tasdik Rahman (@tasdikrahman) June 10, 2017

Day 3

As we headed to the last day of the conference, we started with a keynote by Russell where he discussed about the lessons we can learn from the past and present of open source communities, so that we can ensure we have a strong, healthy Python community going into the future.

The only talk I could attend that day was “enjoy type hinting and its benefits”.

I was mostly making some last minute changes to the slides for my talk slotted for the afternoon session.

I was happy that I had finally given my talk! :)

Talk slides for my talk at @PyConTW titled "Diving deep on how imports work in Python" https://t.co/i7UYC2oKWC #PyConTW #Python
— Tasdik Rahman (@tasdikrahman) June 12, 2017

And it was a wrap up for PyCon Taiwan 2017.

I had a great time and made amazing friends. Adrian, Tzu, Keith, Mike, Yahsin and I missing out a lot of names in this list.

You will be missed Taiwan!

So it's a wrap up for @PyConTW. Had a great time with the community back there. Missing you guys already! TPE ✈️ BKK ✈️ BLR #PyConTW pic.twitter.com/2xlQd5Mwz8
— Tasdik Rahman (@tasdikrahman) June 11, 2017

Implementing Role Based Access Control

2017-06-01T00:00:00+00:00

You can find the python module for implementing RBAC0 here at https://github.com/tasdikrahman/easyrbac

Main Idea behind it

If I have some 100 users in my system and for each user. I need to have some form of ACL using which the system makes choices whether they should be having authorisation for different actions on resources. Meaning, only the actors should be able to perform only those actions for which they are having authorisation.

How do you solve that?

Do you remember your English class ** teacher **? I sure do. She used to tell all kinds of interesting facts in and around Indian history. Anyways

Do you see, the word teacher here? What is that?

A Role? So whoever was a ** teacher ** had a role as a teacher in the school?

What permissions/privileges did they have on the resources? Were they needed to be as assigned special permissions/authorisations on per teacher basis(yes, for let’s say the CS teacher gets access to the CS labs at any time but that would be an exception). More or less they had a lot of responsibilities or so as to speak privileges common among themselves

So it would be common sense to group them (teachers) together and create an entity(role in this case).

How does this help?

Now instead of defining some 100 rules for 100 teachers, I can create a Role called teacher and create users which would be assigned the role of a teacher.

This way I can easily manage the permissions for all the 100 teachers without getting repetitive as now I would just need to edit rules at the role level and not the 100 Users which were assigned the role of a teacher.

I also get the freedom to easily delete a role from a user in a cleaner manner. Imagine writing individual ACL policies for every user out there. Horror right?

Analogous to this would be iptables, this model does not scale very well when you have a couple more of users in your system. By practicality, your existing rules would be circus managing which would be a huge man-hour consumer. This also, increases the chance of human error while doing so. Editing and removing some user from that? Even harder.

This is what ufw solves for you.

Introduction to RBAC

RBAC is Role Based Access Control, a powerful complement to traditional access control strategies and is the most manageable model (Who’s how to do what (Which)) and is (one of)the most popular access control mechanism which greatly reduces the workload of security administrators.

Here, the use of the role as an authorised intermediary, its basic idea is to access the permissions assigned to a certain role, the user by playing a different role to obtain the role of access rights have access

Professor Sandhu has done a great job in explaining everything. Do check his article out which I have pointed at the end.

So I would use an RBAC for it’s

Advantages

Easy to manage
Easy to classify according to work needs
To grant the minimum privilege

RBAC mainstream model

Layered RBAC basic model

RBAC0: contains RBAC core part
RBAC1: contains RBAC0, another role inheritance (RH)
RBAC2: contains RBAC0, and other constraints (Constraints)
RBAC3: Contains all levels of content and is a complete model

easyrbac

Easiest way you can grok and retain all I had read, was to make something out of it. easyrbac was born out of it. I have tried implementing RBAC0 for this release. The next release would focus on getting RBAC1, which includes role inheritance.

easyrbac has a very simple API to interact around and create Roles and Users

from easyrbac import Role, User

everyone_role = Role('everyone')
admin_role = Role('admin')

everyone_user = User(roles=[everyone_role])
admin_user = User(roles=[admin_role, everyone_role])

For User resource access permissions allocation

acl = AccessControlList()

acl.resource_read_rule(everyone_role, 'GET', '/api/v1/employee/1/info')
acl.resource_delete_rule(admin_role, 'DELETE', '/api/v1/employee/1/')

# checking READ operation on resource for user `everyone_user`
for user_role in [role.get_name() for role in everyone_user.get_roles()]:
    assert acl.is_read_allowed(user_role, 'GET', '/api/v1/employee/1/info') == True

# checking WRITE operation on resource for user `everyone_user`
# Since you have not defined the rule for the particular, it will disallow any such operation by default.
for user_role in [role.get_name() for role in everyone_user.get_roles()]:
    assert acl.is_write_allowed(user_role, 'WRITE', '/api/v1/employee/1/info') == False

# checking WRITE operation on resource for user `admin_user`
for user_role in [role.get_name() for role in everyone_user.get_roles()]:
    if user_role == 'admin': # as a user can have more than one role assigned to them
        assert acl.is_delete_allowed(user_role, 'DELETE', '/api/v1/employee/1/') == True
    else:
        assert acl.is_delete_allowed(user_role, 'DELETE', '/api/v1/employee/1/') == False

Future work

Adding hierarchical roles, which represent parent<->child relations
Adding this on top of Bottle/Flask

Literature material

Github repo

https://github.com/tasdikrahman/easyrbac

Cheerio

Learnings from analyzing my compromised server (Linode)

2017-05-25T00:00:00+00:00

DISCLAIMER: All views presented are personal and not that of my employers or anyone else for that matter. In no occasion do I blame Linode for this security breach. It was because I did not follow the best practices which you will read and not repeat again.

Yesterday, I was having a great day!

I had the daily goal of walking 6000 steps done. Thanks to my swanky new Mi band 1 (bade goodbye to my Sony SWR10). Wrote about installing ovirt-engine using an ansible-playbook in a post yesterday. My mom wasn’t telling me for a change to get a haircut and suggested I start packing some proper clothes (she means washed and ironed) for my upcoming trip to Taiwan which would be my first international trip.

You need some getting used to, to the event of waking up to your significant others video call early in the morning. And I tell you if you had some bad sleep like me yesterday. You would hardly be in the mood for it. Sorry sunshine.

Anyways. So about yesterday, I finally decided to shift from DigitalOcean to Linode. As evident from my tweet yesterday.

@tasdikrahman @linode Now, I'm not trying to entice you further, but you could use the code 'linode10' and get some credit to start with. It's nice to have.
— Feeling OK (@FeelingSohSoh) May 24, 2017

Thanks for the credits Linode. Appreciate it :)

Fast forward some hours. I have a 4GB centOS 7 box up and running on a Singapore datecenter. After some failed attempts, got my ansible-playbook to run on the remote machine which installed ovirt-engine on it.

Happy period quickly turning to a bad one

Everything is fine and dandy and I am watching a basketball match with my friends after office hours. 4 minutes left to the final whistle, Cracking match on, everyone is playing like a pro. A very close call between the two teams, but one gets the better of the other one.

Returning back home. I get this buzz on my phone. Turns out it’s an email from Linode. Daym. I thought was I billed already?

Trust me on this, I was really not sure what to do of this for the first two minutes when I read the email.

I opened the Linode admin panel to check out what was my server up to. And the CPU graph had jumped off the hooks.

Same was the case with the network graph

Looking at the network log’s suggested a high amount of outbound traffic coming from my server, further cementing the Linode support ticket that I got.

I ssh’d inside my server to see what was going on.

I will be damned. I don’t remember sleep typing my password continuously for that long!

Let me tell you, you don’t do a cat /var/log/secure at this point as the file would just be spit continously at you with no end of stopping.

Did head (even a tail can do) to it. Going through the start of the file, everything was fine until I started to see the extremely less epoch time between two failed attempts. This confirmed my hunch that some script kiddie was trying to brute force through the root user login.

I know, I should have disabled root login at the start and used ssh-keys to access my server. But I just delayed it to be done the next day. My fault.

The logical thing now would be to start iptables (or) ufw and block outbound traffic as well as inbound except required stuff. Then take all the logs and look at them.

yum install breaks

Now I am pretty sure most linux distrbutions, Ubuntu for example ships with ufw by default. An SELinux like centOS not shipping ufw cannot be remotely possible. My hunch was that the perpetrator must have removed it altogether.

No problemo, I could do a yum install ufw right?

I try installing it and there are constant timeouts on the network calls being made by the server to the mirrors holding the package. Same is the case with other packages like strace, tcpdump et al.

The connection is really sluggish even though I am on a network having very less latency.

when `ufw` fails go back to good old `iptables`

ufw stands for uncomplicated firewall. So that you don’t have to directly deal with the low level intricacies

With ufw, for blocking all incoming traffic from a given IP address would have been as simple as doing a

$ sudo ufw deny from <ip-address>

But since I don’t have it, falling back to iptables.

Checking all outgoing requests/connections from the server

For that I did a $ netstat -nputw

lists all UDP (u), TCP (t) and RAW (w) outgoing connections (not using l or a) in a numeric form (n, prevents possible long-running DNS queries) and includes the program (p) associated with that.

What has a cd command got to do with making network connections?

Did a $ ps aux to check all the system processes and the thing was standing out here too

It was taking up 95% of the whole CPU! When was the last time you saw a coreutil doing that?

Digging down further, I wanted to see what were the files being opened by our, at this point I may say modified cd program

The perpetrator had been running his malicious program under /usr/bin which obviously meant he did gain root access to my server. There was no other way I could think of, through which the perpetrator could have placed it under /usr/bin otherwise.

If you feel, someone else too is logged in, you can easily check that by doing a

$ netstat -nalp | grep “:22″

To double check that he did get in, I ran a

$ cat /var/log/secure* | grep ssh | grep Accept`

And not so surprisingly,

At this point I don’t have anything like strace or gdb to it to see what this program is doing, as mentioned earlier the inability do install anything using yum through the network.

Compiling from source?

wget was no where to be found!

I could only think of killing the process once and see what happens then. I do a $ kill -9 3618 to kill the process and check ps aux again looking for the particular program only to find it back again as a new process

Conjuring 3

Again killing it and checking the processes. This time I don’t see it.

But I was wrong, looks like there is another process hogging CPU in a similar fashion like cd was doing.

Tbh, this was bat shoot crazy stuff for me. And I was feeling the adrenaline pump. Sleep was something I lost completely even though it was some 3a.m. in the morning. This crap was way too exciting to be left over to be done in the morning!

I check ps aux twice just to double check if there anything in the lines of what happened. And there is nothing this time to send a SIG KILL to.

Relief?

I try installing packages through yum and it’s the same old story of network-timeout.

You might also be wondering why I did not check bash history. Trust me, it was clean! I could only see the things which I had typed away. Whatever it was, it was smart. But I guess this is just the basic stuff of not leaving anything behind.

nmap away all the things

Cause why not?

Ran a quick scan on the server using

$ sudo nmap ovirtlinode.gsoc.org

Starting Nmap 7.40 ( https://nmap.org ) at 2017-05-25 14:35 IST
Nmap scan report for ovirtlinode.gsoc.org
Host is up (0.035s latency).
Not shown: 978 filtered ports
PORT     STATE  SERVICE
22/tcp   open   ssh
25/tcp   closed smtp
80/tcp   open   http
113/tcp  closed ident
179/tcp  closed bgp
443/tcp  open   https
1723/tcp closed pptp
2000/tcp closed cisco-sccp
2222/tcp open   EtherNetIP-1
5432/tcp open   postgresql
6000/tcp closed X11
6001/tcp closed X11:1
6002/tcp closed X11:2
6003/tcp closed X11:3
6004/tcp closed X11:4
6005/tcp closed X11:5
6006/tcp closed X11:6
6007/tcp closed X11:7
6009/tcp closed X11:9
6025/tcp closed x11
6059/tcp closed X11:59
6100/tcp open   synchronet-db

Checking the UDP connections

$ sudo nmap -sS -sU -T4 -A -v ovirtlinode.gsoc.org
Starting Nmap 7.40 ( https://nmap.org ) at 2017-05-25 14:37 IST
NSE: Loaded 143 scripts for scanning.
NSE: Script Pre-scanning.
Initiating NSE at 14:37
Completed NSE at 14:37, 0.00s elapsed
Initiating NSE at 14:37
Completed NSE at 14:37, 0.00s elapsed
Initiating Ping Scan at 14:37
Scanning ovirtlinode.gsoc.org (172.104.36.181) [4 ports]
Completed Ping Scan at 14:37, 0.01s elapsed (1 total hosts)
Initiating SYN Stealth Scan at 14:37
Scanning ovirtlinode.gsoc.org (172.104.36.181) [1000 ports]
Discovered open port 80/tcp on 172.104.36.181
Discovered open port 443/tcp on 172.104.36.181
Discovered open port 22/tcp on 172.104.36.181
Discovered open port 6100/tcp on 172.104.36.181
Discovered open port 5432/tcp on 172.104.36.181
Discovered open port 2222/tcp on 172.104.36.181
Completed SYN Stealth Scan at 14:37, 4.22s elapsed (1000 total ports)
Initiating UDP Scan at 14:37
Scanning ovirtlinode.gsoc.org (172.104.36.181) [1000 ports]
Completed UDP Scan at 14:37, 4.19s elapsed (1000 total ports)
Initiating Service scan at 14:37

Did not get much out of it.

So this brings me to the moment where I see the attack vector which was placed by the perpetrator becomes dormant. I did not see any other activity which would wry my attention again.

I let my server run for the night (stupid call but I was just plain old curious) to check in the morning what was the status.

Turned out there still had been a fair amount of outbound calls being made during that time.

The CPU graph resonated the same

I turned off the server for a brief period of time after this. The CPU graphs can be seen below during that period.

There were no outbound network calls after that.

Aftermath

I turned off the server for good. There’s no going back to it. I will be rebuilding the image to it.

It happened to have my public key on the server. This is not something one should immediately worry about. Going from public key to private key is exceptionally hard; that’s how public key cryptography works. (By “exceptionally” I mean that it’s designed to be resistant to well-funded government efforts; if it keeps the NSA from cracking you, it’ll be sure good enough for stopping your average joe)

But just to be on the safer side, I regenerated my ssh keys, deleted the old ones at the places it was being used and updated it with the new ones.

Just to be extra sure of everything, I checked the activity logs of the services which did use the older ssh keys. No unusual activities.

Learnings

disable root password login

The very first thing that I should have done after provisioning the server would have been to do this. This would infact have stopped the perpetrator from logging in as root and cause any havoc

obfuscate the port sshd uses

change it to something not common as the default 22 would be known by you as well as the perpetrator. ~As someone rightfully said, security lies in obscurity.~

But thinking that security through obscurity is makes you fail safe. Think again.

Security by obscurity is a beginner fail! one thing is to keep secrets and make it harder to exploit, the other is to rely on secrets for security. The financials were doing it for long until they learned it does not work.

It’s better to use known good practices, protocols and techniques than to come with your own, reinventing the wheel and trying to keep it secret. Reverse engineering has been done for everything, from space rockets to smart toasters. It takes usually 30min to fingerprint an OS version aling with libraries no matter how you protect it. Just looking at a ping RTT can identify the OS, for example.

Finally, nobody ever got blamed for using best practices. Yet, if you try to outsmart the system it’ll all be your fault.

Use your public key to ssh into the machine instead of password login as suggested.
Take regular backups/snapshot’s of your server. That way if something funny does happen. You can always restore it to a previous state.
I kept a really easy to guess, dictionary based password. I have to admit it, this is simply the stupidest thing one can do. And yes, I did it. My only reasoning for doing that was that this was a throwaway server, but that makes up for no excuse for not following security practices.

KEEP A STRONG PASSWORD! Even though there is a brute force attack on your server, it is relatively very hard to crack the password if you keep a strong one.

Bruce has a nice essay about the subject here. Won’t repeat what he has said so take a look at it.

Research suggests that adding password complexity requirements like upper case/numbers/symbols cause users to make easy to predict changes and cause them to create simpler passwords overall due to being harder to remember.

Also read OWASP’s blog here about what they have to say about passwords

There was a discussion on security exchange too over this

protect ssh with fail2ban

Fail2ban can mitigate this problem by creating rules that automatically alter your iptables firewall configuration based on a predefined number of unsuccessful login attempts. This will allow your server to respond to illegitimate access attempts without intervention from you.

Some Files which are in the Common Attack Points:

$ ls /tmp -la

$ ls /var/tmp -la

$ ls /dev/shm -la

Check these to have a look for something which should not be there.

Unluckily, I had already rebooted my server at that point which caused me to lose this info.

Keep a pristine copy of critical system files (such as ls, ps, netstat, md5sum) somewhere, with an md5sum of them, and compare them to the live versions regularly. Rootkits will invariably modify these files. Use these copies if you suspect the originals have been compromised.
aide or tripwire will tell you of any files that have been modified - assuming their databases have not been tampered with. Configure syslog to send your logfiles to a remote log server where they can’t be tampered with by an intruder. Watch these remote logfiles for suspicious activity
read your logs regularly - use logwatch or logcheck to synthesize the critical information.
Know your servers. Know what kinds of activities and logs are normal.
using tools like chkrootkit to check for rootkits regularly

Closing notes

How do you know if your Linux server has been hacked?

You don’t!

I know, I know — but it’s the paranoid, sad truth, really ;) There are plenty of hints of course, but if the system was targeted specifically — it might be impossible to tell. It’s good to understand that nothing is ever completely secure.

If your system was compromised, meaning once someone has root access on your host, you cannot trust anything you see because above and beyond the more obvious methods like modifying ps, ls, etc, one can simply attack kernel level system calls to subjugate IO itself. None of your system tools can be trusted to reveal the truth.

Simply put, you cant trust anything your terminal tells you, period.

I would like to thank everyone at Linode. Their effort and dedication to maintaining our servers and their security, while ensuring an affordable price, enables us to succeed. Thanks guys

If I have made any mistakes or you think I should have done something else from what I have tried or not described something as it should be. Please feel free to point it out. I am just starting out on the infosec scene :)

Cheers and stay secured!

And I just have to post this image.

Catch the discussion over HN

Further read

Using Ansible Playbooks to Install oVirt 4.1 on centOS 7 (Linode)

2017-05-24T00:00:00+00:00

In my previous post, I played around how to install ovirt-engine on a remote VM.

Turns out we can automate the whole process using ansible playbooks!

Secondly, I have thought of shifting from digitalocean to linode. Why?

Well, first off. It’s not that I don’t like digitalocean. It’s just that the prices for the VMS that I’am provisioning, are getting too high for me.

For a 4GB VM with 2 cores and 60GB of SSD to spare with. I am getting some 4TB of network I/O. 4TB for me is generous. All this adds up to a damage of $40/month or roughly speaking, $0.060/hour.

If you compare this to that of Linode’s offerings. I am getting the same 4GB VM with 48GB SSD Storage. 2 CPU Cores and 3TB XFER. This costs $20/mo or (.03/hr)

That’s like half of what I was paying to digitalocean for my servers! But the ease of use for digitalocean is pretty good. The only reason that I see if I want to use digitalocean for my VM’s would be if I wanted low latency. As the closest datecenter to me for linode is in Singapore. While we digitalcoean has a datacenter around in bangalore.

Enough of me ranting around.

Show me the Code

The playbook is already up there on github which holds the ansible roles for quite a many things.

Having a quick look at them

$ ls -la roles
total 0
drwxr-xr-x  13 tasdik  tasdik  442 May 23 14:28 .
drwxr-xr-x  20 tasdik  tasdik  680 May 24 11:05 ..
-rw-r--r--   1 tasdik  tasdik    0 May 23 14:28 ansible.cfg
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-collect-logs
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-common
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-engine-backup
drwxr-xr-x   7 tasdik  tasdik  238 May 23 14:28 ovirt-engine-cleanup
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-engine-config
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-engine-install-packages
drwxr-xr-x   7 tasdik  tasdik  238 May 23 14:28 ovirt-engine-remote-db
drwxr-xr-x   7 tasdik  tasdik  238 May 23 14:28 ovirt-engine-setup
drwxr-xr-x   7 tasdik  tasdik  238 May 23 14:28 ovirt-guest-agent
drwxr-xr-x   6 tasdik  tasdik  204 May 23 14:28 ovirt-iso-uploader-conf

I would be showing around the role ovirt-engine-setup in this post.

Provision your server and ssh into it.

I have created an entry on my /etc/hosts on my local dev box for the VM where I am going to install ovirt-engine.

$ cat /etc/hosts | grep gsoc
xxx.xxx.xx.xxx ovirtlinode.gsoc.org

xx.xxx.xx.xxx would be the public IP of your server.

You have to change the hostname of your server. Will explaing the why in a bit.

$ ssh root@ovirtlinode.gsoc.org
root@ovirtlinode.gsoc.org's password:
[root@ovirtlinode ~]# hostnamectl set-hostname ovirtlinode.gsoc.org
[root@ovirtlinode ~]# hostname
ovirtlinode.gsoc.org
[root@ovirtlinode ~]#

Now from your local dev box

$ git clone https://github.com/rhevm-qe-automation/ovirt-ansible/
$ cd ovirt-ansible/
$ touch site.yml inventory

The contents of the file above should be in the lines of

$ cat inventory
[all:vars]
ovirt_engine_type=ovirt-engine
ovirt_engine_version=4.1
# Make sure that link to release rpm is working!!!
ovirt_rpm_repo=http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm
ovirt_engine_organization=ovirtlinode.gsoc.org
ovirt_engine_admin_password=secret

[engine]
ovirtlinode.gsoc.org ansible_ssh_user=root ansible_ssh_pass=secretpassword
$
$ cat site.yml
---
- hosts: engine
  roles:
    - role: ovirt-common
    - role: ovirt-engine-install-packages
    - role: ovirt-engine-setup

A quick note on the variable ansible_ssh_pass=secretpassword. This is not adviced! Don’t do this as now obviously your ssh password is stored in a text file which can be read by anyone. And god forbid if you add this file to version control accidentally and push it.

Rule of thumb is to always use ssh keys to run playbooks!

After this, we need to check the ansible fact ansible_fqdn. This one is important, as the we will be accessing the Admin panel of ovirt-engine using the FQDN of this value. Otherwise our ansible playbook won’t work.

$ ansible -m setup -i inventory engine -a 'filter=ansible_fqdn'
ovirtlinode.gsoc.org | SUCCESS => {
    "ansible_facts": {
        "ansible_fqdn": "ovirtlinode.gsoc.org"
    },
    "changed": false
}

Running the playbook, assuming you are on ovirt-ansible/ dir

$ ansible-playbook site.yml -i inventory -vvv

This might take a few minutes to get done. Get a coffee or read some xkcd while it does its thing.

You can now access the admin panel through the url https://ovirtlinode.gsoc.org/ovirt-engine/webadmin/

The url would obviously be different for you, citing the FQDN you decided to use during the installation process.

Cheerio!

Installing oVirt 4.1 on centOS 7 (DigitalOcean)

2017-05-21T00:00:00+00:00

Was trying to install oVirt engine on a VM deployed on DigitalOcean. My learnings from it are documented here.

Installing oVirt Engine

I would concentrate on the part of just installing oVirt-engine as I had a fair share of problems while doing so.

The VM I am installing it on is a 4GB centOS 7 box with 80GB of SSD to spare for. Also, make sure you read through the whole requirements mentioned on the official docs while going forward with this.

A quick run through of what oVirt is. If you have used vSphere by VMWare, this product offered by Redhat is a competitor to it.

The oVirt platform consists of at least one oVirt Engine and one or more Nodes.

oVirt Engine provides a graphical user interface to manage the physical and logical resources of the oVirt infrastructure.
oVirt Engine runs virtual machines.

oVirt Engine is the control center of the oVirt environment. It allows you to define hosts, configure data centers, add storage, define networks, create virtual machines, manage user permissions and use templates from one central location.

Installation

[root@centos-4gb-blr1-ovirt-engine ~]# yum update
[root@centos-4gb-blr1-ovirt-engine ~]# yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release41.rpm
[root@centos-4gb-blr1-ovirt-engine ~]# yum -y install ovirt-engine
[root@centos-4gb-blr1-ovirt-engine ~]# engine-setup
[ INFO  ] Stage: Initializing
[ INFO  ] Stage: Environment setup
          Configuration files: ['/etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf', '/etc/ovirt-engine-setup.conf.d/10-packaging.conf']
          Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20170520124451-wvuuny.log
          Version: otopi-1.6.1 (otopi-1.6.1-1.el7.centos)
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ INFO  ] Stage: Environment customization

          --== PRODUCT OPTIONS ==--

          Configure Engine on this host (Yes, No) [Yes]:
          Configure Image I/O Proxy on this host? (Yes, No) [Yes]:
          Configure WebSocket Proxy on this host (Yes, No) [Yes]:
          Please note: Data Warehouse is required for the engine. If you choose to not configure it on this host, you have to configure it on a remote host, and then configure the engine on this host so that it can access the database of the remote Data Warehouse host.
          Configure Data Warehouse on this host (Yes, No) [Yes]:
          Configure VM Console Proxy on this host (Yes, No) [Yes]:

          --== PACKAGES ==--

[ INFO  ] Checking for product updates...
[ INFO  ] No product updates found

          --== NETWORK CONFIGURATION ==--

          Host fully qualified DNS name of this server [centos-4gb-blr1-ovirt-engine]: ovirt.gsoc.org
[WARNING] Failed to resolve ovirt.gsoc.org using DNS, it can be resolved only locally
          Setup can automatically configure the firewall on this system.
          Note: automatic configuration of the firewall may overwrite current settings.
          Do you want Setup to configure the firewall? (Yes, No) [Yes]:
          The following firewall managers were detected on this system: firewalld
          Firewall manager to configure (firewalld):
[ ERROR ] Invalid value
          Firewall manager to configure (firewalld):
[ ERROR ] Invalid value
          Firewall manager to configure (firewalld): firewalld
[ INFO  ] firewalld will be configured as firewall manager.

          --== DATABASE CONFIGURATION ==--

          Where is the DWH database located? (Local, Remote) [Local]:
          Setup can configure the local postgresql server automatically for the DWH to run. This may conflict with existing applications.
          Would you like Setup to automatically configure postgresql and create DWH database, or prefer to perform that manually? (Automatic, Manual) [Automatic]:
          Where is the Engine database located? (Local, Remote) [Local]:
          Setup can configure the local postgresql server automatically for the engine to run. This may conflict with existing applications.
          Would you like Setup to automatically configure postgresql and create Engine database, or prefer to perform that manually? (Automatic, Manual) [Automatic]:

          --== OVIRT ENGINE CONFIGURATION ==--

          Engine admin password:
          Confirm engine admin password:
[WARNING] Password is weak: it is based on a dictionary word
          Use weak password? (Yes, No) [No]: Yes
          Application mode (Virt, Gluster, Both) [Both]:

          --== STORAGE CONFIGURATION ==--

          Default SAN wipe after delete (Yes, No) [No]:

          --== PKI CONFIGURATION ==--

          Organization name for certificate [gsoc.org]:

          --== APACHE CONFIGURATION ==--

          Setup can configure the default page of the web server to present the application home page. This may conflict with existing applications.
          Do you wish to set the application as the default page of the web server? (Yes, No) [Yes]:
          Setup can configure apache to use SSL using a certificate issued from the internal CA.
          Do you wish Setup to configure that, or prefer to perform that manually? (Automatic, Manual) [Automatic]:

          --== SYSTEM CONFIGURATION ==--

          Configure an NFS share on this server to be used as an ISO Domain? (Yes, No) [No]:

          --== MISC CONFIGURATION ==--

          Please choose Data Warehouse sampling scale:
          (1) Basic
          (2) Full
          (1, 2)[1]:

          --== END OF CONFIGURATION ==--

[ INFO  ] Stage: Setup validation
[WARNING] Cannot validate host name settings, reason: resolved host does not match any of the local addresses
[WARNING] Less than 16384MB of memory is available

          --== CONFIGURATION PREVIEW ==--

          Application mode                        : both
          Default SAN wipe after delete           : False
          Firewall manager                        : firewalld
          Update Firewall                         : True
          Host FQDN                               : ovirt.gsoc.org
          Configure local Engine database         : True
          Set application as default page         : True
          Configure Apache SSL                    : True
          Engine database secured connection      : False
          Engine database user name               : engine
          Engine database name                    : engine
          Engine database host                    : localhost
          Engine database port                    : 5432
          Engine database host name validation    : False
          Engine installation                     : True
          PKI organization                        : gsoc.org
          DWH installation                        : True
          DWH database secured connection         : False
          DWH database host                       : localhost
          DWH database user name                  : ovirt_engine_history
          DWH database name                       : ovirt_engine_history
          DWH database port                       : 5432
          DWH database host name validation       : False
          Configure local DWH database            : True
          Configure Image I/O Proxy               : True
          Configure VMConsole Proxy               : True
          Configure WebSocket Proxy               : True

          Please confirm installation settings (OK, Cancel) [OK]:
[ INFO  ] Stage: Transaction setup
[ INFO  ] Stopping engine service
[ INFO  ] Stopping ovirt-fence-kdump-listener service
[ INFO  ] Stopping dwh service
[ INFO  ] Stopping Image I/O Proxy service
[ INFO  ] Stopping vmconsole-proxy service
[ INFO  ] Stopping websocket-proxy service
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Upgrading CA
[ INFO  ] Initializing PostgreSQL
[ INFO  ] Creating PostgreSQL 'engine' database
[ INFO  ] Configuring PostgreSQL
[ INFO  ] Creating PostgreSQL 'ovirt_engine_history' database
[ INFO  ] Configuring PostgreSQL
[ INFO  ] Creating CA
[ INFO  ] Creating/refreshing Engine database schema
[ INFO  ] Creating/refreshing DWH database schema
[ INFO  ] Configuring Image I/O Proxy
[ INFO  ] Setting up ovirt-vmconsole proxy helper PKI artifacts
[ INFO  ] Setting up ovirt-vmconsole SSH PKI artifacts
[ INFO  ] Configuring WebSocket Proxy
[ INFO  ] Creating/refreshing Engine 'internal' domain database schema

[ INFO  ] Generating post install configuration file '/etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf'
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up
[ INFO  ] Starting engine service
[ INFO  ] Starting dwh service
[ INFO  ] Restarting ovirt-vmconsole proxy service

          --== SUMMARY ==--

[ INFO  ] Restarting httpd
          Please use the user 'admin@internal' and password specified in order to login
          Web access is enabled at:
              http://ovirt.gsoc.org:80/ovirt-engine
              https://ovirt.gsoc.org:443/ovirt-engine
          Internal CA 80:BD:83:EA:FB:EB:F8:BD:F5:22:98:F2:90:57:03:92:62:B2:5A:62
          SSH fingerprint: ee:43:05:9e:95:cc:7e:bb:6a:6e:aa:28:0d:4d:c1:08
[WARNING] Less than 16384MB of memory is available

          --== END OF SUMMARY ==--

[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20170520124451-wvuuny.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20170520131443-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ INFO  ] Execution of setup completed successfully
[root@centos-4gb-blr1-ovirt-engine ~]#

The most important thing to note above in the engine-setup command is the FQDN for the VM. You need this for accessing the Engine administration portal.

Take a look at this specific part

...
[ INFO  ] Restarting httpd
          Please use the user 'admin@internal' and password specified in order to login
          Web access is enabled at:
              http://ovirt.gsoc.org:80/ovirt-engine
              https://ovirt.gsoc.org:443/ovirt-engine
...

For resolving the admin portal, we need to add an entry inside /etc/hosts inside the VM first.

[root@centos-4gb-blr1-ovirt-engine ~]# cat /etc/hosts
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.redhat.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
# The following lines are desirable for IPv4 capable hosts
127.0.0.1 centos-4gb-blr1-ovirt-engine centos-4gb-blr1-ovirt-engine
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost4.localdomain4 localhost4

# The following lines are desirable for IPv6 capable hosts
::1 centos-4gb-blr1-ovirt-engine centos-4gb-blr1-ovirt-engine
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

xxx.xx.xx.xxx ovirt.gsoc.org
[root@centos-4gb-blr1-ovirt-engine ~]#

Where xxx.xx.xx.xxx would be the public IP of my VM.

So if you now do a ping to ovirt.gsoc.org, out local DNS would be successfully able to resolve the FQDN

[root@centos-4gb-blr1-ovirt-engine ~]# ping ovirt.gsoc.org
PING ovirt.gsoc.org (xxx.xx.xx.xxx) 56(84) bytes of data.
64 bytes from ovirt.gsoc.org (xxx.xx.xx.xxx): icmp_seq=1 ttl=64 time=0.034 ms
64 bytes from ovirt.gsoc.org (xxx.xx.xx.xxx): icmp_seq=2 ttl=64 time=0.041 ms
64 bytes from ovirt.gsoc.org (xxx.xx.xx.xxx): icmp_seq=3 ttl=64 time=0.033 ms
^C
--- ovirt.gsoc.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.033/0.036/0.041/0.003 ms
[root@centos-4gb-blr1-ovirt-engine ~]#

Now on the local dev box, I did the same thing with my /etc/hosts

$ cat /etc/hosts
...
# DO server
xxx.xx.xx.xxx ovirt.gsoc.org
...

Now it’s resolvable from my local dev box too

$ ping ovirt.gsoc.org
PING ovirt.gsoc.org (xxx.xx.xx.xxx): 56 data bytes
64 bytes from xxx.xx.xx.xxx: icmp_seq=0 ttl=59 time=5.322 ms
^C
--- ovirt.gsoc.org ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 5.322/5.322/5.322/0.000 ms

Admin portal

Now on your dev box, check the admin panel by going to the url https://ovirt.gsoc.org:443/ovirt-engine

Let’s take the admin panel for a spin

Isn’t she pretty?

Had been fighting with some dumb defaults for almost an hour. It’s 5 am. in the morning. So I need to get some sleep. Stay tuned!

Debugging tips

make sure its (the remote host where you are installing ovirt-engine) resolvable from your machine.
check if the engine is running (check /var/log/engine.log and ovirt-engine service)
check firewall on the machine.
try to connect to the web browser from the virtual machine to be sure its not network issue between you and VM ping your.vm.com should resolve that ip address

Community bonding period, GSoC 2017 with oVirt org

2017-05-20T00:00:00+00:00

A lot has happened over the last few weeks.

Chelsea won the premiership and that too with a comfortable lead. Dominance is something which we definitely had in the premiership. But I must say, West Brom did put up a good show.

The same day, WannaCry Ransomware started it’s havoc. If you are affected, be sure to check out wanawiki and wanadecrypt. Benjamin does a wonderful job in explaining the intrinsic details of the tools in this blog post.

Bitcoin just surged past $2,000 for the first time. Sadly I don’t own any.

Also, we are just a week away from the official coding time period for GSoC 2017 which starts on 30th May.

Recent developments in the oVirt community

I have been subscribed to the oVirt devel mailing lists for quite some time now. Interacting on minor discussions to get a feel of the whole community. #ovirt IRC is something which I haven’t been quite active on.

Some recent developments for our latest 4.2 release lined up. The stable one would be 4.1 for the moment. Check out oVirt blog for what we have been up to lately.

Our infra team made some changes on our gerrit UI to make it fit better with other oVirt services. Which looks great by the way. Check it out here at https://gerrit-staging.phx.ovirt.org/

Check it out and if any suggestions the new UI, you can create a JIRA ticket here https://ovirt-jira.atlassian.net/browse/OVIRT-912

Also, if you like living your life on the edge. Check out the nightly builds here https://www.ovirt.org/develop/dev-process/install-nightly-snapshot/.

We are also having some enhanced OVA support. Which would be

Support for uploading OVA.
Support for exporting a VM/template as OVA.
Support for importing OVA that was generated by oVirt (today, we only support those that are VMware-compatible).
Support for downloading OVA.

That’s about it for this time. Happy weekend :)

Making of Trumporate: Building markovipy - Part 1

2017-05-06T00:00:00+00:00

Do you even read comics?

Kiddin. Between, I love reading Calvin and Hobbes. It’s something which I keep re-reading their well worn collections, maybe for the n-th time. The thing which keeps me hooked to it maybe the blunt truthfulness of strip.

Haven’t read any?

I like this one because of its simplicity.

This one makes me smile all the time.

xkcd is the only thing that comes close in comic strips which I visit frequently.

Psst. Let me tell you something

The picture which you saw, titled is “calvin and Markov”, is generated using Markov Chains as explained here by the author. So it’s not written by a human but generated programmitcally using a corpus.

Markov What?

Quite simply put, it theorises that

every event happening is dependent on its previous event that occurred.

When the probability of some event is dependent or is conditional on previous events, we say they are dependent events.

Let me put it this way.

You could relate to it from the fact that most of the things in the physical world are dependent on their previous outcomes.

Imagine a coin flip, which is dependent on previous outcomes. So it has short-term memory of one event. This can be visualised using a hypothetical machine which contains two cups, which we call states. In one state we have a 50-50 mix of light versus dark beads, while in the other state we have more dark versus light. One cup we can call state zero.

It represents a dark having previously occurred, and the other state, we can call one, it represents a light bead having previously occurred. To run our machine, we simply start in a random state and make a selection. Then we move to either state zero or one, depending on that event. Based on the outcome of that selection, we output either a zero if it’s dark, or a one if it’s light.

With this two-state machine, we can identify four possible transitions. If we are in state zero and a black occurs, we loop back to the same state and select again. If a light bead is selected, we jump over to state one, which can also loop back on itself, or jump back to state zero if a dark is chosen. The probability of a light versus dark selection is clearly not independent here, since it depends on the previous outcome.

But Markov proved that as long as every state in the machine is reachable, when you run these machines in a sequence, they reach equilibrium. That is, no matter where you start, once you begin the sequence, the number of times you visit each state converges to some specific ratio, or a probability.

Quite naturally , if it rains when it’s a cloudy day. You don’t drown when you are in your bed.

This helps in calculating the conditional probability and can be applied to various scenarios.

Why am I so interested in it?

Maybe because it comes into the intersection of math and linguistics. Or maybe I wanted something new to fool around. Take your pick. You wouldn’t be wrong both ways.

Going down the rabbit hole I wanted to build something of my own with this new found knowledge from the read ups.

I built Markovipy, a Markov Text Generator which can be used to randomly generate (somewhat) realistic sentences, using words from a source text. Words are joined together in sequence, with each new word being selected based on how often it follows the previous word in the source document.

The results are often just nonsense, but at times can be strangely poetic - the sentences below were generated from the text of “The Hamlet” by Shakespeare:

If his occulted guilt, Do not it selfe vnkennell in one speech, It is most retrograde to our desire And we beseech you, bend you to remaine Heere in the cheere and comfort of our eye, Our cheefest Courtier Cosin, and our whole Kingdome To be contracted in one brow of woe Yet so farre hath Discretion fought with Nature, That we with wisest sorrow thinke on him, Together with remembrance of our selues.

Here I was using a chain length of 3 which optionally represents the number of words taken into account when choosing the next word. Chain length defaults to 1 (which is fastest), but increasing this may generate more realistic text, albeit slightly more slowly.

Depending on the text, increasing the chain length past 6 or 7 words probably won’t do much good – at that point you’re usually plucking out whole sentences anyway, so using a Markov model is kind of redundant.

I got them off Project Gutenberg. The usual copyright headers had to be removed so that they could serve as useful sample input, but naturally all the rights and restrictions of a Gutenberg book still apply.

The API looks something like this

>>>
>>> from markovipy.markovipy import MarkoviPy
>>>
>>> obj = MarkoviPy("/Users/tasrahma/development/projects/markovipy/corpus/shakespeare/hamlet_utf8.txt", 3)
>>> obj.generate_sentence()
'If his occulted guilt, Do not it selfe vnkennell in one speech, It is most retrograde to our desire And we beseech you, bend you to remaine Heere in the cheere and comfort of our eye, Our cheefest Courtier Cosin, and our whole Kingdome To be contracted in one brow of woe Yet so farre hath Discretion fought with Nature, That we with wisest sorrow thinke on him, Together with remembrance of our selues.'
>>> obj.generate_sentence()
'Fare you well my Lord Ham.'
>>> obj.generate_sentence()
'To thinke, my Lord? Ham.'

Future improvements

As with every project, there is always space to improve and here are some things which can be done with markovipy

Specify the number of sentences to be generated when the API is being called. As of now only one sentence gets generated till the period.
I am storing the mappings of possible words in memory as of now which can be shifted to redis
Or if you can suggest something, I would be very happy to take a look. Create an issue here on it’s github page. https://github.com/tasdikrahman/markovipy/issues/

I would like to end this article using the last issue of Calvin and Hobbes which was on 31st December, 1995. The same year and month that I was born :)

Thanks for your time. Cheers!

DISCLAIMER: All Calvin and Hobbes Comic strips and images copyright to Bill Watterson and publishers. All written © CalvinAndHobbes.co.uk

Further read

Hello oVirt, GSoC 2017

2017-05-04T00:00:00+00:00

So Chelsea won their last Premier league match against Everton hands down last Sunday. And we had a very comfortable win I would say.

Pedro’s 25-yard stunner, Gary Cahill’s close-range finish and Willian’s tap-in kept us ahead of The toffees. Courtois got his much deserved clean sheet. Wouldn’t be wrong to say that we had a great day.

We are just 3 wins away from the premier league title with Tottenham right behind our backs.

What other better news can there be for me?

Accepted for @gsoc 2017! Will write software for @ovirt under @redhatopen. Couldn't be any happier 😄 #GSOC #gsoc2017 #RedHat #opensource
— Tasdik Rahman (@tasdikrahman) May 4, 2017

Google Summer of Code

Well turns out there is. I got accepted for GSoC 2017 under oVirt and will be working with Lukas Svaty among others this summer.

Building things which ease human effort is something which I deeply believe in.

I used to read a lot of xkcd so ended up making an xkcd-dl, a cross platform comic downloader for readers. I liked playing classic FPS games so I ended up writing spaceShooter, a cross platform game using pygame which ended up on HN, Product hunt and what not. Used to spend most of the time at the terminal, had a lot of .txt files lying around which had random links or pieces of text conjured from couch surfing. Thought I should keep them at a unified place with tags and a searcheable interface. Hacked together tnote which does just that. And so on for my other projects.

There was a need and hence I built it!

GSoC is all about coming together with the community and building things which would be used by thousands(if not millions). And oVirt would allow me to live by that very statement.

oVirt is a product very similar to vsphere offered by VMWare. Both of them are competitor of sorts in that space. Both being server virtualisation tools.

The open source project is the downstream version of the commercial product. Just as is the case with other Redhat projects. You have RHEL which is the upstream version of CentOS which is nothing but the hardened version of fedora.

About my project specifically

So I have been fooling around a configuration management tool called Ansible for some months now (which would be evident from my feed if you stumble upon my blog once in while). Apparantly, my interests co-incided with one of the project proposals at oVirt. So there you go.

If you are curious about the whole project proposal, here is the link to it for you to read more about it :)

All in all I cannot be more excited about this!

Cheerio!

Testing your ansible roles using travis-CI

2017-04-06T00:00:00+00:00

NOTE: The ansible playbook written here can be found at tasdikrahman/ansible-bootstrap-server

Continous Integration

Simply put with each commit that you are making to shared repository, which is then verified by an automated build. This helps in detection of errors early on.

If you are new to this development style. There are plenty of places which explain will help you understand. This practice in itself is quite old. CI/CD anyone?

But I am not writing this to explain what is CI right?

So you made an ansible playbook?

I have talked about Infra as code in some of my earlier blog posts. Automatically provisioning your complete server(s) in minutes is something which every org is trying/has achieved.

If you follow TDD principles, you would be knowing right where I am taking this conversation to.

Here is the directory structure for the ansible role I am testing this out

ansible-bootstrap-server
├── .travis.yml
├── ansible.cfg
├── play.yml
├── roles
│   ├── basic_server_hardening
│   │   ├── defaults
│   │   │   └── main.yml
│   │   ├── handlers
│   │   │   └── main.yml
│   │   └── tasks
│   │       └── main.yml
│   ├── create_new_user
│   │   ├── defaults
│   │   │   └── main.yml
│   │   └── tasks
│   │       └── main.yml
│   ├── install_minimal_packages
│   │   └── tasks
│   │       └── main.yml
│   ├── update
│   │   └── tasks
│   │       └── main.yml
│   └── vimserver
│       ├── defaults
│       │   └── main.yml
│       ├── files
│       │   └── vimrc_server
│       └── tasks
│           └── main.yml
└── tests
    ├── inventory
    └── test.yml

If you want to understand how the files are organised. I have written about it in “Organising tasks in roles using Ansible”

Writings tests

I would be running the tests inside the travis build environment.

Why travis?

I like them more! But there are many other good CI/CD providers namely circleCI, bambooCI and some more. Choose whatever fits best to your organisation or personal appeal.

So unlike when I am running the ansible-playbook from the controller node, I would be running the playbook on the localhost. This part would be obvious by now.

Let’s have a look at tests/test.yml

---
- hosts: localhost
  connection: local
  become: true
  roles:
    - {role: ../roles/update}
    - {role: ../roles/install_minimal_packages}
    - {role: ../roles/create_new_user}
    - {role: ../roles/basic_server_hardening}
    - {role: ../roles/vimserver}

Let’s break it down,

hosts: localhost this is simply telling the host/group of hosts which this playbook would be targeting.
connection: local would tell ansible to run the tasks on the system itself and not ssh onto to some remote machine for executing the playbook.
become: true makes the tasks run as the root user

and the roles part is where we would be organising our roles to be executed sequentially.

Testing against different versions of Ansible

For travis to build your code, it would be requiring a .travis.yml file in the root directory of your project.

In this particualr example, the contents of it.

---
sudo: required
dist: trusty

language: python
python: "2.7"

# Doc: https://docs.travis-ci.com/user/customizing-the-build#Build-Matrix
env:
  - ANSIBLE_VERSION=latest
  - ANSIBLE_VERSION=2.2.2.0
  - ANSIBLE_VERSION=2.2.1.0
  - ANSIBLE_VERSION=2.2.0.0
  - ANSIBLE_VERSION=2.1.5
  - ANSIBLE_VERSION=2.1.4
  - ANSIBLE_VERSION=2.1.3
  - ANSIBLE_VERSION=2.1.2
  - ANSIBLE_VERSION=2.1.1.0
  - ANSIBLE_VERSION=2.1.0.0
  - ANSIBLE_VERSION=2.0.2.0
  - ANSIBLE_VERSION=2.0.1.0
  - ANSIBLE_VERSION=2.0.0.2
  - ANSIBLE_VERSION=2.0.0.1
  - ANSIBLE_VERSION=2.0.0.0
  - ANSIBLE_VERSION=1.9.6

branches:
  only:
    - master

before_install:
  - sudo apt-get update -qq

install:
  # Install Ansible.
  - if [ "$ANSIBLE_VERSION" = "latest" ]; then pip install ansible; else pip install ansible==$ANSIBLE_VERSION; fi
  - if [ "$ANSIBLE_VERSION" = "latest" ]; then pip install ansible-lint; fi

script:
  # Check the role/playbook's syntax.
  - ansible-playbook -i tests/inventory tests/test.yml --syntax-check

  # Run the role/playbook with ansible-playbook.
  - ansible-playbook -i tests/inventory tests/test.yml -vvvv --skip-tags update,copy_host_ssh_id

  # check is the user is created or not
  - id -u tasdik | grep -q "no" && (echo "user not found" && exit 1) || (echo "user found" && exit 0)

The interesting thing to note here is the list arguments for env here. Travis expands on these environment variables to create multiple build environments one after another.

For this case we have 16 env variables, so there would be 16 seperate builds for the specified ansible versions which this playbook will be tested against.

So for the line

.
env:
  - ANSIBLE_VERSION=latest
.

The Build config will be something like, where you can see the "env": "ANSIBLE_VERSION=latest"

{
  "sudo": "required",
  "dist": "trusty",
  "language": "python",
  "python": "2.7",
  "env": "ANSIBLE_VERSION=latest",
  "before_install": [
    "sudo apt-get update -qq"
  ],
  "install": [
    "if [ \"$ANSIBLE_VERSION\" = \"latest\" ]; then pip install ansible; else pip install ansible==$ANSIBLE_VERSION; fi",
    "if [ \"$ANSIBLE_VERSION\" = \"latest\" ]; then pip install ansible-lint; fi"
  ],
  "script": [
    "ansible-playbook -i tests/inventory tests/test.yml --syntax-check",
    "ansible-playbook -i tests/inventory tests/test.yml -vvvv --skip-tags update,copy_host_ssh_id",
    "id -u tasdik | grep -q \"no\" && (echo \"user not found\" && exit 1) || (echo \"user found\" && exit 0)"
  ],
  "group": "stable",
  "os": "linux"
}

ansible-playbook -i tests/inventory tests/test.yml --syntax-check:

would check for any syntax errors as obvious from the command itself, helps in checking any errors early on before the build.

ansible-playbook -i tests/inventory tests/test.yml -vvvv --skip-tags update,copy_host_ssh_id:

is the line which actually runs the playbook

id -u tasdik | grep -q "no" && (echo "user not found" && exit 1) || (echo "user found" && exit 0):

checks whether a user named tasdik exists or not after the playbook has completed its execution.

I have skipped explaining some of the parts in my .travis.yml. You can learn more about build configuration inside travis from the docs

Skipping tasks inside build

Travis builds will fail, if you try to exceed a particular limit while your build job is running. You can find more about the exact specs and timings in the travis docs talking about build timeouts

Some roles in particular had tasks which would make some network calls which is unnecessary to test in a travis build. I needed to skip them in the builds

Enter ansible tags.

It provides an elegant way to skip or only run tasks with some specified tags.

.
"ansible-playbook -i tests/inventory tests/test.yml -vvvv --skip-tags update,copy_host_ssh_id"
.

The above playbook run will run the playbook tests/test.yml on the hosts described on tests/inventory skipping the tags update,copy_host_ssh_id which I have specified inside my tasks.

Would be trying out testing ansible roles on docker next. Stay tuned.

Happy ansibling!

Further reading

Organising tasks in roles using Ansible

2017-03-19T00:00:00+00:00

NOTE: The ansible playbook written here can be found at tasdikrahman/ansible-playbook

Roles are nothing but a further abstraction of making your playbook more modular. If you have played around with the ansible-playbook command. You might have noticed the common pattern of repeating tasks which you did some or the other time back.

Ansible roles provide you a way to reuse tasks(or roles for that matter). Imagine this to be a very similar concept writing Object oriented code.

Need for roles?

I realised that I was doing the same thing over and over again whenever I had to spin up a new droplet(instance for the EC2 people). Things like

Updating and upgrading the existing packages
Installing some bare essentials on it (eg: git, vim, ncdu etc)
creating a non-root user with admin privileges
enabling a basic firewall
some common chores.

Hence I found myself writing tasdikrahman/ansible-playbook

Take this structure for example.

$ tree digitalocean
digitalocean
├── README.md
├── play.yml
└── roles
    ├── bootstrap_server
    │   └── tasks
    │       └── main.yml
    ├── create_new_user
    │   └── tasks
    │       └── main.yml
    ├── update
    │   └── tasks
    │       └── main.yml
    └── vimserver
        ├── files
        │   └── vimrc_server
        └── tasks
            └── main.yml

Let’s break it down further.

digitalocean : The root dir which will contain the roles dir which further contains roles for tasks
play.yml : A normal playbook in .yml format which stiches the roles that we have created inside roles dir
bootstrap_server : contains the task file(s) needed for the necessary tasks declared inside the tasks/main.yml. The roles create_new_user et el suggest the same thing.
tasks : This directory contains all of the tasks that would normally be in a playbook. These can reference files and templates contained in their respective directories without using a path.

Similar to tasks dir inside roles, we have many more files specifically used for other things

Those being,

files: This directory contains regular files that need to be transferred to the hosts you are configuring for this role. This may also include script files to run.
handlers: All handlers that were in your playbook previously can now be added into this directory.
meta : This directory can contain files that establish role dependencies. You can list roles that must be applied before the current role can work correctly.
templates: You can place all files that use variables to substitute information during creation in this directory.
vars: Variables for the roles can be specified in this directory and used in your configuration files.

`play.yml`

The contents of my play.yml are sequenced in such a manner so that the roles get executed in a sequence. This is a very handy feature which allows you to configure software which depends on some other software.

---
- hosts: testdroplets
  roles:
    - update
    - bootstrap_server
    - role: create_new_user
      username: tasdik
    - role: vimserver
      username: tasdik

The roles are placed as key-value’s inside the dict roles here. the username variable is being passed to the roles create_new_server and vimserver as a value using which the new user should be created.

You can also put the username variable inside individual roles dir, which would look something like

└── roles
    └── create_new_user
        ├── vars
        │   └── main.yml
        └── tasks
            └── main.yml

The vars/main.yml would contain

---
username: tasdik

But I feel this would become a cumbersome task for the ansible-playbook at question. I would again have to put the same vars/main.yml inside the role vimserver which would be a duplicate of this file.

For that reason, I am passing the variable at the play level for each of the ansible roles.

How does Jinja2 come into play here?

Have a look at the file

$ cat digitalocean/roles/vimserver/tasks/main.yml
---
- name: Place vimrc_server on ~/.vimrc
  copy:
    src: vimrc_server
    dest: /home/{ username }/.vimrc
    mode: 0644
    owner: "{ username }"

The "{ username }" is substitued with the value that you provided to the role at runtime. The reason we have put quotes around it is because the yaml parser for ansible requires it so when you are just starting the jinja variable as the starting value.

You needn’t have put the quotes if for example it were something like this.

- name: Copy .ssh/id_rsa from host box to the remote box for user username
  become: true
  copy:
    src: ~/.ssh/id_rsa.pub
    dest: /home/{ username }/.ssh/authorized_keys

NOTE: Please note that there are double curly braces in the above example which surround username. The templating engine wasn’t allowing me to put it there as it would take it as a variable for substitution.

Closing thoughts

Ansible uses Jinja2 as the templating engine of choice(also the choice of the biggies Flask and Django). So you would be in familair territory if you have dabbled in those.

You can finally run this playbook using

$ ansible-playbook play.yml -vvv

Happy ansibling!

Introduction to Configuration Management using Ansible

2017-02-28T00:00:00+00:00

Need for Configuration management

There are many devs/sysadmins out there who manage their servers by logging in through ssh. Making the changes and then logging out again. Sounds like you?

Well hey, you are not alone!

But do you feel that this can create snowflake servers?

Servers which are impossible to recreate because we missed out on some minute detail which the other dev had known.

But Tasdik. This wouldn’t happen if we have a very good documentation process giving a step by step guide on how to do so!

Good! I will say you guys have followed a very good engineering practice of documenting each and every other process that you do! But in most fast moving environments. This is not the case!

Enter Configuration management

It’s good that we have a good range of config. management tools out there like CFEngine, Ansible, chef to name a few.

Isn’t this DevOps thing some buzzword out there?

Let me tell you a first hand experience of mine which really made me think about provisioning tools.

I am currently working on a project which deals with OpenCV 3 and some python dependencies thrown around. We have currently deployed the demo app on a humble 512MB RAM, 20 gig SSD on a DigitalOcean droplet which runs Xenial Xerius(64 bit).

The general process of getting the development env up and running for OpenCV involves

Updating and upgrading the dist packages
installing developer toolchains for compiling and installing the C++ source code
installing tons of libraries for image and video formats
getting the actual source code of OpenCV and OpenCV contrib which we are gonna use
Build a particular version of OpenCV required by us.
Compiling it

And the list goes on!

But you can still write a shell script for it. Right?

Yes! You absolutely can. No doubt.

I will go further one step and say that it DOES take time and resources to learn a config management tool and have some working knowlege to be productive with it.

But here are some points in favor for config mgmt. tools

They can be re-used, distributed. Chef has cookbooks, juju has charms, Perl has CPAN, Java has Maven.
The DSL’s do take away some freedom, but they are much cleaner.
Idempotency: Which means you can safely re-run it any number of times and each time it will go to the desired state, remain there or more closer to the desired state.
Scalable!
OS Agnostic.
Version Controlling - In short, maintaining Infrastructure as Code.
Easy to write

This is an example ansible role which updates the apt-cache for target server.

---
- name: Updating the apt-cache
  become: yes
  apt: update_cache=true

Looks good?

Why Ansible?

I just wanted to get started with some or the other tool! That’s it.

I am pretty sure that chef, puppet, salt stack and the like are equally good and will serve your use case. So do check them out too and have a feel around the different hammers around the market.

But one great thing about Ansible is that you don’t need a PKI architecture or some special communication protocol for managing it’s nodes. It(your manager) just uses plain on SSH for communicating with it’s nodes. Although for older versions of ansible, it used to communicate using the paramiko SSH-2 python implementation

A simple comparison

Nothing much, just a simple apache2 virtual hosts setup for your VPS.

Manual install

$ sudo apt-get update && upgrade
$ sudo apt-get install apache2
$ cd /etc/apache2
/etc/apache2 $ sudo cp /files/awesome-app sites-available/awesome-app.conf
/etc/apache2 $ sudo chmod 640 sites-available/awesome-app.conf
/etc/apache2 $ sudo rm /etc/apache2/sites-enabled/default && /etc/apache2/sites-enabled/default-ssl
/etc/apache2 $ sudo ln -s /etc/apache2/sites-available/awesome-app /etc/apache2/sites-enabled/awesome-app
/etc/apache2 $ sudo service apache2 restart

Shell script

You can put all the necessary commands above and put it inside a provision.sh file and then call it using $ sh provision.sh

Ansible playbook

- hosts: web
  become: yes  # for escalated privileges
  tasks:
    - name: Installs apache web server
      apt: pkg=apache2 state=installed update_cache=true

    - name: Push default virtual host configuration
      copy: src=files/awesome-app dest=/etc/apache2/sites-available/awesome-app mode=0640

    - name: Disable the default virtualhost
      file: dest=/etc/apache2/sites-enabled/default state=absent
      notify:
        - restart apache

    - name: Disable the default ssl virtualhost
      file: dest=/etc/apache2/sites-enabled/default-ssl state=absent
      notify:
        - restart apache

    - name: Activates our virtualhost
      file: src=/etc/apache2/sites-available/awesome-app dest=/etc/apache2/sites-enabled/awesome-app state=link
      notify:
        - restart apache

  handlers:
    - name: restart apache
      service: name=apache2 state=restarted

The above configuration is just a simple POC showing the general relative ease of using tools like Ansible. You can do a lot more like provisioning full blown distributed servers from scratch, load balance them by putting a load balancer like HAProxy in front of it and what not!

So it’s left on you to decide what suits you best.

Cheers!

On a side note, I automated installing and setting up the dev environment for OpenCV 3 for python3 using Ansible. The code as usual for every other project lies in github

More than 18 stops, a little less than 1800kms, Backpacking Trip To Himachal Pradesh

2016-12-22T00:00:00+00:00

Trip itinerary

We all have that one trip with friends which gets cancelled no matter what right(goa anybody!)?

But hopefully, our decided trip was completed. Wasn’t that smooth. But as the saying goes

What is an ocean if it doesn’t challenge the sailors!

Thanks to the recent demonetization drive by the Indian govt. We had to delay our plans for the trip by some days.

Shimla

I started from Dehradun on the night of 28th of November. Left for Delhi in the middle of the night, after much cajoling and giving reassurances to my mum that I wouldn’t be doing bat-sh*t crazy stuff on my trip. Which was not entirely fulfilled as you will see after some time.

The bus ride was full of surprises as air seeped in through like water from the narrow gapes in the metal sheets. And how could I miss the creaking noise of the suspension. All this amounted to me not getting little or no sleep at all throughout the night. Add it to the amount of cold it was outside.

I was dropped well before my destination citing reasons of excessive fog and traffic. Panic? Rushing to the nearest metro station was the answer. Reached Kashemere Gate from where we had our next bus to Shimla. Devaraj tagged along from there.

At least, this bus was a decent one. I completed on the sleep which I had missed while Devaraj blasted on EDM on his earphones!

We reached quite late. Late by the standards of Shimla! 8:30p.m was the time and we could not see a dog roaming around! The next shocker came when we got to know that the booked hotel was another 2 hour journey from where we were.

No wonder it was cheap when we booked! Heck it was in the middle of the jungle!

We cancelled the reservation and asked the taxi driver(as it was the only means of transport available at that time) to take us to a decent hotel within our specified budget.

250 bucks was what he took for just 2kms. Even airlines charge less than that! Heck, a normal bus ticket from dehradun to delhi costs around 230 INR. And we are looking at a distance of around 240 kms. If it was a time before the sun had set, I would have called it daylight robbery!

The sweet talking manager showed us our rooms which had a great view of shimla from the window. We settled down and changed. He continued babbling about the package which he was providing us. “The cheapest rates in the country”, that’s what he said and I quote it. We said we would think about it and let him know in the morning! So logically, this means we did not accept his terms right? So far so good. Read along.

We had dinner at the local Chinese cuisine shop after which we roamed around the famous mall road. Handicrafts were priced off the roof. No wonder, given the amount of tourists which flocked this city.

Morning gave us a rude shock. This guy(the same guy who came with to us with his sweet mouth) was trying to dupe us! As we were checking out, he said he had already booked a car for our two day tour of Shimla. The car charges per day being 2.5k INR excluding the room charges. Angry? Oh a lot!

He kept asking 3x the room amount for the car that he had supposedly booked. Now mind you that we are not two drunk dudes who would just randomly say yes to any BS someone would present to us. This day being no different.

After much shouting and drama, we settled with him for the initial booking amount of the room for the day.

Just when we were thinking that this trip was going on a downhill drive, we had our impromptu rafting trip.

And boy it was amazing! 25kms of raw nature. I cannot articulate the feeling. And 3 hours later, after countless rapids and being drenched to the bone, we reached back to the last spot from where we would be picked up back on the trucks. We changed our wet clothes with some dry ones and hopped inside our ride back. I slept like a baby.

The day ended on a high with some adventure sports at kufri which I missed up on my reluctance to not wake up from my slumber(read: drained out)

Luckily the second hotel room which we booked had a nice steam bath. Yes, a steamed bathroom. Add it to Wooden planks and comfy blankets.

Manali

The 2 nights spent in Shimla went just fine. And on the 3rd day, we left for manali early in the morning.

Before I write any further. I confess, that we were one of those guys who didn’t know that kullu and manali were two different places altogether. Hey, I never said that I am good at geography!

So before Manali, you arrive at Kullu if you go by road from Shimla.

For a change, not everyone was trying to loot us. And it was a little warmer than Shimla. We got a decent room for ourselves and settled for the night.

Manali was a little over 9kms from our place and we left early the next morning.

It was a mix of Shimla’s mall road and valleys that you find around in Pune. We visited the local Tibetan monasteries, bought a lot of souvenirs and some indigenous handicrafts.

We returned back to Kullu the same night. ATMs were scarce. Further, finding ATMs which had any money in them was like finding water in a desert. But we survived the night.

Kheerganga is a good destination for trekking if you are in the mood for any of it. But as we didn’t plan it, we left it out for another reason(read on)

Van vihar should have been prepended with a “big” before it. Minus the fact that it was not snowing, I liked it more than Shimla.

Kasol

In the similar lines of Shimla, 2 nights was all at kullu.

The next morning, we paid off the hotel manager. And asked him how to reach Kasol. And boy was he happy on hearing that word.

Imagine a tall unshaven guy, wearing a tattered blue denim with a pair of Quechua’s. Pair that along with a black hoodie and a monkey cap not of the ordinary types.

We first had to reach bhunter and then take a bus to kasol. We reached kasol just before sunset and got us a good room. It was a real steal at the price at which we got.

You could find the essense of Israeli culture almost everywhere. The shops, the cafes. All had menu cards, graffiti and wall posters in Israeli.

You ask why was our room deal a good one? Our room was situated right beside the Parvati river. We had the fabled Dhauladhar mountain range right in front of us. Add the fact that we were right in the middle of parvati valley. Picturesque? You bet!

The main Kasol is a small place and you can literally cover it within an hour by foot. All the shops and cafes are situated around a T-shaped area.

This place is a must for anybody who is attracted towards the hippie culture. You will find Israeli’s buying groceries talking in fluent Hindi. People roam around wearing long beaded jewelry, colorful khaki pants (imagine a purple tinge for example), flower print tops, unkempt dirty long hair(no offence) filled with beads.

Some Beetles and Bob Dylan music and there you are back to the late 1960s hippie culture of the United States.

Manikaran

Manikaran is a little over 7-8kms from Kasol. It was a little bigger than our previous stop. The only place worth visiting here is the Ram temple. It has a natural hot spring in it’s vicinity where you can find people overjoyed of playing inside it!

And no, we didn’t take a dip inside it. Why? Short on time. I must warn you that the irregular bus timings here will really make you mad.

A one night stay at Kasol was enough for the whole thing!

Dharamshala

This trip from Manikaran was the most grueling of bus trips till now.

Manikaran to Kasol was the first leg. This went swiftly. Next stop was Bhunter. After that we had to catch a bus for Mandi. All this consumed a lot more than our patience and time. We reached Mandi just a little after 4o’clock.

Such was our luck that we just missed a direct bus for Dharamshala a few minutes before arriving. We were famished! Our stomachs were empty since the raw Israeli breakfast that we had back at Kasol(it was Devaraj’s idea).

We had some momo’s at a local shop near the bus stand and came back asking for directions.

The alternatives at hand were to either catch a bus for a place near Dharamshala’s vicinity and then pray that we got another bus from there at around 9p.m!

Or we could wait for another direct bus to Dharamshala. The catch here was that we would be reaching there not anytime before 12a.m. Oh, did I mention we stopped believing in any advance room bookings since the Shimla incident?

And we could have always booked a room at Mandi and waited for the next morning and hope that everything would pan out perfectly while we slept our bums out.

Given how reckless we are, we chose the 6o’clock bus. If you have ever gotten into a BEST bus(referring to the local state transport for Bombay), our bus was worse than that. I could hardly position my legs in a comfortable way. Sleeping was out of question. But we had to endure. I mean did we have any other choice here?

Fast forward some uncomfortable hours, I was jolted by a thud at my feet. Next thing I know, a man was literally down at my feet. A tinge of alcohol arose from under his breath. It became apparent that he did not fall down from his seat by the violent jerks which the driver didn’t care giving the passengers.

His fellow mate (or what I assumed him to be) swooped him up from my feet before I kept my bag aside to help him. He placed him back to his seat.

We reached Dharamshala’s ISBT a few minutes shyer than 12:30a.m. In search of a hotel, we started roaming the streets. There was nobody to be found on the streets, Leave alone dogs. Well I didn’t expect anybody to be out of their houses at that unearthly hour given the experience at Shimla where the whole of Mall road closes by 9:30p.m.

After some hurried searching, we got a decent room and settled down. The squeamish feeling since Mandi didn’t go away hence I didn’t want anything to eat. But Devaraj was feeling uncomfortably hungry. He went out in search of some food at a time when most people are having their first cycle of REM sleep!

To both of our surprise, he did find a local dhaba serving roti sabzi and dal. While Devaraj was gobbling down his supper, I read on that our honorable Chief Minister, Jayalalitha had passed away that day! News of which came as a shocker for both of us.

Sleep came naturally and both of us greeted our families on phone with a good afternoon as it was past 12 when we woke up the next day.

Sidhpur

Next stop was the famed Norbulingka institute.

I would say it would be the one of the best places for someone who wants to have a closer look at the Tibetan culture, their history and how they live their daily lives.

You have workshops which teach you about the traditional wood carving practices.

The 3-D wall paintings which you can see on monasteries, the bronze buddha idols which are hand crafted to perfection in the local workshop. It was beautiful!

The whole compound resonated rich Tibetan architecture. They had a Museum which had exhibits displaying how different sects of the Buddhist communities dressed. All the way from the common man to the wives of kings and their family.

We thought about dropping by the souvenir shop inside to find something in our budget but it was the exact opposite! But I won’t complain as this institute is self sustained and lives off these earnings. Very well maintained. I give them that.

We came back to our room and had our dinner after some time and went to sleep.

McLeodGanj

Way back in the 1930s, McLeodGanj was literally left for ruins when it was hit with a massive earthquake. It was brought back to the scene when the exiled Tibetan govt was given refuge in its hills.

The honorable Dalai Lama and his house of ministers all reside in and around that area. I found that the Tibetan souvenirs, if to be bought, were much cheaper in comparison to kasol/shimla/manali here. This was to be expected as a large population of exiled Tibetan refugees stay here. So if you want to shop, this is the place to do so!

We hopped around some shops and had our refreshments keeping in mind the affordable shops to buy from while returning back. And we left for triund.

Triund

As planned, our trek to Triund started when we reached McLeodGanj. Heck we had to give up on Kheerganga for this one.

We started at around 12:30p.m and expected the 7 km uphill trek to be completed within 3-4 hours. At least that’s wha t we thought about.

The path to the top of the hill is not a straight climb but full of ridges and small rocks. Mind you we had our rucksacks and in my case it was no less than 10 kilos(thanks a lot to the carefree shopping we did in Manali and elsewhere!) which I had to balance with the body weight.

Loose placed rocks, rubble, dry grass mixed with the dump of local herders cattle. You name it! One slip and down you go.

The initial wider and plain track gave way to a much narrow and harder path. We had our moments where we had an extremely beautiful view of McLeodGanj in front of us and there were times when Devaraj would almost collapse (blackout if you may) due to fatigue. He has a heart condition which he didn’t tell me before. But never the less.

There were 3-4 shops placed (2 were closed the day we visited) at the edge of the path where you could buy chocolates, water bottles and Maggi. They sold things at 2x the MRP but I think this is justified.

The only means of taking goods up the hill are donkeys. Each donkey costs 800 INR a per trip from the campsite to the hill top.

The shops have dustbins where they store all the garbage to be given back to the same donkeys when they come back on a U-turn to be taken downhill. The charges for this is cheap. 9/- INR per bag. Enough with the tariffs.

So we reached the famous magic view cafe. Well it’s not exactly the typical CCD or barista that we frequent out. But apparently, it’s the oldest chai point on the way to triund’s top.

After some refreshments and countless pit stops, we did reach the top. I mean finally!

We reached just before the sunset and the view! It was ethereal. The sun rays falling down on the dhauladhar mountain range. The dogs running around and playing with each other while occasionally coming to the lap of the tea stall owner, wagging his tail for some biscuits.

Pain, fatigue and thirst were long lost friends when we sat down and drank the last few gulps of water left in our bottles. I felt saintly!

Devaraj was as usual puffing on to one of his Marlboro’s while I checked my cell phone for any coverage. Zero signal! But what else do you expect at 2875 meters above sea level!

We got ours tents fixed and settled down our bags in them and got out to join the others in the bonfire. We had a very diverse group of people in our vicinity. Some came from Mumbai, a group of students from Delhi University and some from Banaras.

As we devoured down our modest plate of Dal Chawal (which costed a bomb!) we talked about our experiences and past. Our travel shenanigans. Discussed our plans in life like we were childhood friends reunited after a long time!

Time for sleep came a lot early for me as I was feeling really cold and fatigued. I slept with 3 layers of clothes, a muffler, a bunny cap and a sleeping bag on top of it! And still I was shaking vigorously due to the cold for some time.

So here we were watching “Harry Potter and the deathly hallows” on Devaraj’s phone, sharing his earpieces and finishing the last piece of Lays packet we had on us and before I knew it I was fast asleep.

I don’t know what the time was, but I woke up to some heavy breathing nearby. I thought it was Devaraj vaping but he was fast asleep.

Turned out the dog decided to sleep right next to my side of the tent. I could literally feel his hot and heavy breathing, continuously near me. I told Devaraj that a black bear (the chaiwala told us that they frequented the path uphill in search of food sometimes) came to visit us in the morning! And he was gullible enough to even say OMG for a second.

It may be around 6o’clock in the morning, opened the tent sheets and this was in front of us.

You will actually have to be there to just feel what it felt like. We had on one side the Dhauladhar mountain range in front us which looked like a golden cone crumpled all the way from the top and on the other side we saw the city of McLeodGanj through the misty clouds.

You might wonder if you have any houses around in the hill top? Well no, just some tea stalls put together with large stones and the tents.

So would there be any bathrooms? Well of course not!

Surprisingly not, there is not even a makeshift loo for you. So what’s the solution?

You have to figure out a spot for yourself and take a dump. While doing so, you have to hope that nobody comes around in search for their perfect spot. I don’t know how does it compare to taking a dump near a railway track as I haven’t done it.

The climax comes over when the spot where you are squatting over is on a cliff. So if you accidentally slip over. You are not coming back alive the 2780 meters! Hope you have done your insurance?

After a bowl of Maggi, packing our rucksacks and bading goodbye with the friends we made, we left the summit.

From Dharamshala, Devaraj got on a bus to Delhi and I boarded a bus to Chandigarh which I reached at around 11:30p.m. Mum reminded me again of staying over for the night but I wanted to reach back home badly.

With all the bad decisions so far, this was no better. I reached Dehradun at around 3:30a.m.

No Taxi, no auto. No nothing! Oh no, I didn’t walk all the way to my house!

I did wait for the night to get over at the bus stand. Boarded the first city bus and got off near my house.

Best trip till now for me!

So that was much about it. If you are reading till the end. I sincerely thank you for being so patient!

Between, I have hardly pinned any photos for the trip here. So if you wanted a deeper look, my friends on Instagram will surely complain about my never ending feed from this trips photographs!

Godspeed!

Demystifying how imports work in python, ChennaiPy

2016-10-24T00:00:00+00:00

22nd October, 2016

Clock goes overboard and tries waking me up. Which it has been unsuccessfully trying to do for the last two months. Thanks ma, for the lovely gift. But I woke up to the sweet melody of my roomate’s snoring.

Anyway, I was still intoxicated by last nights coffee. Remembered I still had to finish the slides of the talk that I had to give over @ ChennaiPy, October Meetup’16.

So finally, I did complete the slides and off we went over to the IMSC, chennai campus where the meetup was to be held. This being my second time there.

Deep dive into dictionary

The first talk up was given by the very talented Naren, who works as a Backend engineer @ Mad street Den. He went on explaining how the interals of dictionary is implemented as Hash tables in python.

Demystifying how imports work in python

Fun fact. This was the talk for which I had been preparing the slides for earlier that morning. All in all, it was a very good feeling to give a talk in front of a large crowd.

Here are the slides.

Code snippets: https://github.com/tasdikrahman/talks

Introduction to selenium

Mayur did a great job in introducing us to testing using selenium. Was a fun talk overall.

That’s me by the way. Thanks to the bad camera quality sigh.

The lovely crowd for the meetup. Cheers to them!

Pycon India 2016, New Delhi

2016-10-19T00:00:00+00:00

This year being no different, I attended PyCon India (yet again, this being my 3rd one). The only difference being that this time it was being held at New Delhi instead of Bangalore.

Met many of my old friends, made some new ones, interacted with some really interesting people and should I say met some legendary guys/gals too. All in all it was just like the previous year. Felt just like home!

Talks

Puneet gave a really interesting talk about how to generate tests and not actually write them. Anand as usual was on his best and his talk on decorators was truly delightful. All in all, every talk had something to take away home with.

On top of it, we had Andreas Mueller with us this time. He is a core contributer and maintainer to the scikit-learn machine learning library.

Lev was there too. He is the community manager for the gensim library. BG from helpshift. Van from Rackspace International. And the list just goes on!

Ending note

I could write another two paragraphs about it but as I have my mid sems starting tomorrow, I will keep it short and try opening those notes I borrowed.

Some pictures to drool by.

If you are looking for me, I am the unkempt haired guy with the black wingify T-shirt

Pycharm spreading some of their love

That was Lev and his Random Forest T. ML guys would get the reference here.

And that is Van. What a humble guy. Given that he is a Vice President @ RackSpace.

The only thing which I regretted was to not propose a talk for this years PyCon. But I hope I will do so for the next years PyCon. See you there then!

My internship experience at Wingify (VWO team), New Delhi

2016-08-13T00:00:00+00:00

As I am sitting here at the Delhi Airport waiting for my flight back to Chennai, I could just not stop myself from thinking about my time as an intern back at Wingify which ended last week. Here’s what I wrote down after getting carried way with several cups of coffee (thanks for luring me with that smell costa coffee)

So here it is then!

Day 1

It was 5 o’clock in the morning and I was quite drowsy. Reason being the all nighter I pulled the other night for the last exam of our end sems. phew

Here I was at the Delhi airport just a day after my semester exams, ready to start with my internship. Talk about eagerness here!

A small part of me was also happy that I was moving out of Chennai! (at least for some time)

Joined them the next day in their main office at the heart of NSP, Delhi.

Now I was naturally excited to work in a company which had grown and become one of the best startups in India in such a short span of time. On top of that, this was my first internship in a well-established product based start-up and I was hoping that I could learn all that I could and perform in accordance to their standards.

I was introduced to Ankit Jain (Lead software Engineer at Wingify) and Ajay Sharma (Senior Software Engineer) by my HR. We had a brief chat where I was told I would be working with the Backend Development team for VWO, their flagship product.

Talking about VWO, it’s the world’s easiest A/B testing tool. And we are quite good (read “The Best”) at it! The month before I joined, we had monthly sales crossing a little over 1 million dollars.

After getting up and ready with my development environment, I was given my first project.

Integration of Statsd and graphite (Project #1)

StatsD collects and aggregates metrics and then ships them off to Graphite which stores the time-series data and enables us to render graphs based on these data.

Graphite consists of three parts.

carbon: a daemon that listens for time-series data.
whisper: a simple database library for storing time-series data.
webapp: a (Django) webapp that renders graphs on demand.

The setting up of the the overall stack was a bit archaic but I finally got it right and the metrics for our internal service were being graphed correctly by Graphite. And they looked pretty too!

Coming back, the service on which we integrated StatsD and graphite runs on several servers. So while plotting the graphs we wanted to know the server from where the stats are being pushed on to the buckets of statsd. Well that was much about it.

Bumblebee - An experimental slack bot VWO (Project #2)

Wingify has this culture of organizing hackathons at the end of every month, where people from the engineering team come together to hack on something which they want to see at VWO.

To be honest, I was quite clueless on what to build for the first half an hour or so and after a little nudge from Ankit I decided upon bumblebee. Bumblebee makes use of the beautiful VWO API to provide functionalities (if not all) to the VWO account holder right at the comfort of his slack channel. Like you can get details of all the campaigns of your account, check their status (whether they are running, paused et el). Update status to Start/Stop/Pause a particular campaign. Share your campaign with someone else and some more things.

It was written in python and Ankit was too kind to let me open source it. Here is the link for the curious.

https://github.com/wingify/bumblebee

Optimization much? (Project #3)

My 3rd project revolved around optimization of an internal service. I had to increase the efficiency (read performance). I implemented some rough 3 approaches and the last one bumped the performance by up to 23.6%. I could have tried for dropping it down further to a lower one but sadly the end to my internship was looming around the corner so I dropped it. And that was the 3rd and the last project I did as an intern at Wingify.

How was my experience?

My experience? I loved it there!

Solving hard engineering problems. Check
Extremely talented engineering team. Check
Approachable mentors. Check

Heck, here I am with Abhishek Batra, the Android guru here and Abhishek Garg, our DevOps pro. Turns out that we all came along quite well along each other along with the other people and yes these guys are quite senior to me :)

Awesome Work Culture. Check
Delhi :P. Check

Jokes apart. I made some really good friends back there and learned a ton from everyone. I am proud that I was part of a team which is building something which people love and has an impact on thousands of customers.

So what now?

Looking back at the time when I received a call from Nupur about my acceptance as an intern at Wingify. I was thinking about whether to join it over the other 6-7 odd companies which accepted my application as a summer intern.

After the two months that I have spent here at Wingify, I now believe that I did just the right thing on choosing Wingify over others!

And did I mention that they gave me a pre-placement offer :) ?

This was taken on my last day at office. And boy was I sad!

Ankit threw a huge party at a pub in Rajouri garden for all the interns along with the other guys and my last day turned out to be the best day in Delhi.

Until next time Delhi!

This post was cross posted at Wingify’s team blog too

Decorators 101 - A gentle introduction

2016-07-21T00:00:00+00:00

Decorators you say

If you are familiar with python, chances are that you might have already seen the decorator syntax. It comes off as a simle concept when being used, but when you try to get your head around the underlying details, you find yourself in a hot fix. And are probably asking yourself

How the heck does it work?

Python does a very good job in abstracting about the intricacies, so much so that we take it almost for granted. Remember the routes in flask?

Adding a route is as simple as doing a

@app.route('/index/')
def hello():
    return "hello there"

Where the @ symbol denotes the decorator syntax.

But before diving into decorators, discussing about functions would seem appropriate

>>> def foo(name):
...     return "hello {}".format(name)
... 
>>> foo
<function foo at 0x7f59dc601aa0>
>>> foo("tasdik")
'hello tasdik'
>>> bar = foo
>>> bar
<function foo at 0x7f59dc601aa0>
>>> bar("body double")
'hello body double'
>>>

If you have been dabbling away in python, this might not seem very unfamiliar to you I assume.

As you can see, functions can be assigned to each other

Everything in python is considered as a first class object, which include functions, classes and everything else which you thought could not be an object. Jokes apart, this paradigm is really different from the other programming languages but it has it’s own advantages to it.

Moving forward this analogy of treating everything in python as first class objects. Functions can also be passed on other functions! Not sure about that? Here you go

>>> 
>>> def sum(a, b):
...     return a+b
... 
>>> def diff(a, b):
...     return a-b
... 
>>> def operation(func, a, b):
...     return func(a, b)
... 
>>> operation(sum, 10, 20)
30
>>> operation(diff, 10, 20)
-10
>>>

We have passed around the function name to the function operation() as we pass around normal values.

Now what if I told you functions can be returned as return values for other functions! Let’s see how we do that

>>> def foo(value):
...     i = 2
...     def bar():
...             return value
...     return bar
... 
>>> bar
<function foo.<locals>.bar at 0x7fe2c7229b70>
>>> 
>>> bar = foo(10)
>>> bar.__closure__[0].cell_contents
10
>>> bar()
10
>>> 

Not that’s surprising then I suppose. The only gotcha here would be that the inner enclosing functions have the access to the enclosing function variables. That is the reason, the variable value is still accessible to the function foo()

The foo function displays the closure property beautifully here as it stores the value that was passed on to it.

Talking about closures, this can be used very cleverly in some cases

>>> val1 = foo(10)
>>> val1
<function bar at 0x7f59dc601de8>
>>> val1()
10
>>> val2 = foo(20)
>>> val2()
20
>>> 

You can see a special behaviour here demonstrated by the function foo(). It remembers the value passed to it between function calls. A property which can then be utilized for implementing other features. It’s somewhat similar to the demonstration of public and private interface. Where the function foo() would be acting as the public function and inner() being the private one.

So how do I write one

Simply put, decorators are nothing but funcions which take on another functions and modify it’s behaviour without changing the original code

Confused? Let’s write one

This is more useful in the context when we have a function and we want to modify the output of that function without playing around with the original source code. Reasons may be that we are not allowed to do or because it’s simply not possible, whatever the reason may be. Decorators are here to the rescue.

>>> def greet(name):
...     return "hello there {}!".format(name)
... 
>>> def tagify(func):
...     def wrap(name):
...             return "<p>{}</p>".format(name)
...     return wrap
... 
>>> 
>>> tagify_tasdik = tagify(greet)
>>> tagify_tasdik("tasdik")
'<p>tasdik</p>'

So we just decorated the return value of a function!

But where is that `@` syntax you were talking all along?

Don’t worry, here is an example for you. Keeping in mind what we have discussed so far. Keeping the above example in mind,

We don’t always have to do tagify_tasdik = tagify(greet) for decorating our function. Python provides some syntactic sugar for doing the same.

>>> 
>>> def tagify(func):
...     def wrap(name):
...             return "<p>{}</p>".format(func(name))
...     return wrap
... 
>>>
>>> @tagify
... def greet(name):
...     return "hello there {}".format(name)
... 
>>> 
>>> greet("foo")
'<p>hello there foo</p>'
>>> 

Chaining one or more decorators

As the title suggests, let’s say we want to decorate our function further, we can chain the decorators to get the desired output.

>>> def p_tagify(func):
...     def wrap(content):
...             return "<p>{}</p>".format(func(content))
...     return wrap
... 
>>> def h1_tagify(func):
...     def wrap(content):
...             return "<h1>{}</h1".format(func(content))
...     return wrap
... 
>>> def div_tagify(func):
...     def wrap(content):
...             return "<div>{}</div>".format(func(content))
...     return wrap
... 
>>> @div_tagify
... @h1_tagify
... @p_tagify
... def greet(name):
...     return "hello there {}".format(name)
... 
>>> greet("tasdik")
'<div><h1><p>hello there tasdik</p></h1</div>'
>>> 

But wait a second! What do we have here

>>> greet.__name__
'wrap'
>>> 

As you can see, the functions name got changed to the method which was decorating it and this can cause a huge pain when you are debugging your programs.

But as usual, we have functools to the rescue

>>> 
>>> from functools import wraps
>>> 
>>> def p_tagify(func):
...     @wraps(func)
...     def decorate(content):
...             return "<p>{}</p>".format(func(content))
...     return decorate
... 
>>> @p_tagify
... def greet(name):
...     return "hello there {}".format(name)
... 
>>> greet("tasdik")
'<p>hello there tasdik</p>'
>>> greet.__name__
'greet'
>>> 

Passing Arguments to decorators

Now wouldn’t it have been real nice if you could pass on arguments to decorators to tagify the content as you wished. This would reduce 3 functions into 1. (Remember the decorator chaining example?)

>>> from functools import wraps
>>> 
>>> def tag(tag_name):
...     def tag_decorator(func):
...             @wraps(func)
...             def func_wrapper(content):      
...                     return "<{0}>{1}</{0}>".format(tag_name, func(content))
...             return func_wrapper
...     return tag_decorator
... 
>>> @tag("p")
... def greet(name):
...     return "hello there {}".format(name)
... 
>>> greet("tasdik")
'<p>hello there tasdik</p>'
>>> greet.__name__
'greet'
>>> 

So I hope you now have a good idea about how decorators work in python.

Margo: An opiniated Slack Bot for SRMSE's Slack channel

2016-06-25T00:00:00+00:00

Bots: Before and Now

When was the last time you were having a conversation with a computer before? Most of us will come into the category where we haven’t done so or maybe you did but you didn’t like the experience! The days aren’t far when customer support will be provided by dedicated bots built using cutting edge ML techniques and backed by state of the art NLP research.

If you have been following the latest trends in software industry. Bots and VR are the thing as echoed by a lot of big shot companies and tech evangelists. I mean just look at these crazy articles on techcrunch. Bots have been the talk of the town since some time now and developers are taking a fair advantage of this rise in attention for promoting their own bots in the market.

What got me into bots

Well to be honest. I am quite a regular to techcrunch’s website. And when I got to know that they had released a telegram bot for their service, it didn’t take me too long to setup the bot for my telegram ID.

I was quite cynical at first about how the experience would turn out. But surprisingly! The interface was really intuitive and didn’t make any assumptions on their side about what the user knew or didn’t.

The bot helps you stay on top of the topics, stories and people you care about the most. You can subscribe to different topics, authors or sections of the site, and the bot will send you news articles on a daily basis or when you explicitly ask for it.

In short I was really impressed and didn’t regret installing the bot.

image courtesy: TechCrunch

So what’s the deal with Margo

Back at SRM Search Engine, we manage our own single dedicated server for providing our search service. And sometimes, it goes down due to some or the other reasons. Sometimes for the insane amount of data that we crunch on it or when we are playing around with a new technology which requires some downtime.

The idea for Margo came to me while having a chat with one of my college mates on our slack channel. Oh no, we were not talking about bots! Something totally different, but well it came out of the blue to me!

Weekend ahead! Weather in Delhi (ok. it was raining that day!), along with some aloo tikkas. What a bliss!

Reading the docs for the Slack API simultaneously and working on my bot. I had a working prototype ready in about 2 hours. The next 1 hour was spent refactoring the app.

Here’s a glimpse to what it does

Pretty basic for now. I plan to automate the pinging process. But the current deployment of the bot forbades me on doing so. You see, I have deployed it to a basic dyno on heroku. The thing is that, the dyno goes to sleep if it does not recieve any HTTP requests afer some time. Moreover, they have a fixed 6 hour downtime for any basic dyno. So yeah, as I am pretty much broke right now. I cannot afford a Digital Ocean/Linode server. But I mean what the heck right, at least it works for now!

More functionalities are coming through over.

Here’s a link to the project : Margo

All in all, I enjoyed building Margo for this was a side project after some weeks (several if you may) of break. Reason being, I hadn’t had much time to indulge in side projects due to my commitments as an intern at Wingify. And man, I am loving it here!

Back there, I just finished working on an internal backend service which is a rabbitMQ consumer handling the consumed messages (based on type of the queue) from the numerous queues and processing them accordingly. Did some refactoring of the service and then integrated statsd with it to graph the IO operations done by it. Graphite along with it’s components carbon and whisper were used to visualize the data in a human readable graph format. Last step was to deploy the setup to a test server on Digital Ocean.

Maybe what I just wrote is quite abstract. But I plan on writing a blog post about my experience with, but let’s see when I get time to write about it!

So this weekend, me and some of my friends header over to Bot builder workshop meetup, Delhi for the fun of it. We had Beerud Sheth, the CEO of Gupshup give a talk about how bots were the next big thing.

All in all we had a really good time and met some really interesting guys.

Cheers!

Simple lessons learned while building things - My open source journey so far

2016-04-24T00:00:00+00:00

Reinventing the wheel is sometimes a good idea

One of the stock critiques for any new project is that it’s been done before. You’re working on a new module, format, etc: what about this existing format?

Contributing to existing projects is often impossible if your vision is different from those of the maintainers, your changes are too large, or they’re absent.
Even if a problem is “solved” by an existing project, it can often be solved better, faster, in fewer lines of code, or with more documentation.
Sometimes writing the thing is the best way to learn about the solution. It’s best to develop the skill of reading code and understanding it, but usually the existing project is not written clearly.

Few open source projects attain escape velocity

Most projects attract neither users nor contributors and are forgotten within a year of their development.

Some lucky projects acquire users.

Only the very luckiest projects get contributors, probably less than 1% of projects. These have more than one maintainer who really feels continuing ownership and has the time to contribute, and the open source dream can be achieved.

Most projects will cost you time. Your responsibilities to maintain them and their users will add up. The more things you create, the more time you spend helping users and doing their bidding, and less time you have to create new things.

The ultimate skill is knowing what to do

Your ability to write code quickly or answer issues or anything else doesn’t matter if you don’t know what to do, and people don’t start out knowing what to do. The experienced builder knows what to prioritize when, and this is what make them effective.

Optimize for performance when you need to. Write minimal docs and expand them once the project is ready. Avoid implementing anything that isn’t necessary. If you expand the mission of a project, do so knowing that it’ll cost time and could sacrifice focus.

Most importantly, don’t build things that are impossible or that you’re sure will turn out bad just because that’s what the plan entails: change the plan and build possible, good things.

Extraction of text from image using tesseract-ocr engine

2016-04-04T00:00:00+00:00

This post was long overdue!

We have been working on building a food recommendation system for some time and this phase involved getting the menu items from the menu images

We poured over at zomato’s site looking for menu’s and all we found was images in the name of menu’s

This is not what we wanted!

We want the menu items tp be in text format, so that we can easily track which restaurants are serving which dish and analyze the reviews to see which restaurant serves best

Scrape them all!

The first step was to scrape the images of the hotels.

The very first apps which came to our mind when we thought about food were no other than zomato and burrp (brownie points for those for whom these were the names echoing in their minds).

Zomato kept blocking our crawlers from time to time. So we found burrp to be a boon in this sense.

You can find some of the webscrapers here (https://github.com/foodoh/web_scrapers)

Enter OCR

So the obvious choice was to apply image processing techniques so as to extract the text inside these images

We thought we would be getting okayish results using the tesseract-ocr engine for this purpose. If you haven’t heard about it. tesseract is maitained by google and provides a decent API for getting the job done!

We ran some of our images on it wihout any pre-processing and waited for the result. But we were in for a rude shock. Not only were we getting bad results but some of them were outright garbage text

Tell me if you can comprehend any of this

  M hm mkﬂﬁfﬁ {MWﬂ m lax-g;
lug-BI I 1. I I.“ n I
III"- - ‘I I Mb. I I
I‘M“ - ‘I I “IQ-h I .
in“ . 1. I “my I I
III-“mun. - 1I I I‘M n I
lOl-I I I Ila-Inn." P! I
Imus-n I Ia—l— I I
'wm ll Intimi— I I
mm lulu - I III-m - I
'60.“ II I II._~ . o.
'm“ : Iain—Lita 1| I III—I.— - .-
uhﬁnI—II I . I I w“ n I
nun—n.- . "h." III-nun— a I
m um:
I n
H .
C -
I I
I n
C I
I 1!

This is supposed to be the text list of menu items extraced from this image

Sucks right?

But some results were turning out fine. Take for instance this image (link)

Result for this (link)

777 ﬂ

SOUP
Tomato Soup 80
Sweet Corn so
SHURUWAAT
Paneer Tirang! Tikka 210
Paneer Tikka La] Mirch 2 w
Paneer Malai Tikka 210
Paneer Kali Mirch 210
Paﬁeer Peshawari 210
Tandoori Baby Corn 190
Chalpali Baby Corn 190
Alan Nazakat Ke 170
Chaman Kl Seekh 180
A100 Ke Tukde 17o:
Makai Ki mm 190
Kuuiiniri Seekh lao
Tandoori Kumbh 2w
Lahnri Suhzi Seekh 180
Veg Kui-kurc 190
Makai Maxi Scckh 190
— SUBZI KI BAHAAR
A100 Gobi Adarakhi 170
Pindi Choic 11o
Bhendi Do Pyazza 170
Sarson Ka Saag 170
Shin-i121 W313 Emma 170
Had Makai Khas 180
Lasnoni Vegeiahle 180
Lagaan Ki Subzi 180
Kashmiri Dum A100 180
Veg Angare 180
Subzi Zaykcdar 180
Vegomble Lahari 12m
Diwan] Handi 130
Vegetable Kali Mirch 180
Suhzi La Jawab 190
Sahzi Jam Pahechani 190
Jafrani Kufta 190
Panzer Knthmari Kuﬁa 190

Decent enough for me.

Grayscaling the images

Now after some reading, we found out that grayscaling the images according would increase the OCR accuracy.

A simple PIL program for that

This improved the accuracy to a certain extent. Here is a sample greyscaled image for you (link)

Some of the cleaning scripts lie here (https://github.com/foodoh/image_cleansing/)

Automating the task of OCR

Now tesseract was provinding a CLI interface for interacting with it. But how would you automate this? I am not gonna sit there and type

 $ tesseract myscan.png out

for each and every image scraped!

Enter python!

As always. python comes to the rescue. I wrote a simple script which ran over the image directories, looping over each and every image for each hotel and ran tesseract-ocr on them.

Storing of each hotel’s text menu was done in a different file with the name that file being the hotel’s normalized name.

You can find most of the scripts used for this automation here

Stay tuned for more!

Trying out Oculus Rift: Development kit 2

2016-03-13T00:00:00+00:00

I woke up and found myself inside the elevator. A rather spooky one I would say. Questions like “Why was I inside an elevator?” would come to your mind. I am coming to that.

Prologue

I was all alone inside the dimly lit elevator. A sudden jerk and the next thing I know is that the lift is going up.

I look around and it seems all normal to me. Next, the lift stops at the 4th floor. The door opens and I look outside the elevator door. I was expecting to see somebody as I didn’t press any of the buttons. Door closes automatically and up we go again.

We stop again at the 10th floor. I again don’t remember myself pressing anything, but what the heck right. The door opens and the next thing I see makes me jump two steps closer to the back of the elevator. What was it you ask?

Did you say a tricyle?

It was a tricycle! The ones which we all used to have when we were in playschool! I assured myself that it must have been left by some kid playing around and that he/she must be nearby the door. I try peeping out but all I see is utter darkness. Out of nowhere, a basketball is thrown inside the lift and I swear to god I shouted like anything!

The door closes again. The lift starts moving up and suddenly there is complete blackness inside. It takes me a moment to realize that there has been a powercut. I try looking for my cellphone in my pockets. But I can’t feel anything there.

After some restless moments, the power comes back. The lift starts moving up. I notice something really strange on the lift’s display. The numerics which display the floor on which the lift currently is, starts flickering. I don’t pay attention to it thinking that it might be due to some technical glitch. But what spookes me out is when the display decides to go haywire and display text which looks more of some foreign language to me!

Now I am frantically trying to press the buttons but all in vain. The lift stops and the door opens. I look up and see the same tricycle at the same place where it was on the floor beneath. Talk about coincidences when you don’t have any. Poor brain trying to calm you down I guess. Anyways.

Hello! Is anybody there?

The door closes after a brief 3-4 seconds and off we go again. I am contemplating about how the heck did I land up inside this lift and while I am thinking all this, the door opens again and I freeze.

I look up expecting the same old tricycle but I see none which does little to relieve me. I move forward to leave the elevator and the next thing I know is a dark shadow landing right in front of me! I mean right on my freckin’ face.

As I compose myself, I see the man standing right in front me. A black suit, red tie. I watch the masked man as I take fumbled steps towards the back of the elevator. I watch in horror, as the figure slowly grows in size and reaches a height in which his head literally touches the elevator top.

In a flick of a second, the masked man lungs at me and I fall down on the ground so as to dogde him.

And then I took off the Oculus Rift over from my head!

It all happened so fast that I was left panting for more!

Sorry for the long intro, but it would be shame if I did not explain the level of immersiveness of an Oculus with an experience. You have to try this thing out however you can and I cannot even begin to tell you how much I enjoyed the experience!

It’s been two week or so, since Milan 2k16 (our college cultural fest) has passed by. And I thank the guys over at Nvidia (being one of the many great sponsers that we had) and the guys at TGN for bringing over super cool gadgets and gaming rigs over to our college.

Enjoyed the fest and all the events in it, but this was the cherry on top for me.

Here’s me being dumbfounded by the next big thing after the internet!

Making of space Shooter using pygame

2016-02-02T00:00:00+00:00

I procrastinated enough in writing this post so here it goes. Pygame treated me good. So good that I was able to create a decent enough 2-D game in a day!

So here is my breakdown of it.

Creating the basic rectangles

I used the pygame-boilerplate which I made in the process of making this game.

It’s nothing groundbreaking. Provides just a basic starting ground for you to base your pygame projects. Saves you some gruntwork.

Adding the enemy sprites

At this point, the collisions needed to be added. When the player collided with a mob sprite, the game needs to end.

Adding the icons for all

Adding the icons for the sprites was not that hard. I got the icons from opengameart.org/, more particulary from the Space shooter content pack from @kenney.

License for them is in Public Domain. This pack is a gem of a package. I mean you get all what you need in this pack!

The sound effects came next which included the explosions for the mob sprites as well as the player.

How about adding sound effects when shooting the missiles? Done deal!

Finishing it up

Done with the sound effects. The explosion animations. Was Left with adding things like high scores, player lives, health bar.

Something which I had not planned were powerups like shields and power ups. Dealing with Github feature requests anybody?

So this is what is what the main menu looks like

In the end. I loved making this game a lot and I hope you make something much cooler than this. Do share it when you do.

Here’s the git repo if you are interested in taking a look at the source code.

tasdikrahman/spaceShooter

Wanna play?

Have a nostalgic trip back to your childhood playing it! You can Download it for your preferred system.

Best part, it requires no installation! Just unzip it and you are good to go

	Download for linux based systems
	Download for windows based systems

Support for MAC OS coming soon!

Happy coding!

Say Hi to peewee

2016-01-29T00:00:00+00:00

Once upon a time, when we had to interact with the databases. We had to write bare bones SQL(seequel if you may) or Structured Query language. A language which many common databases like

SQLite
MySQL
MariaDB to name a few.

SQL is amazing but for day to day tasks it is pretty daunting and could be one of the ways in which you can shoot yourself squarely in the foot.

Now I am sure that you don’t want that to happen!

Enter ORM

ORM is an acronym for Object Relational Mapping. Okay, but what does an ORM do?

It turns object in your code to rows in your database and vice versa.

They do this by defining a model. A model represents a table in the database. The models are nothing but classes who have their attributes represent the columns

I stumbled upon Peewee, a lightweight ORM tool for python.

Github/peewee

Documentation

Now you might ask, why peewee and not any other ORM like SQLAlchemy or Storm for the matter?

Well my reason for that would be the closeness to the Django style declaration models. This would be immensely helpful for anybody who is gonna be learning/working with Django. Plus it’s lightweight!

Here is a quick demo for how to use it.

Peewee, A gentle intro

Let’s try to model a Student database

from peewee import *  

db = SqliteDatabase('student.db')

class Student(Model):
    username = CharField(max_length=255, unique=True)
    points = IntegerField(default=0)

    class Meta:     
        database = db

The model has been created.

Let’s enter some students to the database

students = [
    {'username': 'tasdik',
     'points': 200
     },
    {'username': 'kellogs',
     'points': 400
     },
    {'username': 'john',
     'points': 500
     },
    {'username': 'doe',
     'points': 600
     },
    {'username': 'foo',
     'points': 1000
     },
]

How about we make the process of creating a seperate function for adding the students to the database?

def add_student():
    for student in students:
        try:    
            Student.create(
                username=student['username'],
                points=student['points']
            )
        except IntegrityError: 
            student_record = Student.get(username=student['username'])
            if student['points'] != student_record.points:
                student_record.points = Student.get(points=student['points'])
            student_record.save()

Checking whether there is anything in there

Firing up the interpreter and running sqlite3

$ sqlite3 students.db
-- Loading resources from /home/tasdik/.sqliterc

SQLite version 3.8.6 2014-08-15 11:46:33
Enter ".help" for usage hints.
sqlite> .tables
student
sqlite> SELECT * FROM student;
id          username    points    
----------  ----------  ----------
1           tasdik      200       
2           kellogs     400       
3           john        500       
4           doe         600       
5           foo         1000      
sqlite> .exit

Querying the database becomes as simple as a breeze too!

The whole thing put together

So there you go

Some notes - If you may:

model - A code object that represents a database table
SqliteDatabase - The class from Peewee that lets us connect to an SQLite database
Model - The Peewee class that we extend to make a model
CharField - A Peewee field that holds onto characters. It’s a varchar in SQL terms
max_length - The maximum number of characters in a CharField
IntegerField - A Peewee field that holds an integer
default - A default value for the field if one isn’t provided
unique - Whether the value in the field can be repeated in the table
.connect() - A database method that connects to the database
.create_tables() - A database method to create the tables for the specified models.
safe - Whether or not to throw errors if the table(s) you’re attempting to create already exist
.create() - creates a new instance all at once
.select() - finds records in a table
.save() - updates an existing row in the database
.get() - finds a single record in a table
.delete_instance() - deletes a single record from the table
.order_by() - specify how to sort the records
.update() - also something we didn’t use. Offers a way to update a record without .get() and .save()
.where() - method that lets us filter our .select() results
.contains() - method that specifies the input should be inside the specified field

Example:

Student.update(points=student['points']).where(Student.username == student['username']).execute()

References:

Getting started with Pygame

2016-01-17T00:00:00+00:00

Pygame intro

As with every other kid out there, I spent long hours sitting in front of the computer playing Games like Super Mario, Dangerous Dave and the likes. So when I got to know about Pygame. I was really getting the itch on creating something in the lines of these games.

Of cource, we have better game engines written in other languages like C++, but since I liked python. So I mean what the heck right?

Lets Get Started then shall we

Installation

Ubuntu/Debian

$ sudo apt-get install python-pygame

OS X

If you DON’T have homebrew installed, then

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

If you have it installed, then

$ brew install caskroom/cask/brew-cask
$ brew cask install xquartz
$ brew install python3
$ brew install python
$ brew linkapps python3
$ brew linkapps python
$ brew install git
$ brew install sdl sdl_image sdl_ttf portmidi libogg libvorbis
$ brew install sdl_mixer --with-libvorbis
$ brew tap homebrew/headonly
$ brew install smpeg
$ brew install mercurial
$ pip3 install hg+http://bitbucket.org/pygame/pygame

See if it works?

Open the terminal

$ python
Python 2.7.8 (default, Jun 18 2015, 18:54:19) 
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygame
>>>

If there is no Error on that, then you are good to go.

A simple Pygame Boilerplate

I made a dead simple Pygame boilerplate for creating a base for your pygame programs.

Here is the Github link for you.

link : tasdikrahman/pygame-boilerplate

A sneak peak

What can you do with this?

I created Space Shooter using this boilerplate. Here’s a demo screen for you

References

Unicode strings in python, a gentle intro

2015-12-08T00:00:00+00:00

Summary

In this post I will try to explain how to handle them in python 2 and 3.

I had long undermined the way I handled strings in my projects, but I could feel the gravity of handling strings properly when I was working on vocabulary, a side project of mine.

There was this one feature in it where the module had to return the pronunciation for a given word. Well I wrote the logic to parse the content and all the stuff. I had it all figured out, but then I was facing this issue.

Let’s start shall we?

ASCII strings

So let’s start with the ASCII strings, Have a look at hi.txt

Let’s see what does it hold

tasdik at Acer in ~/unicode
$ cat hi.txt 
hi

Nice and easy. It contains, two characters h and i

Size ?

tasdik at Acer in ~/unicode
$ du -a -b hi.txt 
2   hi.txt

This means that the file is of 2 bytes. Now what do these 2 bytes hold inside them? Let’s do a hexdump

tasdik at Acer in ~/unicode
$ hexdump hi.txt 
0000000 6968                                   
0000002

If you look over to the ASCII table and look out for the hex representations, you will see that the letter h is represented by 68 and i is represented by 69

Let’s see how python2 handles this. Firing up the interpreter

>>> with open('hi.txt') as f:
...   content = f.read()
... 
>>> content
'hi'
>>> type(content)
<type 'str'>
>>>

Now I probably should reiterate the fact that

Every character in a string is a single byte

And that the ASCII table translates each byte value to a unique character. the file contains an ASCII string of exactly two characters. So it does makes sense. Let’s dig a little further.

>>> len(content)
2
>>> content[0]
'h'
>>> content[1]
'i'
>>>

So this confirms that the x[0] contains h and x[1] contains i

Enter Unicode

So how many characters does the ASCII representation able to represent? Doing the math, 256(2^8) would be the maximum number of characters that the ASCII table can represent. Just giving a heads up here, Chinese has a lot more than 256 characters. So how would you handle chinese as well as the characters on your keyboard?

Have a look at chinese.txt

tasdik at Acer in ~/unicode
$ cat chinese.txt 
hi猫

So it contains three character namely h, i and 猫. Size?

tasdik at Acer in ~/unicode
$ du -a -b chinese.txt 
5   chinese.txt

5 bytes. Let’s see what does each byte contain

tasdik at Acer in ~/unicode
$ hexdump chinese.txt
0000000 68 69 e7 8c ab
0000005

The relevant thing to note here are the five hexadecimal numbers 69, 68, e7, 8c and ab

So five numbers, 5 bytes. Good so far? Now how do we interpret these numbers? We will have a look at the Unicode UTF-8 table.

In this table, 68 is the character h, 69 is the character i, and the three-byte sequence e7, 8c, ab is the character 猫. To recap, h is one byte, i is one byte, but 猫 is three bytes.

A point to note here is that, the Unicode UTF-8 table is a superset of the ASCII table, so that’s the reason h and i are represented by the same characters in both.

Handling unicode strings in `python2`

>>> with open('chinese.txt') as f:
...   content = f.read()
... 
>>> content
'hi\xe7\x8c\xab'
>>> len(content)
5
>>>

What was all that? h and i are represented just fine but when it comes to the chinese character, it shows me hexdecimal numbers. And how does it return me 5 as the string lenght, when we know perfectly well that there are just 3 characters in that file?

It turns out that the python str doesn’t store a string but a stream of bytes in it. Digging further.

>>> x[0]
'h'
>>> x[1]
'i'
>>> x[2]
'\xe7'
>>> x[3]
'\x8c'
>>> x[4]
'\xab'

The hi is returned prefectly fine as those are ASCII characters, but when it comes to the chinese character, it is represented by UTF-8 unicode. But since str object in python2 just stores a sequece of bytes, it has no way of deciding to group these 3 characters to represent the chinese character. So we see them as the hexadecimal numbers.

So how should we deal with this.

`decode()` to the rescue

>>> utf_content = content.decode('utf-8')
>>> utf_content
u'hi\u732b'
>>> type(utf_content)
<type 'unicode'>
>>> len(utf_content)
3
>>> utf_content[0]
u'h'
>>> utf_content[1]
u'i'
>>> utf_content[2]
u'\u732b'
>>>

So the decode() tells python to convert the string content into a UTF-8 string. I know, the name is confusing as hell. But let’s leave that for another day.

I we call the print statement now. Let’s see what we get

>>> print utf_content
hi猫
>>>

So there you go.

Word of caution

Weird things happen in python2 if you think that str is a string. To be safe, convert the str object to utf-8 format immediately by doing a decode('utf-8'). Then work with your unicode object and not the str or else you will some real pain handling the issues. Like I had in vocabulary

In python2, a unicode object type represents real strings whereas the str object is a sequece of bytes.

So when you are done precessing your unicode object and now you want to write it down to a file or a database. First convert it back to a sequence of bytes (str object) using the encode() method.

>>> str_content = utf_content.encode('utf-8')
>>> type(str_content)
<type 'str'>
>>> str_content
'hi\xe7\x8c\xab'
>>> content == str_content
True
>>>

Now you will be able to write this content to a file or database as directly doing so with a unicode object would have given you some wierd errors.

Okay, okay. I will show that to you

>>> with open('myfile.txt', 'w') as f:
...   f.write(utf_content)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u732b' in position 2: ordinal not in range(128)
>>>

Now doing the same with the str object

>>> with open('myfile.txt', 'w') as f:
...   f.write(str_content)
... 
>>>

Handling unicode strings in `python3`

Python3 makes handling of unicode strings easy.

One of the significant changes being that, str now stores unicode strings and not a sequence of bytes

Let’s see how it handles the chinese.txt file

tasdik at Acer in ~/unicode
$ python3
Python 3.4.2 (default, Jun 19 2015, 11:34:49) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('chinese.txt') as f:
...   content = f.read()
... 
>>> type(content)
<class 'str'>
>>> len(content)
3
>>> content[0]
'h'
>>> content[1]
'i'
>>> content[2]
'猫'

So everything works out of the box(Going with the Batteries included philosophy of python).

Now what if I wanted to interpret the contents of it as bytes.

You can do so by passing the argument rb when opening the file

>>> with open('chinese.txt', 'rb') as f:
...   content = f.read()
... 
>>> type(content)
<class 'bytes'>
>>> content
b'hi\xe7\x8c\xab'

So now you have got the default behaviour of python2.

Converting it into utf-8

>>> content.decode('utf-8')
'hi猫'

So to sum it up

In python3, str represents unicode string while the bytes type represent the sequence of bytes

For further reading, I would really, really suggest you have a look on the content written by these guys

on this topic

Submitting python package to pypi

2015-11-10T00:00:00+00:00

Recently I had written a thin wrapper around getziptastic’s API and I wanted that to be availble as a pypi package.

What is `PyPI`?

From the official website:

`PyPI` — the Python Package Index

The Python Package Index is a repository of software for the Python programming language. Written something cool? Want others to be able to install it with easy_install or pip? Put your code on PyPI. It’s a big list of python packages that you absolutely must submit your package to for it to be easily one-line installable.

The good news is that submitting to PyPI is simple in theory: just sign up and upload your code, all for free. The bad news is that in practice it’s a little bit more complicated than that. The other good news is that I’ve written this guide, and that if you’re stuck, you can always refer to the official documentation.

Create your accounts

On PyPI Live and also on PyPI Test. You must create an account in order to be able to upload your code. I recommend using the same email/password for both accounts, just to make your life easier when it comes time to push.

Create a `.pypirc` configuration file

This file holds your information for authenticating with PyPI, both the live and the test versions. This file should be placed in your home directory. So do a

$ nano ~/.pypirc

Where you can use your favorite text editor in place of nano

[distutils] # this tells distutils what package indexes you can push to
index-servers =
  pypi
  pypitest

[pypi]
repository: https://pypi.python.org/pypi
username: user_name
password: your_password

[pypitest]
repository: https://testpypi.python.org/pypi
username: user_name
password: your_password

[server-login]
repository: https://testpypi.python.org/pypi
username: user_name
password: your_password

This is just to make your life easier, so that when it comes time to upload you don’t have to type/remember your username and password.

Prepare your package

Every package on PyPI needs to have a file called setup.py at the root of the directory. If your’e using a markdown-formatted read me file you’ll also need a setup.cfg file. Also, you’ll want a LICENSE.txt file describing what can be done with your code. So if I’ve been working on a library called mypackage, my directory structure would look like this:

.
├── LICENSE.txt
├── pyzipcode-cli
│   ├── countries.json
│   ├── __init__.py
│   └── pyzipcode-cli.py
├── README.md
├── requirements.txt
├── setup.cfg
├── setup.py
└── usage.gif

Here’s a breakdown of what goes in which file:

`setup.py`

This is metadata about your library.

#!/usr/bin/env python
try:
  import os
  from setuptools import setup, find_packages
except ImportError:
  from distutils.core import setup

setup(
  name = 'pyzipcode-cli',
  version = '0.0.12',
  author = 'Tasdik Rahman',
  author_email = 'tasdikrahman@zoho.com',
  # packages = ['pyzipcode_cli'], 
  description = "a thin wrapper around getziptastic's API v2",
  url = 'https://github.com/tasdikrahman/pyzipcode-cli', 
  license = 'MIT',
  install_requires = [
    "docopt==0.6.1",
    "requests==2.8.1"
  ],
  ### adding package data to it 
  packages=find_packages(exclude=['contrib', 'docs', 'tests']),
  package_data={
      'pyzipcode_cli': ['*.json'],
  },

  ###
  download_url = 'https://github.com/tasdikrahman/pyzipcode-cli/tarball/0.0.12', 
  classifiers = [
      'Intended Audience :: Developers',
      'Topic :: Software Development :: Build Tools',
      'License :: OSI Approved :: MIT License',

      # Specify the Python versions you support here. In particular, ensure
      # that you indicate whether you support Python 2, Python 3 or both.
      'Programming Language :: Python :: 2.7',
      'Programming Language :: Python :: 3.4',
  ],
  keywords = ['api', 'geo-location', 'zipcode','devtools', 'Development', 'ziptastic'], 
  entry_points = {
        'console_scripts': [
            'pyzipcode = pyzipcode_cli.core:main'
      ],
    }
)

The download_url is a link to a hosted file with your repository’s code. Github will host this for you, but only if you create a git tag.

In your repository, type:

$ git tag 0.1 -m "Adds a tag so that we can put this on PyPI."

Then, type git tag to show a list of tags — you should see 0.1 in the list.

Type

git push --tags origin master

to update your code on Github with the latest tag information. Github creates tarballs for download at https://github.com/{username}/{module_name}/tarball/{tag}.

`setup.cfg`

This tells PyPI where your README file is.

[metadata]
description-file = README.md

This is necessary if you’re using a markdown readme file. At upload time, you may still get some errors about the lack of a readme — don’t worry about it. If you don’t have to use a markdown README file, I would recommend using reStructuredText (REST) instead.

`LICENSE.txt`

This file will contain whichver license you want your code to have. I tend to use the MIT license.

Upload your package to PyPI Test

Run:

$ python setup.py register -r pypitest

This will attempt to register your package against PyPI’s test server, just to make sure you’ve set up everything correctly.

Then, run:

$ python setup.py sdist upload -r pypitest

You should get no errors, and should also now be able to see your library in the test PyPI repository.

Upload to PyPI Live

Once you’ve successfully uploaded to PyPI Test, perform the same steps but point to the live PyPI server instead. To register, run:

$ python setup.py register -r pypi

Then, run:

$ python setup.py sdist upload -r pypi

and you’re done!

Some shameless promotion

If you want to try my package out here is the

My cool looking badge :D

References:

Creating a gif of the current window

2015-11-07T00:00:00+00:00

BackDrop:

Everybody at some time of their life as netizen would have seen something like this

How about we create one?

Well recently when I was building a Calculator app, I wanted a gif image to be there in the README.md so as to show the usage of the app.

I wrote a How hard can Building a calclator be right? for the same some time back.

Did some googling and found out byzanz-record as the tool perfect for me

Installation

Beginning for 14.04 and above, it is available in the universe repository

$ sudo apt-get install byzanz

If you are on a system older than that

$ sudo add-apt-repository ppa:fossfreedom/byzanz
$ sudo apt-get update && sudo apt-get install byzanz

Usage:

We are gonna be using this tool from the command prompt itself as GUI’s would just slow down the process. Now for that we just need four things

--x=<your_value>
--y=<your_value>
--width=<your_value>
--height=<your_value>

Now how do we get that?

run

$ xwininfo

and point on the window which you want to record. And it will return you the required values and a little extra information too!

tasdik@Acer:~/Desktop/pyCalc$ xwininfo

xwininfo: Please select the window about which you
          would like information by clicking the
          mouse in that window.

xwininfo: Window id: 0x740003f "Calculator"

  Absolute upper-left X:  984
  Absolute upper-left Y:  509
  Relative upper-left X:  0
  Relative upper-left Y:  0
  Width: 284
  Height: 169
  Depth: 24
  Visual: 0x20
  Visual Class: TrueColor
  Border width: 0
  Class: InputOutput
  Colormap: 0x22 (installed)
  Bit Gravity State: NorthWestGravity
  Window Gravity State: NorthWestGravity
  Backing Store State: NotUseful
  Save Under State: no
  Map State: IsViewable
  Override Redirect State: no
  Corners:  +984+509  -98+509  -98-90  +984-90
  -geometry 284x169-88-80

tasdik@Acer:~/Desktop/pyCalc$

There now, I have got my values

tasdik@Acer:~/Desktop/pyCalc$ byzanz-record --duration=45 --x=984 --y=509 --width=290 --height=170 out.gif

This will immediately start the recording for the Calculator window until 45 seconds. So be sure to finish whatever you want to do before that.

Here’s is what I got from the above script

References:

Converting python script into an executable

2015-11-07T00:00:00+00:00

BackDrop:

So recently I was building a Calculator using tkinter.

I wrote a blog post for the same some time back.

Now I thought, how awesome it would be if I could distribute it to my friends and let the use it. Problem was that some of them not being CS grads would not know head or tails about how to run it!

So I thought the best way and the easiest way was to convert the pyCalc.py into a .exe file.

That way both my purposes were solved.

non-cs people would get an interface to run it which was familiar to them. Heck even a granny who knows how to use chrome to watch cookery shows can use it now. Ok, that was a little bit too much
They didn’t have to install anything in this process

Enter pyinstaller

Note: Before installing PyInstaller on Windows, you will need to install PyWin32. You do not need to do this for GNU/Linux or Mac OS X systems.

To install it. You just have to do

$ sudo pip install pyinstaller for python2.*

$ sudo pip3 install pyinstaller for python3.*

If you are behind a proxy server, just add -E flag like this sudo -E pip3 ..

Creating the executable

I have my pyCalc.py which I want to make an executable of, in

tasdik@Acer:~/Desktop/pyCalc$ tree
.
└── pyCalc.py

0 directories, 1 file
tasdik@Acer:~/Desktop/pyCalc$

To build the executable

tasdik@Acer:~/Desktop/pyCalc$ pyinstaller --onefile --windowed pyCalc.py

Yes, it’s that easy!

If you are not haunted with any errors. You should see two folders being placed in pyCalc

tasdik@Acer:~/Desktop/pyCalc$ tree
.
├── build
│   └── pyCalc
│       ├── base_library.zip
│       ├── localpycos
│       │   ├── pyimod01_os_path.pyc
│       │   ├── pyimod02_archive.pyc
│       │   ├── pyimod03_importers.pyc
│       │   └── struct.pyo
│       ├── out00-Analysis.toc
│       ├── out00-EXE.toc
│       ├── out00-PKG.pkg
│       ├── out00-PKG.toc
│       ├── out00-PYZ.pyz
│       ├── out00-PYZ.toc
│       ├── out00-Tree.toc
│       ├── out01-Tree.toc
│       └── warnpyCalc.txt
├── dist
│   └── pyCalc
├── pyCalc.py
└── pyCalc.spec

4 directories, 17 files
tasdik@Acer:~/Desktop/pyCalc$

For a successful build , the final executable, pyCalc, and any associated files, will be placed in the dist directory, which will be created if it doesn’t exist.

Let me briefly describe the options that are being used:

--onefile is used to package everything into a single executable. If you do not specify this option, the libraries, etc. will be distributed as separate files alongside the main executable.
--windowed prevents a console window from being displayed when the application is run. If you’re releasing a non-graphical application (i.e. a console application), you do not need to use this option.
pyCalc.py the main source file of the application. The basename of this script will be used to name of the executable, however you may specify an alternative executable name using the --name option.

See the PyInstaller Manual for more configuration information.

Sadly, we can’t make windows executables from pyinstaller! It supported it some time back in the earlier versions but not in the newer versions.

Alternatives to `pyinstaller`

Till then. Goodbye!

References:

How hard can building a Calculator be right?

2015-11-06T00:00:00+00:00

BackDrop:

So I was reading about this wonderful module called tkinter some days back. And man was I not sucked into it!

Sure, we do have some very good GUI modules like PyQT, wxPython, PySide. And I don’t deny the fact that some are way better than tkinter. But one advantage which makes tkinter stand out from the crowd is that, it comes pre-packaged with the python package.

What I mean by that is you don’t have to install anything to run and write GUI’s with it and run my Calculator program for instance.

Dabbling with it:

I got naturally interested in it so after reading the docs for some time. I thought why not apply it by making something small. It got me thinking about what to make with it!

After some rambling aroung, I settled with the thought of making a Calculator using Tkinter. Aiming for a file search program at my second go.

Taking a piece of paper:

So the first task was to decide whether to use OOP’s for this project or not. I figured out that adding class‘es would only complicate it in the first go.

Second was to decide which functions/operators should it have initially, upon which initial design should be made.

The problem which took me the hardest to figure out was where we had to parse the input entered by the user. Take this input as an example.

Now to get 80 as the first operand, one has to press 8 and then 8 again. Now you and I know that it is 80. But how do you make the program understand that?

I figured out that, it was best that I dumped the whole input into a function called calculate where I got the whole of the entered input through display.get(), display being an Entry() object.

How was I dumping what I clicked into `Entry()` widget?

I used two functions for that, get_variables(num) for operand and get_operation(operator) for operator. Inside each, I have a global variable, whose value gets incremented each time control transfers to any of these functions. I use this global variable to keep track of the position of the next data item(be it a operand or operator) to be inserted into the Entry widget.

Here is what I did for get_variables()

def get_variables(num):
    """Gets the user input for operands and puts it inside the entry widget"""
    global i
    display.insert(i, num)
    i += 1

So for a typical Button, lets take 7 here. I have

seven = Button(root, text = "7", command = lambda : get_variables(7), font=FONT_LARGE)
seven.grid(row = 4, column = 0)

Well one problem solved!

Adding and `<-` (undo) button

Now what if you pressed something wrong, you don’t want to press the AC button to clear the whole of the entered text as you would have to again waster your time into typing it again. How do we achieve that?

I figured out that the whole_string stores the whole of the input and what I wanted with the <- button was an undo of what I did last.

So I just had to remove the last index of the string whole_string For that.

new_string = whole_string[:-1]
print(new_string)
clear_all()
display.insert(0, new_string)

That did it what <- was supposed to do

How do I seperate the operands from the operators?

Now at first thoughts, I thought I should hardcode it for each and every operator. Like you have + inside the whole_string which stores the value of display.get(). And then you do a

if '+' in whole_string:
    # Now to split the contents into `operands`, I did a `var1, var2 = whole_string.split(operator)`. 
    var1, var2 = whole_string.split("+")
    result = int(var1) + int(var2)

But you notice that this fails at the slightest of sneezes. Enter more than 2 variables and then you are left with an extra unused operand. You will left making if-else clauses for the rest of this project.

So I dropped this approach for good!

After some frustrating 3 hours and messing around and deleting with 3 branches. I finally came to this.

Why not user parser to parse the expression and evaluate it.

So this is what I did

def calculate():
    whole_string = display.get()
    try:
        formulae = parser.expr(whole_string).compile()      ## returns a `parser` object
        result = eval(formulae)                             ## evaluates the parsed expression
        clear_all()
        display.insert(0, result)
    except Exception:
        clear_all()
        display.insert(0, "Error!")

This solved my problem and it works for most of the test cases I have checked with, but by far the best approach would be write my own parser.

Will refactor to include it in my next release.

Download it!

Fork this project

Feel free to fork this project and make changes to it.

Github repo

Filing bugs

If you find some bugs, please file an issue on the github page

Report a bug

Till then. Goodbye!

Setting up DBPedia Spotlight on your local server

2015-10-24T00:00:00+00:00

Intro :

DBPedia Spotlight can be queried over the API which they provide. But for our convenience, it is not always possible to do so.

So setting it up locally was the best solution.

You can run DBpedia Spotlight from the comfort of your own machine in one or many of the following ways:

Web Service (no installation needed). We offer a WADL service descriptor, so with Eclipse or Netbeans you can automagically create a client to call our Web Service. See: Web Service
JAR. We offer a jar with all dependencies included. You can just download it and run from command line. See: Run from a Jar
Maven. Our build is mavenized, which means that you can use the scala plugin to run our classes from command line. See: Build from Source with Maven
Ubuntu/Debian package. We are also starting to share our downloads as debian packages so that anybody can just install DBpedia Spotlight directly from apt-get, Synaptic or their favorite package manager. See:Debian-Package-Installation:-How-To
WAR Files/ Tomcat. DBpedia Spotlight is also build as a WAR file. You can use it through Apache Tomcat.

We would be doing it the JAR way.

Get set. Go

Requirements

Java 1.6+
RAM of appropriate size for the spotter lexicon you need

First we will install a pre-packaged lightweight deployment to get you started.

Lucene

sys2@sys2:~$ wget http://spotlight.dbpedia.org/download/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip
--2015-10-22 10:28:16--  http://spotlight.dbpedia.org/download/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip
Resolving spotlight.dbpedia.org (spotlight.dbpedia.org)... 134.155.95.15
Connecting to spotlight.dbpedia.org (spotlight.dbpedia.org)|134.155.95.15|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip [following]
--2015-10-22 10:28:17--  http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.6/dbpedia-spotlight-quickstart-0.6.5.zip
Resolving wifo5-04.informatik.uni-mannheim.de (wifo5-04.informatik.uni-mannheim.de)... 134.155.95.17
Connecting to wifo5-04.informatik.uni-mannheim.de (wifo5-04.informatik.uni-mannheim.de)|134.155.95.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 156569414 (149M) [application/zip]
Saving to: ‘dbpedia-spotlight-quickstart-0.6.5.zip’

100%[====================================================================================================>] 15,65,69,414 1.14MB/s   in 5m 51s 

2015-10-22 10:34:09 (435 KB/s) - ‘dbpedia-spotlight-quickstart-0.6.5.zip’ saved [156569414/156569414]
sys2@sys2:~$
sys2@sys2:~$ unzip dbpedia-spotlight-quickstart-0.6.5.zip
Archive:  dbpedia-spotlight-quickstart-0.6.5.zip
   creating: dbpedia-spotlight-quickstart-0.6.5/data/
   creating: dbpedia-spotlight-quickstart-0.6.5/data/index/
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/_1.cfs  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/_2.cfs  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/_3.cfs  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/segments.gen  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/segments_5  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/index/similarity-thresholds.txt  
   creating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/
   creating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-chunker.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-chunker.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-location.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-location.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-organization.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-organization.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-person.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-ner-person.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-pos-maxent.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-pos-maxent.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-sent.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-sent.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-token.bin  
 extracting: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/english/en-token.zip  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/opennlp/README.txt  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/pos-en-general-brown.HiddenMarkovModel  
  inflating: dbpedia-spotlight-quickstart-0.6.5/data/spotter.dict  
  inflating: dbpedia-spotlight-quickstart-0.6.5/dbpedia-spotlight-0.6.5-jar-with-dependencies.jar  
  inflating: dbpedia-spotlight-quickstart-0.6.5/run.sh  
  inflating: dbpedia-spotlight-quickstart-0.6.5/server.properties  
  inflating: dbpedia-spotlight-quickstart-0.6.5/apache-2.0.txt  
  inflating: dbpedia-spotlight-quickstart-0.6.5/lingpipe-license-1.txt  
sys2@sys2:~$ cd dbpedia-spotlight-quickstart-0.6.5/
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5$

Statistical

sys2@sys2:~$ wget http://spotlight.sztaki.hu/downloads/version-0.1/en.tar.gz
--2015-10-22 11:12:58--  http://spotlight.sztaki.hu/downloads/version-0.1/en.tar.gz
Resolving spotlight.sztaki.hu (spotlight.sztaki.hu)... 193.225.89.3
Connecting to spotlight.sztaki.hu (spotlight.sztaki.hu)|193.225.89.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2257455959 (2.1G) [application/x-gzip]
Saving to: ‘en.tar.gz’

 100%[==================================================================================================>] 2,25,74,55,959  935KB/s   in 44m 59s

2015-10-22 11:57:58 (817 KB/s) - ‘en.tar.gz’ saved [2257455959/2257455959]
sys2@sys2:~$ 
sys2@sys2:~$ wget http://spotlight.sztaki.hu/downloads/version-0.1/dbpedia-spotlight.jar
--2015-10-22 12:13:39--  http://spotlight.sztaki.hu/downloads/version-0.1/dbpedia-spotlight.jar
Resolving spotlight.sztaki.hu (spotlight.sztaki.hu)... 193.225.89.3
Connecting to spotlight.sztaki.hu (spotlight.sztaki.hu)|193.225.89.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116325524 (111M) [application/java-archive]
Saving to: ‘dbpedia-spotlight.jar’

 100%[====================================================================================================>] 11,63,25,524  415KB/s   in 3m 27s 

2015-10-22 12:17:07 (549 KB/s) - ‘dbpedia-spotlight.jar’ saved [116325524/116325524]

sys2@sys2:~$ tar xvf en.tar.gz 
en/
en/model/
en/model/res.mem
en/model/res.mem_
en/model/tokens.mem
en/model/context.mem
en/model/sf.mem
en/model/candmap.mem
en/model.properties
en/stopwords.list
en/spotter_thresholds.txt
en/opennlp/
en/opennlp/pos-maxent.bin
en/opennlp/token.bin
en/opennlp/chunker.bin
en/opennlp/sent.bin
sys2@sys2:~$
sys2@sys2:~$

Add the Data corpus

We can run the model right now, but I will do so after adding the larger data corpus.

sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5$ cd data
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ wget http://spotlight.dbpedia.org/download/release-0.5/context-index-compact.tgz
--2015-10-22 12:25:58--  http://spotlight.dbpedia.org/download/release-0.5/context-index-compact.tgz
Resolving spotlight.dbpedia.org (spotlight.dbpedia.org)... 134.155.95.15
Connecting to spotlight.dbpedia.org (spotlight.dbpedia.org)|134.155.95.15|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.5/context-index-compact.tgz [following]
--2015-10-22 12:25:59--  http://wifo5-04.informatik.uni-mannheim.de/downloads/release-0.5/context-index-compact.tgz
Resolving wifo5-04.informatik.uni-mannheim.de (wifo5-04.informatik.uni-mannheim.de)... 134.155.95.17
Connecting to wifo5-04.informatik.uni-mannheim.de (wifo5-04.informatik.uni-mannheim.de)|134.155.95.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12017976481 (11G) [application/x-gzip]
Saving to: ‘context-index-compact.tgz’


 100%[=================================================================================================>] 12,01,79,76,481 1.16MB/s   in 4h 10m 

2015-10-22 16:36:37 (781 KB/s) - ‘context-index-compact.tgz’ saved [12017976481/12017976481]

sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ tar zxvf context-index-compact.tgz
index-withSF-withTypes-compressed/
index-withSF-withTypes-compressed/_at.cfs
index-withSF-withTypes-compressed/_66.cfs
index-withSF-withTypes-compressed/segments_9t
index-withSF-withTypes-compressed/_99.cfs
index-withSF-withTypes-compressed/similarity-thresholds.txt
index-withSF-withTypes-compressed/segments.gen
index-withSF-withTypes-compressed/_33.cfs
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ mv index-withSF-withTypes-compressed index
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ wget http://spotlight.dbpedia.org/download/release-0.4/surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ gunzip surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary.gz
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ mv surface_forms-Wikipedia-TitRedDis.uriThresh75.tsv.spotterDictionary spotter.dict
sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$

Testing the installation :

In order to test the installation, do a

sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ curl http://sys2:2222/rest/annotate   -H "Accept: text/xml"   --data-urlencode "text=The earliest authenticated human remains in South Asia date to about 30,000 years ago.[26] Nearly contemporaneous Mesolithic rock art sites have been found in many parts of the Indian subcontinent, including at the Bhimbetka rock shelters in Madhya Pradesh.[27] Around 7000 BCE, the first known Neolithic settlements appeared on the subcontinent in Mehrgarh and other sites in western Pakistan.[28] These gradually developed into the Indus Valley Civilisation,[29] the first urban culture in South Asia;[30] it flourished during 2500–1900 BCE in Pakistan and western India along the river valleys of Indus and Sarasvati.[31] Centred on cities such as Mohenjo-daro, Harappa, Rakhigarhi, Dholavira, and Kalibangan, and relying on varied forms of subsistence, the civilisation engaged robustly in crafts production and wide-ranging trade."   --data "confidence=0"   --data "support=0"


sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$ curl http://sys2:2222/rest/annotate   -H "Accept: text/xml"   --data-urlencode "text=The earliest authenticated human remains in South Asia date to about 30,000 years ago.[26] Nearly contemporaneous Mesolithic rock art sites have been found in many parts of the Indian subcontinent, including at the Bhimbetka rock shelters in Madhya Pradesh.[27] Around 7000 BCE, the first known Neolithic settlements appeared on the subcontinent in Mehrgarh and other sites in western Pakistan.[28] These gradually developed into the Indus Valley Civilisation,[29] the first urban culture in South Asia;[30] it flourished during 2500–1900 BCE in Pakistan and western India along the river valleys of Indus and Sarasvati.[31] Centred on cities such as Mohenjo-daro, Harappa, Rakhigarhi, Dholavira, and Kalibangan, and relying on varied forms of subsistence, the civilisation engaged robustly in crafts production and wide-ranging trade."   --data "confidence=0"   --data "support=0"

<?xml version="1.0" encoding="utf-8"?>

<Annotation text="The earliest authenticated human remains in South Asia date to about 30,000 years ago.[26] Nearly contemporaneous Mesolithic rock art sites have been found in many parts of the Indian subcontinent, including at the Bhimbetka rock shelters in Madhya Pradesh.[27] Around 7000 BCE, the first known Neolithic settlements appeared on the subcontinent in Mehrgarh and other sites in western Pakistan.[28] These gradually developed into the Indus Valley Civilisation,[29] the first urban culture in South Asia;[30] it flourished during 2500–1900 BCE in Pakistan and western India along the river valleys of Indus and Sarasvati.[31] Centred on cities such as Mohenjo-daro, Harappa, Rakhigarhi, Dholavira, and Kalibangan, and relying on varied forms of subsistence, the civilisation engaged robustly in crafts production and wide-ranging trade." confidence="0.0" support="0" types="" sparql="" policy="whitelist">

<Resources>

<Resource URI="http://dbpedia.org/resource/South_Asia" support="2850" types="Freebase:/book/book_subject,Freebase:/book,Freebase:/location/location,Freebase:/location,Freebase:/organization/organization_scope,Freebase:/organization,Freebase:/location/region,Freebase:/people/ethnicity,Freebase:/people" surfaceForm="South Asia" offset="44" similarityScore="0.13776831328868866" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Mesolithic" support="681" types="" surfaceForm="Mesolithic" offset="114" similarityScore="0.13642436265945435" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Indian_subcontinent" support="1497" types="DBpedia:Continent,DBpedia:PopulatedPlace,DBpedia:Place,Schema:Place,Schema:Continent,Freebase:/location/region,Freebase:/location,Freebase:/location/location" surfaceForm="Indian subcontinent" offset="177" similarityScore="0.141897514462471" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Madhya_Pradesh" support="2950" types="DBpedia:Settlement,DBpedia:PopulatedPlace,DBpedia:Place,Schema:Place,Freebase:/location/in_state,Freebase:/location,Freebase:/location/statistical_region,Freebase:/location/dated_location,Freebase:/location/location,Freebase:/book/author,Freebase:/book,Freebase:/location/administrative_division" surfaceForm="Madhya Pradesh" offset="242" similarityScore="0.13563162088394165" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Common_Era" support="1247" types="" surfaceForm="BCE" offset="274" similarityScore="0.1675749570131302" percentageOfSecondRank="0.20302364934710979"/>

<Resource URI="http://dbpedia.org/resource/Neolithic" support="2903" types="" surfaceForm="Neolithic" offset="295" similarityScore="0.11610618233680725" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Pakistan" support="23561" types="DBpedia:Country,DBpedia:PopulatedPlace,DBpedia:Place,Schema:Place,Schema:Country,Freebase:/location/country,Freebase:/location,Freebase:/organization/organization_member,Freebase:/organization,Freebase:/biology/breed_origin,Freebase:/biology,Freebase:/location/statistical_region,Freebase:/military/military_combatant,Freebase:/military,Freebase:/people/ethnicity,Freebase:/people,Freebase:/location/dated_location,Freebase:/law/court_jurisdiction_area,Freebase:/law,Freebase:/sports/sport_country,Freebase:/sports,Freebase:/government/governmental_jurisdiction,Freebase:/government,Freebase:/olympics/olympic_participating_country,Freebase:/olympics,Freebase:/location/location,Freebase:/meteorology/cyclone_affected_area,Freebase:/meteorology,Freebase:/book/book_subject,Freebase:/book,Freebase:/sports/sports_team_location" surfaceForm="Pakistan" offset="385" similarityScore="0.0921529158949852" percentageOfSecondRank="0.6893670279187828"/>

<Resource URI="http://dbpedia.org/resource/Indus_Valley_Civilization" support="424" types="Freebase:/time/event,Freebase:/time,Freebase:/book/book_subject,Freebase:/book" surfaceForm="Indus Valley Civilisation" offset="434" similarityScore="0.20086094737052917" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/South_Asia" support="2850" types="Freebase:/book/book_subject,Freebase:/book,Freebase:/location/location,Freebase:/location,Freebase:/organization/organization_scope,Freebase:/organization,Freebase:/location/region,Freebase:/people/ethnicity,Freebase:/people" surfaceForm="South Asia" offset="492" similarityScore="0.13776831328868866" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Common_Era" support="1247" types="" surfaceForm="BCE" offset="539" similarityScore="0.1675749570131302" percentageOfSecondRank="0.20302364934710979"/>

<Resource URI="http://dbpedia.org/resource/Pakistan" support="23561" types="DBpedia:Country,DBpedia:PopulatedPlace,DBpedia:Place,Schema:Place,Schema:Country,Freebase:/location/country,Freebase:/location,Freebase:/organization/organization_member,Freebase:/organization,Freebase:/biology/breed_origin,Freebase:/biology,Freebase:/location/statistical_region,Freebase:/military/military_combatant,Freebase:/military,Freebase:/people/ethnicity,Freebase:/people,Freebase:/location/dated_location,Freebase:/law/court_jurisdiction_area,Freebase:/law,Freebase:/sports/sport_country,Freebase:/sports,Freebase:/government/governmental_jurisdiction,Freebase:/government,Freebase:/olympics/olympic_participating_country,Freebase:/olympics,Freebase:/location/location,Freebase:/meteorology/cyclone_affected_area,Freebase:/meteorology,Freebase:/book/book_subject,Freebase:/book,Freebase:/sports/sports_team_location" surfaceForm="Pakistan" offset="546" similarityScore="0.0921529158949852" percentageOfSecondRank="0.6893670279187828"/>

<Resource URI="http://dbpedia.org/resource/South_Asia" support="2850" types="Freebase:/book/book_subject,Freebase:/book,Freebase:/location/location,Freebase:/location,Freebase:/organization/organization_scope,Freebase:/organization,Freebase:/location/region,Freebase:/people/ethnicity,Freebase:/people" surfaceForm="India" offset="567" similarityScore="0.1619909405708313" percentageOfSecondRank="0.8368753166673841"/>

<Resource URI="http://dbpedia.org/resource/Indus_Valley_Civilization" support="424" types="Freebase:/time/event,Freebase:/time,Freebase:/book/book_subject,Freebase:/book" surfaceForm="Indus" offset="600" similarityScore="0.20086094737052917" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Sarasvati_River" support="60" types="Freebase:/geography/river,Freebase:/geography,Freebase:/location/location,Freebase:/location,Freebase:/geography/geographical_feature,Freebase:/geography/body_of_water,Freebase:/religion/deity,Freebase:/religion" surfaceForm="Sarasvati" offset="610" similarityScore="0.08038105070590973" percentageOfSecondRank="0.9255801653640217"/>

<Resource URI="http://dbpedia.org/resource/Mohenjo-daro" support="136" types="DBpedia:WorldHeritageSite,DBpedia:Place,Schema:Place,Freebase:/protected_sites/listed_site,Freebase:/protected_sites,Freebase:/location/location,Freebase:/location" surfaceForm="Mohenjo-daro" offset="651" similarityScore="0.19224941730499268" percentageOfSecondRank="-1.0"/>

<Resource URI="http://dbpedia.org/resource/Harappa" support="163" types="DBpedia:Settlement,DBpedia:PopulatedPlace,DBpedia:Place,Schema:Place,Freebase:/location/location,Freebase:/location,Freebase:/location/statistical_region,Freebase:/location/citytown,Freebase:/location/dated_location" surfaceForm="Harappa" offset="665" similarityScore="0.17811940610408783" percentageOfSecondRank="-1.0"/>

</Resources>

</Annotation>

sys2@sys2:~/dbpedia-spotlight-quickstart-0.6.5/data$

So there you go.

References

Till then. Goodbye!

Running CGI Scripts with CGIHTTPServer

2015-10-20T00:00:00+00:00

Intro :

So why should we be interested in CGI which stands for common gateway interface. Well for instance, try to recall the websites that you have been in the last 1 hour. Now out of those websites, some where static and some were dynamic. The latter would mean that the contents of that website kept changing in real time.

CGI scripting is helpful when you want to generate content from the data residing in a database. This is not only convenient but also cuts a lot of time.

In my previous article, I showed how to run CGI scripts in an apache2 webserver. Here is the link, if you wanna take a look at it

Running-CGI-Sripts-on-Apache2-Ubuntu

CGIHTTPServer

I hope that you have python installed in your system. Just to be sure.

tasdik@Acer:~$ python --version
Python 2.7.8
tasdik@Acer:~$

We will be making use of a super simple Web server shipped by default with python instead of using a full blown web server software for the sake of understanding.

Now cgi scripts are executable files inside the cgi-bin or htdocs directory which the web server executes. After which the output of the program is captured in the standard output to be displayed back by the server.

It is the cgi-bin directory first where all our executable scripts will reside.

tasdik@Acer:~$ cd cgi_demo/
tasdik@Acer:~/cgi_demo$ tree
.
├── cgi-bin
│   └── retrieval.py
└── forms.html

1 directory, 2 files
tasdik@Acer:~/cgi_demo$ chmod +x cgi-bin/retrieval.py

**NOTE: ** Don't forget to make your cgi-script Executable

/forms.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<!DOCTYPE html>
<html>
<head>
    <title>A simple form demonstration</title>
</head>
<body>
    <div style="text-align:center;">
        <h1>User login</h1>
        <form action="/cgi-bin/retrieval.py" method="get">
            username  :  <input type="text" name="username" style="text-align:center;">
            <br><br>
            password   :  <input type="password" name="password" style="text-align:center;">
            <br><br><br>
            <input type="submit" value="Submit">
        </form>
    </div>
</body>
</html>

/cgi-bin/retrieval.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python3.4

import cgi, cgitb
cgitb.enable()		## allows for debugging errors from the cgi scripts in the browser

form = cgi.FieldStorage()

## getting the data from the fields 
first = form.getvalue('username')
last = form.getvalue('password')


print("Content-type:text/html\r\n\r\n")
print("<html>")
print("<head><title>User entered</title></head>")
print("<body>")
print("<h1>User has entered</h1>")
print("<b>Firstname : </b>" + first + "<br>")
print("<br><b>Lastname : </b>" + last + "<br>")
print("")
print("</div>")
print("</body>")
print("</html>")

Running the webserver

We will be running our webserver in the cgi_demo directory.

tasdik@Acer:~/cgi_demo$ python -m CGIHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

At this point. Our web server is up and running. To check on it.

Go to the link http://localhost:8000/forms.html in your browser.

You will be displayed with the forms to enter name and password. At this point, if you have a look at the terminal. You will see

tasdik@Acer:~/cgi_demo$ python -m CGIHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...
127.0.0.1 - - [21/Oct/2015 14:26:32] "GET /forms.html HTTP/1.1" 200 -

200 is the response code for the request made by the browser for the file forms.html, which was present. So it was served back by the server to the browser.

After filling up the form and submitting it. The form calls retrieval.py program and passes the data entered by the user to it using a GET request.

You will be redirected to page with a url looking something like

http://localhost:8000/cgi-bin/retrieval.py?username=tasdik&password=admin123

You will notice that the form data is appended with the url of the program itself. This is the standard way of how the GET method passes data onto functions.

The data part starts after the ?

After you are redirected. You will notice a change in your terminal window from where you had started the web server

tasdik@Acer:~/cgi_demo$ python -m CGIHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...
127.0.0.1 - - [21/Oct/2015 17:17:32] "GET /forms.html HTTP/1.1" 200 -
127.0.0.1 - - [21/Oct/2015 17:17:37] "GET /cgi-bin/retrieval.py?username=tasdik&password=admin123 HTTP/1.1" 200 -

The program retrieval.py was served successfully to the browser.

That’s all for this article. In the next one, I will write about how to talk to an sqlite3 database using CGI scripts

Till then. Goodbye!

Running CGI Scripts on Apache2

2015-09-30T00:00:00+00:00

Intro

Have you ever wanted to create a webpage or process user input from a web-based form using Python? These tasks can be accomplished through the use of Python CGI (Common Gateway Interface) scripts with an Apache web server. CGI scripts are called by a web server when a user requests a particular URL or interacts with the webpage (such as clicking a “Submit” button). After the CGI script is called and finishes executing, the output is used by the web server to create a webpage displayed to the user.

If you just want to test the waters in the CGI world, you might wanna test your scripts in a simple server like the one shipped with python default.

I have written an article on that. Here’s the link.

Running-CGI-Sripts-with-CGIHTTPServer

Configuring the Apache2 Web server to run CGI scripts

I am assuming that you are using apache2 version 2.4.*, as in Apache2.4, the configuration was cleaned up considerably, and things in the default site definition have been moved to configuration files in conf-available. Among other things, this also includes the CGI-related configuration lines seen in the default site of older versions. These have been moved to /etc/apache2/conf-available/serve-cgi-bin.conf, which contains:

ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

Just to check whether you have apache2 installed on your system and its up and running do a

tasdik@Acer:~$ apache2 -v
Server version: Apache/2.4.10 (Ubuntu)
Server built:   Mar  5 2015 18:13:03
tasdik@Acer:~$

I am currently on version 2.4. If you don’t get any output from the prompt its likely that you don’t have it installed on your system. Just do a

tasdik@Acer:~$ sudo apt-get install apache2

Anyways, You just need to make changes on two configuration files, them being

/etc/apache2/apache2.conf
/etc/apache2/conf-available/serve-cgi-bin.conf

On the terminal

tasdik@Acer:~$ mkdir /var/www/cgi-bin
tasdik@Acer:~$ cd /var/www/cgi-bin/
tasdik@Acer:/var/www/cgi-bin$ sudo nano /etc/apache2/apache2.conf

And add the following at the end

###################################################################
#########     Adding capaility to run CGI-scripts #################
ServerName localhost
ScriptAlias /cgi-bin/ /var/www/cgi-bin/
Options +ExecCGI
AddHandler cgi-script .cgi .pl .py

This converts the cgi-bin address to the specified address and the AddHandler tells the server to treat all files with .cgi,.pl and .py extensions as cgi scripts.

Now for the second conf file

tasdik@Acer:~$ sudo nano /etc/apache2/conf-available/serve-cgi-bin.conf

The final file should look something like this

<IfModule mod_alias.c>
	<IfModule mod_cgi.c>
		Define ENABLE_USR_LIB_CGI_BIN
	</IfModule>

	<IfModule mod_cgid.c>
		Define ENABLE_USR_LIB_CGI_BIN
	</IfModule>

	<IfDefine ENABLE_USR_LIB_CGI_BIN>
		#ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
		#<Directory "/usr/lib/cgi-bin">
		#	AllowOverride None
		#	Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
		#	Require all granted
		#</Directory>

		## cgi-bin config
		ScriptAlias /cgi-bin/ /var/www/cgi-bin/
	    <Directory "/var/www/cgi-bin/">
	        AllowOverride None
	        Options +ExecCGI
	    </Directory>

	</IfDefine>
</IfModule>

# vim: syntax=apache ts=4 sw=4 sts=4 sr noet

Now restart the apache2

tasdik@Acer:~$ sudo service apache2 restart

Creating a simple CGI script

We have to first create the folder that will hold the cgi-scripts, so lets do that.

tasdik@Acer:~$ cd /var/www/cgi-bin
tasdik@Acer:/var/www/cgi-bin$ touch hello.py
##  make it executable for you and others

tasdik@Acer:/var/www/cgi-bin$ chmod o+x hello.py

##  Put the following content just for testing inside `hello.py`
tasdik@Acer:/var/www/cgi-bin$ sudo nano hello.py

Add the following to hello.py

#!/usr/bin/env python

import cgitb
cgitb.enable()
print("Content-Type: text/html;charset=utf-8")

print "Content-type:text/html\r\n\r\n"
print '<html>'
print '<head>'
print '<title>Hello Word - First CGI Program</title>'
print '</head>'
print '<body>'
print '<h2>Hello Word! This is my first CGI program</h2>'
print '</body>'
print '</html>'

Run the Script:

Open your browser and enter the following link

http://localhost/cgi-bin/hello.py

And the script should run just fine.

Debugging :

If the script is not running, you can check the logs stored in

/var/log/apache2/error.log

You can also refer the official reference here : http://httpd.apache.org/docs/2.0/howto/cgi.html

Hope it helped!

References:

My Ramblings with Oracle-11g

2015-09-27T00:00:00+00:00

Intro :

I have recently started using Oracle 11g and boy, doesn’t it come with enough problems already!! However, it may be due to my undying allegiance to mysql and sqlite. But whatever be the case.

Here are some of the things which I found would be useful to a first time user of Oracle 11g. I will try to update it when I find something useful.

If you haven’t already installed Oracle 11g. I wrote a small article on how to do so sometime back.

Creating a new user :

Now we can use the default users SYSTEM or SYS, but I prefer creating my own here.

To do that

tasdik@Acer:~$ sqlplus

SQL*Plus: Release 11.2.0.2.0 Production on Sat Sep 26 14:25:09 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter user-name: SYSTEM
Enter password: 

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> create user lab identified by lab;

User created.

## check the new user 

SQL> select username from dba_users ;

USERNAME
------------------------------
LAB
SYS
SYSTEM
ANONYMOUS
APEX_PUBLIC_USER
APEX_040000
OUTLN
XS$NULL
FLOWS_FILES
MDSYS

USERNAME
------------------------------
CTXSYS
XDB
HR

14 rows selected.

SQL> grant create session to lab ;

Grant succeeded.

SQL> grant connect to lab ;

Grant succeeded.

SQL> grant all privileges to lab ; 

Grant succeeded.

SQL> exit
Disconnected from Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

Now connect to the new user lab

tasdik@Acer:~$ sqlplus

SQL*Plus: Release 11.2.0.2.0 Production on Sat Sep 26 14:34:10 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter user-name: lab
Enter password: 

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production


## clear the screen
SQL> cl scr

SQL> select table_name from user_tables ;

no rows selected
## as no tables have been created by the user `lab` till now
## so lets create some, shall we

SQL> CREATE TABLE department (
  2   dept_name varchar(20), 
  3   building varchar(15), 
  4   budget numeric(12,2) check (budget > 0), 
  5   primary key (dept_name)
  6  ) ; 

Table created.

SQL> select table_name from user_tables ; 

TABLE_NAME
------------------------------
DEPARTMENT

## you can do whatever you can normally do with this user
SQL> drop table department ; 

Table dropped.

SQL> select table_name from user_tables ;

no rows selected

SQL>

To Delete a User :

After dropping the user, you need to, for each related tablespace, take it offline and drop it. For example if you had a user named ‘SAMPLE’ and two tablespaces called ‘SAMPLE’ and ‘SAMPLE_INDEX’, then you’d need to do the following:

1
2
3
4
5
DROP USER SAMPLE CASCADE;
ALTER TABLESPACE SAMPLE OFFLINE;
DROP TABLESPACE SAMPLE INCLUDING CONTENTS;
ALTER TABLESPACE SAMPLE_INDEX OFFLINE;
DROP TABLESPACE SAMPLE_INDEX INCLUDING CONTENTS;

There is no such thing as `IF EXISTS` :

I mean what! I find this really irritating in the part of Oracle to not give support for this feature as any other major RDMS system implements this.

Lets try it out

SQL> select table_name from user_tables ; 

no rows selected

SQL>

So we don’t have any relations right now. Lets try to drop tasdik which does not exist the way we do in mysql.

SQL> DROP TABLE IF EXISTS tasdik ; 
DROP TABLE IF EXISTS tasdik
              *
ERROR at line 1:
ORA-00933: SQL command not properly ended


SQL>

If you try to DROP a relation which does not exist, the query will not stop executing in the middle, but will just give you an error like the above

Well, sqlite has a limitation where we cannot alter the attribute in a relation, like add or delete an attribute to a relation. So nobody is perfect here.

No support for cursor keys :

tasdik@Acer:~$ sqlplus

SQL*Plus: Release 11.2.0.2.0 Production on Sun Sep 27 19:12:41 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter user-name: lab
Enter password: 

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL> ^[[A^[[A^[[A

As you can see, whenever you press the cursor keys, there is garbage on the screen. Now I was not sure why this was happening. Turns out it doesn’t support the arrow keys in *nix based systems. (Good news for those on windows).

But well there is a workaround.

You can use a third party utility called rlwrap.

rlwrap is a readline wrapper, a small utility that uses the GNU readline library to allow the editing of keyboard input for any other command. It maintains a separate input history for each command, and can TAB-expand words using all previously seen words and/or a user-specified file.So you will be able to use arrows and also get a command history as a bonus.

After you have installed the utility run sqlplus the following way:

tasdik@Acer:~$ rlwrap sqlplus

SQL*Plus: Release 11.2.0.2.0 Production on Sun Sep 27 19:04:38 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter user-name: lab
Enter password: 

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL>

And the arrow keys would work just fine

References:

Install and Configure Oracle-11g on Ubuntu-14.10

2015-09-26T00:00:00+00:00

Intro :

Well as luck would have it, we were having our Database labs in ORACLE 11g as part of the coursework. And I must admit, I still have my love for mysql.

I installed it on my system(after a lot of trials!) this way.

Next part of this article

Pre-requisites:

JAVA should be installed and the environment variable for it should be set. Check it by doing

tasdik@Acer:~$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.10.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
tasdik@Acer:~$ echo $JAVA_HOME
/usr/lib/jvm/java-7-openjdk-amd64
tasdik@Acer:~$

If not you can set it by placing the installed jvm to /etc/environment

Mine looks like this

1
2
3
4
5
6
7
8
tasdik@Acer:~$ cat /etc/environment 
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
http_proxy="http://172.16.0.16:8080/"
https_proxy="https://172.16.0.16:8080/"
ftp_proxy="ftp://172.16.0.16:8080/"
socks_proxy="socks://172.16.0.16:8080/"
JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
tasdik@Acer:~$

Installing Oracle 11g R2 Express Edition

A couple of extra things are needed

tasdik@Acer:~$ sudo apt-get install alien libaio1 unixodbc

Download Oracle 11g, you need to make an account for that(Yeah, I know!)

Link : http://www.oracle.com/technetwork/database/database-technologies/express-edition/downloads/index.html

Place annd Unzip the zip file to the Directory of your choice. I did mine to ‘Documents’

tasdik@Acer:~$ cd Documents/
tasdik@Acer:~/Documents$ unzip oracle-xe-11.2.0-1.0.x86_64.rpm.zip

## A new directory was added

tasdik@Acer:~$ cd Documents/Disk1/

Now we need to convert the rpm (A red hat package to .deb) package

tasdik@Acer:~/Documents/Disk1$ sudo alien --scripts -d oracle-xe-11.2.0-1.0.x86_64.rpm

This will take a while, so open another terminal window

tasdik@Acer:~/Documents/Disk1$ sudo nano /sbin/chkconfig

And then add the following to it

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash
# Oracle 11gR2 XE installer chkconfig hack for Ubuntu
file=/etc/init.d/oracle-xe
if [[ ! `tail -n1 $file | grep INIT` ]]; then
echo >> $file
echo '### BEGIN INIT INFO' >> $file
echo '# Provides: OracleXE' >> $file
echo '# Required-Start: $remote_fs $syslog' >> $file
echo '# Required-Stop: $remote_fs $syslog' >> $file
echo '# Default-Start: 2 3 4 5' >> $file
echo '# Default-Stop: 0 1 6' >> $file
echo '# Short-Description: Oracle 11g Express Edition' >> $file
echo '### END INIT INFO' >> $file
fi
update-rc.d oracle-xe defaults 80 01
#EOF

Save it and give the required execution privileges

tasdik@Acer:~$ sudo chmod 755 /sbin/chkconfig

After this, we have to create the file /etc/sysctl.d/60-oracle.conf to set the additional kernel parameters. Open the file by executing the following statement.

tasdik@Acer:~/Documents/Disk1$ sudo nano /etc/sysctl.d/60-oracle.conf

Copy and paste the following into the file. Kernel.shmmax is the maximum possible value of physical RAM in bytes. 536870912 / 1024 /1024 = 512 MB.

1
2
3
4
5
# Oracle 11g XE kernel parameters
fs.file-max=6815744
net.ipv4.ip_local_port_range=9000 65000
kernel.sem=250 32000 100 128
kernel.shmmax=536870912

Load the kernel parameters:

tasdik@Acer:~/Documents/Disk1$ sudo service procps start

The changes may be verified again by executing:

tasdik@Acer:~/Documents/Disk1$ sudo sysctl -q fs.file-max
fs.file-max = 6815744
tasdik@Acer:~/Documents/Disk1$

After this, execute the following statements to make some more required changes:

tasdik@Acer:~/Documents/Disk1$ sudo ln -s /usr/bin/awk /bin/awk
tasdik@Acer:~/Documents/Disk1$ mkdir /var/lock/subsys
tasdik@Acer:~/Documents/Disk1$ touch /var/lock/subsys/listener

Install the .deb file

tasdik@Acer:~/Documents/Disk1$ sudo dpkg --install oracle-xe_11.2.0-2_amd64.deb

After that, Execute the following to avoid getting a ORA-00845: MEMORY_TARGET error. Note: replace “size=3804m” with the size of your (virtual) machine’s RAM in MBs.

Note : Close Chrome before executing the following three lines as it crashes when doing so

To check your system RAM (allated), do a

tasdik@Acer:~$ free -tom
             total       used       free     shared    buffers     cached
Mem:          3804       3465        338        910         81       1642
Swap:         3943         30       3913
Total:        7748       3496       4252
tasdik@Acer:~$

My RAM is 3804 here Then do the following

tasdik@Acer:~$ sudo rm -rf /dev/shm
tasdik@Acer:~$ sudo mkdir /dev/shm
tasdik@Acer:~$ sudo mount -t tmpfs shmfs -o size=3804m /dev/shm

Create the file /etc/rc2.d/S01shm_load.

tasdik@Acer:~$ sudo nano /etc/rc2.d/S01shm_load

And then add the following

1
2
3
4
5
6
7
8
9
10
#!/bin/sh
case "$1" in
start) mkdir /var/lock/subsys 2>/dev/null
touch /var/lock/subsys/listener
rm /dev/shm 2>/dev/null
mkdir /dev/shm 2>/dev/null
mount -t tmpfs shmfs -o size=3804m /dev/shm ;;
*) echo error
exit 1 ;;
esac

Give permissions to it

tasdik@Acer:~$ sudo chmod 755 /etc/rc2.d/S01shm_load

Configuring Oracle 11g R2 Express Edition

tasdik@Acer:~$ sudo /etc/init.d/oracle-xe configure

Go on choosing the defaults. I chose, not to start Oracle on startup. So do accordingly.

Setting the environment vars

tasdik@Acer:~$ sudo gedit /etc/bash.bashrc

And add the following

1
2
3
4
5
6
7
#### for oracle 11g
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/xe
export ORACLE_SID=XE
export NLS_LANG=`$ORACLE_HOME/bin/nls_lang.sh`
export ORACLE_BASE=/u01/app/oracle
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME/bin:$PATH

Save it and source it

tasdik@Acer:~$ source /etc/bash.bashrc
## check the environment variables
tasdik@Acer:~$ echo $ORACLE_HOME
/u01/app/oracle/product/11.2.0/xe
tasdik@Acer:~$

I recommend rebooting the system at this point of time

After a System Reboot

tasdik@Acer:~$ sudo service oracle-xe start

A file named oraclexe-gettingstarted.desktop is placed on your desktop. To make this file executable, navigate to you desktop.

tasdik@Acer:~$ cd ~/Desktop
tasdik@Acer:~/Desktop$ sudo chmod a+x oraclexe-gettingstarted.desktop

Running `SQlplus`

You have to Unset http_proxy and no_proxy.

To do that

tasdik@Acer:~$ unset http_proxy
tasdik@Acer:~$ unset no_proxy
tasdik@Acer:~$ sqlplus

SQL*Plus: Release 11.2.0.2.0 Production on Sat Sep 26 11:34:06 2015

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter user-name: SYSTEM
Enter password: 

Connected to:
Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production

SQL>

There you go!

Ending notes:

Remember to do a

tasdik@Acer:~$ sudo service oracle-xe start

at startup, if you chose not to Start Oracle 11 g at system startup

Workaround for deleting Columns in SQLite

2015-09-18T00:00:00+00:00

Intro :

SQLite has limited ALTER TABLE support that you can use to add a column to the end of a table or to change the name of a table. If you want to make more complex changes in the structure of a table, you will have to recreate the table. You can save existing data to a temporary table, drop the old table, create the new table, then copy the data back in from the temporary table.

From the docs.

It is not possible to rename a column, remove a column, or add or remove constraints from a table.

source : http://www.sqlite.org/lang_altertable.html

While you can always create a new table and then drop the older one.

I will try to explain this this workaround with an example.

sqlite> .schema
CREATE TABLE person(
 id INTEGER PRIMARY KEY, 
 first_name TEXT,
 last_name TEXT, 
 age INTEGER, 
 height INTEGER
);
sqlite> select * from person ; 
id          first_name  last_name   age         height    
----------  ----------  ----------  ----------  ----------
0           john        doe         20          170       
1           foo         bar         25          171

Now you want to remove the column height from this table.

Create another table called new_person

sqlite> CREATE TABLE new_person(
   ...>  id INTEGER PRIMARY KEY, 
   ...>  first_name TEXT, 
   ...>  last_name TEXT, 
   ...>  age INTEGER 
   ...> ) ; 
sqlite>

Now copy the data from the old table

sqlite> INSERT INTO new_person
   ...> SELECT id, first_name, last_name, age FROM person ;
sqlite> select * from new_person ;
id          first_name  last_name   age       
----------  ----------  ----------  ----------
0           john        doe         20        
1           foo         bar         25        
sqlite>

Now Drop the person table and rename new_person to person

sqlite> DROP TABLE IF EXISTS person ; 
sqlite> ALTER TABLE new_person RENAME TO person ;
sqlite>

So now if you do a .schema, you will see

sqlite>.schema
CREATE TABLE "person"(
 id INTEGER PRIMARY KEY, 
 first_name TEXT, 
 last_name TEXT, 
 age INTEGER 
);

So that’s how you insert and delete columns from an SQLite Database

Install Hadoop(Multi Node)

2015-09-13T00:00:00+00:00

Intro:

In this article, I will describe the required steps for setting up a distributed, multi-node Apache Hadoop cluster backed by the Hadoop Distributed File System (HDFS), running on Ubuntu Linux.

I am using nano as the text editor for this article but you can use any other text-editor like vi, sublime or atom for doing the same

I suggest you set up hadoop on single node before directly going for multi node as it will be easier to debug any errors. I have explained about how to setup single node hadoop on ubuntu

Note :

I will be demonstrating for 3 nodes but you can add more nodes as you like.

Installation

Install for the three node using this tutuorial

Done? Let’s continue then!

The very first step would be stop hadoop on all the three machines.

For that, you just need to do

hadoop@sys9:~/hadoop$ bin/stop-all.sh

on each of the nodes

Now to check whether hadoop processes are stopped in all the nodes or not!

For sys9

hadoop@sys9:~/hadoop$ jps
12792 Jps
hadoop@sys9:~/hadoop$

For sys10

hadoop@sys10:~/hadoop$ jps
12637 Jps
hadoop@sys10:~/hadoop$

For sys8

hadoop@sys8:~$ jps
2186 Jps
hadoop@sys8:~$

Now we have to choose which one node will be the master node, so that the other two nodes can be the slaves

In my case, I will chose sys8 to be my master node and the others to be slaves

Master node Configuration :

Add the the ip address of the slaves to /etc/hosts on sys8

hadoop@sys8:~$ sudo nano /etc/hosts

It should look something like this after you have added it.

### master node
hadoop@sys8:~$ cat /etc/hosts
192.168.103.26  sys9          ## slave
192.168.103.28  sys10         ## slave
hadoop@sys8:~$

For sys9

### for slave node
hadoop@sys9:~$ cat /etc/hosts
192.168.103.24  sys8          ## master
192.168.103.26  sys9          ## slave
192.168.103.28  sys10         ## slave
hadoop@sys9:~$

For sys10

### for slave node
hadoop@sys10:~$ cat /etc/hosts
192.168.103.24  sys8          ## master
192.168.103.26  sys9          ## slave
192.168.103.28  sys10         ## slave
hadoop@sys10:~$

Changing the hadoop configurations :

Shift to the hadoop directory first

hadoop@sys8:~$ cd ~/hadoop/conf
hadoop@sys8:~/hadoop/conf$

Editing `conf/core-site.xml` (all nodes)

hadoop@sys8:~/hadoop/conf$ sudo nano core-site.xml

After editing, it should look like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://192.168.103.24:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>

similarly for nodes sys9 and sys10.

Note

Here 192.168.103.24 is the ip address of the master node. We just have replaced localhost with this ip address

Editing `conf/mapred-site.xml` (All nodes)

After editing, the file should be

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>192.168.103.24:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
</configuration>

similarly for nodes sys9 and sys10.

Note :

The file conf/hdfs-site.xml remains the same for all the nodes

Editing the `conf/masters` (master node only)

hadoop@sys8:~/hadoop/conf$ sudo nano masters

It should look like

hadoop@sys8:~/hadoop/conf$ cat masters
192.168.103.24
hadoop@sys8:~/hadoop/conf$

Editing the `conf/slaves` (master node only)

hadoop@sys8:~/hadoop/conf$ sudo nano slaves

It should look like

hadoop@sys8:~/hadoop/conf$ cat slaves
192.168.103.24
192.168.103.26
192.168.103.28
hadoop@sys8:~/hadoop/conf$

If you have additional slaves to add up, just add those in the conf/slaves file after a newline

So we have basically added ip’s of all the nodes inside the slaves file.

###Generate the ssh keys for the master again

hadoop@sys8:~/hadoop/conf$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
/home/hadoop/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
fc:73:90:c9:c9:cc:b7:13:e7:75:55:0b:94:b5:18:01 hadoop@sys8
The key's randomart image is:
+--[ RSA 2048]----+
|           Eo=+..|
|             .+ +|
|             . o.|
|       . = +    .|
|        S X o . o|
|         . o = ..|
|          o + .  |
|           o .   |
|                 |
+-----------------+
hadoop@sys8:~/hadoop/conf$

Add this key to all the nodes (Including the master)

For sys9

hadoop@sys8:~/hadoop/conf$ ssh-copy-id hadoop@sys9
The authenticity of host 'sys9 (192.168.103.26)' can't be established.
ECDSA key fingerprint is e9:d3:5f:de:5b:0d:93:17:18:16:b8:9b:39:39:fa:62.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@sys9's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop@sys9'"
and check to make sure that only the key(s) you wanted were added.

hadoop@sys8:~/hadoop/conf$

For sys10

hadoop@sys8:~/hadoop/conf$ ssh-copy-id hadoop@sys10
The authenticity of host 'sys10 (192.168.103.28)' can't be established.
ECDSA key fingerprint is 1b:e2:6c:bc:2d:41:12:4d:79:e1:60:5c:08:74:32:9a.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@sys10's password: 
Permission denied, please try again.
hadoop@sys10's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop@sys10'"
and check to make sure that only the key(s) you wanted were added.

hadoop@sys8:~/hadoop/conf$

For sys8 itself!

hadoop@sys8:~/hadoop/conf$ ssh-copy-id hadoop@sys8
The authenticity of host 'sys4 (192.168.103.24)' can't be established.
ECDSA key fingerprint is 1b:e2:6c:bc:2d:41:12:4d:79:e1:60:5c:08:74:32:9a.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@sys4's password: 
Permission denied, please try again.
hadoop@sys4's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop@sys4'"
and check to make sure that only the key(s) you wanted were added.

hadoop@sys8:~/hadoop/conf$

Format the `/app/tmp/` directory contents (master node)

hadoop@sys8:~$ sudo rm -rf /app/hadoop/tmp/*
hadoop@sys8:~$

Format the namenode (master node)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
hadoop@sys8:~/hadoop$ bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

15/09/13 23:11:22 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = sys8/192.168.103.24
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.7.0_79
************************************************************/
15/09/13 23:11:22 INFO util.GSet: Computing capacity for map BlocksMap
15/09/13 23:11:22 INFO util.GSet: VM type       = 64-bit
15/09/13 23:11:22 INFO util.GSet: 2.0% max memory = 932184064
15/09/13 23:11:22 INFO util.GSet: capacity      = 2^21 = 2097152 entries
15/09/13 23:11:22 INFO util.GSet: recommended=2097152, actual=2097152
15/09/13 23:11:22 INFO namenode.FSNamesystem: fsOwner=hadoop
15/09/13 23:11:22 INFO namenode.FSNamesystem: supergroup=supergroup
15/09/13 23:11:22 INFO namenode.FSNamesystem: isPermissionEnabled=true
15/09/13 23:11:22 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/09/13 23:11:22 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/09/13 23:11:22 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
15/09/13 23:11:22 INFO namenode.NameNode: Caching file names occuring more than 10 times 
15/09/13 23:11:22 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 112 bytes saved in 0 seconds.
15/09/13 23:11:22 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
15/09/13 23:11:22 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
15/09/13 23:11:23 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
15/09/13 23:11:23 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sys8/192.168.103.24
************************************************************/
hadoop@sys8:~/hadoop$

Start the name node (in master node)

1
2
3
4
5
6
7
8
9
10
11
12
13
hadoop@sys8:~/hadoop$ bin/start-all.sh 
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-sys8.out
192.168.103.24: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-sys8.out
192.168.103.28: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-sys10.out
192.168.103.26: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-sys9.out
192.168.103.24: starting secondarynamenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-sys8.out
starting jobtracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-sys8.out
192.168.103.26: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-sys9.out
192.168.103.28: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-sys10.out
192.168.103.24: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-sys8.out
hadoop@sys8:~/hadoop$

Check JPS (in all systems)

sys8

hadoop@sys8:~/hadoop$ jps
3526 Jps
2817 NameNode
3258 JobTracker
3153 SecondaryNameNode
2976 DataNode
3442 TaskTracker
hadoop@sys8:~/hadoop$

sys9

hadoop@sys9:~/hadoop$ jps
2440 TaskTracker
2519 Jps
hadoop@sys9:~/hadoop$

sys10

hadoop@sys10:~/hadoop$ jps
2549 Jps
2469 TaskTracker
hadoop@sys10:~/hadoop$

Check the web interface in your browser

You can also check the web interface of hadoop in your browser

Using the url : http://192.168.103.24:50030/

List of all nodes in the Hadoop Cluster

</figure> ***

Mapreduce User Interface in the master node

</figure> That's all Folks!

Install Hadoop(Single node)

2015-09-10T00:00:00+00:00

On a starting note :

I am assuming that you have a fresh Ubuntu install on your system as this will cut down a lot of frustration trying to debug why Hadoop is not running.

I am using nano as the text editor for this article but you can use any other text-editor like vi, sublime or atom for doing the same

Next article : Install Hadoop Multi Node 1.0.3 on Ubuntu 14.04.2

Installation

Create a dedicated user

Create a seperate user named hadoop for seperating out the configuration files and the installation files

sys8@sys8:~$ sudo adduser hadoop

You would be then prompted to add the new UNIX password and details

add this user to the sudoer’s group

sys8@sys8:~$ sudo adduser hadoop sudo

Switch to the newly created user

sys8@sys8:~$ su - hadoop

Install JAVA which is in the default repos of Ubuntu

hadoop@sys8:~$ sudo apt-get install default-jdk

Configure SSH:

Enabling ssh for localhost is the next step

Install openssh-server

hadoop@sys8:~$ sudo apt-get install openssh-server

Generate the keys

hadoop@sys8:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
68:b0:84:c0:3f:16:41:38:d9:7e:d6:63:a3:a0:28:f5 hadoop@sys8
The key's randomart image is:
+--[ RSA 2048]----+
|o =o.            |
| * +             |
|  = + .          |
|  .B = *         |
|..o.* = S        |
|o.  Eo           |
|.                |
|                 |
|                 |
+-----------------+
hadoop@sys8:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Copy the keys over to enable passwordless ssh

hadoop@sys8:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 17:f4:fc:aa:88:4d:51:b1:08:ae:df:75:2f:07:37:26.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.

Test the ssh connection to the localhost

hadoop@sys8:~$ ssh hadoop@localhost
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-28-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

60 packages can be updated.
24 updates are security updates.

hadoop@sys8:~$ exit
logout
Connection to localhost closed.

hadoop@sys8:~$

Apache Hadoop Installation:

Download hadoop from apache’s site,

But first change to hadoop’s home directory first

hadoop@sys8:~$ pwd
/home/hadoop
hadoop@sys8:~$ wget https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
--2015-09-12 18:51:49--  https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Resolving archive.apache.org (archive.apache.org)... 140.211.11.131, 192.87.106.229, 2001:610:1:80bc:192:87:106:229
Connecting to archive.apache.org (archive.apache.org)|140.211.11.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63851630 (61M) [application/x-gzip]
Saving to: ‘hadoop-1.2.1.tar.gz’

100%[======================================================================================================>] 6,38,51,630 1.82MB/s   in 42s    

2015-09-12 18:52:32 (1.45 MB/s) - ‘hadoop-1.2.1.tar.gz’ saved [63851630/63851630]

Extract and rename the file

hadoop@sys8:~$ sudo tar xzf hadoop-1.2.1.tar.gz
hadoop@sys8:~$ sudo mv hadoop-1.2.1 hadoop

Set the user permissions

hadoop@sys8:~$ sudo chown -R hadoop:hadoop hadoop

Configuration

Update `/etc/profile`

hadoop@sys8:~$ sudo nano /etc/profile

Add the following lines at the end

NOTE :

This is assuming that you have installed the default-jdk from the repo’s. Change it accordingly for any other java version

1
2
3
4
5
6
7
8
9
export HADOOP_HOME="/home/hadoop/hadoop/"
export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-amd64"

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

export PATH="$PATH:$HADOOP_HOME/bin"

Edit the configuration files for Hadoop

Edit `hadoop-env.sh`

Change to the hadoop folder and add the JAVA home path

hadoop@sys8:~$ cd ~/hadoop
hadoop@sys8:~/hadoop$ ls
bin        CHANGES.txt  docs                     hadoop-core-1.2.1.jar         hadoop-test-1.2.1.jar   ivy.xml  LICENSE.txt  sbin   webapps
build.xml  conf         hadoop-ant-1.2.1.jar     hadoop-examples-1.2.1.jar     hadoop-tools-1.2.1.jar  lib      NOTICE.txt   share
c++        contrib      hadoop-client-1.2.1.jar  hadoop-minicluster-1.2.1.jar  ivy                     libexec  README.txt   src
hadoop@sys8:~/hadoop$ sudo nano conf/hadoop-env.sh

Add the following

# The java implementation to use.  Required.
export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-amd64"
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Note :

Added the second line so as to disable ipv6 specifically for hadoop and not for the whole system

###Create the tmp folder for Hadoop

hadoop@sys8:~/hadoop$ sudo mkdir -p /app/hadoop/tmp
### add the permissions
hadoop@sys8:~/hadoop$ sudo chown hadoop:hadoop /app/hadoop/tmp
hadoop@sys8:~/hadoop$ sudo chmod 750 /app/hadoop/tmp

Edit `conf/core-site.xml`

hadoop@sys8:~/hadoop$ sudo nano conf/core-site.xml

You would be having blank spaces in between the tags

It should look something like this after editing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>

Edit `conf/mapred-site.xml`

hadoop@sys8:~/hadoop$ sudo nano conf/mapred-site.xml

It should look something like this after editing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
</configuration>

Edit `conf/hdfs-site.xml`

hadoop@sys8:~/hadoop$ sudo nano conf/hdfs-site.xml

It should look something like this after editing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
</configuration>

Format the HDFS file

Note : Run this one command only once, i.e now

hadoop@sys8:~/hadoop$ bin/hadoop namenode -format
15/09/12 19:01:48 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = sys8/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.7.0_79
************************************************************/
15/09/12 19:01:49 INFO util.GSet: Computing capacity for map BlocksMap
15/09/12 19:01:49 INFO util.GSet: VM type       = 64-bit
15/09/12 19:01:49 INFO util.GSet: 2.0% max memory = 932184064
15/09/12 19:01:49 INFO util.GSet: capacity      = 2^21 = 2097152 entries
15/09/12 19:01:49 INFO util.GSet: recommended=2097152, actual=2097152
15/09/12 19:01:49 INFO namenode.FSNamesystem: fsOwner=hadoop
15/09/12 19:01:49 INFO namenode.FSNamesystem: supergroup=supergroup
15/09/12 19:01:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
15/09/12 19:01:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
15/09/12 19:01:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
15/09/12 19:01:49 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
15/09/12 19:01:49 INFO namenode.NameNode: Caching file names occuring more than 10 times 
15/09/12 19:01:49 INFO common.Storage: Image file /app/hadoop/tmp/dfs/name/current/fsimage of size 112 bytes saved in 0 seconds.
15/09/12 19:01:49 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/app/hadoop/tmp/dfs/name/current/edits
15/09/12 19:01:49 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/app/hadoop/tmp/dfs/name/current/edits
15/09/12 19:01:49 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
15/09/12 19:01:49 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at sys8/127.0.1.1
************************************************************/

##Start the Single node

NOTE :

bin/start-all.sh is deprecated so we will use bin/start-dfs.sh and then bin/start-mapred.dfs

hadoop@sys8:~/hadoop$ bin/start-dfs.sh 
starting namenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-namenode-sys9.out
localhost: starting datanode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-datanode-sys9.out
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-sys9.out
hadoop@sys8:~/hadoop$ bin/start-mapred.sh 
starting jobtracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-sys9.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-sys9.out

Check if everything is running fine:

Run jps for that and the output should look something like this

hadoop@sys8:~/hadoop$ jps
11508 TaskTracker
10938 NameNode
11868 Jps
11250 SecondaryNameNode
11347 JobTracker
11085 DataNode
hadoop@sys8:~/hadoop$

Stopping hadoop

hadoop@sys8:~/hadoop$ bin/stop-all.sh

That’s all Folks!

ROS Jade : Installation

2015-08-28T00:00:00+00:00

What the heck is ROS anyway :

Robot Operating System (ROS) is a collection of software frameworks for robot software development, providing operating system-like functionality on a heterogeneous computer cluster.

So in layman terms, it just helps us build robot applications.

Previous post on ROS : Configuring ROS - Jade on Ubuntu 14.10

Note :

ROS Jade ONLY supports Trusty (14.04), Utopic (14.10) and Vivid (15.04) for debian packages. If you are on any other version of Ubuntu or have a different flavor of linux installed, I suggest you head over to wiki.ros.org.

Note that this guide is written keeping in mind that we are on Ubuntu 14.10 !

Requirements :

Supported OS :
- Ubuntu Trusty(14.04)
- Ubuntu Utopic (14.10)
- Ubuntu Vivid (15.04)
Minimum requirements :
- C++03
- C++11 features are not used, but code should compile when -std=c++11 is used
- Python 2.7 *Python 3.3 not required, but testing against it is recommended
- Lisp SBCL 1.1.14
- CMake 2.8.12
- Boost 1.54

Installation :

Configure your Ubuntu repositories :
- Configure your Ubuntu repositories to allow “restricted,” “universe,” and “multiverse.” You can follow follow the Ubuntu guide for completing this work.
Set up your sources.list

This can be done by running the following command in the terminal

    tasdik@Acer:~$ sudo sh -c 'echo "deb http://packages.ros.org/ros/ubuntu $(lsb_release -sc) main" > /etc/apt/sources.list.d/ros-latest.list'

Set up your keys

    tasdik@Acer:~$ sudo apt-key adv --keyserver hkp://pool.sks-keyservers.net --recv-key 0xB01FA116

Note that if you are behind a proxy, this command wont work. You will have to workaround that I guess(use a dongle maybe!)

Installation

Make sure that Debian package index is up-to-date.

To make sure just do a

  tasdik@Acer:~$ sudo apt-get update

and you are good to go

For those running 14.04.2

If you are on 14.04.2, you have to manually fix the dependy issues,

DO NOT INSTALL THE BELOW PACKAGES IF YOU ARE ON 14.04, THIS WILL DESTROY YOUR X-SERVER. IN SHORT, YOU WONT BE ABLE TO SEE THE GUI OF YOUR OS AGAIN!

  tasdik@Acer:~$ sudo apt-get install xserver-xorg-dev-lts-utopic mesa-common-dev-lts-utopic libxatracker-dev-lts-utopic libopenvg1-mesa-dev-lts-utopic libgles2-mesa-dev-lts-utopic libgles1-mesa-dev-lts-utopic libgl1-mesa-dev-lts-utopic libgbm-dev-lts-utopic libegl1-mesa-dev-lts-utopic

DO NOT INSTALL THE ABOVE PACKAGES IF YOU ARE ON 14.04, THIS WILL DESTROY YOUR X-SERVER. IN SHORT, YOU WONT BE ABLE TO SEE THE GUI OF YOUR OS AGAIN!

Alternatively you can try,

  tasdik@Acer:~$ sudo apt-get install libgl1-mesa-dev-lts-utopic

to fix the dependency issues.

For those running 14.04, 14.10 and 15.04

After that you can run the command to install the Desktop-Full Install which is the recommeded one by doing.

  tasdik@Acer:~$ sudo apt-get install ros-jade-desktop-full

If everything is good uptil here, you should see the package manager downloading the required packages.

Get yourself a coffee or two

because it can take a helluva a time depending on the speed of your internet connection.

Initialize rosdep

Before you can use ROS, you will need to initialize rosdep. rosdep enables you to easily install system dependencies for source you want to compile and is required to run some core components in ROS.

tasdik@Acer:~$ sudo rosdep init
tasdik@Acer:~$ rosdep update

Environment setup

It’s convenient if the ROS environment variables are automatically added to your bash session every time a new shell is launched:

tasdik@Acer:~$ echo "source /opt/ros/jade/setup.bash" >> ~/.bashrc
tasdik@Acer:~$ source ~/.bashrc

Getting rosinstall

rosinstall is a frequently used command-line tool in ROS that is distributed separately. It enables you to easily download many source trees for ROS packages with one command.

For ubuntu users, just run

tasdik@Acer:~$ sudo apt-get install python-rosinstall

Finally

The next step would be to configure ROS, the steps of which can be found here

Now that you have installed ROS on your system, you can look forward to the ROS tutorials

Till then Goodbye!

References :

I could not have written this guide if the documentation had not been so crisp and straight forward. Kudos to the ROS development team!

http://wiki.ros.org/jade/Installation/Ubuntu

ROS Jade : Configuration

2015-08-28T00:00:00+00:00

Configuring ROS

If you have not installed ROS, I have written a short guide describing the process. It can be found here found here

Install ROS - Jade on Ubuntu 14.10

The first step is to check whether the environment variables for ROS are setup properly Do a

tasdik@Acer:~$ export | grep ROS

Look for ROS_ROOT and ROS_PACKAGE_PATH whether they are set.

It should look something like this

List of all nodes in the Hadoop Cluster

</figure> If not, just do a

tasdik@Acer:~$ source /opt/ros/<distro>/setup.bash

where is ~~the distribution name of you OS.~~ version of ROS installed (Jade Turtle for our case). ### Creating ROS Workspace I will be using [catkin](http://wiki.ros.org/catkin) to create my workspace.

tasdik@Acer:~$ mkdir -p ~/catkin_ws/src
tasdik@Acer:~$ cd ~/catkin_ws/src
tasdik@Acer:~$ catkin_init_workspace

Even though the folder is empty(we just have a file named CMakeLists.txt). We can still build the workspace.

tasdik@Acer:~$ catkin_make
tasdik@Acer:~$ cd ~/catkin_ws/

Do an `ls` and now you can see that we have folders like `build`, `devel` folder. Inside the `devel` folder, there are several `*.sh` files ## Source the setup file

tasdik@Acer:~/catkin_ws$ source devel/setup.bash

To make sure, everything is done correctly so far, do

tasdik@Acer:~/catkin_ws$ echo $ROS_PACKAGE_PATH
/opt/ros/jade/share:/opt/ros/jade/stacks
tasdik@Acer:~/catkin_ws$

If that is your Output for the prompt, you are good to go. Till then Goodbye! ### References : * [http://wiki.ros.org/ROS/Tutorials/InstallingandConfiguringROSEnvironment](http://wiki.ros.org/ROS/Tutorials/InstallingandConfiguringROSEnvironment)

Apache2 : Virtual Hosts

2015-08-21T00:00:00+00:00

What do we need?

The two most common tools for this are the Apache and nginx servers.

Notes:

You’ll need to edit a few system configuration files. If you’re uncomfortable with vim, replace vim with nano, or gedit in the following commands. For example, sudo vim will become sudo -H gedit or sudo nano.

Once you’re done setting it up, have a look at How to avoid using sudo when working in /var/www? A more detailed guide is available from the Ubuntu LTS Server Guide.

First, install Apache:

tasdik@Acer:~$ sudo apt-get install apache2

The Apache configuration files are located in /etc/apache2. You’ll typically be interested in:

/etc/apache2/sites-available - contains the Virtual Host definitions. Definitions are enabled and disabled using the a2ensite and a2dissitecommands. The enabled site definitions are linked to /etc/apache2/sites-enabled.
/etc/apache2/conf-available - contains custom configuration files. They are enabled and disabled using the a2enconf and a2disconf commands. The enabled site configuration files are linked to /etc/apache2/conf-enabled.
/var/www/html - the default directory that Apache serves.
For most instructions, I’ll assume we are in /etc/apache2.

VirtualHost setup

Let us create a new site. There’s a default configuration available in sites-enabled/default.conf. We will make a copy of this, and work on it:

This is where should be

tasdik@Acer:/etc/apache2$ ls
apache2.conf  conf-available  conf-enabled  envvars  magic  mods-available  mods-enabled  ports.conf  sites-available  sites-enabled
tasdik@Acer:/etc/apache2$

tasdik@Acer:~$ sudo cp sites-available/000-default.conf sites-available/my-name.conf
tasdik@Acer:~$ sudo nano sites-available/my-name.conf

It should look something like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<VirtualHost *:80>
	# The ServerName directive sets the request scheme, hostname and port that
	# the server uses to identify itself. This is used when creating
	# redirection URLs. In the context of virtual hosts, the ServerName
	# specifies what hostname must appear in the request's Host: header to
	# match this virtual host. For the default virtual host (this file) this
	# value is not decisive as it is used as a last resort host regardless.
	# However, you must set it for any further virtual host explicitly.
	ServerName myname.com

	ServerAdmin webmaster@localhost
	DocumentRoot /var/www/my-name

	# Available loglevels: trace8, ..., trace1, debug, info, notice, warn,
	# error, crit, alert, emerg.
	# It is also possible to configure the loglevel for particular
	# modules, e.g.
	#LogLevel info ssl:warn

	ErrorLog ${APACHE_LOG_DIR}/error.log
	CustomLog ${APACHE_LOG_DIR}/access.log combined

	# For most configuration files from conf-available/, which are
	# enabled or disabled at a global level, it is possible to
	# include a line for only one particular virtual host. For example the
	# following line enables the CGI configuration for this host only
	# after it has been globally disabled with "a2disconf".
	#Include conf-available/serve-cgi-bin.conf
</VirtualHost>

# vim: syntax=apache ts=4 sw=4 sts=4 sr noet

Save the file, and enable it:

tasdik@Acer:~$ sudo a2ensite my-name

Now, we need to set up the directory for the site:

tasdik@Acer:~$ sudo mkdir /var/www/my-name

We’ll set permissions for convenience:

tasdik@Acer:~$ sudo chown $USER:www-data /var/www/my-name
tasdik@Acer:~$ sudo chmod g+s /var/www/my-name

Add a few HTML files here.

Since the virtual host is to run locally, we need to map myname.com to a local address. To do this, we need to edit /etc/hosts:

tasdik@Acer:~$ sudo nano /etc/hosts

It should look something like this

1
2
3
4
5
6
7
8
9
10
11
127.0.0.1       localhost
127.0.1.1       Acer
127.0.0.2       myname.com   myname


# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Save, and then restart Apache:

tasdik@Acer:~$ sudo service apache2 restart

Now, you can browse to http://myname.com or http://myname, and the contents of /var/www/my-name will be displayed.

Tasdik Rahman

ContainerDays 2024, Hamburg

Takeaways

KCP

fireside chat with Kelsey Hightower

Open Cluster Management

My Presentation

Ending notes

Oncall in product teams

Why do we need to have oncall rotations?

Communication Channels

Oncall is for whom

Getting paged too often

Primary and Secondary

Oncall as part of SRE/Operations teams

Ending notes

Keyboard setups over time

College days

Getting my first keyboard

Change over from Cherry MX blue to Cherry MX brown switch

Current setup as of September 2023

Renewing your root CA with a new root CA such that the older certs signed by old root CA are still valid

Context

Going about this

Generating the root CA again

References

neovim setup for golang 6 months in

So far

What I am trying to fix on my setup so far

Workflow

Key maps being used

My vim setup for golang

tl;dr what does all this get me in my setup

Impressions so far

Changes and things kept from existing vim-go setup

What does it lack as of now

Links

Working remotely in a geographically distributed team without burning yourself out

Context

Takeaways from my discussion with a couple of folks.

Scaling cluster upgrades for kubernetes

Context

What is the bottleneck at the end of the day

Ways to think about efficiently managing this drift and human bottleneck problem

Closing thoughts

References

Musings with client-go of k8s

Initialization of the config

Initiliazation happening by default to the default kube context

Initializing client-go to a user specified k8s context

Testing client-go interactions

References

Moving over to www.tasdikrahman.com from tasdikrahman.me

Why did I end up moving?

What did I end up doing

Add redirect on the old page

Adding redirect for tasdikrahman.com to redirect to www.tasdikrahman.com

Ending notes

Stubbing and few other testing tidbits for python

What about non-deterministic tests

Stubbing responses, in this case stubbing a method which receives STDIN

Testing for STDOUT

IO related to files

References

spf13/cobra not respecting mandatory flags as part of Prerun

Context

Workaround 1

Workaround 2

Comparison of the two workarounds for the cli

Ending notes

References

Credits

Evolution of support for infrastructure teams

Context

Before we start

Initial days

As the team size grows

Further growth and division into sub-teams

Further improving productivity inside the sub-teams

Final thoughts