Command line tools are awesome. Every time I do something with data using unix tools I can't help but be grateful for how easy the entire job turns out to be. Almost any kind of processing can be done with straight out-of-box unix utilities without having to write any code. Command line solutions are elegant, efficient and time-saving.
Need to do some simple analytics on your web logs? A combination of
uniq is probably all that you need to use to get the job done. Need to do some complex heavy lifting with a slightly more exotic dataset than server logs? Throw a bit of
xargs into the mix.
Deploying a Spark cluster is overkill in many situations and if you find yourself processing data worth a few GB's every other day, being well acquainted with unix tools is one of the kindest things that you could do for yourself. Big data tools like Spark even provide API's to integrate processing jobs with command line applications.
Every once in a while though, I find myself in need of some extra functionality that requires me to write a bit of code. Let's take the hypothetical example of a large file containing json entries on every line. I'd like to filter out records that have a particular keyword, extract a couple of fields, join the information from the two fields together and write out the new records to a file.
The first part can be achieved with a simple invocation of
grep but what about the complex json stuff? We'll definitely have to implement that functionality ourselves. This is where a quick and dirty python script comes in handy.
#!/usr/bin/env python import sys import json while True: line = sys.stdin.readline() if not line: break record = json.loads(line) field1 = record['firstname'] field2 = record['lastname'] output = field1 + " " + field2 print output
Saving this file in a system path directory gives us easy access to this script and essentially makes this a command line tool. (don't forget to change the file permissions for this script to make it executable!)
The nice thing is that we can simply plug this script into a unix pipe composition and it will work just like any other command line tool. If we want to make it more flexible, we could use something like the
argparse module to pass in arguments as to what fields we want to join together.
grep KEYWORD records.json | pythontool | sort | uniq -c | sort -n
The above command will return unique names in our json records and their respective counts.
If we keep in mind the Unix Philosophy of doing things, we can build flexible and modular tools that play well with other tools on the command line.