JSON on the Command Line

Most of the apps I work on get data from a server encoded as JSON, but playing around with JSON in Excel or the command line isn’t straightforward. At least that’s what I thought until I really started looking around for options. I’ve tried doing this in the past, and maybe I just didn’t Google the right things, because I had more success this time! 😁

I wanted to see if I could transform some data from Slack, and figured it would be an opportunity to try doing it in a shell. Specifically, I tried to get the distribution of messages sent by each user. I’ve ignored identifying user data so that I didn’t draw conclusions about anyone in particular, because there’s a lot of factors that could lead to some people sending more messages than others. With that out of the way, let’s get started.

Overview

The tools I use here are written in a few different languages that you’ll need to have installed first, and so because there are different kinds of dependencies some of the simplicity you’d think would come from doing this in a shell is lost. There’s almost certainly alternative ways to do what I’ve done here, and I’d love to know about ways to improve what I did. You’ll need Go, JavaScript and Node.js, Homebrew and likely some other things. I won’t cover how to get all of this setup, but searching for those things will bring you to some great resources. When I mention any programs and don’t say how to install them, they should already be available on any *nix or OS X computer.

What you’ll do:

  1. Get the data from Slack.
  2. Reformat it so that it’s easier to work with on the command line.
  3. Look at different ways to count the data.
  4. Plot the data for a particular month in a chart that you can see visually.

At any point you can try changing the parameters to these commands to see how that changes the data that’s output. There’s no wrong way to do this!

Fetching channel archives

I used slack-dump to get the archives of a particular channel. As usual sometimes things are in flux and when I was working on this I had to use @harperreed’s fork because it had a fix for some dependency changes. By the time you try it those changes might already be merged in.

go get github.com/joefitzgerald/slack-dump

Run this to get a .zip containing channel.json, users.json and a directory of dated json files for each day of activity in the channel. If you feel like it you can delete users.json so you’re just working with user IDs.

slack-dump -t=SLACK_TOKEN CHANNEL

Then cd into the channel directory containing all of the daily JSON files for the rest of this.

Appending a date field to each message in all of the json files

This will make it easier once you convert the JSON to CSV. The messages do have a timestamp, but this was quicker. Fiddle with parsing the timestamp if you need more info, but I was looking for coarse month and all-time data.

This command gets the date from each filename and then inserts that as a property into each message in the file’s JSON. This is one of the things that I didn’t think would be possible from the command line.

First you’ll need the json program:

npm install -g json

And then run the command:

ls -1 | sed 's/\..*//' | sort | uniq | ( while read filename ; do echo $filename; json -I -e "this.date='$filename'" -f "$filename.json"; done )

Breaking it down:

  • ls -1 (that’s a number one) will give a list of the files in the directory. ls normally gives quite a bit of information about the contents, but this argument will list only the filename
  • sed 's/\..*// strips the file extensions from the files, leaving just the date in “YYYY-MM-DD” format
  • sort sorts the dates
  • uniq leaves only unique entries. It determines this based on entries beside each other, which is why it helps to sort first
  • ( while read filename ; do echo $filename; json -I -e "this.date='$filename'" -f "$filename.json"; done ) opens a subshell, gets the date as the filename and then uses the json program to set each message’s date property to the filename

Merging the JSON files

In order to work with the data for a month, it’s best if the messages for each day are in a single file. We can use the json program again for this, it’s pretty versatile.

cat *.json | json -g  > merged.json
  • cat *.json reads all of the json files in this directory to standard output
  • json -g groups the adjacent json arrays (each day file contains an array of messages) into a single array. I’ll note that Some of json’s argument names are a bit ambiguous. In this case you want group and not merge.
  • > merged.json redirects the merged output into a new file

Convert to CSV

At this point the JSON is in a better structure for what we want to do, but it’s possible to take advantage of more command line programs if the data is in CSV format. It’s easier than I though to convert this from the command line, but when I was first looking for solutions I came across http://konklone.io/json/ which would do great in a pinch or if you aren’t as comfortable with the command line.

You’ll need two more programs for this part:

brew install jq
go get github.com/jehiah/json2csv
jq -c '.[] | {date: .date, user: .user}' merged.json | json2csv -k date,user -o merged.csv
  • jq -c '.[] | {date: .date, user: .user}' merged.json will transform all of the message objects to a new object with just date and user fields. The -c argument outputs each new object on its own line, which is what json2csv prefers
  • json2csv -k date,user -o merged.csv converts the output from jq to CSV and puts it in a new file

Result:

2015-08-28,U0XXXXXX1
2015-08-28,U0XXXXXX1
2015-08-29,U0XXXXXX2
2015-09-01,U0XXXXXX2
...

Convert to tab-delimited data

This is optional but I found a lot of the other commands I was running would end up substituting tabs or spaces for commas anyways, so I just put it in a file ahead of time.

cat merged.csv | tr ',' '\t' > merged
  • tr ',' '\t' will translate any occurences of commas into tabs.

Result:

2015-08-28 U0XXXXXX1
2015-08-28 U0XXXXXX1
2015-08-29 U0XXXXXX2
2015-09-01 U0XXXXXX2
...

Now that we have this merged CSV data there are a few ways we can work with it:

1. Message count by user by day

cat merged | sort -k1,1 -k2,2 | uniq -c | sort -k2,2 -k1,1 -r
  • sort -k1,1 -k2,2 sorts by the first column (date) and then the second column (user ID)
  • uniq -c combines similar adjacent rows and adds a count column at the start of each line. This will be the number of messages each user sent per day.
  • sort -k2,2 -k1,1 -r sorts by the second column (date), then the first column (count) and then reverses the order so that the latest date is at the top and each user’s message count is in descending order.

Result:

21 2015-03-31 U0XXXXXX1
6  2015-03-31 U0XXXXXX2
6  2015-03-31 U0XXXXXX3
...

2. Message count by user by month

This doesn’t build off of the last step but instead converts the dates to be only the year and month. Note that this gets put in a new file for “by month” data that we’ll use to make a chart later.

cat merged | awk '{ printf "%s-%s %s\n", substr($1, 0, 4), substr($1, 6, 2), $2 }' > user_messages_by_month
  • awk '{ printf "%s-%s %s\n", substr($1, 0, 4), substr($1, 6, 2), $2 }' reformats the dates as YYYY-MM and ignores the day

Result:

2015-08 U0XXXXXX1
2015-08 U0XXXXXX1
2015-08 U0XXXXXX2
2015-09 U0XXXXXX2
...

This is the same command as counting by day, but is used with the month data instead.

cat user_messages_by_month | sort -k1,1 -k2,2 | uniq -c | sort -k2,2 -k1,1 -r

Result:

163 2015-04 U0XXXXXX1
77  2015-04 U0XXXXXX2
38  2015-04 U0XXXXXX3
...

3. Message count by user for all time

This will count and sort the message count for users for the entire history of the channel.

cat merged | cut -f 2 | sort | uniq -c | sort -r | awk '{ if($2!="") print $1 "\t" $2 }'
  • cut -f 2 selects the second field (user ID) from the data
  • sort sorts the user IDs. There are still duplicates at this point so those rows will be adjacent after this.
  • uniq -c combines the matching adjacent rows and adds a count column to the start of each line.
  • sort -r sorts those rows by the count column and reverses the order so it’s descending.
  • awk '{ if($2!="") print $1 "\t" $2 }' prints only the rows with user IDs.

Result:

851 U0XXXXXX1
410 U0XXXXXX2
191 U0XXXXXX3
...

Plot each users message count for a particular month

brew install gnuplot

I had some issues with gnuplot working after I ran this. Something about liblua, which I was able to fix by running brew link lua and brew reinstall gnuplot. I’m not sure if this was because I had a failed install of lua from the past or if it would happen to someone else.

Now create a file called plot.gp and add the following code to it:

set terminal png
set xtics rotate by -45
set output "plot.png"
plot "<cat" using 1:xticlabels(3) with boxes
  • set terminal png will create a new PNG image file, but GNUPlot has a bunch of options for output including just printing to standard output
  • set xtics rotate by -45 rotates the x-axis labels by 45 degrees, since otherwise they were overlapping
  • set output "plot.png" saves the created PNG file with that filename
  • plot "<cat" using 1:xticlabels(3) with boxes will create a box chart with data from standard input instead of a file.
cat user_count_by_month | grep '2015-04' | gnuplot plot.gp; open plot.png
  • cat user_count_by_month prints the file contents to standard output
  • grep '2015-04' prints only those lines that contain 2015-04, the particular month that’ll be plotted
  • gnuplot plot.gp; plots the data with the plot.gp file and ends this command
  • open plot.png opens the created PNG file

The resulting plot should open up in Preview and look something like this, although with different values for each bar. I’ve also redacted user IDs.

You can see here the distribution of the number of messages that each user in this channel had posted during a particular month. And that’s what we set out to accomplish! It took a fair number of steps to do this, but because it was done on the command line it is easy to see how the different tools were composed to accomplish the task.

References: