March 17, 2015

Reshaping data with underscore.js and groupBy

I never had much of a need to use the groupBy until now. When porting some old code I decided to see how easy it would be to replace several for loops in one of our views with something more understandable.

Starting Data

Our starting data will be this sample below, where all unnecessary data has been removed. There is no error checking in the code, and it assumes the data is well formed.

var baseData = [
    { date: "20150201", status: "ok",    location: { country: "china" }  },
    { date: "20150203", status: "ok",    location: { country: "china" }  },
    { date: "20150201", status: "error", location: { country: "germany"} }
]

We want to transform this data into two forms for display:

  1. Grouped by date, discarding the date afterwards
  2. Grouped by country, keeping the country as the key

Creating a "reduce" function to sum the data

We will call this function after groupBy. It needs to take a list of data, and return a single object. I like to use a template object to cut down on existence checks in the reduce function.

Reduce calls the function it is given once for each item in the list. We don't have to use reduce, we could use a simple for loop if we wanted.

// we will make a new copy of this template for each call to reduceDataList
var templateData = { ok_count: 0, error_count: 0 };

var reduceDataList = function(list) { 
    // give the inner function a name, to help with stack traces
    _(list).reduce(function __reduceDataList(memo, item) {
        if (item.status == "ok")    memo.ok_count += 1;
        if (item.status == "error") memo.error_count += 1;
        return memo;
    }, angular.copy(templateData));  // or use jquery; just copy it!
}

Here is the for lop version, which may look simpler. Probably faster too :D

var reduceDataList = function(list) { 
    var memo = { ok_count: 0, error_count: 0 };
    for (i = 0; i < list.length; i++) { 
        item = list[i];
        if (item.status == "ok") memo.ok_count += 1;
        if (item.status == "error") memo.error_count += 1;
    }
    return memo;
}

Running the whole list through the function should give this result:

reduceDataList(baseData) ==>
  {
      ok_count: 2
      error_count: 1
  }

Now add the group by. In the first case, we don't care about the location, which lets us use the property name to do grouping. We also don't care about the group-by key, which is the "date" value.

var byDate = _(baseData)  // wrap in undersore (or lodash) object
    .chain()              // allow chaining of calls to _
    .groupBy("date")      // group by a property name (no nested properties)
    .map(reduceDataList)  // pass each group of items to the reduce function
    .value();             // return the value array
    
[ 
    { ok_count: 1, error_count: 1 },   // items from date: 20150201
    { ok_count: 1, error_count: 0 }    // items from date: 20150203
]

More complicated grouping

Grouping and turing the data into a named dictionary is a little more of a pain. We need to call reduceDataList manually, and overwrite the values in the object we get back from groupBy. We also care about all the values, so we can't ignore them like in the first example.

How is the map function called?

Take an exmple user

var user = { 
	'fav_colors': ['red', 'green'], 
    'age': 32
};

If you say _.map(user, myFunction), your myFunction will be called once for every property on the object. The order of the calls is not guaranteed, though, and will vary.

myFunction(32, 'age', user);
myFunction(['red', 'green'], 'fav_colors', user);

This is how we use it in our example, where we want to group by the country. Since country is nested inside location, we can't use a simple property access.

var byCountry = _(baseData)
    .chain()
    .groupBy(function(item) { return item.location.country; })
    .map(function(groupedItemList, countryName, obj) { 
        // overwrite the value with our newly calcuated object 
    	obj[countryName] = reduceDataList(groupedItemList);
    })
    .value();

Assuming all goes right, we should see the following value for byCountry

/// byCountry
{
    "china": { ok_count: 1, error_count: 1 }, 
    "germany": { ok_count: 1, error_count: 0 }
}