Gathering California Sales Tax Rates With Node.js Part 2: Parsing CSV

This is part 2 in the series describing how the data for the California sales tax rates page. Now that we’ve downloaded the California sales tax CSV file, it’s time to parse the file.

Desired Results

The goal of parsing is to convert each row from a comma separated set of fields into a Javascript object with properties for each of the fields. Let’s look at the format of csv file we are parsing.  The file looks something like:

All Cities,,
City,Tax,County
Acampo,7.750%,San Joaquin
Acton,8.750%,Los Angeles
Adelaida,7.250%,San Luis Obispo
...
Zamora,7.250%,Yolo
Zenia,7.250%,Trinity
,,
"Rates effective July 1, 2012",,
"""*"" next to city indicates incorporated city",,

If we converted it by hand to a Javascript object, it would look something like:

var cities = [ { city: 'Acampo',  tax: '7.75%', county: 'San Joaquin' },
               { city: 'Acton',   tax: '8.75%', county: 'Los Angeles' },
               ...];

Our goal is to have the program generate this structure for us.

Splitting the CSV into fields

Let’s start with the ‘csv’ part of the problem by breaking up one row into the comma separated parts.  We’ll start with the Acampo row:

var rowString = 'Acampo,7.750%,San Joaquin';

To split this string on commas, we’ll use the appropriately named String.split() method:

var rowParts = rowString.split(',');

At this point, rowParts contains:

[ 'Acampo', '7.750%', 'San Joaquin' ]

Now that we have the parts, we can build up an object that contains these fields:

var city = { city: rowParts[0], tax: rowParts[1], county: rowParts[2] };

Throw all these together and print out the object at the end to make sure it’s what we expect:

var rowString = 'Acampo,7.750%,San Joaquin';
var rowParts = rowString.split(',');

var city = { city: rowParts[0], tax: rowParts[1], county: rowParts[2] };

console.log(city);

And node will print out:

{ city: 'Acampo', tax: '7.750%', county: 'San Joaquin' }

Another look at building the object

Our code works great and for California is fairly small since we only have 3 fields. For other data sets (and other state’s csv files) can have many more columns. Instead of building up the object manually, we can use the column headers to help build up the object. Recall the 2nd line of the csv file is:

City,Tax,County

Which is a comma separated list of the names (ignoring case) of the properties we set on our object. We’ll split those headers in an array the same way as before:

var headers = 'City,Tax,County'.split(',');

We could leave the case alone and update reference the fields as city.Tax instead of city.tax, but it’s not hard to update the names to be lowercase by using the String.toLowerCase() method. The trusty for loop would make short work of this problem, but let’s be fancy and use the Array.map(Funtion) method to apply toLowerCase to each item in the array. map will return a new array built up from the lower cased items:

headers = headers.map(function(header) {
                        return header.toLowerCase();
                      });

Now we have 2 arrays:

rowParts: [ 'Acampo', '7.750%', 'San Joaquin' ]
headers:  [ 'city', 'tax', 'county' ]

Using a simple loop we can zip these two arrays together to build one object where each field’s name (or key) comes from headers and value comes from rowParts. We’ll throw that in a helper function to make it easy to apply to other places. Note that we switch to using the object[key] syntax instead of object.key syntax since the key is variable:

var keyValuePairsToObject = function(keys, values) {
  var object = {};
  for (var i = 0; i < keys.length; i++) {
    object[keys[i]] = values[i];
  }
  return object;
};

var city = keyValuePairsToObject(headers, rowParts);

Combining everything together:

var headers = 'City,Tax,County'.split(',');
headers = headers.map(function(header) {
                        return header.toLowerCase();
                      });

var rowString = 'Acampo,7.750%,San Joaquin';
var rowParts = rowString.split(',');

var keyValuePairsToObject = function(keys, values) {
  var object = {};
  for (var i = 0; i < keys.length; i++) {
    object[keys[i]] = values[i];
  }
  return object;
};

var city = keyValuePairsToObject(headers, rowParts);

console.log(city);

Produces the same result as before:

{ city: 'Acampo', tax: '7.750%', county: 'San Joaquin' }

Beyond Acampo

It’s time to apply everything we’ve built so far to generate an array of all the rates for California. We will fill in the content of the httpGet callback from the last post:

httpGet('www.boe.ca.gov','/sutax/files/city_rates.csv', function(data) {
  // Process the data
});

First thing to do is split up the text of the file into rows. We use the same split() method as before but we’ll split on the new line character sequence ('\r\n'):

  var rows = data.split('\r\n');

We don’t need the first line of the csv file since it is neither a city or our headers. We can shift() the rows array to remove it:

  rows.shift(); // Shift off the 1st line from the file

Instead of hardcoding the headers we can get it by getting the first line left in rows. Since we don’t need it to be in the list once we pull it out, we can use shift here as well.

  var headers = rows.shift().split(',');
  headers = headers.map(function(header) {
                          return header.toLowerCase();
                        });

The next item in rows is the first city – Acampo. We can start gathering our rates into a cities array now. We loop over the rows and build our objects for each row. The Array.push method adds the object to the end of cities. We’ll stop when we hit an item that does not have a rate to catch the last couple lines of the file that do not represent cities.

  var cities = [];

  for (var i = 0; i < rows.length; i++) {
    var rowString = rows[i];
    var rowParts = rowString.split(',');

    var city = keyValuePairsToObject(headers, rowParts);

    // Stop if we hit an item with no rate (last couple lines of the file)
    if (!city.tax) {
      break;
    }

    cities.push(city)
  }

The complete program so far:

var http = require('http');

var httpGet = function(host, path, callback) {
  // fill in the options for the specified path
  var options = {
    host: host, // 'www.example.com'
    path: path, // '/path/to/file.txt'
    port: 80
  }
  http.get(options, function(res) {
    var data = '';
    res.setEncoding('utf8');
    res.on('data', function (chunk) {
      // Append the chunk to the data downloaded so far
      data += chunk;
    });
    res.on('end', function(){
      // All done downloading. Give the data to the caller.
      callback(data);
    });
  }).on('error', function(e) {
    throw Error('Error getting http://' + host + path + ' ' + e);
  });
};

// Convert parallel arrays of key/value pairs into an object
var keyValuePairsToObject = function(keys, values) {
  var object = {};
  for (var i = 0; i < keys.length; i++) {
    object[keys[i]] = values[i];
  }
  return object;
};

httpGet('www.boe.ca.gov','/sutax/files/city_rates.csv', function(data) {
  var rows = data.split('\r\n');
  rows.shift(); // Shift off the 1st line from the file

  var headers = rows.shift().split(',');
  headers = headers.map(function(header) {
                          return header.toLowerCase();
                        });

  var cities = [];

  for (var i = 0; i < rows.length; i++) {
    var rowString = rows[i];
    var rowParts = rowString.split(',');

    var city = keyValuePairsToObject(headers, rowParts);

    // Stop if we hit an item with no rate (last couple lines of the file)
    if (!city.tax) {
      break;
    }

    cities.push(city)
  }
  console.log(cities);
});

Running it shows we’ve built our array as planned:

[ { city: 'Acampo', tax: '7.750%', county: 'San Joaquin' },
  { city: 'Acton', tax: '8.750%', county: 'Los Angeles' },
  ...
  { city: 'Zamora', tax: '7.250%', county: 'Yolo' },
  { city: 'Zenia', tax: '7.250%', county: 'Trinity' } ]

Wrapping up

In this post we took a raw string of csv text downloaded from the internet, parsed it, and built up a convenient Javascript representation of the data. Unfortunately, the resulting data is heavily tied to the original format of the csv file. As we gather the data for more states, we need to standardize the resulting data into a consistent format within the program so we can build one view on top of the 50 different State data sources. Join us in our next post as we dive into formatting this data in a consistent way.

Leave a comment