Power BI Analytics Challenge: Part 2 – Processing Data from a PDF

Power BI Analytics Challenge: Part 2 – Processing Data from a PDF
Reading Time: 4 minutes

Note: Power BI now has a native PDF connector. This post was originally written in May 2017. Please use this for reference only.

After selecting an initial challenge category for the Power BI Analytics Challenge, it was time to get to work. I chose to visualize data from the American Time Use Survey, which is available from the Bureau of Labor Statistics. The ATUS data consists of numerous CSV files, but another challenge is making sense of column names and their meanings. In addition, data dictionaries available from BLS are in PDF format, which is not easily imported into Power BI.

As a first step, I used Tabula to quickly extract the tables from the Activity Lexicon. The lexicon is a PDF file that contains the dictionary of activity codes associated with various activities in other ATUS files. With Tabula, it’s a few steps to import the lexiconnoex0315.pdf file, autodetect tables, and export to CSV.

TabulaATUS

TabulaATUSExport

 

Once the Activity Lexicon data is available in CSV format, all data wrangling can be done in Power BI. There are a few problems to solve with the lexicon, however, that will transform the data into a more usable format.

The primary issue is that the hierarchy columns Major category and First and second-tier categories have gaps that need to be filled. The major category and first/second tier categories should display on the same rows as their child activity codes. The numeric activity codes (i.e. “010101” for Sleeping) are the lowest level and will be the keys that join the Activity Lexicon to other ATUS files such as the Activity Summary (atussum.csv).

Here are the steps I took to get to a tidy, usable format. Everything was done through the Power BI Desktop user interface, and I then retroactively went back and defined my FilePath in the Advanced Editor. This would allow anyone to more easily substitute their own local file path if they use the sample script at the bottom of this post.

PQ1

 

I provided the location of my FilePath and Source CSV file. This simply used Get Data, then Text/CSV, and I browsed to my Tabula CSV output file.

At this stage, the data appeared like so:

PQ-Source

 

I then removed the top row and promoted the header.

PQ-Promote

 

I then Filtered Rows on the first column and removed the two recurring page header rows (ATUS 2003-2015 Activity coding lexicon and Major category).

PQ-FilterRows

 

The original CSV from Tabula uses empty double quotes for blank values instead of only commas. This will be a problem for Power BI when it comes to filling the gaps in the first two category columns because the Fill function only fills on explicitly null values, not empty strings. I therefore had to replace the blank “” values with null. Power BI’s version of null is a simple replacement of “” with the actual word null.

PQ-ReplaceWithNull

 

At this stage, I was ready to use Fill Down with the first two columns.

PQ-FillDown

 

After filling the gaps in the two category columns, I could Filter Rows to remove out the blank records from the 6-digit activity code column.

PQ-FilterRows2

 

As a final step, I Removed Columns. The last two columns are for notes, and I chose not to import them.

PQ-RemoveColumns

 

Note that while I removed the notes, it does not make them any less important. Over time, these ATUS activity codes have evolved. Rather than sort through and manually account for codes that only exist in the nodes and not in the 6-digit activity code column, I made the decision to only report on activities from more reliable later years rather than go back to 2003. This means that later on when it comes to importing other ATUS CSV files, I will always filter those out to use 2008 data or later. There are still some small gaps where codes in the response data will not match to the separate activity code table, but this seems to be minimal after 2008. If this were not a personal project, or if I chose to focus on overall duration rather than by duration by individual codes or categories, I would try to better account for this history. For an “analytics challenge” that has more flexibility though, I’m fine with this decision.

ActLexNotes

 

With everything filled and filtered, the Activity Lexicon is now in better shape to use for an American Time Use Survey report in Power BI.

Here is my M / Power Query script. Feel free to use it in your own efforts if needed.

-David

 




Leave a Reply

%d bloggers like this: