November 1, 2013
Lying with charts for fun & profit

A few months ago I discovered that Wikipedia provides detailed hourly data dumps of how many pageviews each article gets, and the former political science major in me quickly sprang into action. I wanted to look at article traffic for candidates during the run-up to the 2012 election; I figured I would find all sorts of interesting patterns and glean new insight into American politics and information-seeking behavior. It was going to be great. As usual, I was wrong.

Before I could even investigate the data, I had to jump through a few hoops. The hourly dumps include EVERY Wikimedia page in one giant tab-separated list, so you’re talking about terabytes of data in total just to grab a very short list of presidential and senate candidates. It also turns out that, shockingly, some major party Senate candidates from the 2012 election don’t even have Wikipedia articles. To further muck things up, because the end of daylight savings time occurs during the campaign, you have to do some time-shifting to get everything to match.

Once the data wrangling is done, if you plot the hourly pageviews for Romney (in red) and Obama (in blue) as a stacked area chart, it looks like this:

image

You see certain spikes there that line up with key live events.

image

OK, so this is mildly interesting.  The story here seems to be that people run to their computers to look up the candidates during the debates, on election day, and during the conventions, when something is happening right at that moment on TV.  The disparity between the activity during the GOP convention and the Democratic convention makes some sense, since Obama is more of a known quantity.  And if you zoom in on the conventions, you see that everyone is looking up Romney during the GOP convention, but it’s about 50/50 for the Democratic convention:

image

image

But what if we take the same data and aggregate it by day instead of by hour?

image

Now the story looks quite different.  The conventions and debates are really just blips.  All the action is on election day.  Actually, most of it is the day AFTER election day, East Coast time, because the big traffic rush comes during Obama’s acceptance speech, which took place after midnight Eastern Time.

We could also plot the data as cumulative traffic instead:

image

Now it mostly just looks like a slow and steady climb, with Romney getting somewhat more traffic up until election day, when Obama’s numbers get a gentle bump.

These three charts are in some sense showing the same data, but the immediate takeaways are quite different.

As another quick example, let’s look at a line chart of the same pageview data for 2012 senate candidates:

image

This looks a bit different.  There are two massive spikes, and everything else is tiny by comparison.  It turns out both one-hour spikes belong to Elizabeth Warren, the now-senator from Massachusetts, who spoke at the Democratic convention.

image

This chart seems to tell the story that Warren had two breakout moments where lots of people were looking into her online, and the rest of the Senate field was quiet (including Ted Cruz, who spoke at the Republican convention but didn’t draw nearly the same amount of traffic).  But what about the little sawtoothed pile that starts around August 20?

image

If we try aggregating by day, as with the presidential election, we get the answer:

image

Oh right, that guy.  When Akin made his ill-advised comments, he apparently had a lot of people run to their computers to look up who he was.  But unlike Warren’s convention speech, it wasn’t a second-screen, live TV moment sort of thing.  It was news that spread more gradually, over the course of about two days.

We also see that many other candidates got some attention on election day.  The person with the biggest daily peak turns out not to be Warren, but rather Tammy Baldwin from Wisconsin, now the first openly gay US senator.  She didn’t make waves during the campaign, but her historic election brought a bunch of curious Wikipedia viewers after the polls closed.

image

Had the “days” been grouped on a cutoff besides midnight Eastern Time, so that the late-night election speeches and results were grouped in with the day before, we would have seen yet another story.  We could also look at total pageviews by candidate and get a different impression:

image

And let’s not forget that Wikipedia traffic is far from a great proxy for information-seeking behavior generally.  It suffers from all kinds of biases.

So which of these charts is the accurate one?  Which one tells the story?  All of them?  None of them?

The lesson, as usual: data does not speak for itself.  It’s something you can mold into different forms, all of them “true,” none of them the whole truth.  The way you slice and scale things matters.  Context matters.  Even something as prosaic as time zones can have a big impact on what story comes out of your work.  Always think carefully about what your data is and is not telling you.

A more detailed version of the presidential pageview chart is available here.

2:51pm  |   URL: http://tmblr.co/Zu2sptzCe-cw
  
Filed under: opennews 
  1. ihasquestions reblogged this from journo-geekery
  2. readinglist32 reblogged this from journo-geekery
  3. journo-geekery reblogged this from veltman
  4. veltman posted this