Last weekend at the Mozilla Festival, a group of journalists sat down to solve a murder mystery on the command line.
Each person got a set of folders containing text data files full of information about the mean streets of Terminal City. The files listed who lived there, the vehicles they owned, the clubs they belonged to, the streets they lived on, and so forth. The formats were varied - some of them were tab-separated tables, some were plain text, some had instructional header or footer rows.
More importantly, 99.9% of the text was junk. It was gibberish, or excerpts from Alice in Wonderland, or names of random 2012 olympic athletes. But buried at key points in these large files were actual clues that, when followed, would eventually lead you to the identity of the murderer. With so much nonsense text to sift through, the only way to crack the case in a reasonable amount of time would be to use the command line to quickly search, filter, and inspect the data.
I thought this might be a stickier way to teach the basics of the command line than drily walking through a lot of examples, because it more closely mimics a real world data journalism scenario: you inherit a big dump of messy data without any context. There’s too much data to hold in your head, and you don’t even really know what’s in it, or how the files are structured. You have to probe and get your bearings, and then you have to be careful with your inquiries, spot checking and duplicating results as you go. You can only see one slice of the big picture at a time.
The key thing about this whodunit exercise is that it’s freeform. You don’t have instructions to follow; you have a situation, and it’s up to you to experiment and find a path to the solution, once you figure out what the solution would even look like. There are many different ways you could find the answer. Some might be more efficient but trickier to implement, others might be simple and stepwise but easier to follow and modify. This is an important part of getting comfortable with the command line: understanding that it consists of small pieces that do one thing well, and you can combine them in infinite ways to get what you need.
Why worry about teaching journalists the command line in the first place? I can think of a few reasons why it comes in handy even if you have no plans to become a developer:
- A lot of really useful tools for journalists end up stranding you on the command line. You hear that piece of software X is exactly what you need to convert that weird file, or build a certain type of chart, or make a map, so you go try to download it. But you end up on a GitHub page with installation instructions that are way over your head and involve fifteen different steps on the command line.
In a perfect world you could just copy and paste the commands from the documentation and cross your fingers and hope it all works. In the real world, those tools almost never just “work,” and the documentation usually leaves out some important details. So you get some weird error message during setup, or output you weren’t expecting. You’ll be stuck unless you have some idea of how the commands are structured and what you might need to change.
- Command line tools are a lot more efficient at processing text than desktop software or even custom scripts, and this starts to matter if you have a massive dataset. You can open a 5MB file in Excel, but not a 5GB one. If you’re a data journalist and you encounter a really huge quantity of data, using the command line for filtering/searching/cleaning can save you a lot of headaches.
- It’s useful to stop thinking of “data” as a special category, something you only interact with delicately and indirectly, with a piece of software like Excel as your liaison. Data is text, and text is data. Virtually any sort of data a journalist encounters can be treated as just a big pile of text, and once you understand that, you can get more creative in how you interrogate and modify it, because it all boils down to searching and replacing, reading text in and spitting it back out.
As for the mystery, you can give it a try yourself (you only need the file clmystery.zip). This version was kind of a rush job, with not nearly as much hardboiled, Sam Spade flavor as I would have liked, but pretty soon I’ll start working on the next case and hopefully introduce more advanced commands like sed and awk. Get to work, gumshoes!
- decause reblogged this from veltman
- decause likes this
- roboticwrestler likes this
- marie-cip likes this
- mirandainchicago likes this
- brittneyroselogan likes this
- gm500 likes this
- decodynglife likes this
- brianimmel likes this
- wyomingirl reblogged this from journo-geekery
- pryomaniac reblogged this from journo-geekery
- endlessagito likes this
- ajroach42 likes this
- tokume reblogged this from journo-geekery
- tokume likes this
- journo-geekery reblogged this from veltman and added:
- pabusnoodlery likes this
- monicagerber likes this
- journo-geekery likes this
- justinsanak likes this
- sunnifer likes this
- sisiwei reblogged this from veltman
- veltman posted this