FOR THE SAKE OF VANITY

by andre briggs

Data Just Flies Through the Air

Michael Mann’s 1995 film Heat is one of my favorites films. One of the many scenes that sticks out in my head is an exchange between Robert DeNiro’s character (McCauley) and Tom Noonan’s character (Kelso). Long story short, Kelso is wants a revenue sharing deal on a bank heist that McCauley’s crew will perform. McCauley is skeptical about the accuracy and source of the information Kelso is supplying him:

McCauley: How do you get this information?

Kelso: It just comes to you. This stuff just flies through the air. They send this information out, I mean it’s just beamed out all over the fuckin place. You just gotta know how to grab it. See I know how to grab it.

Heat (1995)

Recently I’ve been on a kick on create some data visualizations1. After getting into Node.js I’ve been inspired to dabble in javascript more. I figure that the most interesting information is the data around me that I’m generating. Sometimes I feel as humans we are drawn to seeing patterns and outliers. There’s so much happening around us that never gets mentioned because much of it is inaccessible to us. There are things we can control and curate. I came up with a quick list of potential sources one can extract data from for some cursory analysis:

Music

Applications like iTunes contain a plethora of metrics like most and least played tracks. Categories by genre, artists, artist features, etc. Some options beyond the obvious could include discovering the most common words in song title, creating a histogram of track lengths. It doesn’t end there. These days’s one has access to lyrics online. Are there certain lyrics, phrases, or words that you may have an affinity or aversion to? Do a join on your local iTunes data with online lyric information. Perhaps there’s a formula to making a song that fits you?

Photos

If you’re using desktop software like iPhoto you ay be able to link the facial recognition database to your photos. One could determine if there is a high occurrence of certain people in a certain location time after time. Correlation like that may be a signal for some other type of task completion. Locations, time, people, and even colors are all rich data sets to play with.

Email

When are you most likely to receive email on any given day? Can we make our email polling interval adaptive to when you most actively receive email? So much information can be mined from your personal email accounts. One of the perennial examples of machine learning is spam classification in email. Brush up on your ML skill by trying to discover other tasks.

SMS

Who is most likely to respond to a text from you fastest and vice-versa? Use a burst detection algorithm to discover with whom you are most likely to get into a longer back and forth with. Is this driven by time of day? Who is most likely to send you late night texts on Friday and Saturday evenings? Pulling text message data is tricky. If you use iMessage you could probably pull it from chat logs on your Mac.

Wearables

Honestly I think the consumer “wearables” data that is available right now is to be taken with a grain of salt. Measures such as steps and calories can vary a lot by device/people. Use this data to compliment other data sets.

Weather

By itself weather data isn’t that interesting. Again there are many companies that provide the stand alone data. It would be interesting to see if the people use SMS more on or preceding a sunny day. This is certainly an area where the composition of data from other fields makes it intriguing.

Movies

Have a favorite movie? Scripts are available online. Parse it, create taxonomies from from characters. Establish family trees. May you like films about loner protagonists in the 80’s who are estranged from their families. I guess this is essentially what Netflix is doing.

When trying to get inspired for some data analysis task one shouldn’t have to worry about a lack of data. The focus should be knowing your approach and having a clear idea of your goals. The worst part of data analysis is getting answers that just lead to more questions. Much like a recursive loop you have to find your base case to know when to exit.


  1. In order to get to the point to have interesting data to sure some analysis must occur first. Some of this analysis may use machine learning techniques or just data massaging. Assume that this is all completed.