/ This American Life

That American Life

If you listen to podcasts, you are almost certainly familiar with This American Life. The radio show, which chooses a topic each episode and then gathers diverse stories about that topic from a range of contributors, is frequently (if not always) the most popular podcast in the iTunes store. From their website:

Our show is heard by 2.2 million listeners each week on over 500 public radio stations in the U.S., with another 2.5 million people downloading each episode as a podcast. We’re usually one of the five top podcasts on iTunes. We’re also heard on radio stations in Ireland and Germany, and all across Canada and Australia.

I've been a fan since the first time I heard them on the radio one evening on my drive home from school. Ira Glass, the host of the show, has often referred to the experience of waiting in your car even after reaching your destination in order to continue listening. I had exactly that experience the first time I heard it.

The show has been on the air since 1995 and has made their vast archive of episodes available to stream for free on their website. In addition to these streams they publish transcripts of almost every episode. These transcripts are ripe for recreational analysis.

What Did You Build?

I've put together the beginnings of a data pipeline that downloads and parses transcripts of each episode for further inspection and experimentation. Each transcript contains enough information for me to produce CSVs for each episode with the following fields:

1. Episode Number
2. Episode Name
3. Act Number
4. Act Name
5. Speaker Role
6. Speaker Name
7. Statement Start time
8. Statement Text

For example, here is a sample from the transcript of Episode 1:

"1","New Beginnings","3","Act Two: Act Two","interviewer","Ira Glass","00:29:53.94","Hey, is Barry there?"
"1","New Beginnings","3","Act Two: Act Two","subject","Receptionist","00:29:55.38","Pardon me?"
"1","New Beginnings","3","Act Two: Act Two","interviewer","Ira Glass","00:29:55.87","Is Barry there?"
"1","New Beginnings","3","Act Two: Act Two","subject","Receptionist","00:29:56.99","Yes, he is. He's on another call. Do you wish to hold, or I could take a message, or you can leave one on his voicemail?"
"1","New Beginnings","3","Act Two: Act Two","interviewer","Ira Glass","00:30:04.42","It's his son."
"1","New Beginnings","3","Act Two: Act Two","subject","Receptionist","00:30:05.70","Uh-huh."

What Are You Going To Do With This?

At the very least I'll be sharing the output. I'll be publishing all the output of my data pipeline on Github, sharing the code that implements it, and also eventually posting experiments and analysis built using the data set.

How can I participate?

Download my dataset and read about what I'm doing with it! Initially I'm just going to be looking at interesting trends and statistics, but I'll also be writing about how I've implemented this, what kinds of problems I'm running into, and what I'm doing about solving them.