![]()
|
Link to the last RSS article here: Introducing AMOS 17 and SPSS 17. - Ed. PROC TWEET: Using SAS to Analyze TwitterBy Patrick McLeod, Research and Statistical Support Services ConsultantThis month’s Benchmarks Online article brings together something we’re all familiar with (SAS) and something that is a bit more recent (Twitter). Twitter is a micro-blogging service that is free to sign up for, free to use and, most importantly for the purposes of this paper, free to access. The idea behind Twitter is that users post short missives about something (140 character limit) and share these with their followers (other Twitter users who are following that user) and the public in general. These posts, known as tweets, can use a topic convention for certain messages pertaining to a specific topic. This convention involves the use of a hash tag (#) followed by a topic name. For instance, the topic #nbafinals is a Twitter topic about the NBA Finals between the Lakers and Magic. #iranelectionFor this article, we’re going to import a trending Twitter topic (#iranelection, a topic covering tweets about the contested Presidential election in Iran) and examine the occurrences of two hot sub-topics concerning Twitter’s scheduled maintenance later this evening. To provide a bit of background on why I’m examining this, last Friday, June 12, 2009, there was a Presidential election in Iran. For a variety of reasons, the results of this election have been questioned by the Green Party and other opposition parties fielding candidates in Iran and by many organizations and individuals in the international community. Iranian citizens using Twitter were providing an almost-instantaneous stream of information via the social media platform to the outside world at a point in time when international journalists were (and continue to be) restricted from travelling outside of Tehran, Iran’s capital. These tweets gave the outside world an unparalleled look at what was going on as protests against the election began and were suppressed by the Iranian government. Twitter announced earlier in the day today that Twitter would be unavailable for an hour and a half later this evening for scheduled maintenance. This announcement sparked two trending sub-topics in the #iranelection topic, #twitterfail and #nomaintenance. Both sub-topics expressed dismay at Twitter being unavailable for any amount of time, particularly considering that information coming out of Iran right now is mostly coming from Twitter. To look at these two sub-topics, we’re going to use SAS 9.2, SAS XML Mapper, a small macro, PROC PRINT and PROC SGPLOT. I based this example on Chris Hemedinger’s blog post at the SAS Dummy on using SAS to mine Twitter for an informal poll of the VP debate winner from last Fall’s Presidential election. You can find it here. The first step in this process is to create an XML feed for the topic #iranelection. This can easily be done on Twitter’s home page ( http://www.twitter.com ). Under the trending topic sidebar, click on the topic you would like to examine. A new page showing only tweets with that topic will appear. On this page, click on the link that reads RSS feed for this topic. This will create another page that will provide you with the Atom feed for the topic in question. In this case, my feed URL for #iranelection is http://search.twitter.com/search.atom?q=%23iranelection . It’s worth taking a moment to note that taking data from topic feeds when they are very active (as #iranelection is at this time) poses a unique set of considerations. For one, new tweets can be added to a topic amazingly fast such that your data’s time frame can change considerably depending on how active the topic. If you want to capture a wider time frame for an active topic, you will need to get more pages of tweets than for a slower topic. I ran my SAS program that utilizes the XML Mapper to grab data from the Atom feed above at 4:42pm Central Time. All 541 tweets that were captured were issued no earlier than 4:40pm Central Time. Using a regular expression matching statement with Perl, I selected all the tweets in the dataset that contained the #twitterfail and #nomaintenance hash tags and generated a horizontal bar graph of the results. Here it is:
The number of Twitter users using the #nomaintenance hash tag in these two minutes of tweets is far greater than the number of Twitter users using the #twitterfail hash tag. There is quite a bit of mining that could be done using Twitter’s Atom feeds. I hope that this example of how one might employ SAS, its XML Mapper and a well-formed XML-based RSS feed like Atom will open up some new possibilities for you when it comes to using the most current that the web can offer in conjunction with some simple statistical analysis. Until next time, happy computing! Code for this article:
|