Benchmarks Online

Skip Navigation Links


Page One

Campus Computing News

What the New Students are Learning about Computing this Summer

What to Buy Your New or Returning UNT Student This Year

Summer Hours

Today's Cartoon

RSS Matters

The Network Connection

Link of the Month

Helpdesk FYI

Short Courses

IRC News

Staff Activities

Subscribe to Benchmarks Online
    

Research and Statistical Support - University of North Texas

RSS Matters

Link to the last RSS article here: Introducing AMOS 17 and SPSS 17. - Ed.

PROC TWEET: Using SAS to Analyze Twitter

By Patrick McLeod, Research and Statistical Support Services Consultant

This month’s Benchmarks Online article brings together something we’re all familiar with (SAS) and something that is a bit more recent (Twitter). Twitter is a micro-blogging service that is free to sign up for, free to use and, most importantly for the purposes of this paper, free to access. The idea behind Twitter is that users post short missives about something (140 character limit) and share these with their followers (other Twitter users who are following that user) and the public in general. These posts, known as tweets, can use a topic convention for certain messages pertaining to a specific topic. This convention involves the use of a hash tag (#) followed by a topic name. For instance, the topic #nbafinals is a Twitter topic about the NBA Finals between the Lakers and Magic.

#iranelection

For this article, we’re going to import a trending Twitter topic (#iranelection, a topic covering tweets about the contested Presidential election in Iran) and examine the occurrences of two hot sub-topics concerning Twitter’s scheduled maintenance later this evening. To provide a bit of background on why I’m examining this, last Friday, June 12, 2009, there was a Presidential election in Iran. For a variety of reasons, the results of this election have been questioned by the Green Party and other opposition parties fielding candidates in Iran and by many organizations and individuals in the international community. Iranian citizens using Twitter were providing an almost-instantaneous stream of information via the social media platform to the outside world at a point in time when international journalists were (and continue to be) restricted from travelling outside of Tehran, Iran’s capital. These tweets gave the outside world an unparalleled look at what was going on as protests against the election began and were suppressed by the Iranian government.

Twitter announced earlier in the day today that Twitter would be unavailable for an hour and a half later this evening for scheduled maintenance. This announcement sparked two trending sub-topics in the #iranelection topic, #twitterfail and #nomaintenance. Both sub-topics expressed dismay at Twitter being unavailable for any amount of time, particularly considering that information coming out of Iran right now is mostly coming from Twitter.

To look at these two sub-topics, we’re going to use SAS 9.2, SAS XML Mapper, a small macro, PROC PRINT and PROC SGPLOT. I based this example on Chris Hemedinger’s blog post at the SAS Dummy on using SAS to mine Twitter for an informal poll of the VP debate winner from last Fall’s Presidential election. You can find it here.

The first step in this process is to create an XML feed for the topic #iranelection. This can easily be done on Twitter’s home page ( http://www.twitter.com ). Under the trending topic sidebar, click on the topic you would like to examine. A new page showing only tweets with that topic will appear. On this page, click on the link that reads RSS feed for this topic. This will create another page that will provide you with the Atom feed for the topic in question. In this case, my feed URL for #iranelection is http://search.twitter.com/search.atom?q=%23iranelection . It’s worth taking a moment to note that taking data from topic feeds when they are very active (as #iranelection is at this time) poses a unique set of considerations. For one, new tweets can be added to a topic amazingly fast such that your data’s time frame can change considerably depending on how active the topic. If you want to capture a wider time frame for an active topic, you will need to get more pages of tweets than for a slower topic.

I ran my SAS program that utilizes the XML Mapper to grab data from the Atom feed above at 4:42pm Central Time. All 541 tweets that were captured were issued no earlier than 4:40pm Central Time. Using a regular expression matching statement with Perl, I selected all the tweets in the dataset that contained the #twitterfail and #nomaintenance hash tags and generated a horizontal bar graph of the results. Here it is:

maintenance_tweets.png

The number of Twitter users using the #nomaintenance hash tag in these two minutes of tweets is far greater than the number of Twitter users using the #twitterfail hash tag.

There is quite a bit of mining that could be done using Twitter’s Atom feeds. I hope that this example of how one might employ SAS, its XML Mapper and a well-formed XML-based RSS feed like Atom will open up some new possibilities for you when it comes to using the most current that the web can offer in conjunction with some simple statistical analysis. Until next time, happy computing!

Code for this article:

filename twsearch temp;

/** this is the XML map that will convert the RSS search feed

   into a SAS data set **/

data _null_;

  infile datalines truncover;

  file twsearch;

  input line $1000.;

  put line;

datalines4;

<?xml version="1.0" encoding="windows-1252"?>

<!-- ############################################################ -->

<!-- 2008-10-03T11:35:31 -->

<!-- SAS XML Libname Engine Map -->

<!-- Generated by XML Mapper, 902000.2.1.20080911191346_v920 -->

<!-- ############################################################ -->

<SXLEMAP name="SXLEMAP" version="1.2">

    <!-- ############################################################ -->

    <TABLE name="entry">

        <TABLE-PATH syntax="XPath">/feed/entry</TABLE-PATH>

        <COLUMN name="id">

            <PATH syntax="XPath">/feed/entry/id</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>37</LENGTH>

        </COLUMN>

        <COLUMN name="published">

            <PATH syntax="XPath">/feed/entry/published</PATH>

            <TYPE>numeric</TYPE>

            <DATATYPE>datetime</DATATYPE>

            <FORMAT width="19">IS8601DT</FORMAT>

            <INFORMAT width="19">IS8601DT</INFORMAT>

        </COLUMN>

        <COLUMN name="link">

            <PATH syntax="XPath">/feed/entry/link</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>32</LENGTH>

        </COLUMN>

        <COLUMN name="title">

            <PATH syntax="XPath">/feed/entry/title</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>140</LENGTH>

        </COLUMN>

        <COLUMN name="content">

            <PATH syntax="XPath">/feed/entry/content</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>261</LENGTH>

        </COLUMN>

        <COLUMN name="updated">

            <PATH syntax="XPath">/feed/entry/updated</PATH>

            <TYPE>numeric</TYPE>

            <DATATYPE>datetime</DATATYPE>

            <FORMAT width="19">IS8601DT</FORMAT>

            <INFORMAT width="19">IS8601DT</INFORMAT>

        </COLUMN>

        <COLUMN name="author">

            <PATH syntax="XPath">/feed/entry/author</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>32</LENGTH>

        </COLUMN>

    </TABLE>

</SXLEMAP>

;;;;

 

/** this is the data set that will hold the "tweet" content **/

data work.feed;

  length fail $ 10;

run;

 

/** this macro makes it simple to get several "pages" worth of tweets **/

%macro getpage(num);

      %let feed="http://search.twitter.com/search.atom?q=%23iranelection";

      filename twit URL &feed

      /** if you need to specify a proxy server to get to the internet **/

      /**   proxy="http://myproxy.com"  **/

      ;

 

      /** use the XML library engine **/

      libname tf XML xmlfileref=twit xmlmap=twsearch;

 

      data work.feed;

        set work.feed tf.entry;

      run;

%mend;

 

%getpage(1);

%getpage(2);

%getpage(3);

%getpage(4);

%getpage(5);

%getpage(6);

%getpage(7);

%getpage(8);

%getpage(9);

%getpage(10);

%getpage(11);

%getpage(12);

%getpage(13);

%getpage(14);

%getpage(15);

%getpage(16);

%getpage(17);

%getpage(18);

%getpage(19);

%getpage(20);

%getpage(21);

%getpage(22);

%getpage(23);

%getpage(24);

%getpage(25);

%getpage(26);

%getpage(27);

%getpage(28);

%getpage(29);

%getpage(30);

%getpage(31);

%getpage(32);

%getpage(33);

%getpage(34);

%getpage(35);

%getpage(36);

%getpage(37);

%getpage(38);

%getpage(39);

%getpage(40);

%getpage(41);

%getpage(42);

%getpage(43);

%getpage(44);

%getpage(45);

%getpage(46);

%getpage(47);

%getpage(48);

%getpage(49);

%getpage(50);

%getpage(51);

%getpage(52);

%getpage(53);

%getpage(54);

%getpage(55);

%getpage(56);

%getpage(57);

%getpage(58);

%getpage(59);

%getpage(60);

 

/** if the tweet contains the #twitterfail topic, it gets a tick **/

/** if the tweet contains the #nomaintenance topic, it gets a tick **/

data work.feed;

  set work.feed;

   if prxmatch('/(#twitterfail)/',lowcase(title)) >0 then fail="twitfail";

   if prxmatch('/(#nomaintenance)/',lowcase(title)) >0 then fail="nomaint";

run;

 

proc print data=work.feed;

var fail;

run;

 

title "Tweets On Maintenance";

ods graphics / width=800 height=600;

proc sgplot data=work.feed;

hbar fail;

xaxis label="#iranelection hash tags on maintenance";

run;

quit;

 

 

Bookmark


Originally published, June 2009 -- Please note that information published in Benchmarks Online is likely to degrade over time, especially links to various Websites. To make sure you have the most current information on a specific topic, it may be best to search the UNT Website - http://www.unt.edu . You can also search Benchmarks Online - http://www.unt.edu/benchmarks/archives/back.htm as well as consult the UNT Helpdesk - http://www.unt.edu/helpdesk/ Questions and comments should be directed to
benchmarks@unt.edu


Return to top