Cpsc470 Data Mining fun times: 2009

Thursday, April 30, 2009

Portfolio 9: Final Project

For the Final Project, our group decided to take the examples shown in Chapter 6 of classifying Blogs to the next level. The goal was to create some sort of interface that could take in a file (email or blog most likely) and accurately label it as Spam or Not Spam.

The first thing that we needed to do was find a adequate source of training files. I was able to find such a resource in something called the BlogSplog study. This was a compilation of 3000 HTML pages, 1400 of which (700 in each category) had been labeled as Spam or Not Spam by the UMBC research team, and the rest were meant to be identified by our Bayesian Classifier. First my team edited the training function in the spamclass.py file included in chapter 6 to take an input file instead of a string. The changed method is show below:

def train(self,item,cat):
features = self.getfeatures(item)
  
  # Increment the count for every feature with this category
  for f in features:
    self.incf(f,cat)
  
  # Increment the count for this category
  self.incc(cat)

This method could then be passed a filename and it would train based on that file, and the category it was passed (spam in our instance.) We then added a function that could train multiple files if they were named with an appended int at the end:

def sampletrain(cl,basefile,numfiles,gory):
   for i in range(1,numfiles+1):
       filename =  basefile + str(i) + '.txt'
       #print filename
       cl.train(filename,gory)

This was useful for our preliminary tests on the training data. Later, I added a more sophisticated function that read in all the data from the splogblog dataset. I had to format the function to read the specific format for this data which was:

[HTML NAME] [FILE LOCATION] [1 (not spam) or -1(spam]

Here is that function:

def openfiles(cl):
    data = open('/home/wboyd/blogsplog.txt', 'r')
    lines = data.readlines();
    for i in lines:
        thisline = i.split(" ");
        filename = thisline[1];
        if thisline[2] == "1\n":
            spamtype = 'not-spam';
        else:
            spamtype = 'spam';
 filename = "/home/wboyd/"+filename
        cl.train(filename, spamtype);

So now, I had a function that could train from all of the data from the blogsplog data set. I figured that with this trained data set, I could create an interface that let a user post in the URL of a google blog search results and it would classify each blog based on the short description of that blog as spam or not spam. The hurdle to overcome in this process was figuring out how I could actually read in the HTML content from the URL. This meant that I had to open a web page with python and parse through the HTML to the point showing the blurb, I did this using the SMLLIB class. I had to create a new class using inheritance to edit some of the features of the class to make it work for this example. Here is the code:

class MyParser(sgmllib.SGMLParser):   #you inherit the SGML parser class

// Quick note about this awesome library: Whenever a tag is encountered, the code calls
a function like start_a(), where a is the tag <>.  This allows you to change what happens
when certain tags are encountered to suit your needs.  All of this code is an example of
implementation inheritance.

    count = 1
    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.blurbchunk = []
        self.combinedblurb = []
        self.inside_td_element = 0

    def start_a(self, attributes):
        for name, value in attributes:
            if name == "href":
                if len(attributes) > 1:
                    if attributes[1][0] == "id":
                        if attributes[1][1] == "p-"+str(self.count): 
                            self.hyperlinks.append(value)
                            self.count = self.count + 1

    def start_td(self, attributes):
        if (len(attributes) > 0):
            for name, value in attributes:
                if name == "class":
                    if value == "j":
                        self.inside_td_element = 1

    def end_td(self):
        tempList = []
        tempList.extend(self.blurbchunk)
        if len(tempList) > 0:
            self.combinedblurb.append(tempList)
            list([self.blurbchunk.pop() for z in xrange(len(self.blurbchunk))])
        self.inside_td_element = 0

    def handle_data(self, data):
        if self.inside_td_element:
            self.blurbchunk.append(data)
                
    def get_hyperlinks(self):
        print self.hyperlinks
 print "\n"
        print self.combinedblurb

Getting this to work was kind of difficult because I had to learn about how python deals with inheritance, however once I found out how Google named their tags the encompassed the blog descriptions, I was able to access this data with the following command line arguments:

>>> import urllib, sgmllib, MyParser
>>> f = urllib.urlopen("http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=seo&btnG=Search+Blogs")
>>> s = f.read()
>>> parser = MyParser.MyParser()
>>> parser.parse(s)
>>> parser.get_hyperlinks()

If you look back up to the code, you will see that the get_hyperlinks() function returns an array of the HREFs to the blog pages and an array blurbs. These were ordered in the same way they were shown on the google page. So now, I was able to open up any google blog search URL with this function, get the blurbs and the hyperlinks. Now I needed to write these blurbs to files and call the classify method on each file. This is where I start to incorporate PHP into my example. However, just before I was able to use PHP shell exec commands, I realized that I couldn't have a namespace that stored my class instances like I was used to with the IDLE command line IDE. This posed a significant problem, either I had to re-train using the getwords() function every time when a new google blog search URL was entered by the user. This wouldn't be acceptable since the training of 1400 files takes around 10 minutes even using the rosemary servers.

The solution is to 'pickle' your class instances. What this does is write your instance to a binary files where it can be stored and reopened quickly. This isn't nearly instantaneous, but I found it to be 10-20 times faster than retraining every time, a significant improvement to say the least. Here is how I pickled my instances:

>>> import spamclass, cPickle as pickle
>>> cl = spamclass.naivebayes(spamclass.getwords)
>>> spamclass.openfiles(cl)
>>> f1 = file('temp.pk1', 'wb');
>>> pickle.dump(cl, f1)

Note: I read online that cPickle uses an algorithm inherited from a pickling method developed originally in C that is up to a thousand times faster but does not include a couple advanced features, so I used it here since I didn't need those features.

Now, I could access the temp.pk1 file using the pickle load() function as show below:

>>> import spamclass, cPickle as pickle
>>> f1 = file('temp.pk1', 'rb');
>>> cl = pickle.load(f1)

Pretty neat, huh? And it was pretty easy once I figured it all out. So now, how to dynamically save files based on descriptions using PHP. Here is how I did it:

if(isset($_POST['url'])){       //Check to see if they put in a url
$fileName = "pythonexec.py";

       // We need to dynamically change the name of the URL that we are checking.
       // To do this, we need to open up our python exec and rewrite everything.

       $fh = fopen($fileName, 'w') or die("can't open file");
  fwrite($fh, "import urllib, sgmllib, MyParser\n");
        
        // This is where we dynamically change to URL to whatever the person typed.
  fwrite($fh, "f = urllib.urlopen(\"".$_POST['url']."\")\n");  
  fwrite($fh, "s = f.read()\n");
  fwrite($fh, "parser = MyParser.MyParser()\n");
  fwrite($fh, "parser.parse(s)\n");
  fwrite($fh, "parser.get_hyperlinks()\n");
  fclose($fh);
  $results = shell_exec("python pythonexec.py");  //execute the python executable and store the results, the next few lines just break up the two arrays for formatting purposes
  $resultSplit = explode("[[", $results);
  $links = explode(",", $resultSplit[0]);
  $blurbSplit = explode("],", $resultSplit[1]);
 }

// Some HTML code that basically just has an input form.

$i = 0;
// for each of the results we just overwrite this test file and then classify it.
// Could be changed in the future to store previous searches to add
// some additional functionality

foreach($blurbSplit as $newfile){
  
                $fa = fopen("testfile1.txt", 'w') or die("cant open file");  
  fwrite($fa, $newfile);
  fclose($fa);
  $theresult = shell_exec("python classify.py");
        
        // Show the HREF location.  This should have been formatted as an <> tag
echo $links[$i]."<>";

        // Show the result, spam or not spam. 
echo trim(stripslashes($theresult))."<>";  
  $i++;
 }

So that's it! Now the user can post in a google blog search with the Form and it lists the same links that show in the google blog search and then classifies the links as spam or not spam. What's cool is that it actually is relatively accurate. Unfortunately I ran out of time or I could have made the interface much more appealing. All of the Python stuff to actually get this to work, including the pickling and the inheritance were much more difficult and complex than is shown by the final product.

So for the future, I would definitely like to spice up the final page, and some other functions in the spamclass could have been defeinitely been incorporated into the page to accomodate the idea that one person's spam is another person's not spam.

Here is an example of the tool working:
These were the results from the search "Viagra". The URL of the blog search
from:
http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=viagra&btnG=Search+Blogs

http://politics.theatlantic.com/2009/04/pennsylvania_senator_arlen_specters_switch.php
spam

http://www.quizilla.com/poems/9372855/buying-viagra
spam

http://www.zengermillerlibrary.com/2009/04/30/where-to-order-viagra/ 
spam

http://www.worldhealthlife.com/2009/04/30/nano-technology-combination-viagra-drug.html
spam

http://forum.treonauts.com/palm-smartphones/centro/11233-viagra-sale-online.html 
spam

http://www.tuzluca-mikail.com/2009/04/headache-pills-made-of-rat-poison-and-viagra-made-of-chalk-we-reveal-the-chilling-truth-about-internet-drugs/ 
spam

http://www.fwi.co.uk/Articles/2009/04/29/115364/opicos-nitro-jet-is-like-viagra-for-osr.html 
not-spam

http://warwicknews.blogvis.com/2009/04/30/viagra-skin-rub-viagra-%E2%80%98type-drug-%C2%BB-blue-luminary-chronicles/ 
spam

http://iaenatacionlibres.blogspot.com/2009/04/viagra-100mg-x-90-pills-us-15995.html 
spam

http://www.littleredbook.cn/2009/04/29/chinese-viagra/ 
spam

If you check the one labeled as not-spam you will see that it indeed is not a spam website.

However it is not perfect, because I would probably not label this result as spam:
http://www.worldhealthlife.com/2009/04/30/nano-technology-combination-viagra-drug.html

Here is a link to the tool if you want to try it out:
http://rosemary.umw.edu/~wboyd/datamining/blogtest.php

THE END

Thursday, February 19, 2009

Portfolio 3: last.fm

Our first idea was to create a program using the command prompt. Unfortunately we didn't really leave ourselves enough time before the class presentation to really flesh out a nice interface, though we were able to load up the API and get some of the functions working.

While my group worked on trying to get this command prompt demo going I started writing a PHP page to achieve a similar goal through an interactive website. Fortunately I know PHP relatively well and found the incorporation of the last.fm API easy after looking at a couple examples included in the download file.

I found that I really liked the API. The functions were all very similar, easily replicated in multiple places. The only trouble I had was with a couple API discrepancies, for example, the API doc states that you can search for similar tracks with just a track name and the artist was optional. I found this wasn't actually true, or I simply couldn't get it to work. Here is an example of how I figured out how to access the multidimensional array returned by the similarity functions:


if ($_POST['track'] != null){
 $methodVars = array(
  'track' => $_POST['track'],
  'artist' => $_POST['trackArtist'],
 );
 
 $tracks = $trackClass->getSimilar($methodVars);
 //$similar = array();
 echo "Top 50 Similar Tracks for ".$_POST['track'].": 
";
 echo "< id="navlist">";
 $count = 0;
 foreach($tracks as $key){
  echo "< class="navitem">";
  echo "<>";
  print "< src="'">";
  print "<>".$key['name']."";
  print "<>".$key['artist']['name']."";
  echo "< /div>";
  echo "< /li>";
  $count++;
  if ($count == 50){
   break;
  }
 }
 echo "< /ul>";

So this segment of code just checks for a post variable called track, which would indicate that the user has searched for similar tracks. I decided to just break the loop because i was pressed for time and didn't want like 6 million similar tracks, and for some reason couldn't think of a better way to do it in the moment. The formatting to make the page look decent is just some CSS applied to a couple html tags, such as the list and so on. Here is some of the CSS I wrote:

ul#navlist {
 position: absolute;
 top: 14em;
 left: 15%;
 background-color: gray;
 list-style: none;
 padding-top: 20px;
 margin: 0;
 max-width: 70%;
}

li.navitem {
 display: inline;
 float: left;
 font: 12px arial;
 margin: 8px;
 width: 100px;
 height: 135px;
}

li.navitem a {
 text-decoration: none;
 color: white;
}

li.navitem a.artist {
 text-decoration: none;
 color: black;
}

li.navitem img {
 max-height: 64px;
 max-width: 64px;
}

li.navitem a:hover {
 color: #00FF99;
}

li.navitem a.artist:hover {
 color: #00FF99;
}


li.navitem a.artist {
 text-decoration: none;
 color: black;
}

This basically just makes ever list item appear as a box on the page, within a bigger UL box. This accomplished with the float and display modifiers. The hover just makes the links more appealing.

I think the cool thing about this is I might actually use it. On last.fm I have to digg through so much fluff just to actually see what I could possible also be interested in based on my interests. This tool is simple, quick and dirty.

If I could write a message to the API creator I would tell them how much it sucks how not standardized their array output is. sometimes its [images], sometimes [image], sometimes [images-medium][artist] or [track][image-medium]. This isn't a showstopper but just aggravating, (output doesn't react as you expect it to after getting used to the system.)

Overall experience: I liked the API, found it remarkably easy to use. Great examples and cool use of inheritance in their PHP code, something not too often seen in PHP, since it seems pseudo OO to me. (objects do not persist past a single page, making singleton classes and things of this nature kind of difficult to realize.)

PEACE

Monday, February 2, 2009

Assignment 1 with Python 3.0

My data mining textbooks have finally arrived, thanks Amazon (only 8 days after supposed shipping date.)

So I decided to download Python 3.0 instead of 2.whatever because I am very adventurous. I realize now that this means I will have to check Google for many alternate shell functions, since according to the 3.0 documentation, simple ones such as reload() have been removed. For example:

>>> reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
reload(recommendations)
NameError: name 'reload' is not defined

and then an alternative suggested by some uninformed knit-wit:

>>> imp.reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
imp.reload(recommendations)
NameError: name 'imp' is not defined
>>> recommendations.reload()
Traceback (most recent call last):
File "", line 1, in
recommendations.reload()
AttributeError: 'module' object has no attribute 'reload'

But after several tries I figured out I just need to restart the shell and then import a file whenever I edit a file.. Boo. My first attempt with the Euclidian Distance Score:
>>> import recommendations
>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.29429805508554946

Okay I have the Pearson's method working correctly with no problems, getting the same number as the book

>>> print recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour')
SyntaxError: invalid syntax (, line 1)
>>> print(recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour'))
0.396059017191

Apparently the print() function is a bit different in python 3.0 it seems.

The next few functions were very easy to replicate in the recommendations.py file, copying from the book.

The first attempt at the Manhattan difference produces undesired results and I notice (supposed to end up between 0 and 1, and got -2.) I realized this was because I was not taking the absolute value of the difference between ratings., and was dividing 1 by -.5.

I fixed this problem and I am getting the sum of the differences correctly:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.181818181818
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.25
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.666666666667

I was using (1 / 1 + manhattan distance sum) to format the result between 0 and 1. You can see from my output that this isn't a very good way to do it since a difference total of only .5 only produces a score of .6666. I changed it to

return len(si)/(len(si)+pow(difference_sum,2))

upon testing:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.4
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.952380952381

You'll notice that this doesn't work because as the difference dips below 1 and you square it, the result is a deceptively close to 1. After some more experimentation I decided to leave it like this:

difference_sum=sum(abs(prefs[person1][item]-prefs[person2][item])for item in si)
scaled_down= (len(si)+difference_sum)/len(si)
return 1/(pow(scaled_down,2))

which produces the following results:

>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.444444444444
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.826446280992
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.326530612245

I liked the way these results turned out. The difference sum for .5 appears close to one, but is not too close considering how few movies are being rated. I also like that when comparing the final results to each other, they appear to be similarly distanced in regards to comparison of their respective difference sums.

The end.

Cpsc470 Data Mining fun times

Thursday, April 30, 2009

Portfolio 9: Final Project

Thursday, February 19, 2009

Portfolio 3: last.fm

Monday, February 2, 2009

Assignment 1 with Python 3.0

Followers

Blog Archive

About Me