Cpsc470 Data Mining fun times: Portfolio 9: Final Project

For the Final Project, our group decided to take the examples shown in Chapter 6 of classifying Blogs to the next level. The goal was to create some sort of interface that could take in a file (email or blog most likely) and accurately label it as Spam or Not Spam.

The first thing that we needed to do was find a adequate source of training files. I was able to find such a resource in something called the BlogSplog study. This was a compilation of 3000 HTML pages, 1400 of which (700 in each category) had been labeled as Spam or Not Spam by the UMBC research team, and the rest were meant to be identified by our Bayesian Classifier. First my team edited the training function in the spamclass.py file included in chapter 6 to take an input file instead of a string. The changed method is show below:

def train(self,item,cat):
features = self.getfeatures(item)
  
  # Increment the count for every feature with this category
  for f in features:
    self.incf(f,cat)
  
  # Increment the count for this category
  self.incc(cat)

This method could then be passed a filename and it would train based on that file, and the category it was passed (spam in our instance.) We then added a function that could train multiple files if they were named with an appended int at the end:

def sampletrain(cl,basefile,numfiles,gory):
   for i in range(1,numfiles+1):
       filename =  basefile + str(i) + '.txt'
       #print filename
       cl.train(filename,gory)

This was useful for our preliminary tests on the training data. Later, I added a more sophisticated function that read in all the data from the splogblog dataset. I had to format the function to read the specific format for this data which was:

[HTML NAME] [FILE LOCATION] [1 (not spam) or -1(spam]

Here is that function:

def openfiles(cl):
    data = open('/home/wboyd/blogsplog.txt', 'r')
    lines = data.readlines();
    for i in lines:
        thisline = i.split(" ");
        filename = thisline[1];
        if thisline[2] == "1\n":
            spamtype = 'not-spam';
        else:
            spamtype = 'spam';
 filename = "/home/wboyd/"+filename
        cl.train(filename, spamtype);

So now, I had a function that could train from all of the data from the blogsplog data set. I figured that with this trained data set, I could create an interface that let a user post in the URL of a google blog search results and it would classify each blog based on the short description of that blog as spam or not spam. The hurdle to overcome in this process was figuring out how I could actually read in the HTML content from the URL. This meant that I had to open a web page with python and parse through the HTML to the point showing the blurb, I did this using the SMLLIB class. I had to create a new class using inheritance to edit some of the features of the class to make it work for this example. Here is the code:

class MyParser(sgmllib.SGMLParser):   #you inherit the SGML parser class

// Quick note about this awesome library: Whenever a tag is encountered, the code calls
a function like start_a(), where a is the tag <>.  This allows you to change what happens
when certain tags are encountered to suit your needs.  All of this code is an example of
implementation inheritance.

    count = 1
    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        sgmllib.SGMLParser.__init__(self, verbose)
        self.hyperlinks = []
        self.blurbchunk = []
        self.combinedblurb = []
        self.inside_td_element = 0

    def start_a(self, attributes):
        for name, value in attributes:
            if name == "href":
                if len(attributes) > 1:
                    if attributes[1][0] == "id":
                        if attributes[1][1] == "p-"+str(self.count): 
                            self.hyperlinks.append(value)
                            self.count = self.count + 1

    def start_td(self, attributes):
        if (len(attributes) > 0):
            for name, value in attributes:
                if name == "class":
                    if value == "j":
                        self.inside_td_element = 1

    def end_td(self):
        tempList = []
        tempList.extend(self.blurbchunk)
        if len(tempList) > 0:
            self.combinedblurb.append(tempList)
            list([self.blurbchunk.pop() for z in xrange(len(self.blurbchunk))])
        self.inside_td_element = 0

    def handle_data(self, data):
        if self.inside_td_element:
            self.blurbchunk.append(data)
                
    def get_hyperlinks(self):
        print self.hyperlinks
 print "\n"
        print self.combinedblurb

Getting this to work was kind of difficult because I had to learn about how python deals with inheritance, however once I found out how Google named their tags the encompassed the blog descriptions, I was able to access this data with the following command line arguments:

>>> import urllib, sgmllib, MyParser
>>> f = urllib.urlopen("http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=seo&btnG=Search+Blogs")
>>> s = f.read()
>>> parser = MyParser.MyParser()
>>> parser.parse(s)
>>> parser.get_hyperlinks()

If you look back up to the code, you will see that the get_hyperlinks() function returns an array of the HREFs to the blog pages and an array blurbs. These were ordered in the same way they were shown on the google page. So now, I was able to open up any google blog search URL with this function, get the blurbs and the hyperlinks. Now I needed to write these blurbs to files and call the classify method on each file. This is where I start to incorporate PHP into my example. However, just before I was able to use PHP shell exec commands, I realized that I couldn't have a namespace that stored my class instances like I was used to with the IDLE command line IDE. This posed a significant problem, either I had to re-train using the getwords() function every time when a new google blog search URL was entered by the user. This wouldn't be acceptable since the training of 1400 files takes around 10 minutes even using the rosemary servers.

The solution is to 'pickle' your class instances. What this does is write your instance to a binary files where it can be stored and reopened quickly. This isn't nearly instantaneous, but I found it to be 10-20 times faster than retraining every time, a significant improvement to say the least. Here is how I pickled my instances:

>>> import spamclass, cPickle as pickle
>>> cl = spamclass.naivebayes(spamclass.getwords)
>>> spamclass.openfiles(cl)
>>> f1 = file('temp.pk1', 'wb');
>>> pickle.dump(cl, f1)

Note: I read online that cPickle uses an algorithm inherited from a pickling method developed originally in C that is up to a thousand times faster but does not include a couple advanced features, so I used it here since I didn't need those features.

Now, I could access the temp.pk1 file using the pickle load() function as show below:

>>> import spamclass, cPickle as pickle
>>> f1 = file('temp.pk1', 'rb');
>>> cl = pickle.load(f1)

Pretty neat, huh? And it was pretty easy once I figured it all out. So now, how to dynamically save files based on descriptions using PHP. Here is how I did it:

if(isset($_POST['url'])){       //Check to see if they put in a url
$fileName = "pythonexec.py";

       // We need to dynamically change the name of the URL that we are checking.
       // To do this, we need to open up our python exec and rewrite everything.

       $fh = fopen($fileName, 'w') or die("can't open file");
  fwrite($fh, "import urllib, sgmllib, MyParser\n");
        
        // This is where we dynamically change to URL to whatever the person typed.
  fwrite($fh, "f = urllib.urlopen(\"".$_POST['url']."\")\n");  
  fwrite($fh, "s = f.read()\n");
  fwrite($fh, "parser = MyParser.MyParser()\n");
  fwrite($fh, "parser.parse(s)\n");
  fwrite($fh, "parser.get_hyperlinks()\n");
  fclose($fh);
  $results = shell_exec("python pythonexec.py");  //execute the python executable and store the results, the next few lines just break up the two arrays for formatting purposes
  $resultSplit = explode("[[", $results);
  $links = explode(",", $resultSplit[0]);
  $blurbSplit = explode("],", $resultSplit[1]);
 }

// Some HTML code that basically just has an input form.

$i = 0;
// for each of the results we just overwrite this test file and then classify it.
// Could be changed in the future to store previous searches to add
// some additional functionality

foreach($blurbSplit as $newfile){
  
                $fa = fopen("testfile1.txt", 'w') or die("cant open file");  
  fwrite($fa, $newfile);
  fclose($fa);
  $theresult = shell_exec("python classify.py");
        
        // Show the HREF location.  This should have been formatted as an <> tag
echo $links[$i]."<>";

        // Show the result, spam or not spam. 
echo trim(stripslashes($theresult))."<>";  
  $i++;
 }

So that's it! Now the user can post in a google blog search with the Form and it lists the same links that show in the google blog search and then classifies the links as spam or not spam. What's cool is that it actually is relatively accurate. Unfortunately I ran out of time or I could have made the interface much more appealing. All of the Python stuff to actually get this to work, including the pickling and the inheritance were much more difficult and complex than is shown by the final product.

So for the future, I would definitely like to spice up the final page, and some other functions in the spamclass could have been defeinitely been incorporated into the page to accomodate the idea that one person's spam is another person's not spam.

Here is an example of the tool working:
These were the results from the search "Viagra". The URL of the blog search
from:
http://blogsearch.google.com/blogsearch?hl=en&ie=UTF-8&q=viagra&btnG=Search+Blogs

http://politics.theatlantic.com/2009/04/pennsylvania_senator_arlen_specters_switch.php
spam

http://www.quizilla.com/poems/9372855/buying-viagra
spam

http://www.zengermillerlibrary.com/2009/04/30/where-to-order-viagra/ 
spam

http://www.worldhealthlife.com/2009/04/30/nano-technology-combination-viagra-drug.html
spam

http://forum.treonauts.com/palm-smartphones/centro/11233-viagra-sale-online.html 
spam

http://www.tuzluca-mikail.com/2009/04/headache-pills-made-of-rat-poison-and-viagra-made-of-chalk-we-reveal-the-chilling-truth-about-internet-drugs/ 
spam

http://www.fwi.co.uk/Articles/2009/04/29/115364/opicos-nitro-jet-is-like-viagra-for-osr.html 
not-spam

http://warwicknews.blogvis.com/2009/04/30/viagra-skin-rub-viagra-%E2%80%98type-drug-%C2%BB-blue-luminary-chronicles/ 
spam

http://iaenatacionlibres.blogspot.com/2009/04/viagra-100mg-x-90-pills-us-15995.html 
spam

http://www.littleredbook.cn/2009/04/29/chinese-viagra/ 
spam

If you check the one labeled as not-spam you will see that it indeed is not a spam website.

However it is not perfect, because I would probably not label this result as spam:
http://www.worldhealthlife.com/2009/04/30/nano-technology-combination-viagra-drug.html

Here is a link to the tool if you want to try it out:
http://rosemary.umw.edu/~wboyd/datamining/blogtest.php

THE END

Cpsc470 Data Mining fun times

Thursday, April 30, 2009

Portfolio 9: Final Project

No comments:

Post a Comment

Followers

Blog Archive

About Me