Cpsc470 Data Mining fun times: February 2009

Thursday, February 19, 2009

Portfolio 3: last.fm

Our first idea was to create a program using the command prompt. Unfortunately we didn't really leave ourselves enough time before the class presentation to really flesh out a nice interface, though we were able to load up the API and get some of the functions working.

While my group worked on trying to get this command prompt demo going I started writing a PHP page to achieve a similar goal through an interactive website. Fortunately I know PHP relatively well and found the incorporation of the last.fm API easy after looking at a couple examples included in the download file.

I found that I really liked the API. The functions were all very similar, easily replicated in multiple places. The only trouble I had was with a couple API discrepancies, for example, the API doc states that you can search for similar tracks with just a track name and the artist was optional. I found this wasn't actually true, or I simply couldn't get it to work. Here is an example of how I figured out how to access the multidimensional array returned by the similarity functions:


if ($_POST['track'] != null){
 $methodVars = array(
  'track' => $_POST['track'],
  'artist' => $_POST['trackArtist'],
 );
 
 $tracks = $trackClass->getSimilar($methodVars);
 //$similar = array();
 echo "Top 50 Similar Tracks for ".$_POST['track'].": 
";
 echo "< id="navlist">";
 $count = 0;
 foreach($tracks as $key){
  echo "< class="navitem">";
  echo "<>";
  print "< src="'">";
  print "<>".$key['name']."";
  print "<>".$key['artist']['name']."";
  echo "< /div>";
  echo "< /li>";
  $count++;
  if ($count == 50){
   break;
  }
 }
 echo "< /ul>";

So this segment of code just checks for a post variable called track, which would indicate that the user has searched for similar tracks. I decided to just break the loop because i was pressed for time and didn't want like 6 million similar tracks, and for some reason couldn't think of a better way to do it in the moment. The formatting to make the page look decent is just some CSS applied to a couple html tags, such as the list and so on. Here is some of the CSS I wrote:

ul#navlist {
 position: absolute;
 top: 14em;
 left: 15%;
 background-color: gray;
 list-style: none;
 padding-top: 20px;
 margin: 0;
 max-width: 70%;
}

li.navitem {
 display: inline;
 float: left;
 font: 12px arial;
 margin: 8px;
 width: 100px;
 height: 135px;
}

li.navitem a {
 text-decoration: none;
 color: white;
}

li.navitem a.artist {
 text-decoration: none;
 color: black;
}

li.navitem img {
 max-height: 64px;
 max-width: 64px;
}

li.navitem a:hover {
 color: #00FF99;
}

li.navitem a.artist:hover {
 color: #00FF99;
}


li.navitem a.artist {
 text-decoration: none;
 color: black;
}

This basically just makes ever list item appear as a box on the page, within a bigger UL box. This accomplished with the float and display modifiers. The hover just makes the links more appealing.

I think the cool thing about this is I might actually use it. On last.fm I have to digg through so much fluff just to actually see what I could possible also be interested in based on my interests. This tool is simple, quick and dirty.

If I could write a message to the API creator I would tell them how much it sucks how not standardized their array output is. sometimes its [images], sometimes [image], sometimes [images-medium][artist] or [track][image-medium]. This isn't a showstopper but just aggravating, (output doesn't react as you expect it to after getting used to the system.)

Overall experience: I liked the API, found it remarkably easy to use. Great examples and cool use of inheritance in their PHP code, something not too often seen in PHP, since it seems pseudo OO to me. (objects do not persist past a single page, making singleton classes and things of this nature kind of difficult to realize.)

PEACE

Monday, February 2, 2009

Assignment 1 with Python 3.0

My data mining textbooks have finally arrived, thanks Amazon (only 8 days after supposed shipping date.)

So I decided to download Python 3.0 instead of 2.whatever because I am very adventurous. I realize now that this means I will have to check Google for many alternate shell functions, since according to the 3.0 documentation, simple ones such as reload() have been removed. For example:

>>> reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
reload(recommendations)
NameError: name 'reload' is not defined

and then an alternative suggested by some uninformed knit-wit:

>>> imp.reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
imp.reload(recommendations)
NameError: name 'imp' is not defined
>>> recommendations.reload()
Traceback (most recent call last):
File "", line 1, in
recommendations.reload()
AttributeError: 'module' object has no attribute 'reload'

But after several tries I figured out I just need to restart the shell and then import a file whenever I edit a file.. Boo. My first attempt with the Euclidian Distance Score:
>>> import recommendations
>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.29429805508554946

Okay I have the Pearson's method working correctly with no problems, getting the same number as the book

>>> print recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour')
SyntaxError: invalid syntax (, line 1)
>>> print(recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour'))
0.396059017191

Apparently the print() function is a bit different in python 3.0 it seems.

The next few functions were very easy to replicate in the recommendations.py file, copying from the book.

The first attempt at the Manhattan difference produces undesired results and I notice (supposed to end up between 0 and 1, and got -2.) I realized this was because I was not taking the absolute value of the difference between ratings., and was dividing 1 by -.5.

I fixed this problem and I am getting the sum of the differences correctly:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.181818181818
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.25
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.666666666667

I was using (1 / 1 + manhattan distance sum) to format the result between 0 and 1. You can see from my output that this isn't a very good way to do it since a difference total of only .5 only produces a score of .6666. I changed it to

return len(si)/(len(si)+pow(difference_sum,2))

upon testing:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.4
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.952380952381

You'll notice that this doesn't work because as the difference dips below 1 and you square it, the result is a deceptively close to 1. After some more experimentation I decided to leave it like this:

difference_sum=sum(abs(prefs[person1][item]-prefs[person2][item])for item in si)
scaled_down= (len(si)+difference_sum)/len(si)
return 1/(pow(scaled_down,2))

which produces the following results:

>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.444444444444
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.826446280992
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.326530612245

I liked the way these results turned out. The difference sum for .5 appears close to one, but is not too close considering how few movies are being rated. I also like that when comparing the final results to each other, they appear to be similarly distanced in regards to comparison of their respective difference sums.

The end.

Cpsc470 Data Mining fun times

Thursday, February 19, 2009

Portfolio 3: last.fm

Monday, February 2, 2009

Assignment 1 with Python 3.0

Followers

Blog Archive

About Me