Cpsc470 Data Mining fun times: Assignment 1 with Python 3.0

My data mining textbooks have finally arrived, thanks Amazon (only 8 days after supposed shipping date.)

So I decided to download Python 3.0 instead of 2.whatever because I am very adventurous. I realize now that this means I will have to check Google for many alternate shell functions, since according to the 3.0 documentation, simple ones such as reload() have been removed. For example:

>>> reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
reload(recommendations)
NameError: name 'reload' is not defined

and then an alternative suggested by some uninformed knit-wit:

>>> imp.reload(recommendations)
Traceback (most recent call last):
File "", line 1, in
imp.reload(recommendations)
NameError: name 'imp' is not defined
>>> recommendations.reload()
Traceback (most recent call last):
File "", line 1, in
recommendations.reload()
AttributeError: 'module' object has no attribute 'reload'

But after several tries I figured out I just need to restart the shell and then import a file whenever I edit a file.. Boo. My first attempt with the Euclidian Distance Score:
>>> import recommendations
>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.29429805508554946

Okay I have the Pearson's method working correctly with no problems, getting the same number as the book

>>> print recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour')
SyntaxError: invalid syntax (, line 1)
>>> print(recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour'))
0.396059017191

Apparently the print() function is a bit different in python 3.0 it seems.

The next few functions were very easy to replicate in the recommendations.py file, copying from the book.

The first attempt at the Manhattan difference produces undesired results and I notice (supposed to end up between 0 and 1, and got -2.) I realized this was because I was not taking the absolute value of the difference between ratings., and was dividing 1 by -.5.

I fixed this problem and I am getting the sum of the differences correctly:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.181818181818
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.25
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.666666666667

I was using (1 / 1 + manhattan distance sum) to format the result between 0 and 1. You can see from my output that this isn't a very good way to do it since a difference total of only .5 only produces a score of .6666. I changed it to

return len(si)/(len(si)+pow(difference_sum,2))

upon testing:
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.4
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.952380952381

You'll notice that this doesn't work because as the difference dips below 1 and you square it, the result is a deceptively close to 1. After some more experimentation I decided to leave it like this:

difference_sum=sum(abs(prefs[person1][item]-prefs[person2][item])for item in si)
scaled_down= (len(si)+difference_sum)/len(si)
return 1/(pow(scaled_down,2))

which produces the following results:

>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Mick LaSalle'))
the difference sum was: 3.0
0.444444444444
>>> print(recommendations.sim_manhattan(recommendations.critics,'Gene Seymour','Jack Matthews'))
the difference sum was: 0.5
0.826446280992
>>> print(recommendations.sim_manhattan(recommendations.critics,'Lisa Rose','Gene Seymour'))
the difference sum was: 4.5
0.326530612245

I liked the way these results turned out. The difference sum for .5 appears close to one, but is not too close considering how few movies are being rated. I also like that when comparing the final results to each other, they appear to be similarly distanced in regards to comparison of their respective difference sums.

The end.

Cpsc470 Data Mining fun times

Monday, February 2, 2009

Assignment 1 with Python 3.0

No comments:

Post a Comment

Followers

Blog Archive

About Me