For data files see the code directory or snarf the assignment. You are not given any already-written Python code.
get_data
that returns two values: a list and dictionary. You must write two
modules: one for reading data about books/ratings, one for
movie/ratings. An overview an example of get_data
follows. More details are in the details pages.
You'll need to think about how to read this data. There are many ways to
do so, but in your code you won't know the total number of movies
until after you've read the entire file. You'll need to return the
list of movies from get_data
, you can create this
list using sets and list functions. You'll either need to read the
file twice, or store the data as it is read in some way so you can
create the dictionary you'll return after reading the entire file.
So, you can either read the file twice or store
(movie,rating)
tuples or lists for each line read and
process these pairs in creating the dictionary you return. For
example, after creating a list of 150 movies you can find the
index of any movie using
code like movielist.index("Eclipse")
to find the index of
the movie "Eclipse".
You'll need to think about how to read, store, and ultimately return a
list and a dictionary such that rdict[x]
is a list of
N int values if there are N movies for each rater x that's a key
in the dictionary returned.
Recommender.py
that you write. You'll also need to
call these functions and document the results you get. You must
write
functions with exactly these signatures (parameters and return values).
averages(items,ratings)
-- returns a list of tuples where each
tuple includes an item being rated and the average rating for
all those who've rated the item. The list should be sorted so
that the highest rated item is first. Each tuple will contain a
string (item) and a float (average) with the string first.
The parameters are the list of items and the dictionary that are
returned by a reading module's get_data
method. In
calculating averages you should not count raters who give a
value of 0 meaning "not rated". You should divide by (n+1)
where n is the number of non-zero raters, this ensures that
you won't get a division-by-zero error for an item that no one
rates.
similarities(name, ratings)
-- returns a list of
two-tuples where each tuple contains a rater-name (string) and a
similarity-index (int). The list is sorted with the most-similar
rater first. Similarity should be calculated for the user whose
name is a parameter using dot-products as described below. The
rater
whose name is the parameter shuld not be evaluated as how similar she is
to herself, i.e., the list returned should have one less element than
the number of elements in ratings since the rater is not judged
as similar to himself.
A similarity measure can be calculated by finding the dot-product of two rating-lists. For example, for the rating lists [-3,0,5,3] and [-1,3,0,5] the similarity is -3*-1 + 0*3 + 5*0 + 3*5 where each corresponding element of the lists are multiplied and summed. This yields a similar measure of 3+15 = 18. For the lists [-3,0,5,3] and [3,0,-3,3] the similarity measure is -3*3 + 0*0 + 5*-3 + 3*3 = -9 + -15 + 9 = -15. The rater with [-1,3,0,5] is closer to [-3,0,5,3] than is the rater with [3,0,-3,3] since the measures are 18 and -15, respectively. The idea is that two negative or two positive ratings make users closer than do a negative and a positive rating.
The arithmetic result of summing the corresponding products is called the dot-product and is actually related to a measure of the angle between two ratings in a mathematical ratings space.
recommended(closelist,items,ratings,n)
-- returns a list of
recommended items. The parameter items
is the list
of items returned by get_data
as is the dictionary
ratings
. The parameter
closelist
is the list returned by
similarities
, and n
is a number that
indicates how many ratings from close
should be
used.
The idea is to weight the ratings of similar raters more than the ratings of those with whom you don't agree. Consider these ratings, for example for a user whose ratings are [5,3,-5].
[1,5,-3] [5,-3,5] [1,3,0]The similarity measures are
1*5 + 5*3 + -3*-5 = 35 5*5 + -3*3 + 5*-5 = -9 1*5 + 3*3 + 0*-5 = 14So we should weight the first set of ratings most and the second set of ratings least because of how similar these raters are to us and our ratings.
We do this by accumulating a weighted sum as follows:
35 * [1,5,-3] = [ 35, 175,-105] -9 * [5,-3,5] = [-45, 27, -45] 14 * [1,3,0] = [ 14, 42, 0] -------------------------------- [ 4, 244,-150]This means that the best choice for us is the second item whose score is 244, the next is the first item whose score is 4, and the least-recommended is the last item whose score is -150.
The list returned is sorted from most-recommended to least recommended
and is a list of tuples where the first element is the name of
an item and the second element is the score (an int) for that
item. Scores are calculated using n
entries from
the list close
, so that if n==1 we use only the
closest rater's ratings and if n==len(close) we use them all.