Compsci 06, Spring 2012, Recommender Systems

See the howto pages for details on creating projects, files, and so on. The pages here describe in broad strokes what this assignment is about.

A video describing the Netflix contest and prize.

Collaborative filtering and content-based filtering are two kinds of recommender systems that provide users with information to help them find and choose anything from books, to movies, to restaurants, to courses based on their own preferences compared to the preferences of others.

In 2009 Netflix awarded one million dollars to a group that had developed a better-recommender system than the Netflix, in-house system. This NY Times Magazine article describes the competition, the winning teams, and how the movie Napolean Dynamite caused problems for the algorithms and ranking/rating systems developed by contest participants.

In this assignment, adapated from a Nifty Assignment developed by Michelle Craig, you'll develop a program to test three different algorithms for recommending items based on the responses made by others. You'll be practicing reading data from files, using Python dictionaries and lists, and sorting data to find good matches.

The assignment comes in two conceptual parts:

We're providing two sources of data initially, but we'll solicit ratings from Duke students that you'll process as well. Sometimes ratings are stored in a single file, sometimes in more than one file. You'll need to write a separate Python module to deal with each data source, then use what these modules return to develop ratings.

Recommendations of Ratings

This first set of recommendations for Prof. Astrachan comes from Netflix. As you can see, these recommendations are based on two movies seen and then all the data Netflix has on similar movies.

netflix

This next set of recommendations is also for Prof. Astrachan for books as reported to him from Amazon, based mostly on purchases for books in Kindle/epub format that he reads when traveling.

amazon

Overview

Detailed information is supplied in the Howto and there will be information added to that document (or linked here) as more data, and Duke specific data/ratings, become available. For this assignment you will write a module Recommender.py and two modules for reading data: BookReader.py and MovieReader.py. Each of these three modules is described below and in the Howto.

Reading Data

You must write a Python module for each data source. Each module has a function named get_data whose parameters are one or more filenames. The function get_data returns two things: a list of items (we will call this itlist) being rated and a dictionary of ratings (we will call this rdict). The keys in rdict are the names of the raters, e.g., students who completed a survey or rated things. The value associated with each name is a list of integers, the ratings for each item in itlist. If the name "owen" is a key in rdict then rdict["owen"] is a list of len(itlist) items with rdict["owen"][i] the rating that owen gives to itlist[i]. More details in the howto document.

We're providing two sets of data/ratings initially. One is book ratings. These come in two files: books.txt and bookratings.txt. The other is movie ratings. These come in one file: movieratings.txt. The formats are described in the Howto.

You should create modules MovieReader.py and BookReader.py. The former has a method get_data with one parameter, the latter a method with two parameters.

Ratings/Recommender

In the module Recommender.py you must import a reading module, obtain the list and dictionary from the module's get_data method, and then report on at least three different ratings:

For more details on each of these, see the Howto document. You must create a module Recommender.py that can be used to provide information related to any ratings, but initially for either the books or movies provided.

README/Analysis

You should submit a README file as usual. You must also submit either a text file or a .doc or .pdf that is your analysis of what you've done. The analysis should include the following: