How to build a simple recommendation engineJuly 28th, 2005
Building a simple recommendation engine for your website can be a powerful differentiator among your competition. Traditionally only technically advanced sites like Amazon were able to offer this service but with ever increasing processing power and simple open source software, now you too can add recommendations to your website.
A little over a year ago I wanted to add recommendations to my mountain biking website and I searched the net to find a simple open source solution (or even some pseudo-code to get me started). Unfortunately the only things I found were technical research papers on the subject and some fancy proprietary software that I could never afford. In my mind I knew it had to be relatively simple so I set about mapping the idea out in pseudo-code. What follows is pseudo-code you can use to build your own recommendation engine. I personally chose to use PHP and MySQL but this can certainly be done on any platform.
Let me once again reiterate that I am laying out the construction of a simple recommendation engine rather than the complete algorithm for recommendation bliss. I’m sure there are more technically savvy ways to accomplish this and I’m by no means a PHP master. However, I have found this method to work well on my own sites and I think it will work well for 90% of the web tinkerers reading this article.
Next up you’ll need a table to identify the items your users will be recommending (you probably already have this in some form or another). Your table might contain information on products, local night clubs, or mountain bike trails. Each needs to be uniquely identified with a primary key (usually an integer). From here, there’s only one more table to create.
The final table will be a junction table combining the user and information tables. This table will need to have 4 columns at a minimum to keep track of the following details:
- user id of recommender
- information table id (night club id, product id, etc.)
- the rating (usually on a scale of 1 to 5, with 5 meaning “best”) and
- a primary key (this is a given).
If your data is product related and you actually sell products on your site, you might be able to skip the rating column altogether and base recommendations simply on what users have purchased (where a user id & product id entry implies purchase).
Once you’ve collected this information in the junction table, it’s time to get down and dirty with the recommendation algorithm. First off, you need to collect all the user ids of users who have rated a particular item highly (my cutoff is 4 and above; if you are trying to recommend other things people will dislike, you might choose 2 and below). Or if your users are very precise or you have lots of data, you might only consider the maximum rating (5). Suppose we have a user who rates item A a 3. This user does not meet the threshold for a “favorable” vote and is therefore not included in providing a recommendation for this item (we assume this user has different tastes since he does not agree with our other users that this is a good item). Collect the proper user ids in an array for use in the next step.
With the user array in hand, we want to find out what other items each user has also rated highly (or lowly, depending). So for example, if we are interested in providing a recommendation for item A and user 1 gave item A a 5, item B a 2, and item C a 4, we would grab the item C product id and rating as a possible recommendation and place it in an array.
rec[’B'][’score’] = 2
rec[’B'][’votes’] = 1
rec[’C'][’score’] = 4
rec[’C'][’votes’] = 1
Now suppose that user 2 also gave item A a 5 but gave B a 3, C a 5, and D a 4. Now our array looks like this:
rec[’B'][’score’] = 2 + 3 = 5
rec[’B'][’votes’] = 1 + 2 = 2
rec[’C'][’score’] = 4 + 5 = 9
rec[’C'][’votes’] = 1 + 1 = 2
rec[’D'][’score’] = 4
rec[’D'][’votes’] = 1
After a little math, we compute the average score for each recommendation and order our array from best match to worst:
rec[’C'][’avg’] = 4.5
rec[’D'][’avg’] = 4
rec[’B'][’avg’] = 2.5
Now we can present our recommendations to the user. Since our cutoff for likeability was 4 and over, we only present those items that have an average score of 4+: items C and D. Item C is the best match, item D is the next best, and item B is not a match at all. You may want to weight those items with more “votes” more highly than single vote recommendations but this really isn’t as important once your recommendation table reaches a decent size. You can also limit the results you return to the top 5 or whatever number you deem sufficient if many of your users rate a large number of items.
Another way to describe these recommendations is to say “Users who enjoyed A also enjoyed C and D.” This actually better explains the relationship between the items, especially given the simple algorithm used to join the items.
There are several variations you can make to this scheme to improve the results you return. For my own use, I have limited the number of users I poll to 10 users who have rated a particular item (selected at random). This is to keep the script from slowing my pages since I have some items that have been rated by hundreds of unique users. I also limit the recommendations to items in similar “categories.” Specifically, I only return bike trail recommendations for trails in the same state as the original trail. Although many of my users have rated trails in multiple states, it is unlikely that information seekers will be interested in a trail in California that is similar to the one in Georgia that they’ve just ridden.
A recommendation engine is fairly easy to construct using a few simple tables and arrays and can be a distinctive and compelling offering for your users. Once you’ve got the basics down it’s easy to improve this idea to give you truly meaningful results.