Practical Recommender Systems

Practical Recommender Systems

Author

Kim Falk

Year
2019
image

Review

Recommendations have become an important part of most products. As such, Product Managers need a good understanding of the underlying technology. Although the theory can be complex, having an understanding of the tradeoffs and limitations of your system is crucial.

It’s been a while since I’ve done any matrix factorisation or gradient decent. Part II of this book was a challenge for me, but reading beyond my comprehension from time to time expands my circle of competence. Part I however is entirely comprehensible and a great introduction to recommenders.

You Might Also Like:

image

Key Takeaways

The 20% that gave me 80% of the value.

  • Recommendations work best when you have lots of data and a lot of things (e.g. products or content) to recommend
  • Types of data involved:
    • Historical data: behaviour data
    • Content metadata
    • Contextual data: device, location, alone or in group, time of day/week/year
    • Demographic: age, gender, nationality, income, politics
  • How to describe a system: Domain, purpose, context, personalisation, opinions, privacy, trustworthiness, interfaces and algorithms
  • There’s a tradeoff between accuracy and explainability (or interpretation)
  • Data is really important. Garbage in, garbage out
  • Two types of feedback:
    • Explicit: ratings or likes
    • Implicit: activity (deduced from monitoring behaviour)
      • Implicit ratings (based on usage data) are often more trustworthy
  • Browser Behaviour events can demonstrate interest
  • Storage is cheap, save as much data as you can
  • What to choose:
    • High user involvement → collaborative filtering
    • Low user involvement (or 1 time visits) → content-based
  • Recommenders only work if you can identify the users (forcing sign up / login) can help
  • Conversion path is the sequence of pages and actions a user takes form the time of arrival on landing page until conversion
  • You can compute implicit ratings by converting web behaviour into content ratings.
  • User-item preferences are often expressed in a user-item matrix.
    • row for each user
    • column for each item
    • ratings in each cell
  • The rating binds the user, the item and their sentiment toward it
  • Expect empty cells to be common (the sparsity problem). Many recommender systems will try to predict values for empty cells.
  • To calculate an implicit rating: take all events recorded between each user and item, and determine a number that indicates how much they like the item
  • Remember that date/time/day might change what’s best to recommend
  • You can introduce time-decay to your algorithm to give recent signals higher importance
  • Term Frequency - Inverse Document Frequency Problem (TF-IDF)
    • Less frequent items provide more value
    • A user buying a popular item doesn’t say much about taste
  • Start with non-personalised recommendations, then graduate to personalised
  • Content should be ordered based on what you think most users are interested in
    • Use recency to keep your product dynamic
    • For some categories like films, most-recent-first is a sensible rule
  • Seeded recommendations use an item as an input to find other (e.g. Frequently bought together)
  • Creating Frequently Bought Together
    • Do affinity analysis to determine links between products
      • Called frequency sets when products are seen in the same basket together
    • Most products are bought with popular items (e.g. milk and bananas in grocery)
    • Instead, find the products that are bought together but not with everything else
      • To do this you compare how often the items are bought together over the number of times the first one appears
  • The cold start: when you don’t have knowledge of your users, so can’t personalise well
    • Can apply to users and items
    • Gray sheep → users who have individual tastes (no other users have bought what they have)
  • If asking a user to rate items - use active learning (the art of selecting items to rate that will teach you the most about preferences)
  • Cold items are best handled using content based filtering
  • Create user segments, and see how quickly you can classify people into a segment
    • Try demographic recommendations / personalised
    • Try and get more data (e.g. facebook account)
  • Be careful to show only items that are appropriate based on items they’re already viewing
  • Segments can be created using cluster analysis
    • Cluster: a group with similar traits
    • You can find hidden traits and similarities in the data
  • You can use meta-data from an item to find similar items
    • Grouping and abstraction is really helpful…
      • Art by artist
      • News by topic
      • Articles by author
      • Song to artist
      • Product to brand
    • Need to get the abstraction or generalisation just right
      • too general and it will lose it’s value
      • too specific and it won’t enable you to connect any more items
  • Personalised recommendations almost always contain calculations of similarity
    • Finding items like the ones you like
    • Finding users who like what you like (e.g. have similar taste)
  • Jaccard distance: a measure to indicate how close two sets are to each other
    • Also called Jaccard index or Jaccard similarity coefficient
  • To find similarity between two items, calculate how many users bought both items, then divide by how many users bought either one (or both).
  • Calculating k-means clusters (segmenting users can help save compute)
  • Scenario: Find a user that’s similar to me, so you can recommend things they like that I haven’t yet come across. You need…
    • Enough items in common to determine that your tastes match
    • They also need a set of items you haven’t seen / come across
  • In ML pipelines you do as much as possible before the user visits expecting recommendations
  • If you have more users than items → use item-based filtering. If you have more items than users → use user-based filtering.
  • Data needs to be well connected to make recommendations
    • If no users have rated content, no recommendations will be made
    • If your tastes don’t overlap with others, users won’t receive good recommendations
  • Steps to implement collaborative filtering:
    1. Find all the items that are rated by several users
    2. Calculate the number of users who are connected to one or more of those items
    3. Other users won’t receive recommendations
  • Your system doesn’t need any domain knowledge to do collaborative filtering
  • The collaborative filtering tradeoff
    • Narrow the number of users and you risk not finding any content to recommend
    • Make it too wide and you risk recommendations containing content outside of the users taste
  • Collaborative filtering requires data, which you won’t have for new users or items
    • Hacks:
      • Ask new users to rate a few items when they arrive
      • Create a new arrivals list to showcase new products
      • Use explore/exploit methods to inject new data in say every 6th recommendation slot
      When
      What decisions
      Ratings
      Only the positive ones? Only the most recent ones? How should you normalise them?
      Similarity
      How many users need to rate an item-pair for it similarity to be calculated? Should you use only positive similarities?
      Neighbourhood
      What method to select a neighbourhood? How big should the neighbourhood be?
      Predicting
      Should you use classification or regression? Should you use a weighted average?
      Returning
      Should you return them all? Or just the ones above a threshold?
  • Pro’s and Con’s of collaborative filtering:
    • Sparsity - most data sets aren’t dense enough to recommend items other than the most popular
    • Gray sheep - hard to find recommendations for users with unique tastes
    • Number of ratings - you need lots of user ratings
    • Similarity - it’s content-agnostic, based on behaviour, and will skew to popular content
  • The best recommender systems also account for diversity, context, evidence, freshness, and novelty
  • All data applications are ‘living’ → they require maintenance and monitoring
  • Implementation:
    1. Start with the simplest algorithm first → then add complexity if needed
    2. Test on a small data set
    3. Coded and tested
    4. Offline algorithm evaluation
    5. Introduce to humans (control group)
    6. Release to a small % of users
  • If the key success metric is lagging (retention of subscribers) use a leading indicator (users watching more content)
  • Understanding my taste = minimising prediction error
  • The Matthew effect: “the rich get richer and the poor get poorer.”
    • If popular items are always favoured, you’re also creating what we call a filter bubble.
  • Coverage: The better the diversity, the better the coverage
    • An ideal recommender system would enable users to navigate the full catalog
    • Coverage has two sides:
      • how much of the catalogue is it capable of recommending (catalogue coverage)
      • how many users can it recommend something to (user coverage)
  • Serendipity: Serendipity is about giving the user that sensation, so that all their visits aren’t more of the same. Serendipity is subjective and difficult to calculate.
    • Don’t constrain the recommender’s returns too much. Constraining means less serendipity.
  • Measuring error metrics:
    • Measure the difference between the ratings in your historical data set with what the recommender algorithm outputs
      • The difference between a user’s rating and the recommender’s prediction of that rating
        • mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).
          • Mean Absolute Error (MAE): takes the absolute value of each of the differences and finds the average of those. Absolute values are used so a high guess doesn’t cancel out a low guess.
          • Root Mean Squared Error (RMSE): puts big penalties on big errors. Big errors count more than several smaller ones.
            • If it’s important that none of your users gets bad recommendations, then you should use RMSE.
  • Measuring decision-support metrics
    • True positive (TP)— Item recommended and consumed by the user
    • False positive (FP)— Item was recommended but the user didn’t consume it.
    • False negative (FN)— The recommender didn’t include the item in a recommendation and the user consumed it.
    • True negative (TN)— It wasn’t recommended and the user didn’t consume it.
    • Precision— What fraction of the recommended items the user consumed:
    • Precision=TruePositiveTruePositive+FalsePositivePrecision = \frac{TruePositive}{TruePositive + FalsePositive}
    • Recall— What, out of all the items that the user consumed, was recommended
    • Recall=TruePositiveTruePositive+FalseNegativeRecall = \frac{TruePositive}{TruePositive + FalseNegative}
    • It’s considered more important to optimise precision, while not putting too much importance on whether the user gets all the possible relevant items (the recall).
    • Decision-support comes from information filtering, originally used for calculating the quality of search results
    • n the world of recommenders, they’ve restricted the measurement to look only at the k top elements (precision at k)
      • The k top elements are measured by taking the number of relevant items between the first k items
      • P@k(u)=#relevantcontentinthetopkpositionskP@k(u)= \frac{\#{relevant\hspace{0.5em}content\hspace{0.5em}in\hspace{0.5em}the\hspace{0.5em}top\hspace{0.5em}k\hspace{0.5em}positions}}{k}
      • Mean Average Precision(MAP): calculates the average of each precision at k, takes rank into account
      • Discounted Cumulative Gain: finds each relevant item and then penalises the items the farther down it is in the list. It’s much like the MAP that you looked at earlier. But where the MAP only works with relevance, the DCG looks at levels of relevancy. So each item needs a relevancy score.
      • NDCG: Normalises DCG vs the optimal ranking, and is therefore a more universal benchmark across systems
  • Some recommenders don’t distinguish between ratings done now, or a year ago. A ratings age might have an effect on their ability to predict
  • You can use content-based filtering to create…
    • similar items recommendations
    • personal recommendations based on taste
  • Systems based on text first extract keywords (removing those that produce noise)
    • Use the important words you find to create a list of categories (also called topics)
    • LDA algorithm (Latent Dirichlet Allocation):
      • Latent → topics found don’t compare to any category that you know
      • Dirichlet → is the way documents are described using these topics
      • Allocation → words are allocated into topics
  • Content-based filtering seems is about extracting knowledge from the content
  • Content analyser: creates a model based on the content, creating a profile for each item
    • Feature extraction: extracting things that are important for the algorithm to work
    • Mapping data into something a machine can use (vector)
  • User profiler: creates a user profile (sometimes just a consumed items per user)
  • Item retriever: retrieves relevant items found by comparing user profiles to item profiles
  • It’s more likely it’s important if the world is only present in few documents
  • Finding important words with TF-IDF
    • TF: Term frequency:
      • tf(word,document) = how many times does the word appear in document
      • Or tf(word,document) = 1 + log(word frequency)
    • The more times a word is present in a document, the higher the chance that it’s important.
    • It’s more likely it’s important if the world is only present in few documents
    • Inverse term document frequency (idf in the equation)
      • The number of all your documents divided by the number of documents that contain the word.
    • Together, the product is tf-idf, which is defined like this:
      • tf - idf(word,document) = tf(word,document)*idf(word,document)
      • tfidf(word,document)=tf(word,document)idf(word,document)tf-idf(word,document) = tf(word, document)*idf(word,document)
  • When you find words that provide large TF-IDF returns, you can add those to the list of features
  • Topic Modelling using the LDA
    • LDA (Short for latent Dirichlet allocation)
      • Latent: hidden (or found by the algorithm)
      • Distribution of distributions
      • Allocation: all words are allocated to the latent topics
    • An LDA is a generative model
  • LDA’s use
    • Content → Topcis → Words
      • Content can be described by a series of topics, topics can be described by what words are used when describing the topic
      • The goal is to connect words with topics and topics with documents!
    • Gibbs sampling begins with randomly adding topics to documents, and words to topics → it makes adjustments until it converges on something (that hopefully makes sense)
  • Most of the implementation work in making an LDA function is how you process the text you’re inputting.
  • If you want to control the topics a document contains, you can inject words into the descriptions
  • Key parameters for LDA:
    • k (number of topics) if too low, then documents look similar. If too large, you might end up with no similar documents.
    • High Alpha: distribute each document over many topics
    • High Beta: leads to topcis being more similar
  • Once you have the LDA model you can predict the similarity of any two documents
    • The probability distribution can be seen as vectors, so you can use cosine similarity
    • You can also compare a document that wasn’t used to create the model (and get around the cold start problem)
  • To create personalised recommendations, create a user profile
    • Iterate through each of the items the user likes
    • For each item, find similar products
    • Order according to the product of similarity and the users’ ratings from the original list of consumed items
  • Pro’s and Con’s of content based filtering:
    • Pros
      • New items are easy to add (just create the item feature vector)
      • You don’t require much traffic - as it’s the content descriptions that’s driving it
      • Popularity isn’t considered
    • Cons
      • Conflates liking with importance.
      • No serendipity, it’s specialised
      • Limited understanding of content, so system can misunderstand what a user likes
  • By looking at the behavioural data of users, you can find categories or topics that let you explain users’ tastes with much less granularity than with each movie but more detail than genres. These factors will enable us to position the movies that users like closer to each other.
  • Hidden genres = latent factors
    • Latent factors are defined by an algorithm
    • They can represent a user’s taste, it might be hard to understand what the factors mean.
  • You do dimension reduction by factorising the rating matrix
    • UV-decomposition allows you to decompose the matrix into hidden features (read columns for users and rows for items) for items and for users.
  • Often only 1% of the cells in the rating matrix have values. Two ways to solve that with imputation…
    • You can calculate the mean of each item (or user) and fill in this mean where there are zeros in each row (or column) of the matrix.
    • You can normalise each row, such that all elements are centred around zero, so the zeros will become the average
  • Baseline predictors are a better alternative.
    • Baseline predictors extract the biases for the items and the users to make a better prediction
    • If a user underperforms the average, rate it lower
    • If an item outperforms the average rating, rate it higher
  • The SVD method allows you to add new users and items into the system
    • To add a new user means to add another row to the rating matrix, which in turn means that the decomposed user matrix would also have another row
    • You need to have user ratings before you can add items.
    • You still need to recalculate regularly (daily or weekly), the reduction is done to extract topics and they aren’t updated when you fold in a new user or item; they’re placed compared to the topics that are already there
  • The three different types of overall hybrid recommenders
    • Monolithic: components of various recommenders and glues them together in new ways;
    • Ensemble runs different recommenders and then combines the result into one
    • Mixed recommender runs a number of recommenders, returning all of them
  • A mixed hybrid returns the union of all the results, where you’ve a hierarchy of recommenders
    • Consider the recommender as a scale of personalisation
    • Make the first as personalised as possible and then continue until you’re using item popularity to give recommendations
    • Often the most personalised recommender only produces one or two recommendations, the next recommender produces several more and, in that way, you’ll always have a good quantity of recommendations but with as much quality as possible.
  • Feature-weighted linear stacking (FWLS)
    • How to make the weights into functions instead.
    • Meta features: Weights as functions
    • Make your algorithm more flexible. E.g: use content-based recommendations if the user has only rated a few items, and collaborative filtering if the user has interacted with many items. Replace weights with functions (these functions are called meta functions or feature-weights functions)
      • Let the machine decide what to do. You can say that you want the content-based recommender to provide 90% of the recommendation if the user has rated fewer than three items; otherwise, it should be 50/50.
      • You can use this to blend the output of recommenders using weights that are functions that make those extremely flexible.
  • Learning to rank (LTR) is a way to combine different kinds of data sources, such as popularity, distance, or recommender system outputs
  • You’re looking for input sources (features) that will give you an ordering of the objects(
  • Foursquare’s algorithm to rank venues near you
    • Higher values should denote a shorter distance and a higher average venue rating.
    • Invert distances (short distance has higher value)
    • Set a max distance (10 minute walk)
    • Rescale data so that everything is between 1 and 0
    • For more information on rescaling, see wikipedia
    • You want to teach the machine to rank these items based on the input of ratings and distance.
      • You want the system to learn weights (w0 and w1), which produce a value such that the items get ordered as you’d expect on the page
  • How do you optimise the function if you don’t know what it’s supposed to return?
    • You use the check-in feature
image

Deep Summary

Longer form notes, typically condensed, reworded and de-duplicated.

Part 1: Getting ready for recommender systems

Chapter 1: What is a recommender?

  • A recommender system provides relevant content to the user based on knowledge of the user, the content and interactions between the two
  • Recommendations work best when you have lots of data and a lot of things (e.g. products or content) to recommend
  • There’s a spectrum of personalisation
    • Non-Personalised (e.g. Top 10)
    • Semi-Personalised (e.g. location)
    • Personalised
  • User profiles enable you to move from households to people
  • Explicit user ratings shouldn’t override everything
    • often aren’t as good at predicting what people will do
    • users might not want to share that data
    • have to ask for help creating a taste profile
  • Implicit ratings (based on usage data) are often more trustworthy
  • Factors to consider:
    • Explicit ratings
    • User behaviour
    • Freshness
    • Boosting (certain content/products)
    • Showing variety
  • Two levels of intelligence to recommendations:
    • Does it make the list? (E.g. is it in Popular on Netflix?)
    • What position on the list? (How the Popular on Netflix list is presented to me)
  • Types of data involved:
    • Historical data: behaviour data
    • Content metadata
    • Contextual data: device, location, alone or in group, time of day/week/year
    • Demographic: Age, gender, nationality, income, politics
  • How to describe a system:
  • Domain, purpose, context, personalisation, opinions, privacy, trustworthiness, interfaces and algorithms
Domain
Type of content recommended
Purpose
Purpose for the end user and provider. North Star Metric?
Context
Environment the user receives the recommendation in
Personalisation Level
Non-personalised / semi-personalised / personalised Seeded (maybe based on item being viewed (customers also bought))
Whose opinions
Expert or the masses
Privacy & Trustworthiness
Protecting user privacy. Are the recommendations trusted?
Interface
Kinds of inputs/outputs: · Explicit input → user gives ratings or preferences · Implicit input → how you interact with the system · Outputs → predictions, recommendations, filtering Types of presentation: · Organic when recommendations are a natural part of the page · White-box when recommenders are explainable · Black-box when you can’t explain why a recommendation is made
Algorithms
· Collaborative filtering employ usage data · E.g. Finding similar users and content that they like · Needs lots of data · Content-based use content metadata and user profiles · E.g. using movie metadata · Needs good descriptions · Hybrid a mix of the two types
  • There’s a tradeoff between accuracy and explainability (or interpretation)
  • Can you predict what a user needs right now?
  • Can you people a personal service?

Chapter 2: User behaviour and how to collect it

  • Data is really important. Garbage in, garbage out
  • Two types of feedback:
    • Explicit: ratings or likes
    • Implicit: activity (deduced from monitoring behaviour)
  • Evidence is more powerful than when a user declares their taste
    • May say they love french movies, but actually watch more super hero movies
  • Browser Behaviour events can demonstrate interest (in order)
    • View + page duration
    • Scroll a list or browse a category
    • Mouse over for details
    • Expansion clicks (view more details)
    • Search terms
    • Save for later
    • Buy or consume
      • Content: Start, stop, speed, play to end, replay
    • or they can demonstrate the opposite (stop watching after three minutes and never revisit)
  • What to choose:
    • High user involvement → collaborative filtering
    • Low user involvement / 1 time visits → content-based
  • Visitor ratings
    • Submit preferences
    • Save a rating
    • Negative ratings
    • Voting (TripAdvisor, Hacker News)
  • Recommenders only work if you can identify the users (forcing sign up / login) can help
  • Storage is cheap, save as much data as you can

Chapter 3: Monitoring the system

  • Create a basic funnel dashboard:
    • Visitors → Conversion → Revenue
    • Show KPIs over a historical period
  • Conversion rate = number of goal achievements / number of visits
    • you can measure conversion by users or visits
    • Amazon would choose visit, a car company might choose a customer
  • Conversion path is the sequence of pages and actions a user takes form the time of arrival on landing page until conversion

Chapter 4: Ratings and how to calculate them

  • You can compute implicit ratings by converting web behaviour into content ratings.
    • Data for implicit ratings is usually easier to collect that getting the user to provide explicit ratings
  • Before starting consider:
    • Your sites purpose
    • The events that happen ahead of consumption
    • How often those events occur
  • User-item preferences are often expressed in a user-item matrix.
    • row for each user
    • column for each item
    • ratings in each cell
  • The rating binds the user, the item and their sentiment toward it
  • Expect empty cells to be common (the sparsity problem). Many recommender systems will try to predict values for empty cells.
  • Explicit ratings don’t always reveal their true preferences.
  • To calculate an implicit rating: take all events recorded between each user and item, and determine a number that indicates how much they like the item
    • Give each behaviour event a value (positive or negative)
    • Aggregate the values to create a score that predicts likeliness to buy
    • Events clustered in the same session might mean less that those spread across sessions
    • Amazon can get by with just using purchase events as they have so many customers
      • expect to use all browsing history
  • Remember that date/time/day might change what’s best to recommend
  • You can introduce time-decay to your algorithm to give recent signals higher importance
  • Normalise the values so they’re all in the same range
  • Once you have a User-item Matrix with your values you can use a naive Bayes classifier to predict how likely the behaviour is to lead to a buy
  • Term Frequency - Inverse Document Frequency Problem (TF-IDF)
    • Less frequent items provide more value
    • A user buying a popular item doesn’t say much about taste

Chapter 5: Non-Personalised Recommendations

  • Start with non-personalised recommendations, then graduate to personalised
    • Something to surface before you ‘know your users’
    • Examples:
      • Most popular: humans are flock animals
      • People who bought x also bought y (seeded recommendation)
      • Frequently bought together (seeded recommendation)
      • Most liked
  • Content should be ordered based on what you think most users are interested in
    • Use recency to keep your product dynamic
    • For some categories like films, most-recent-first is a sensible rule
    • For discount vouchers (highest discount is a good option)
  • On deploying:
    • calculate as much offline as you can
    • fail gracefully (fall back to something simple, or a hard coded list)
    • keep it separate from other parts of your website
  • Seeded recommendations use an item as an input to find other (e.g. Frequently bought together)
  • Creating Frequently Bought Together
    • Do affinity analysis to determine links between products
      • Called frequency sets when products are seen in the same basket together
    • Most products are bought with popular items (e.g. milk and bananas in grocery)
    • Instead, find the products that are bought together but not with everything else
      • To do this you compare how often the items are bought together over the number of times the first one appears
  • Hold multiple databases (with version numbers) of recommendations. Allows you to fail back.

Chapter 6: The User (and content) who came in from the cold

  • The cold start: when you don’t have knowledge of your users, so can’t personalise well
    • Can apply to users and items
    • Gray sheep → users who have individual tastes (no other users have bought what they have)
  • Boost cold products → make an effort to place them where they’ll get noticed
  • Cold visitor tactics:
    • Ask them for a few explicit ratings
      • If asking a user to rate items - use active learning (the art of selecting items to rate that will teach you the most about preferences)
    • Look into their search terms
    • How soon you decide to surface personal recommendations is a trade-off between quality and time
    • You can use recommenders that fail-back to most popular content for cold visitors
  • Cold items are best handled using content based filtering
  • Machine Learning isn’t magic, it can only find signals that actually exist in the data.
  • Create user segments, and see how quickly you can classify people into a segment
    • Try demographic recommendations / personalised
    • Try and get more data (e.g. facebook account)
  • Create association rules based on early behaviour actions (and business logic)
  • Be careful to show only items that are appropriate based on items they’re already viewing
  • Segments can be created using cluster analysis
    • Cluster: a group with similar traits
    • You can find hidden traits and similarities in the data
  • You can use meta-data from an item to find similar items
    • Grouping and abstraction is really helpful…
      • Art by artist
      • News by topic
      • Articles by author
      • Song to artist
      • Product to brand
    • Need to get the abstraction or generalisation just right
      • too general and it will lose it’s value
      • too specific and it won’t enable you to connect any more items

Part II: Recommender Algorithms

Chapter 7: Finding similarities among users and content

  • Personalised recommendations almost always contain calculations of similarity
    • Finding items like the ones you like
    • Finding users who like what you like (e.g. have similar taste)
  • Similarity: Given two items, create a function that scores their similarity between 0 and 1
    • 0 = nothing in common, 1 = identical objects
  • Jaccard similarity is used to compare sets (e.g. a set of movies a user has bought)
Data Type
Example
Similarity
Unary data
Person A likes film B Person A likes film X Person B likes film A Person B likes film B
Jaccard similarity
Binary data
Person A dislikes film B Person A likes film H Person B likes film B Person B dislikes film Q
Jaccard similarity
Quantitative Data
Person A rated film B 5/10 Person B rated film B 8/10
Pearson or cosine similarity
  • Jaccard distance: a measure to indicate how close two sets are to each other
    • Also called Jaccard index or Jaccard similarity coefficient
  • Think of each piece of content as being a bag that contains all the users who watched it. Here’s an example of two sets:
    • Movie A: Person A, Person B, Person D, Person G
    • Movie B: Person A, Person B, Person K, Person Y
  • Or you could represent the same data in a ‘Unary user-item’ matrix
Watched / Bought
Film A
Film B
Film C
Person A
1
0
1
Person B
1
0
0
Person C
1
0
1
Person D
1
1
0
  • To find similarity between two items, calculate how many users bought both items, then divide by how many users bought either one (or both).
SimilarityJaccard(i,j)=#usersthatboughtbothitems#userswhoboughteitheriorjSimilarity_{Jaccard}(i,j) = \frac{\#users that bought both items}{\#users who bought either i or j}
  • How that’s calculated:
Watched / Bought
Film A
Film B
Similar or not?
Person A
1
0
0
Person B
1
0
0
Person C
1
0
0
Person D
1
1
1
Sum
1
  • Similarity is domain specific, so think of it as a relative measure
  • Measuring distance with LpnormsL_{p}-norms
  • Calculating L1normL_{1}norm (Manhattan distance or Taxicab Geometry)
    • If two users have rated a movie from 1-10 you could measure the distance between those rating by
    • You can then normalise that back to a scale of 0 to 1
    • You can then do that across all movies
    Calculating Mean absolute error (MAE)
    • is calculated using the average of the L1normL_{1}norm . You’re dividing by n to get the average distance between ratings
    Calculating L2NormL_2Norm (Euclidian norm)
    • If L1NormL_1Norm was the distance travelled by a Taxicab between two points, L2NormL_2Norm is the distance as the crow flies
    distance(UserA,UserB)=ruserAruserB2=i=1nruserA,iruserB,i2distance(UserA,UserB) = ||r_{userA} - r_{userB} ||_2=\sqrt{\sum_{i=1}^n|r_{userA,i} - r_{userB,i}}|^2
    • The euclidian norm is often used a measure of how well Machine Learning models perform. It takes the average of the norm, called the room mean square error (RMSE)
    RootMeanSquaredError(RSME)=1ni=1nruserA,iruserB,i2Root\,Mean\,Squared\, Error\,(RSME) = \frac{1}{n} \sum_{i=1}^n|r_{userA,i}- r_{userB,i}|^2
    • You can create similarity by inserting 1 over the sum of squared differences.
    Calculating Cosine Similarity (* add normalisation to capture that users rate differently)
    • You can look at the rows of the rating matrix as vectors in space, and then look at the angle between them
    Watched / Bought
    Film A
    Film B
    Person A
    3
    5
    Person B
    4
    1
    Person C
    2
    5
    • You could plot these as positions on a 2D graph (as there are only 2 films) but it can work in multiple dimensions too
      • The Rating of Film A can be the x axis, Rating of Film B can be the y axis
      • You then plot the people on the chart as per their film ratings
      • You measure the angle from (0,0) where the x and y axis intercept to each of the users. The closer the users’ angle, the more similar their preferences(tastes)
      • image
    • Amazon finds item-item collaborative filtering is more efficient, as they have more customer than items. There are less similarities to calculate if they do it for items
    • You can calculate the angle using the following cosine function:
    sim(i,j)=rirjr12rj2=uri,urj,uuri2,uurj2,usim(i,j)=\frac{r_i·r_j}{||r_1||_2||r_j||_2}=\frac{\sum_ur_i,_ur_j,_u}{\sqrt{\sum_ur_i^2,_u}\sqrt{\sum_ur_j^2,_u}}
    • But because this is a comparison of user ratings, we need to take into account that users use rating scales differently (some users might rate higher, have lots of 10s and no 1s).
      • Bardrul Sarwar et al came up with an adjusted cosine similarity to offset this drawback by subtracting the users average rating
      • sim(i,j)=u(ri,uru)(rj,uru)u(ri,uru)2u(rj,uru)2sim(i,j)=\frac{\sum_u(r_i,_u-r_u)(r_j,_u-r_u)}{\sqrt{\sum_u(r_i,_u-r_u)^2}\sqrt{\sum_u(r_j,_u-r_u)^2}}
    Calculating Pearson’s correlation coefficient
    image
    • If you plot ratings on the y axis and items on the x axis you can start to see which users have similar taste
    • Pearson’s correlation coefficient looks at the points and measures how different each point is on average
      • Scale: Opposite = -1 → Identical = 1
      • No connection = 0
    • Pearson also controls for user rating differences by subtracting the average
    sim(i,j)=eU(ri,uri)(rj,urˉj)ˉeU(ri,urˉi)2eU(rj,urˉj)2sim(i,j)=\frac{\sum_{e\in U(r_{i,u}-\bar{r_i)(r_{j,u}-\bar{r}_j)}}}{\sqrt{\sum_e\in U(r_{i,u}-\bar{r}_i)^2}\sqrt{\sum_e\in U(r_{j,u}-\bar{r}_j)^2}}
    • this is for the set of users which have rated both i and j
  • A users rating should always be viewed in relationship to that user’s other ratings. To normalise their ratings like this, you subtract their average rating from each of their ratings
    • Normalising ratings around zero like this means you’ll get positive and negative numbers
  • Pearson correlation is similar to cosine. Adjusted cosine controls for normalisation which makes it the same function as Pearsons… so the only real difference is
    • Pearson uses → items that two users have rated
    • Adjusted cosine uses → all rated items by either or both (setting ratings to zero when one of the users doesn’t rate an item)
    • ⚠️
      This zero effect will push similarity toward zero if there is little overlap between users
Calculating k-means clusters (segmenting users can help save compute)
  • Calculating similarity between users or items is the achilles heel of neighbourhood collaborative filtering algorithms.
    • The algorithm needs a similarity score from each user to all others (or each item to all other items)
    • Divide your data set into smaller groups, to reduce the amount of calculations \
      • If you cluster users into 4 groups → you can first look up the group the user is in → then you don’t have to compare them to the other 3 groups (reducing time and compute you’re spending creating user-user similarity)
    • Clustering is also called segmentation
    • k-means clustering is an unsupervised ML algorithm.
    • You need to give it a parameter for it to run (it’s a parametric algorithm)
    • You provide the number of clusters it should find (k)
    • It will then find k number of points (called cenroids) that minimises the sum of the distances between all items and their centroids
  • You can use clusters to…
    • Find similar users
      1. Pre-compute clusters
      2. Check what cluster they are in
      3. Check find the most similar people in that cluster
    • Place new users in groups
      • Find what centroid they are closest to, which will determine their group
  • Scenario: Find a user that’s similar to me, so you can recommend things they like that I haven’t yet come across. You need…
    • Enough items in common to determine that your tastes match
    • They also need a set of items you haven’t seen / come across
    • In a diagram…
      image

Chapter 8: Collaborative filtering (neighbourhood-based collaborative filtering)

  • Collaborative filtering is the art of using ratings for generating recommendations. It was developed to fight information overload
  • Assume that people keep their tastes over time → if you agree with somebody in the past, you’ll likely agree with them again in future
  • Two ways to do neighbourhood:
    • Find users with similar tastes and recommend things they like that you haven’t seen
    • Find items similar to items that you already like (item based filtering)
  • User-item matrix
Rating
Film A
Film B
Film C
Person A
5
3
1
Person B
4
5
Person C
3
5
3
Person D
5
4
  • The task it so calculate predictions for the empty cells, using the data already in the matrix
  • It’s normal to have more empty cells that filled ones
  • In ML pipelines you do as much as possible before the user visits expecting recommendations
    • Most users don’t have many item ratings. Adding a new rating can change the calculatino of a users taste
    • Don’t pre-compute which users are similar, but you can compute what items are similar
    • If you have more users than items → use item-based filtering
    • If you have more items than users → use user-based filtering
  • User similarity is good for serendipity (connecting you with dissimilar content that you’ll rate highly) where as item similarity will only provide similar items
  • Data needs to be well connected to make recommendations
    • If no users have rated content, no recommendations will be made
    • If your tastes don’t overlap with others, users won’t receive good recommendations
  • Steps to implement collaborative filtering:
    1. Find all the items that are rated by several users
    2. Calculate the number of users who are connected to one or more of those items
    3. Other users won’t receive recommendations
  • Your system doesn’t need any domain knowledge to do collaborative filtering
  • Item similarity is stable, so it can be calculated beforehand (or offline)
  • Normalise your rating matrix before using it
  • You can calculate the cosine similarity of each item-pair
    1. Cosine Similarity
      Film A
      Film B
      Film C
      Film A
      1
      0.63
      -0.7
      Film B
      0.63
      1
      0.78
      Film C
      -0.7
      0.78
      1
    2. In this example, people who rate Film A highly will rate Film C low
    3. People who rate Film B highly will rate Film C highly
  • Be careful if calculating similarity between users that have too few items in common
    • Always require a minimum of overlapping users (number of users who have rated both films)
  • The collaborative filtering tradeoff
    • Narrow the number of users and you risk not finding any content to recommend
    • Make it too wide and you risk recommendations containing content outside of the users taste
  • How to select a neighbourhood:
    • Neighbourhood = a set of items that are similar to the content you’re looking at
    • Techniques:
      • K-means clustering → not great if your user is on the border of the cluster, or if the cluster has a strange shape
      • Top-N → define a number N as the number of neighbours that should be in the neighbourhood and then say that ll items have N fellow items in the neighbourhood
        • Top-N can force the system to use points that are dissimilar (bad for recommendations)
      • Threshold → only want items of a certain standard to be in the neighbourhood by requiring that the similarity needs to be above a specific constant
        • calibrate a threshold to show more niche or more popular items
  • Calculating predicted ratings (classification or regression)
    1. Pred(u,i)=rˉu+jsj(sim(ij)×ru,j)jsisim(i,j)Pred(u,i)=\bar{r}_u+\frac{\sum_{j\in}s_j(sim(i'j)\times r_{u,j})}{\sum_{j\in}s_isim(i,j)}
    2. rˉu\bar{r}_uis the average rating of the users
    3. ru,jr_{u,j}is the active users uu rating of item jj
    4. sis_i ks the set of items in the neighbourhood
    5. Pred(u,i)Pred(u,i) is the prediction rating for user uu of item ii
    6. sim(ij)sim(ij)is the similarity between item ii and item jj
  • Cold start problems:
    • Collaborative filtering requires data, which you won’t have for new users or items
      • Hacks:
        • Ask new users to rate a few items when they arrive
        • Create a new arrivals list to showcase new products
        • Use explore/exploit methods to inject new data in say every 6th recommendation slot
  • In summary
    • Find similar items to the ones the active user likes
    • Calculate predicted ratings for these items
    • Use predictions to calculate recommendations
  • Decisions you’ll have to make along the way
When
What decisions
Ratings
Only the positive ones? Only the most recent ones? How should you normalise them?
Similarity
How many users need to rate an item-pair for it similarity to be calculated? Should you use only positive similarities?
Neighbourhood
What method to select a neighbourhood? How big should the neighbourhood be?
Predicting
Should you use classification or regression? Should you use a weighted average?
Returning
Should you return them all? Or just the ones above a threshold?
  • Pro’s and Con’s of collaborative filtering:
    • Cons:
      • Sparsity - most data sets aren’t dense enough to recommend items other than the most popular
      • Gray sheep - hard to find recommendations for users with unique tastes
      • Number of ratings - you need lots of user ratings
      • Similarity - it’s content-agnostic, based on behaviour, and will skew to popular content

Chapter 9: Evaluating and testing your recommender

  • The Netflix Prize frames the recommendation problem as accurately predicting ratings.
    • The best recommender systems also account for diversity, context, evidence, freshness, and novelty
  • Be clear on why you’re implementing a recommender… be clear about what you want to achieve
  • All data applications are ‘living’ → they require maintenance and monitoring
  • Implementation:
    1. Start with the simplest algorithm first → then add complexity if needed
    2. Test on a small data set
    3. Coded and tested
    4. Offline algorithm evaluation
    5. Introduce to humans (control group)
    6. Release to a small % of users
  • Be clear about your success metrics.
  • If the key success metric is lagging (retention of subscribers) use a leading indicator (users watching more content)
  • Frame hypothesis: the goal of the test (e.g. Recommender B produces recommendations that are clicked more often than recommendations from Recommender A)
  • It can make more sense to say that a recommendation algorithm is successful if a user clicks one or more times on the recommendations. Success depends on the domain.
  • How can you make the customers happy?
    • The site understands my tastes
    • The site gives me a nice variety of recommendations
    • The site surprises me
    • The site covers everything in the catalogue (business goal)
  • Understanding my taste = minimising prediction error
  • The Matthew effect: “the rich get richer and the poor get poorer.”
    • If popular items are always favoured, you’re also creating what we call a filter bubble.
  • Coverage: The better the diversity, the better the coverage
    • An ideal recommender system would enable users to navigate the full catalog
    • Coverage has two sides:
      • how much of the catalogue is it capable of recommending (catalogue coverage)
      • how many users can it recommend something to (user coverage)
    • You can calculate user coverage by iterating over all users, and seeing what it returns
      • item-item collaborative filtering → covered 13% of movies
      • content-based recommenders might be better
  • Serendipity: Serendipity is about giving the user that sensation, so that all their visits aren’t more of the same. Serendipity is subjective and difficult to calculate.
    • Don’t constrain the recommender’s returns too much. Constraining means less serendipity.
  • Types of evaluation:
    • To do a true evaluation, you need a complete ground truth set
      • A data set containing information about all combinations of users and content.
      • You need to assume the user-item combinations are the truth and representative of all users
  • The 3 testing scenarios:
    • offline experiments
    • controlled user experiments
    • online experiments
  • Offline evaluation uses data that you regard as truthful. Then split the data into two parts and feed one part to the recommender, use the other part to verify that the recommender predicts ratings on items in the set that were hidden to it
    • a recommender can pass an offline evaluation with flying colours but still fail miserably in production
  • Measuring error metrics:
    • Measure the difference between the ratings in your historical data set with what the recommender algorithm outputs
      • The difference between a user’s rating and the recommender’s prediction of that rating
        • mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).
          • Mean Absolute Error (MAE): takes the absolute value of each of the differences and finds the average of those. Absolute values are used so a high guess doesn’t cancel out a low guess.
          • Root Mean Squared Error (RMSE): puts big penalties on big errors. Big errors count more than several smaller ones.
            • If it’s important that none of your users gets bad recommendations, then you should use RMSE.
    • Adjustments for skewed data:
      • Weight all users evenly. So those that rate only a few are still considered
      • Split popular items and long-tail items into two separate datasets
  • Measuring decision-support metrics
    • True positive (TP)— Item recommended and consumed by the user
    • False positive (FP)— Item was recommended but the user didn’t consume it.
    • False negative (FN)— The recommender didn’t include the item in a recommendation and the user consumed it.
    • True negative (TN)— It wasn’t recommended and the user didn’t consume it.
    • Precision— What fraction of the recommended items the user consumed:
    • Precision=TruePositiveTruePositive+FalsePositivePrecision = \frac{TruePositive}{TruePositive + FalsePositive}
    • Recall— What, out of all the items that the user consumed, was recommended
    • Recall=TruePositiveTruePositive+FalseNegativeRecall = \frac{TruePositive}{TruePositive + FalseNegative}
    • It’s considered more important to optimise precision, while not putting too much importance on whether the user gets all the possible relevant items (the recall).
    • Decision-support comes from information filtering, originally used for calculating the quality of search results
    • n the world of recommenders, they’ve restricted the measurement to look only at the k top elements (precision at k)
      • The k top elements are measured by taking the number of relevant items between the first k items
      • P@k(u)=#relevantcontentinthetopkpositionskP@k(u)= \frac{\#{relevant\hspace{0.5em}content\hspace{0.5em}in\hspace{0.5em}the\hspace{0.5em}top\hspace{0.5em}k\hspace{0.5em}positions}}{k}
      • Mean Average Precision(MAP): calculates the average of each precision at k, takes rank into account
      • Discounted Cumulative Gain: finds each relevant item and then penalises the items the farther down it is in the list. It’s much like the MAP that you looked at earlier. But where the MAP only works with relevance, the DCG looks at levels of relevancy. So each item needs a relevancy score.
      • NDCG: Normalises DCG vs the optimal ranking, and is therefore a more universal benchmark across systems
  • Preparing data:
    • Filter out users with only a small number of interactions
    • If you have too much data → do stratified sampling
  • Split the data into training, test and validation
    • The test set is where you calculate how good the algorithm is at predicting the produced ratings
    • It makes sense to split most recommendation data by time
    • You can also split user ratings (not users) between sets
  • Some recommenders don’t distinguish between ratings done now, or a year ago. A ratings age might have an effect on their ability to predict
  • Online Evaluation
    • Controlled Experiments → A/B testing
    • Determine if the recommendations are good, show two different types of recommendations to the user, and see which seems to work better
    • You can use explore/exploit or mult-armed bandits to control rollout
  • Keep feedback loops in mind. If the recommender works it might hinder diversity.
  • Find a way to inject new items into the loop (Consumed → in training data → model → in recommendation)

Chapter 10: Consent-based filtering

  • It’s possible to create recommendations by focusing only on the interactions between users and content
  • You can use content-based filtering to create…
    • similar items recommendations
    • personal recommendations based on taste
  • Finding content visitors think is similar is key. Typically systems will use tags or textual descriptions.
  • Systems based on text first extract keywords (removing those that produce noise)
    • Use the important words you find to create a list of categories (also called topics)
    • LDA algorithm (Latent Dirichlet Allocation):
      • Latent → topics found don’t compare to any category that you know
      • Dirichlet → is the way documents are described using these topics
      • Allocation → words are allocated into topics
  • Content-based filtering seems is about extracting knowledge from the content
  • You need the following modules:
    • Content analyser: creates a model based on the content, creating a profile for each item
      • Feature extraction: extracting things that are important for the algorithm to work
      • Mapping data into something a machine can use (vector)
    • User profiler: creates a user profile (sometimes just a consumed items per user)
    • Item retriever: retrieves relevant items found by comparing user profiles to item profiles
  • Metadata can be facts or tags (facts are things like, film produced in 2021), tags can mean things different things to different people e.g. ‘uplifting’)
    • One of the biggest showstoppers for developers trying to use content-based recommenders is that they can’t get the data about the items.
      • You could build it yourself or you could hire people to go through the content and tag
  • To represent basic film tags, you can make a simple table:
    1. Year
      Starting Ben Affleck
      Action
      Explotions
      2016
      0
      1
      5
    2. Expect a long list of features
    3. A feature needs to appear in enough of the entries to be relevant (an actor that’s in only one film won’t help - you need to find things in common)
  • Normalise the data to be between 0 and 1 (scale things like year ‘2016’ too)
  • TD-IDF → what words are in the article? How often do they appear?
  • If you’re using descriptions, use a bag-of-words model to split descriptions into an array of words (and remove words from a stopword list)
    • you can also look at removing high frequency words that appear in everything
  • Finding important words with TF-IDF
    • TF: Term frequency:
      • tf(word,document) = how many times does the word appear in document
      • Or tf(word,document) = 1 + log(word frequency)
    • The more times a word is present in a document, the higher the chance that it’s important.
    • It’s more likely it’s important if the world is only present in few documents
    • Inverse term document frequency (idf in the equation)
      • The number of all your documents divided by the number of documents that contain the word.
    • Together, the product is tf-idf, which is defined like this:
      • tf - idf(word,document) = tf(word,document)*idf(word,document)
      • tfidf(word,document)=tf(word,document)idf(word,document)tf-idf(word,document) = tf(word, document)*idf(word,document)
  • When you find words that provide large TF-IDF returns, you can add those to the list of features
  • Most people use the LDA now (not TF-IDF) , but many use TF-IDF to clean input of the LDA
  • Topic Modelling using the LDA
    • LDA (Short for latent Dirichlet allocation)
      • Latent: hidden (or found by the algorithm)
      • Distribution of distributions
      • Allocation: all words are allocated to the latent topics
    • An LDA is a generative model
  • LDA’s use
    • Content → Topcis → Words
      • Content can be described by a series of topics, topics can be described by what words are used when describing the topic
      • The goal is to connect words with topics and topics with documents!
    • Gibbs sampling begins with randomly adding topics to documents, and words to topics → it makes adjustments until it converges on something (that hopefully makes sense)
  • Most of the implementation work in making an LDA function is how you process the text you’re inputting.
  • If you want to control the topics a document contains, you can inject words into the descriptions
  • Key parameters for LDA:
    • k (number of topics) if too low, then documents look similar. If too large, you might end up with no similar documents.
    • High Alpha: distribute each document over many topics
    • High Beta: leads to topcis being more similar
  • Once you have the LDA model you can predict the similarity of any two documents
    • The probability distribution can be seen as vectors, so you can use cosine similarity
    • You can also compare a document that wasn’t used to create the model (and get around the cold start problem)
  • To create personalised recommendations, create a user profile
    • Iterate through each of the items the user likes
    • For each item, find similar products
    • Order according to the product of similarity and the users’ ratings from the original list of consumed items
  • Pro’s and Con’s of content based filtering:
    • Pros
      • New items are easy to add (just create the item feature vector)
      • You don’t require much traffic - as it’s the content descriptions that’s driving it
      • Popularity isn’t considered
    • Cons
      • Conflates liking with importance.
      • No serendipity, it’s specialised
      • Limited understanding of content, so system can misunderstand what a user likes

Chapter 11: Finding hidden genres with matrix factorisation

  • Hidden genres = latent factors
    • Latent factors are defined by an algorithm
    • They can represent a user’s taste, it might be hard to understand what the factors mean.
    • The Netflix prize was won by an ensemble recommender algorithm too complicated to put into production
  • SVD = Single Value Decomposition… you can add new users easily, calculating for a large data set is slow upfront.
    • Reduce the amount of data.
    • Do dimension reduction on high-dimensional data.
    • You want to reduce the data so that only the directions that provide more information are retained
    • If the algorithm succeeds, then the vectors pointing in those directions are said to be latent factors.
    • By looking at the behavioral data of users, you can find categories or topics that let you explain users’ tastes with much less granularity than with each movie but more detail than genres. These factors will enable us to position the movies that users like closer to each other.
    • You do dimension reduction by factorizing the rating matrix
      • Helps you find important factors so that you’ve users and items in the same space, and you can recommend films that are interesting to the user by finding which are nearby. You can also find similar users.
      • the matrix factorisation attempts to push items and users that are more closely related
    • Calculating recommendations is done by looking into the factor space and finding the ones that are closer to the user.
      • there are tools that help you interpret dimensions in a vector space but they rarely work
  • A matrix is a rectangular array of numbers and declared by a number of rows m and a number of columns n.
  • A vector is a special case of a matrix that has only one column.
  • You have to have values to put in all the cells of the matrix
  • What’s factorisation? Factorisation is about splitting things up. You can split 100 into the following prime factorisation:100 = 2 × 2 × 5 × 5
    • Taking a number and writing it as a number of factors (in this case prime numbers)
  • UV-decomposition allows you to decompose the matrix into hidden features for items and for users.
    • You need to insert values in the U and V matrixes in such a way that UV is as close to R as possible.
    • To do factorisation is to find the v’s and u’s that satisfy the equations.
    • One of the most commonly used methods for matrix factorization is an algorithm called SVD (singular value decomposition)
    • You want to find items to recommend to users, and you want to do it using extracted factors from the rating matrix. The idea of factorization is even more complicated because you want to end up with a formula that enables you to add new users and items without too much fuss.
    • image
    • From A rating matrix M, you want to construct two matrices that you can use: one that represents the customer’s taste and one that contains the item profiles.
    • The diagonal matrix in the middle (Σ), more significantly known as the weights matrix.
      • Each of these weights can be an indication of how much information is present in each dimension.
    • A rule of thumb is that you should retain 90% of the information.
  • Solving the problem of the zeros in the rating matrix using imputation
    • Often only 1% of the cells in the rating matrix have values. Two ways to solve that with imputation…
      • You can calculate the mean of each item (or user) and fill in this mean where there are zeros in each row (or column) of the matrix.
      • You can normalise each row, such that all elements are centred around zero, so the zeros will become the average
    • Baseline predictors are a better alternative.
  • The SVD method allows you to add new users and items into the system
    • To add a new user means to add another row to the rating matrix, which in turn means that the decomposed user matrix would also have another row
    • You need to have user ratings before you can add items.
    • You still need to recalculate regularly (daily or weekly), the reduction is done to extract topics and they aren’t updated when you fold in a new user or item; they’re placed compared to the topics that are already there
  • Cons of SVD:
    • Need to do something with the unfilled cells in the rating matrix
    • Slow to calculate large matrixes
    • Unexplainable
  • Pros of SVD:
    • New users can be folded in as they arrive (SVD model is static and should be updated as often as possible)
  • Baseline predictors make it easier to add values to the empty slots of the matrix
    • Baseline predictors extract the biases for the items and the users to make a better prediction
    • If a user underperforms the average, rate it lower
    • If an item outperforms the average rating, rate it higher
  • Temporal dynamic: bias isn’t static. Items go in and out of fashion.
    • Your rating prediction function can be made function of time.
    • If your data stretches over a long period of time and you have many ratings, it may increase precision
  • SVD puts a lot of weight into the rating matrix (much of the data in which is imputed)
    • Simon Funk came up with a method that only uses the things you need to know
    • Funk SVD or (regularised SVD)
      • Use RMSE to provide a measure of how close you are to the known ratings
      • Use gradient descent, which uses RMSE to walk toward a better solution.
    • When you’re optimising algorithms, your first approach should always be RMSE.
      • Penalises big errors (takes the square of the difference)
  • You want to find values which make the result of the function as small as possible.
    • Gradient descent is a way to find optimal points on a graph, where optimal means the lowest point (or the highest point)
    • Gradient descent starts somewhere and then assess if there’s a direction that makes the function produce a smaller value.
    • Often the best suggestion is to try different starting points and see if they arrive at the same point.
    • Alpha (α) in the world of gradient descent is the learning rate, which translates to how big a step you should take every time you want to move to the next point.
      • too large a step, you might miss the minimum altogether
  • Stochastic gradient descent looks at one rating at a time. Gradient descent is an important part of how you train deep learning networks
  • When you run through all the known ratings once, you have two matrices that can be used to predict ratings.
    • Shuffle the list of ratings before you start as trends in the ratings can push the factorisation in strange directions.
  • The art of Machine learning is in initialising parameters and what the constants should be
    • learning rate γ = 0.001, regularisation λ = 0.02 with 40 factors
  • You’re not only looking for more precision and recall, you’re also concentrating on training the algorithm to understand the domain.
    • You want to create something that not only remembers the answers from the training data but also predicts ratings from the test data.
      • As little error as possible both on the data that you’re training and on the data set that you test it on when you calculate or descend toward the two matrices for the user and item factors.
      • Teach the algorithm too little: you won’t have a result that understands the domain, while too much and it’ll be overfitted on the training data
  • The levers you’ll use to manipulate the algorithm include the following:
    • Initialise the factors?: Decides where the gradient descend should start walking.
    • Learning rate?: How fast you should move at each step.
    • Regularisation?: How much should the algorithm regularize the errors? Should it be the same for all the expressions or should you split it into factors and bias?
    • How many iterations?: How specialised should the algorithm become to the training data?
  • Doing recommendations with Funk SVD
    • You have four things:
      1. Item factor matrix: each column represents a content item described by the latent factors that you calculated
      2. User factor matrix: each column represents a user described by the latent factors.
      3. Item bias: certain items are generally considered better (or worse) than others
      4. User bias: encompasses different rating scales for different users.
    • You can calculate a predicted rating for any item for any user, you can now provide predicted ratings on everything
  • To find a list of things the user would rate highly…
    • Brute force: calculate a predicted rating for each user for each item, then you sort all the predictions and return the top N items.
    • Neighbourhoods recommendation: you can use the factors that you calculated. This means that you’re calculating similarities where things are closer and in a smaller dimension, which makes it easier.
    • Factor vectors: use factor vectors that are close to the active user’s vector. They’re all in the same space, so why not.
  • I’m torn about whether you should order the items before adding the item biases or not.
  • This model goes out of date quickly. Ideally you’d start a new run as soon as it’s finished. Train daily or weekly at least.
  • If gradient descent is too slow, use alternating least squares (ALS) method, which isn’t as precise as gradient descent, but should work fairly well anyway.
  • You almost always want to use cross-validation to evaluate the performance of the algorithm.
  • Summary:
    • SVD is a way of doing matrix factorisation but it won’t work if your matrix isn’t complete,
    • You have to fill empty cells with baseline predictors
    • Gradient descent and stochastic gradient descent are super tools for solving optimisation problems

Chapter 12: Taking the best of all algorithms and implementing hybrid recommenders

  • Combine recommender algorithms to get a more powerful tool. Achieve better results and patch up corner cases, where algorithms don’t work well. They also allow you to use the data you have available to you
  • The three different types of overall hybrid recommenders
    • Monolithic: components of various recommenders and glues them together in new ways;
    • Ensemble runs different recommenders and then combines the result into one
    • Mixed recommender runs a number of recommenders, returning all of them
  • The monolithic mixes components from different recommenders or even adds new steps to improve overall performance. A monolithic hybrid recommender is composed of other recommender system parts.
    • E.g. a content-based approach that finds all the items that are similar content-wise, mixing that with a collaborative filtering approach to calculate predicted ratings the same way you’d use simple collaborative filtering
  • The whole point of hybrids is to take advantage of more types of data.
    • E.g. a collaborative filtering recommender with one extra pre-processing step that adds ratings to the rating-matrix, such that the collaborative filtering will connect things that are related content-wise
  • A mixed hybrid returns the union of all the results, where you’ve a hierarchy of recommenders
    • Consider the recommender as a scale of personalisation
    • Make the first as personalised as possible and then continue until you’re using item popularity to give recommendations
    • Often the most personalised recommender only produces one or two recommendations, the next recommender produces several more and, in that way, you’ll always have a good quantity of recommendations but with as much quality as possible.
    • If you’ve several good recommenders, each returning a score, then you can return a list ordered accordingly. Remember that the scores should be normalised so all the results are on the same scale
  • An ensemble combines predictions from different recommenders into one recommendation
    • Start with matrix factorization
      • then add recommenders to create an ensemble recommender
    • If you’ve two recommenders running already, let’s say content-based and collaborative filtering, then why not run them simultaneously and combine the result to get an even better outcome
    • You can do use majority voting approach
  • A switched ensemble recommender is about using the best tool for the job. If you have two or more recommenders, then a switched ensemble recommender will decide which of them to use given the context of the request
    • Country, time of day, how many ratings a user has made
  • Weighted ensemble recommender
    • Consider Collaborative filtering and content-based filtering
      • Content-based filtering finds similar content.
      • If you know a user likes a topic, you can use content-based filtering to find similar things.
      • But content-based filtering doesn’t distinguish between good and bad quality; it’s only concerned with topic or keyword overlap
      • Collaborative filtering, doesn’t care about similarity only that certain people thought it was good quality and others didn’t
    • Use them together to combine their strength and you don’t have to give them equal weight.
      • train two different recommenders and ask them both to produce candidates for recommendations.
      • When two or more recommenders are combined like this its called a feature recommender
    • You can find the best feature weights: guessing, regression, A/B testing / multi-armed bandits
    • It’s good having weights for each recommender, but it’s even better to have these weights change depending on the user or items
  • Feature-weighted linear stacking (FWLS)
    • How to make the weights into functions instead.
    • Meta features: Weights as functions
    • Make your algorithm more flexible. E.g: use content-based recommendations if the user has only rated a few items, and collaborative filtering if the user has interacted with many items. Replace weights with functions (these functions are called meta functions or feature-weights functions)
      • Let the machine decide what to do. You can say that you want the content-based recommender to provide 90% of the recommendation if the user has rated fewer than three items; otherwise, it should be 50/50.
      • You can use this to blend the output of recommenders using weights that are functions that make those extremely flexible.
    • You want the hybrid recommender function to produce output that makes the difference between an expression and the actual ratings as small as possible.
    • Split the data and remove something like 20% of it to check how well the algorithm does.
      • Split data into training, validation, and test sets
      • Training data: trains the feature recommenders and the hybrid.
      • Validation data: allows you to understand if the hybrid is a good model and to fine-tune the hyper parameter
      • Test set: provides an unbiased validation of the model
    • The evaluation runner is used to evaluate an algorithm. It’s a pipeline where the data is first cleaned, then split into the k folds for cross-validation. Then for each fold it repeats training of the algorithm to evaluate it. When it’s all finished, you aggregate the result.
  • Summary
    • A recommender system can be greatly optimised by adding the output of several algorithms.
    • Hybrids enable you to combine the forces of different recommenders to get better results.
    • The feature-weighted linear stacking (FWLS) algorithm enables the system to use feature recommenders in a function-weighted way, which makes it strong.

Chapter 13. Ranking and learning to rank (LTR)

  • Instead of focusing on recommendations as a rating prediction problem, it sometimes makes more sense to look at how the items should be stacked
  • To define relevancy like this takes away the need to predict ratings, you only have to know what’s preferred over everything else
  • Information Retrieval Systems: Learning to rank (LTR)
    • Learning to rank is a way to combine different kinds of data sources, such as popularity, distance, or recommender system outputs
    • You’re looking for input sources (features) that will give you an ordering of the objects(
    • Foursquare’s algorithm to rank venues near you
      • Higher values should denote a shorter distance and a higher average venue rating.
      • Invert distances (short distance has higher value)
      • Set a max distance (10 minute walk)
      • Rescale data so that everything is between 1 and 0
      • For more information on rescaling, see wikipedia
      • You want to teach the machine to rank these items based on the input of ratings and distance.
        • You want the system to learn weights (w0 and w1), which produce a value such that the items get ordered as you’d expect on the page
    • How do you optimise the function if you don’t know what it’s supposed to return?
      • You use the check-in feature
  • A hybrid recommender always predicts ratings, while a LTR algorithm produces orderings
  • Re-ranking example: take a popularity ordering and re-rank the items using the recommender. Popularity narrows the list to the most popular.
  • Collaborative filtering algorithms are prone to recommend items liked by few people, but by people who really like the content. The algorithm has no concept of popularity and could be used for re-ranking instead of as the sole source for the ordering
  • Learning to Rank (LTR)
    • A ranking recommender has a catalog of items. The system retrieves items that are relevant to the user and then ranks them so the items at the top of the list are the most applicable
    • A ranking model is trained using an LTR algorithm, which is a supervised learning algorithm, meaning that you provide it with data containing input and output.
      • that’s a user_id as input and a ranked list of items as output.
    • This family of algorithms has three subgroups:
      • Pointwise: produces a score for each item and then ranks them accordingly
      • Pairwise: a type of binary classifier. It’s a function that takes two items and returns an ordering of the two. Usually optimise so that you’ve the minimal number of inversions of items compared to the optimal rank of the items
      • Listwise: is the king of all LTR subgroups because it looks at the whole ranked list and optimises that. The advantage of listwise ranking is that it intuits that ordering is more important at the top of a ranked list than at the bottom. Pointwise and pairwise algorithms don’t distinguish where on the ranked list you are
  • Bayesian Personalised Ranking (BPR)
    • For each user you want to order the content items in such a way that the most relevant is on top
    • Define an ordering that says no matter which two items you hold up, the ordering will rank one better than the other. To make that work, you need three rules:
      • Totality
      • Anti-symmetry
      • Transitivity
    • With implicit ratings you never get negative feedback because you only have events that say “bought.” Makes it hard for ML to know when it’s done something wrong
  • Bayesian statistics are based on this simple equation:
    1. p(ab)=p(BA)p(A)p(B)p(a|b) = \frac{p(B|A)p(A)}{p(B)}
    2. The probability (p) of event A happening given that B happened is equal to the probability that A happens multiplied by the probability that B happens when A occurs, divided by the probability that B happened.
    3. A is the event that it has rained, and B is the event that the street outside is wet. Then Bayes says that the probability that it has rained given that the street is wet (p(A|B)) is equal to the probability that it has rained (p(A)) multiplied by the probability that the street is wet given it has rained (p(B|A)) divided by the probability that the street is wet (p(B)).
    4. In a BPR, you want to find the model such that there’s the highest probability that the model will produce a perfect ordering for all users
    5. Assumes there’s a way to order all your content items flawlessly for each of your users, which is the total ordering that I keep mentioning.
    6. Mix with a content-based recommender or use something else like the multi-armed bandit scheme to introduce more items into the system.
  • Summary Learning to Rank
    • (LTR) algorithms solve a different problem than that of the classic recommender problem of predicting ratings.
    • Foursquare uses a ranking algorithm to combine the ratings and the locations of venues, which is a great example of how to combine relevant data into one recommendation.
    • It’s not easy to come up with a continuous function that can be optimised

Chapter 14: Future of recommender systems

  • Pedro Domingo’s famous paper called “A Few Useful Things to Know about Machine Learning,” he points out that more data beats a more complex algorithm. That’s also true for recommender systems, only data collected isn’t always a source of truth
  • Recommenders will also be used to make decisions in many other scenarios.
  • Next Best Action recommenders are becoming bigger and bigger in marketing and banking.
  • The future of recommenders lies in dynamic recommender algorithms, with a core of offline calculations