Author

Kim Falk

Year

2019

Review

Recommendations have become an important part of most products. As such, Product Managers need a good understanding of the underlying technology. Although the theory can be complex, having an understanding of the tradeoffs and limitations of your system is crucial.

It’s been a while since I’ve done any matrix factorisation or gradient decent. Part II of this book was a challenge for me, but reading beyond my comprehension from time to time expands my circle of competence. Part I however is entirely comprehensible and a great introduction to recommenders.

You Might Also Like:

Designing for Behaviour Change

I made product management my profession, I’ve long been a fan of Behavioural Economics and Atomic Habits is one of my favourite books, so I was always going to like this book. It builds on prior theory, but the authors contribution is to present a few easy to follow frameworks to apply behaviour change principles to product management. Given most product outcomes are reliant on changing people’s behaviour, this is a worthwhile read for anyone building products.

andrewclark.co.uk

The Right It

Marty Cagan acknowledges in "Inspired" that most of our products and features won't meet expectations, and the ones that do require a lot of iteration to reach that point. I believe this is THE most important observation about product development. This book does a great job of articulating that phenomenon and proposes a robust way to navigate it. This is the best book on validating product ideas that I’ve read. I will forever be sharing this book.

andrewclark.co.uk

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

Get a comprehensive review and summary of the best-selling book "Designing Machine Learning Systems" by Huyen Chip, specifically tailored for product managers. Learn how to apply the key insights and takeaways from this must-read for tech leaders and professionals to your role as a product manager.

andrewclark.co.uk

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

Key Takeaways

The 20% that gave me 80% of the value.

Recommendations work best when you have lots of data and a lot of things (e.g. products or content) to recommend
Types of data involved:

Historical data: behaviour data
Content metadata
Contextual data: device, location, alone or in group, time of day/week/year
Demographic: age, gender, nationality, income, politics

How to describe a system: Domain, purpose, context, personalisation, opinions, privacy, trustworthiness, interfaces and algorithms
There’s a tradeoff between accuracy and explainability (or interpretation)
Data is really important. Garbage in, garbage out
Two types of feedback:

Explicit: ratings or likes
Implicit: activity (deduced from monitoring behaviour)

Implicit ratings (based on usage data) are often more trustworthy

Browser Behaviour events can demonstrate interest
Storage is cheap, save as much data as you can

What to choose:

High user involvement → collaborative filtering
Low user involvement (or 1 time visits) → content-based

Recommenders only work if you can identify the users (forcing sign up / login) can help

Conversion path is the sequence of pages and actions a user takes form the time of arrival on landing page until conversion

You can compute implicit ratings by converting web behaviour into content ratings.
User-item preferences are often expressed in a user-item matrix.

row for each user
column for each item
ratings in each cell

The rating binds the user, the item and their sentiment toward it
Expect empty cells to be common (the sparsity problem). Many recommender systems will try to predict values for empty cells.
To calculate an implicit rating: take all events recorded between each user and item, and determine a number that indicates how much they like the item
Remember that date/time/day might change what’s best to recommend
You can introduce time-decay to your algorithm to give recent signals higher importance
Term Frequency - Inverse Document Frequency Problem (TF-IDF)

Less frequent items provide more value
A user buying a popular item doesn’t say much about taste

Start with non-personalised recommendations, then graduate to personalised
Content should be ordered based on what you think most users are interested in

Use recency to keep your product dynamic
For some categories like films, most-recent-first is a sensible rule

Seeded recommendations use an item as an input to find other (e.g. Frequently bought together)
Creating Frequently Bought Together

Do affinity analysis to determine links between products

Called frequency sets when products are seen in the same basket together

Most products are bought with popular items (e.g. milk and bananas in grocery)
Instead, find the products that are bought together but not with everything else

To do this you compare how often the items are bought together over the number of times the first one appears

The cold start: when you don’t have knowledge of your users, so can’t personalise well

Can apply to users and items
Gray sheep → users who have individual tastes (no other users have bought what they have)

If asking a user to rate items - use active learning (the art of selecting items to rate that will teach you the most about preferences)
Cold items are best handled using content based filtering
Create user segments, and see how quickly you can classify people into a segment

Try demographic recommendations / personalised
Try and get more data (e.g. facebook account)

Be careful to show only items that are appropriate based on items they’re already viewing
Segments can be created using cluster analysis

Cluster: a group with similar traits
You can find hidden traits and similarities in the data

You can use meta-data from an item to find similar items

Grouping and abstraction is really helpful…

Art by artist
News by topic
Articles by author
Song to artist
Product to brand

Need to get the abstraction or generalisation just right

too general and it will lose it’s value
too specific and it won’t enable you to connect any more items

Personalised recommendations almost always contain calculations of similarity

Finding items like the ones you like
Finding users who like what you like (e.g. have similar taste)

Jaccard distance: a measure to indicate how close two sets are to each other

Also called Jaccard index or Jaccard similarity coefficient

To find similarity between two items, calculate how many users bought both items, then divide by how many users bought either one (or both).
Calculating k-means clusters (segmenting users can help save compute)
Scenario: Find a user that’s similar to me, so you can recommend things they like that I haven’t yet come across. You need…

Enough items in common to determine that your tastes match
They also need a set of items you haven’t seen / come across

In ML pipelines you do as much as possible before the user visits expecting recommendations
If you have more users than items → use item-based filtering. If you have more items than users → use user-based filtering.
Data needs to be well connected to make recommendations

If no users have rated content, no recommendations will be made
If your tastes don’t overlap with others, users won’t receive good recommendations

Steps to implement collaborative filtering:

Find all the items that are rated by several users
Calculate the number of users who are connected to one or more of those items
Other users won’t receive recommendations

Your system doesn’t need any domain knowledge to do collaborative filtering
The collaborative filtering tradeoff

Narrow the number of users and you risk not finding any content to recommend
Make it too wide and you risk recommendations containing content outside of the users taste

Collaborative filtering requires data, which you won’t have for new users or items

Hacks:

Ask new users to rate a few items when they arrive
Create a new arrivals list to showcase new products
Use explore/exploit methods to inject new data in say every 6th recommendation slot

When	What decisions
Ratings	Only the positive ones? Only the most recent ones? How should you normalise them?
Similarity	How many users need to rate an item-pair for it similarity to be calculated? Should you use only positive similarities?
Neighbourhood	What method to select a neighbourhood? How big should the neighbourhood be?
Predicting	Should you use classification or regression? Should you use a weighted average?
Returning	Should you return them all? Or just the ones above a threshold?

Pro’s and Con’s of collaborative filtering:

Sparsity - most data sets aren’t dense enough to recommend items other than the most popular
Gray sheep - hard to find recommendations for users with unique tastes
Number of ratings - you need lots of user ratings
Similarity - it’s content-agnostic, based on behaviour, and will skew to popular content

The best recommender systems also account for diversity, context, evidence, freshness, and novelty
All data applications are ‘living’ → they require maintenance and monitoring
Implementation:

Start with the simplest algorithm first → then add complexity if needed
Test on a small data set
Coded and tested
Offline algorithm evaluation
Introduce to humans (control group)
Release to a small % of users

If the key success metric is lagging (retention of subscribers) use a leading indicator (users watching more content)
Understanding my taste = minimising prediction error
The Matthew effect: “the rich get richer and the poor get poorer.”

If popular items are always favoured, you’re also creating what we call a filter bubble.

Coverage: The better the diversity, the better the coverage

An ideal recommender system would enable users to navigate the full catalog
Coverage has two sides:

how much of the catalogue is it capable of recommending (catalogue coverage)
how many users can it recommend something to (user coverage)

Serendipity: Serendipity is about giving the user that sensation, so that all their visits aren’t more of the same. Serendipity is subjective and difficult to calculate.

Don’t constrain the recommender’s returns too much. Constraining means less serendipity.

Measuring error metrics:

Measure the difference between the ratings in your historical data set with what the recommender algorithm outputs

The difference between a user’s rating and the recommender’s prediction of that rating

mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).

Mean Absolute Error (MAE): takes the absolute value of each of the differences and finds the average of those. Absolute values are used so a high guess doesn’t cancel out a low guess.
Root Mean Squared Error (RMSE): puts big penalties on big errors. Big errors count more than several smaller ones.

If it’s important that none of your users gets bad recommendations, then you should use RMSE.

Measuring decision-support metrics

True positive (TP)— Item recommended and consumed by the user
False positive (FP)— Item was recommended but the user didn’t consume it.
False negative (FN)— The recommender didn’t include the item in a recommendation and the user consumed it.
True negative (TN)— It wasn’t recommended and the user didn’t consume it.
Precision— What fraction of the recommended items the user consumed:

Precision = \frac{TruePositive}{TruePositive + FalsePositive}

Recall— What, out of all the items that the user consumed, was recommended

Recall = \frac{TruePositive}{TruePositive + FalseNegative}

It’s considered more important to optimise precision, while not putting too much importance on whether the user gets all the possible relevant items (the recall).
Decision-support comes from information filtering, originally used for calculating the quality of search results
n the world of recommenders, they’ve restricted the measurement to look only at the k top elements (precision at k)

The k top elements are measured by taking the number of relevant items between the first k items

P@k(u)= \frac{\#{relevant\hspace{0.5em}content\hspace{0.5em}in\hspace{0.5em}the\hspace{0.5em}top\hspace{0.5em}k\hspace{0.5em}positions}}{k}

Mean Average Precision(MAP): calculates the average of each precision at k, takes rank into account
Discounted Cumulative Gain: finds each relevant item and then penalises the items the farther down it is in the list. It’s much like the MAP that you looked at earlier. But where the MAP only works with relevance, the DCG looks at levels of relevancy. So each item needs a relevancy score.
NDCG: Normalises DCG vs the optimal ranking, and is therefore a more universal benchmark across systems

Some recommenders don’t distinguish between ratings done now, or a year ago. A ratings age might have an effect on their ability to predict

You can use content-based filtering to create…

similar items recommendations
personal recommendations based on taste

Systems based on text first extract keywords (removing those that produce noise)

Use the important words you find to create a list of categories (also called topics)
LDA algorithm (Latent Dirichlet Allocation):

Latent → topics found don’t compare to any category that you know
Dirichlet → is the way documents are described using these topics
Allocation → words are allocated into topics

Content-based filtering seems is about extracting knowledge from the content
Content analyser: creates a model based on the content, creating a profile for each item

Feature extraction: extracting things that are important for the algorithm to work
Mapping data into something a machine can use (vector)

User profiler: creates a user profile (sometimes just a consumed items per user)
Item retriever: retrieves relevant items found by comparing user profiles to item profiles

It’s more likely it’s important if the world is only present in few documents
Finding important words with TF-IDF

TF: Term frequency:

tf(word,document) = how many times does the word appear in document
Or tf(word,document) = 1 + log(word frequency)

The more times a word is present in a document, the higher the chance that it’s important.
It’s more likely it’s important if the world is only present in few documents
Inverse term document frequency (idf in the equation)

The number of all your documents divided by the number of documents that contain the word.

Together, the product is tf-idf, which is defined like this:

tf - idf(word,document) = tf(word,document)*idf(word,document)

tf-idf(word,document) = tf(word, document)*idf(word,document)

When you find words that provide large TF-IDF returns, you can add those to the list of features

Topic Modelling using the LDA

LDA (Short for latent Dirichlet allocation)

Latent: hidden (or found by the algorithm)
Distribution of distributions
Allocation: all words are allocated to the latent topics

An LDA is a generative model

LDA’s use

Content → Topcis → Words

Content can be described by a series of topics, topics can be described by what words are used when describing the topic
The goal is to connect words with topics and topics with documents!

Gibbs sampling begins with randomly adding topics to documents, and words to topics → it makes adjustments until it converges on something (that hopefully makes sense)

Most of the implementation work in making an LDA function is how you process the text you’re inputting.
If you want to control the topics a document contains, you can inject words into the descriptions
Key parameters for LDA:

k (number of topics) if too low, then documents look similar. If too large, you might end up with no similar documents.
High Alpha: distribute each document over many topics
High Beta: leads to topcis being more similar

Once you have the LDA model you can predict the similarity of any two documents

The probability distribution can be seen as vectors, so you can use cosine similarity
You can also compare a document that wasn’t used to create the model (and get around the cold start problem)

To create personalised recommendations, create a user profile

Iterate through each of the items the user likes
For each item, find similar products
Order according to the product of similarity and the users’ ratings from the original list of consumed items

Pro’s and Con’s of content based filtering:

Pros

New items are easy to add (just create the item feature vector)
You don’t require much traffic - as it’s the content descriptions that’s driving it
Popularity isn’t considered

Cons

Conflates liking with importance.
No serendipity, it’s specialised
Limited understanding of content, so system can misunderstand what a user likes

By looking at the behavioural data of users, you can find categories or topics that let you explain users’ tastes with much less granularity than with each movie but more detail than genres. These factors will enable us to position the movies that users like closer to each other.
Hidden genres = latent factors

Latent factors are defined by an algorithm
They can represent a user’s taste, it might be hard to understand what the factors mean.

You do dimension reduction by factorising the rating matrix

UV-decomposition allows you to decompose the matrix into hidden features (read columns for users and rows for items) for items and for users.

Often only 1% of the cells in the rating matrix have values. Two ways to solve that with imputation…

You can calculate the mean of each item (or user) and fill in this mean where there are zeros in each row (or column) of the matrix.
You can normalise each row, such that all elements are centred around zero, so the zeros will become the average

Baseline predictors are a better alternative.

Baseline predictors extract the biases for the items and the users to make a better prediction
If a user underperforms the average, rate it lower
If an item outperforms the average rating, rate it higher

The SVD method allows you to add new users and items into the system

To add a new user means to add another row to the rating matrix, which in turn means that the decomposed user matrix would also have another row
You need to have user ratings before you can add items.
You still need to recalculate regularly (daily or weekly), the reduction is done to extract topics and they aren’t updated when you fold in a new user or item; they’re placed compared to the topics that are already there

The three different types of overall hybrid recommenders

Monolithic: components of various recommenders and glues them together in new ways;
Ensemble runs different recommenders and then combines the result into one
Mixed recommender runs a number of recommenders, returning all of them

A mixed hybrid returns the union of all the results, where you’ve a hierarchy of recommenders

Consider the recommender as a scale of personalisation
Make the first as personalised as possible and then continue until you’re using item popularity to give recommendations
Often the most personalised recommender only produces one or two recommendations, the next recommender produces several more and, in that way, you’ll always have a good quantity of recommendations but with as much quality as possible.

Feature-weighted linear stacking (FWLS)

How to make the weights into functions instead.
Meta features: Weights as functions
Make your algorithm more flexible. E.g: use content-based recommendations if the user has only rated a few items, and collaborative filtering if the user has interacted with many items. Replace weights with functions (these functions are called meta functions or feature-weights functions)

Let the machine decide what to do. You can say that you want the content-based recommender to provide 90% of the recommendation if the user has rated fewer than three items; otherwise, it should be 50/50.
You can use this to blend the output of recommenders using weights that are functions that make those extremely flexible.

Learning to rank (LTR) is a way to combine different kinds of data sources, such as popularity, distance, or recommender system outputs
You’re looking for input sources (features) that will give you an ordering of the objects(
Foursquare’s algorithm to rank venues near you

Higher values should denote a shorter distance and a higher average venue rating.
Invert distances (short distance has higher value)
Set a max distance (10 minute walk)
Rescale data so that everything is between 1 and 0
For more information on rescaling, see wikipedia
You want to teach the machine to rank these items based on the input of ratings and distance.

You want the system to learn weights (w0 and w1), which produce a value such that the items get ordered as you’d expect on the page

How do you optimise the function if you don’t know what it’s supposed to return?

You use the check-in feature

Deep Summary

Longer form notes, typically condensed, reworded and de-duplicated.

Part 1: Getting ready for recommender systems

Chapter 1: What is a recommender?

A recommender system provides relevant content to the user based on knowledge of the user, the content and interactions between the two
Recommendations work best when you have lots of data and a lot of things (e.g. products or content) to recommend
There’s a spectrum of personalisation

Non-Personalised (e.g. Top 10)
Semi-Personalised (e.g. location)
Personalised

User profiles enable you to move from households to people
Explicit user ratings shouldn’t override everything

often aren’t as good at predicting what people will do
users might not want to share that data
have to ask for help creating a taste profile

Implicit ratings (based on usage data) are often more trustworthy
Factors to consider:

Explicit ratings
User behaviour
Freshness
Boosting (certain content/products)
Showing variety

Two levels of intelligence to recommendations:

Does it make the list? (E.g. is it in Popular on Netflix?)
What position on the list? (How the Popular on Netflix list is presented to me)

Types of data involved:

Historical data: behaviour data
Content metadata
Contextual data: device, location, alone or in group, time of day/week/year
Demographic: Age, gender, nationality, income, politics

How to describe a system:
Domain, purpose, context, personalisation, opinions, privacy, trustworthiness, interfaces and algorithms

Domain	Type of content recommended
Purpose	Purpose for the end user and provider. North Star Metric?
Context	Environment the user receives the recommendation in
Personalisation Level	Non-personalised / semi-personalised / personalised Seeded (maybe based on item being viewed (customers also bought))
Whose opinions	Expert or the masses
Privacy & Trustworthiness	Protecting user privacy. Are the recommendations trusted?
Interface	Kinds of inputs/outputs: · Explicit input → user gives ratings or preferences · Implicit input → how you interact with the system · Outputs → predictions, recommendations, filtering Types of presentation: · Organic when recommendations are a natural part of the page · White-box when recommenders are explainable · Black-box when you can’t explain why a recommendation is made
Algorithms	· Collaborative filtering employ usage data · E.g. Finding similar users and content that they like · Needs lots of data · Content-based use content metadata and user profiles · E.g. using movie metadata · Needs good descriptions · Hybrid a mix of the two types

There’s a tradeoff between accuracy and explainability (or interpretation)
Can you predict what a user needs right now?
Can you people a personal service?

Chapter 2: User behaviour and how to collect it

Data is really important. Garbage in, garbage out
Two types of feedback:

Explicit: ratings or likes
Implicit: activity (deduced from monitoring behaviour)

Evidence is more powerful than when a user declares their taste

May say they love french movies, but actually watch more super hero movies

Browser Behaviour events can demonstrate interest (in order)

View + page duration
Scroll a list or browse a category
Mouse over for details
Expansion clicks (view more details)
Search terms
Save for later
Buy or consume

Content: Start, stop, speed, play to end, replay

or they can demonstrate the opposite (stop watching after three minutes and never revisit)

What to choose:

High user involvement → collaborative filtering
Low user involvement / 1 time visits → content-based

Visitor ratings

Submit preferences
Save a rating
Negative ratings
Voting (TripAdvisor, Hacker News)

Recommenders only work if you can identify the users (forcing sign up / login) can help
Storage is cheap, save as much data as you can

Chapter 3: Monitoring the system

Create a basic funnel dashboard:

Visitors → Conversion → Revenue
Show KPIs over a historical period

Conversion rate = number of goal achievements / number of visits

you can measure conversion by users or visits
Amazon would choose visit, a car company might choose a customer

Conversion path is the sequence of pages and actions a user takes form the time of arrival on landing page until conversion

Chapter 4: Ratings and how to calculate them

You can compute implicit ratings by converting web behaviour into content ratings.

Data for implicit ratings is usually easier to collect that getting the user to provide explicit ratings

Before starting consider:

Your sites purpose
The events that happen ahead of consumption
How often those events occur

User-item preferences are often expressed in a user-item matrix.

row for each user
column for each item
ratings in each cell

The rating binds the user, the item and their sentiment toward it
Expect empty cells to be common (the sparsity problem). Many recommender systems will try to predict values for empty cells.
Explicit ratings don’t always reveal their true preferences.
To calculate an implicit rating: take all events recorded between each user and item, and determine a number that indicates how much they like the item

Give each behaviour event a value (positive or negative)
Aggregate the values to create a score that predicts likeliness to buy
Events clustered in the same session might mean less that those spread across sessions
Amazon can get by with just using purchase events as they have so many customers

expect to use all browsing history

Remember that date/time/day might change what’s best to recommend
You can introduce time-decay to your algorithm to give recent signals higher importance
Normalise the values so they’re all in the same range
Once you have a User-item Matrix with your values you can use a naive Bayes classifier to predict how likely the behaviour is to lead to a buy
Term Frequency - Inverse Document Frequency Problem (TF-IDF)

Less frequent items provide more value
A user buying a popular item doesn’t say much about taste

Chapter 5: Non-Personalised Recommendations

Start with non-personalised recommendations, then graduate to personalised

Something to surface before you ‘know your users’
Examples:

Most popular: humans are flock animals
People who bought x also bought y (seeded recommendation)
Frequently bought together (seeded recommendation)
Most liked

Content should be ordered based on what you think most users are interested in

Use recency to keep your product dynamic
For some categories like films, most-recent-first is a sensible rule
For discount vouchers (highest discount is a good option)

On deploying:

calculate as much offline as you can
fail gracefully (fall back to something simple, or a hard coded list)
keep it separate from other parts of your website

Seeded recommendations use an item as an input to find other (e.g. Frequently bought together)
Creating Frequently Bought Together

Do affinity analysis to determine links between products

Called frequency sets when products are seen in the same basket together

Most products are bought with popular items (e.g. milk and bananas in grocery)
Instead, find the products that are bought together but not with everything else

To do this you compare how often the items are bought together over the number of times the first one appears

Hold multiple databases (with version numbers) of recommendations. Allows you to fail back.

Chapter 6: The User (and content) who came in from the cold

The cold start: when you don’t have knowledge of your users, so can’t personalise well

Can apply to users and items
Gray sheep → users who have individual tastes (no other users have bought what they have)

Boost cold products → make an effort to place them where they’ll get noticed
Cold visitor tactics:

Ask them for a few explicit ratings

If asking a user to rate items - use active learning (the art of selecting items to rate that will teach you the most about preferences)

Look into their search terms
How soon you decide to surface personal recommendations is a trade-off between quality and time
You can use recommenders that fail-back to most popular content for cold visitors

Cold items are best handled using content based filtering
Machine Learning isn’t magic, it can only find signals that actually exist in the data.
Create user segments, and see how quickly you can classify people into a segment

Try demographic recommendations / personalised
Try and get more data (e.g. facebook account)

Create association rules based on early behaviour actions (and business logic)
Be careful to show only items that are appropriate based on items they’re already viewing
Segments can be created using cluster analysis

Cluster: a group with similar traits
You can find hidden traits and similarities in the data

You can use meta-data from an item to find similar items

Grouping and abstraction is really helpful…

Art by artist
News by topic
Articles by author
Song to artist
Product to brand

Need to get the abstraction or generalisation just right

too general and it will lose it’s value
too specific and it won’t enable you to connect any more items

Part II: Recommender Algorithms

Chapter 7: Finding similarities among users and content

Personalised recommendations almost always contain calculations of similarity

Finding items like the ones you like
Finding users who like what you like (e.g. have similar taste)

Similarity: Given two items, create a function that scores their similarity between 0 and 1

0 = nothing in common, 1 = identical objects

Jaccard similarity is used to compare sets (e.g. a set of movies a user has bought)

Data Type	Example	Similarity
Unary data	Person A likes film B Person A likes film X Person B likes film A Person B likes film B	Jaccard similarity
Binary data	Person A dislikes film B Person A likes film H Person B likes film B Person B dislikes film Q	Jaccard similarity
Quantitative Data	Person A rated film B 5/10 Person B rated film B 8/10	Pearson or cosine similarity

Jaccard distance: a measure to indicate how close two sets are to each other

Also called Jaccard index or Jaccard similarity coefficient

Think of each piece of content as being a bag that contains all the users who watched it. Here’s an example of two sets:

Movie A: Person A, Person B, Person D, Person G
Movie B: Person A, Person B, Person K, Person Y

Or you could represent the same data in a ‘Unary user-item’ matrix

Watched / Bought	Film A	Film B	Film C
Person A	1	0	1
Person B	1	0	0
Person C	1	0	1
Person D	1	1	0

To find similarity between two items, calculate how many users bought both items, then divide by how many users bought either one (or both).

Similarity_{Jaccard}(i,j) = \frac{\#users that bought both items}{\#users who bought either i or j}

How that’s calculated:

Watched / Bought	Film A	Film B	Similar or not?
Person A	1	0	0
Person B	1	0	0
Person C	1	0	0
Person D	1	1	1
Sum			1

Similarity is domain specific, so think of it as a relative measure
Measuring distance with $L_{p}-norms$

‣

Calculating

L_{1}norm

(Manhattan distance or Taxicab Geometry)

If two users have rated a movie from 1-10 you could measure the distance between those rating by

⁍

You can then normalise that back to a scale of 0 to 1

⁍

You can then do that across all movies

⁍

‣

Calculating Mean absolute error (MAE)

is calculated using the average of the $L_{1}norm$ . You’re dividing by n to get the average distance between ratings

⁍

‣

Calculating

L_2Norm

(Euclidian norm)

If $L_1Norm$ was the distance travelled by a Taxicab between two points, $L_2Norm$ is the distance as the crow flies

distance(UserA,UserB) = ||r_{userA} - r_{userB} ||_2=\sqrt{\sum_{i=1}^n|r_{userA,i} - r_{userB,i}}|^2

The euclidian norm is often used a measure of how well Machine Learning models perform. It takes the average of the norm, called the room mean square error (RMSE)

Root\,Mean\,Squared\, Error\,(RSME) = \frac{1}{n} \sum_{i=1}^n|r_{userA,i}- r_{userB,i}|^2

You can create similarity by inserting 1 over the sum of squared differences.

‣

Calculating Cosine Similarity (* add normalisation to capture that users rate differently)

You can look at the rows of the rating matrix as vectors in space, and then look at the angle between them

Watched / Bought	Film A	Film B
Person A	3	5
Person B	4	1
Person C	2	5

You could plot these as positions on a 2D graph (as there are only 2 films) but it can work in multiple dimensions too

The Rating of Film A can be the x axis, Rating of Film B can be the y axis
You then plot the people on the chart as per their film ratings
You measure the angle from (0,0) where the x and y axis intercept to each of the users. The closer the users’ angle, the more similar their preferences(tastes)

Amazon finds item-item collaborative filtering is more efficient, as they have more customer than items. There are less similarities to calculate if they do it for items
You can calculate the angle using the following cosine function:

sim(i,j)=\frac{r_i·r_j}{||r_1||_2||r_j||_2}=\frac{\sum_ur_i,_ur_j,_u}{\sqrt{\sum_ur_i^2,_u}\sqrt{\sum_ur_j^2,_u}}

But because this is a comparison of user ratings, we need to take into account that users use rating scales differently (some users might rate higher, have lots of 10s and no 1s).

Bardrul Sarwar et al came up with an adjusted cosine similarity to offset this drawback by subtracting the users average rating

sim(i,j)=\frac{\sum_u(r_i,_u-r_u)(r_j,_u-r_u)}{\sqrt{\sum_u(r_i,_u-r_u)^2}\sqrt{\sum_u(r_j,_u-r_u)^2}}

‣

Calculating Pearson’s correlation coefficient

If you plot ratings on the y axis and items on the x axis you can start to see which users have similar taste
Pearson’s correlation coefficient looks at the points and measures how different each point is on average

Scale: Opposite = -1 → Identical = 1
No connection = 0

Pearson also controls for user rating differences by subtracting the average

sim(i,j)=\frac{\sum_{e\in U(r_{i,u}-\bar{r_i)(r_{j,u}-\bar{r}_j)}}}{\sqrt{\sum_e\in U(r_{i,u}-\bar{r}_i)^2}\sqrt{\sum_e\in U(r_{j,u}-\bar{r}_j)^2}}

this is for the set of users which have rated both i and j

A users rating should always be viewed in relationship to that user’s other ratings. To normalise their ratings like this, you subtract their average rating from each of their ratings

Normalising ratings around zero like this means you’ll get positive and negative numbers

Pearson correlation is similar to cosine. Adjusted cosine controls for normalisation which makes it the same function as Pearsons… so the only real difference is

Pearson uses → items that two users have rated
Adjusted cosine uses → all rated items by either or both (setting ratings to zero when one of the users doesn’t rate an item)

⚠️

This zero effect will push similarity toward zero if there is little overlap between users

‣

Calculating k-means clusters (segmenting users can help save compute)

Calculating similarity between users or items is the achilles heel of neighbourhood collaborative filtering algorithms.

The algorithm needs a similarity score from each user to all others (or each item to all other items)
Divide your data set into smaller groups, to reduce the amount of calculations \

If you cluster users into 4 groups → you can first look up the group the user is in → then you don’t have to compare them to the other 3 groups (reducing time and compute you’re spending creating user-user similarity)

Clustering is also called segmentation
k-means clustering is an unsupervised ML algorithm.
You need to give it a parameter for it to run (it’s a parametric algorithm)
You provide the number of clusters it should find (k)
It will then find k number of points (called cenroids) that minimises the sum of the distances between all items and their centroids

You can use clusters to…

Find similar users

Pre-compute clusters
Check what cluster they are in
Check find the most similar people in that cluster

Place new users in groups

Find what centroid they are closest to, which will determine their group

Scenario: Find a user that’s similar to me, so you can recommend things they like that I haven’t yet come across. You need…

Enough items in common to determine that your tastes match
They also need a set of items you haven’t seen / come across

‣

In a diagram…

Chapter 8: Collaborative filtering (neighbourhood-based collaborative filtering)

Collaborative filtering is the art of using ratings for generating recommendations. It was developed to fight information overload
Assume that people keep their tastes over time → if you agree with somebody in the past, you’ll likely agree with them again in future
Two ways to do neighbourhood:

Find users with similar tastes and recommend things they like that you haven’t seen
Find items similar to items that you already like (item based filtering)

User-item matrix

Rating	Film A	Film B	Film C
Person A	5	3	1
Person B	4	❓	5
Person C	3	5	3
Person D	5	4	❓

The task it so calculate predictions for the empty cells, using the data already in the matrix
It’s normal to have more empty cells that filled ones
In ML pipelines you do as much as possible before the user visits expecting recommendations

Most users don’t have many item ratings. Adding a new rating can change the calculatino of a users taste
Don’t pre-compute which users are similar, but you can compute what items are similar
If you have more users than items → use item-based filtering
If you have more items than users → use user-based filtering

User similarity is good for serendipity (connecting you with dissimilar content that you’ll rate highly) where as item similarity will only provide similar items
Data needs to be well connected to make recommendations

If no users have rated content, no recommendations will be made
If your tastes don’t overlap with others, users won’t receive good recommendations

Steps to implement collaborative filtering:

Find all the items that are rated by several users
Calculate the number of users who are connected to one or more of those items
Other users won’t receive recommendations

Your system doesn’t need any domain knowledge to do collaborative filtering
Item similarity is stable, so it can be calculated beforehand (or offline)
Normalise your rating matrix before using it
You can calculate the cosine similarity of each item-pair

Cosine Similarity	Film A	Film B	Film C
Film A	1	0.63	-0.7
Film B	0.63	1	0.78
Film C	-0.7	0.78	1

In this example, people who rate Film A highly will rate Film C low
People who rate Film B highly will rate Film C highly

Be careful if calculating similarity between users that have too few items in common

Always require a minimum of overlapping users (number of users who have rated both films)

The collaborative filtering tradeoff

Narrow the number of users and you risk not finding any content to recommend
Make it too wide and you risk recommendations containing content outside of the users taste

How to select a neighbourhood:

Neighbourhood = a set of items that are similar to the content you’re looking at
Techniques:

K-means clustering → not great if your user is on the border of the cluster, or if the cluster has a strange shape
Top-N → define a number N as the number of neighbours that should be in the neighbourhood and then say that ll items have N fellow items in the neighbourhood

Top-N can force the system to use points that are dissimilar (bad for recommendations)

Threshold → only want items of a certain standard to be in the neighbourhood by requiring that the similarity needs to be above a specific constant

calibrate a threshold to show more niche or more popular items

Calculating predicted ratings (classification or regression)

Pred(u,i)=\bar{r}_u+\frac{\sum_{j\in}s_j(sim(i'j)\times r_{u,j})}{\sum_{j\in}s_isim(i,j)}

$\bar{r}_u$ is the average rating of the users
$r_{u,j}$ is the active users $u$ rating of item $j$
$s_i$ ks the set of items in the neighbourhood
$Pred(u,i)$ is the prediction rating for user $u$ of item $i$
$sim(ij)$ is the similarity between item $i$ and item $j$

Cold start problems:

Collaborative filtering requires data, which you won’t have for new users or items

Hacks:

Ask new users to rate a few items when they arrive
Create a new arrivals list to showcase new products
Use explore/exploit methods to inject new data in say every 6th recommendation slot

In summary

Find similar items to the ones the active user likes
Calculate predicted ratings for these items
Use predictions to calculate recommendations

Decisions you’ll have to make along the way

When	What decisions
Ratings	Only the positive ones? Only the most recent ones? How should you normalise them?
Similarity	How many users need to rate an item-pair for it similarity to be calculated? Should you use only positive similarities?
Neighbourhood	What method to select a neighbourhood? How big should the neighbourhood be?
Predicting	Should you use classification or regression? Should you use a weighted average?
Returning	Should you return them all? Or just the ones above a threshold?

Pro’s and Con’s of collaborative filtering:

Cons:

Sparsity - most data sets aren’t dense enough to recommend items other than the most popular
Gray sheep - hard to find recommendations for users with unique tastes
Number of ratings - you need lots of user ratings
Similarity - it’s content-agnostic, based on behaviour, and will skew to popular content

Chapter 9: Evaluating and testing your recommender

The Netflix Prize frames the recommendation problem as accurately predicting ratings.

The best recommender systems also account for diversity, context, evidence, freshness, and novelty

Be clear on why you’re implementing a recommender… be clear about what you want to achieve
All data applications are ‘living’ → they require maintenance and monitoring
Implementation:

Start with the simplest algorithm first → then add complexity if needed
Test on a small data set
Coded and tested
Offline algorithm evaluation
Introduce to humans (control group)
Release to a small % of users

Be clear about your success metrics.
If the key success metric is lagging (retention of subscribers) use a leading indicator (users watching more content)
Frame hypothesis: the goal of the test (e.g. Recommender B produces recommendations that are clicked more often than recommendations from Recommender A)
It can make more sense to say that a recommendation algorithm is successful if a user clicks one or more times on the recommendations. Success depends on the domain.
How can you make the customers happy?

The site understands my tastes
The site gives me a nice variety of recommendations
The site surprises me
The site covers everything in the catalogue (business goal)

Understanding my taste = minimising prediction error
The Matthew effect: “the rich get richer and the poor get poorer.”

If popular items are always favoured, you’re also creating what we call a filter bubble.

Coverage: The better the diversity, the better the coverage

An ideal recommender system would enable users to navigate the full catalog
Coverage has two sides:

how much of the catalogue is it capable of recommending (catalogue coverage)
how many users can it recommend something to (user coverage)

You can calculate user coverage by iterating over all users, and seeing what it returns

item-item collaborative filtering → covered 13% of movies
content-based recommenders might be better

Serendipity: Serendipity is about giving the user that sensation, so that all their visits aren’t more of the same. Serendipity is subjective and difficult to calculate.

Don’t constrain the recommender’s returns too much. Constraining means less serendipity.

Types of evaluation:

To do a true evaluation, you need a complete ground truth set

A data set containing information about all combinations of users and content.
You need to assume the user-item combinations are the truth and representative of all users

The 3 testing scenarios:

offline experiments
controlled user experiments
online experiments

Offline evaluation uses data that you regard as truthful. Then split the data into two parts and feed one part to the recommender, use the other part to verify that the recommender predicts ratings on items in the set that were hidden to it

a recommender can pass an offline evaluation with flying colours but still fail miserably in production

Measuring error metrics:

Measure the difference between the ratings in your historical data set with what the recommender algorithm outputs

The difference between a user’s rating and the recommender’s prediction of that rating

mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).

Mean Absolute Error (MAE): takes the absolute value of each of the differences and finds the average of those. Absolute values are used so a high guess doesn’t cancel out a low guess.
Root Mean Squared Error (RMSE): puts big penalties on big errors. Big errors count more than several smaller ones.

If it’s important that none of your users gets bad recommendations, then you should use RMSE.

Adjustments for skewed data:

Weight all users evenly. So those that rate only a few are still considered
Split popular items and long-tail items into two separate datasets

Measuring decision-support metrics

True positive (TP)— Item recommended and consumed by the user
False positive (FP)— Item was recommended but the user didn’t consume it.
False negative (FN)— The recommender didn’t include the item in a recommendation and the user consumed it.
True negative (TN)— It wasn’t recommended and the user didn’t consume it.
Precision— What fraction of the recommended items the user consumed:

Precision = \frac{TruePositive}{TruePositive + FalsePositive}

Recall— What, out of all the items that the user consumed, was recommended

Recall = \frac{TruePositive}{TruePositive + FalseNegative}

It’s considered more important to optimise precision, while not putting too much importance on whether the user gets all the possible relevant items (the recall).
Decision-support comes from information filtering, originally used for calculating the quality of search results
n the world of recommenders, they’ve restricted the measurement to look only at the k top elements (precision at k)

The k top elements are measured by taking the number of relevant items between the first k items

P@k(u)= \frac{\#{relevant\hspace{0.5em}content\hspace{0.5em}in\hspace{0.5em}the\hspace{0.5em}top\hspace{0.5em}k\hspace{0.5em}positions}}{k}

Mean Average Precision(MAP): calculates the average of each precision at k, takes rank into account
Discounted Cumulative Gain: finds each relevant item and then penalises the items the farther down it is in the list. It’s much like the MAP that you looked at earlier. But where the MAP only works with relevance, the DCG looks at levels of relevancy. So each item needs a relevancy score.
NDCG: Normalises DCG vs the optimal ranking, and is therefore a more universal benchmark across systems

Preparing data:

Filter out users with only a small number of interactions
If you have too much data → do stratified sampling

Split the data into training, test and validation

The test set is where you calculate how good the algorithm is at predicting the produced ratings
It makes sense to split most recommendation data by time
You can also split user ratings (not users) between sets

Some recommenders don’t distinguish between ratings done now, or a year ago. A ratings age might have an effect on their ability to predict
Online Evaluation

Controlled Experiments → A/B testing
Determine if the recommendations are good, show two different types of recommendations to the user, and see which seems to work better
You can use explore/exploit or mult-armed bandits to control rollout

Keep feedback loops in mind. If the recommender works it might hinder diversity.
Find a way to inject new items into the loop (Consumed → in training data → model → in recommendation)

Chapter 10: Consent-based filtering

It’s possible to create recommendations by focusing only on the interactions between users and content
You can use content-based filtering to create…

similar items recommendations
personal recommendations based on taste

Finding content visitors think is similar is key. Typically systems will use tags or textual descriptions.
Systems based on text first extract keywords (removing those that produce noise)

Use the important words you find to create a list of categories (also called topics)
LDA algorithm (Latent Dirichlet Allocation):

Latent → topics found don’t compare to any category that you know
Dirichlet → is the way documents are described using these topics
Allocation → words are allocated into topics

Content-based filtering seems is about extracting knowledge from the content
You need the following modules:

Content analyser: creates a model based on the content, creating a profile for each item

Feature extraction: extracting things that are important for the algorithm to work
Mapping data into something a machine can use (vector)

User profiler: creates a user profile (sometimes just a consumed items per user)
Item retriever: retrieves relevant items found by comparing user profiles to item profiles

Metadata can be facts or tags (facts are things like, film produced in 2021), tags can mean things different things to different people e.g. ‘uplifting’)

One of the biggest showstoppers for developers trying to use content-based recommenders is that they can’t get the data about the items.

You could build it yourself or you could hire people to go through the content and tag

To represent basic film tags, you can make a simple table:

Year	Starting Ben Affleck	Action	Explotions
2016	0	1	5

Expect a long list of features
A feature needs to appear in enough of the entries to be relevant (an actor that’s in only one film won’t help - you need to find things in common)

Normalise the data to be between 0 and 1 (scale things like year ‘2016’ too)
TD-IDF → what words are in the article? How often do they appear?
If you’re using descriptions, use a bag-of-words model to split descriptions into an array of words (and remove words from a stopword list)

you can also look at removing high frequency words that appear in everything

Finding important words with TF-IDF

TF: Term frequency:

tf(word,document) = how many times does the word appear in document
Or tf(word,document) = 1 + log(word frequency)

The more times a word is present in a document, the higher the chance that it’s important.
It’s more likely it’s important if the world is only present in few documents
Inverse term document frequency (idf in the equation)

The number of all your documents divided by the number of documents that contain the word.

Together, the product is tf-idf, which is defined like this:

tf - idf(word,document) = tf(word,document)*idf(word,document)

tf-idf(word,document) = tf(word, document)*idf(word,document)

When you find words that provide large TF-IDF returns, you can add those to the list of features
Most people use the LDA now (not TF-IDF) , but many use TF-IDF to clean input of the LDA
Topic Modelling using the LDA

LDA (Short for latent Dirichlet allocation)

Latent: hidden (or found by the algorithm)
Distribution of distributions
Allocation: all words are allocated to the latent topics

An LDA is a generative model

LDA’s use

Content → Topcis → Words

Content can be described by a series of topics, topics can be described by what words are used when describing the topic
The goal is to connect words with topics and topics with documents!

Gibbs sampling begins with randomly adding topics to documents, and words to topics → it makes adjustments until it converges on something (that hopefully makes sense)

Most of the implementation work in making an LDA function is how you process the text you’re inputting.
If you want to control the topics a document contains, you can inject words into the descriptions
Key parameters for LDA:

k (number of topics) if too low, then documents look similar. If too large, you might end up with no similar documents.
High Alpha: distribute each document over many topics
High Beta: leads to topcis being more similar

Once you have the LDA model you can predict the similarity of any two documents

The probability distribution can be seen as vectors, so you can use cosine similarity
You can also compare a document that wasn’t used to create the model (and get around the cold start problem)

To create personalised recommendations, create a user profile

Iterate through each of the items the user likes
For each item, find similar products
Order according to the product of similarity and the users’ ratings from the original list of consumed items

Pro’s and Con’s of content based filtering:

Pros

New items are easy to add (just create the item feature vector)
You don’t require much traffic - as it’s the content descriptions that’s driving it
Popularity isn’t considered

Cons

Conflates liking with importance.
No serendipity, it’s specialised
Limited understanding of content, so system can misunderstand what a user likes

Chapter 11: Finding hidden genres with matrix factorisation

Hidden genres = latent factors

Latent factors are defined by an algorithm
They can represent a user’s taste, it might be hard to understand what the factors mean.
The Netflix prize was won by an ensemble recommender algorithm too complicated to put into production

SVD = Single Value Decomposition… you can add new users easily, calculating for a large data set is slow upfront.

Reduce the amount of data.
Do dimension reduction on high-dimensional data.
You want to reduce the data so that only the directions that provide more information are retained
If the algorithm succeeds, then the vectors pointing in those directions are said to be latent factors.
By looking at the behavioral data of users, you can find categories or topics that let you explain users’ tastes with much less granularity than with each movie but more detail than genres. These factors will enable us to position the movies that users like closer to each other.
You do dimension reduction by factorizing the rating matrix

Helps you find important factors so that you’ve users and items in the same space, and you can recommend films that are interesting to the user by finding which are nearby. You can also find similar users.
the matrix factorisation attempts to push items and users that are more closely related

Calculating recommendations is done by looking into the factor space and finding the ones that are closer to the user.

there are tools that help you interpret dimensions in a vector space but they rarely work

A matrix is a rectangular array of numbers and declared by a number of rows m and a number of columns n.
A vector is a special case of a matrix that has only one column.
You have to have values to put in all the cells of the matrix
What’s factorisation? Factorisation is about splitting things up. You can split 100 into the following prime factorisation:100 = 2 × 2 × 5 × 5

Taking a number and writing it as a number of factors (in this case prime numbers)

UV-decomposition allows you to decompose the matrix into hidden features for items and for users.

You need to insert values in the U and V matrixes in such a way that UV is as close to R as possible.
To do factorisation is to find the v’s and u’s that satisfy the equations.
One of the most commonly used methods for matrix factorization is an algorithm called SVD (singular value decomposition)
You want to find items to recommend to users, and you want to do it using extracted factors from the rating matrix. The idea of factorization is even more complicated because you want to end up with a formula that enables you to add new users and items without too much fuss.

From A rating matrix M, you want to construct two matrices that you can use: one that represents the customer’s taste and one that contains the item profiles.
The diagonal matrix in the middle (Σ), more significantly known as the weights matrix.

Each of these weights can be an indication of how much information is present in each dimension.

A rule of thumb is that you should retain 90% of the information.

Solving the problem of the zeros in the rating matrix using imputation

Often only 1% of the cells in the rating matrix have values. Two ways to solve that with imputation…

You can calculate the mean of each item (or user) and fill in this mean where there are zeros in each row (or column) of the matrix.
You can normalise each row, such that all elements are centred around zero, so the zeros will become the average

Baseline predictors are a better alternative.

The SVD method allows you to add new users and items into the system

To add a new user means to add another row to the rating matrix, which in turn means that the decomposed user matrix would also have another row
You need to have user ratings before you can add items.
You still need to recalculate regularly (daily or weekly), the reduction is done to extract topics and they aren’t updated when you fold in a new user or item; they’re placed compared to the topics that are already there

Cons of SVD:

Need to do something with the unfilled cells in the rating matrix
Slow to calculate large matrixes
Unexplainable

Pros of SVD:

New users can be folded in as they arrive (SVD model is static and should be updated as often as possible)

Baseline predictors make it easier to add values to the empty slots of the matrix

Baseline predictors extract the biases for the items and the users to make a better prediction
If a user underperforms the average, rate it lower
If an item outperforms the average rating, rate it higher

Temporal dynamic: bias isn’t static. Items go in and out of fashion.

Your rating prediction function can be made function of time.
If your data stretches over a long period of time and you have many ratings, it may increase precision

SVD puts a lot of weight into the rating matrix (much of the data in which is imputed)

Simon Funk came up with a method that only uses the things you need to know
Funk SVD or (regularised SVD)

Use RMSE to provide a measure of how close you are to the known ratings
Use gradient descent, which uses RMSE to walk toward a better solution.

When you’re optimising algorithms, your first approach should always be RMSE.

Penalises big errors (takes the square of the difference)

You want to find values which make the result of the function as small as possible.

Gradient descent is a way to find optimal points on a graph, where optimal means the lowest point (or the highest point)
Gradient descent starts somewhere and then assess if there’s a direction that makes the function produce a smaller value.
Often the best suggestion is to try different starting points and see if they arrive at the same point.
Alpha (α) in the world of gradient descent is the learning rate, which translates to how big a step you should take every time you want to move to the next point.

too large a step, you might miss the minimum altogether

Stochastic gradient descent looks at one rating at a time. Gradient descent is an important part of how you train deep learning networks
When you run through all the known ratings once, you have two matrices that can be used to predict ratings.

Shuffle the list of ratings before you start as trends in the ratings can push the factorisation in strange directions.

The art of Machine learning is in initialising parameters and what the constants should be

learning rate γ = 0.001, regularisation λ = 0.02 with 40 factors

You’re not only looking for more precision and recall, you’re also concentrating on training the algorithm to understand the domain.

You want to create something that not only remembers the answers from the training data but also predicts ratings from the test data.

As little error as possible both on the data that you’re training and on the data set that you test it on when you calculate or descend toward the two matrices for the user and item factors.
Teach the algorithm too little: you won’t have a result that understands the domain, while too much and it’ll be overfitted on the training data

The levers you’ll use to manipulate the algorithm include the following:

Initialise the factors?: Decides where the gradient descend should start walking.
Learning rate?: How fast you should move at each step.
Regularisation?: How much should the algorithm regularize the errors? Should it be the same for all the expressions or should you split it into factors and bias?
How many iterations?: How specialised should the algorithm become to the training data?

Doing recommendations with Funk SVD

You have four things:

Item factor matrix: each column represents a content item described by the latent factors that you calculated
User factor matrix: each column represents a user described by the latent factors.
Item bias: certain items are generally considered better (or worse) than others
User bias: encompasses different rating scales for different users.

You can calculate a predicted rating for any item for any user, you can now provide predicted ratings on everything

To find a list of things the user would rate highly…

Brute force: calculate a predicted rating for each user for each item, then you sort all the predictions and return the top N items.
Neighbourhoods recommendation: you can use the factors that you calculated. This means that you’re calculating similarities where things are closer and in a smaller dimension, which makes it easier.
Factor vectors: use factor vectors that are close to the active user’s vector. They’re all in the same space, so why not.

I’m torn about whether you should order the items before adding the item biases or not.
This model goes out of date quickly. Ideally you’d start a new run as soon as it’s finished. Train daily or weekly at least.
If gradient descent is too slow, use alternating least squares (ALS) method, which isn’t as precise as gradient descent, but should work fairly well anyway.
You almost always want to use cross-validation to evaluate the performance of the algorithm.
Summary:

SVD is a way of doing matrix factorisation but it won’t work if your matrix isn’t complete,
You have to fill empty cells with baseline predictors
Gradient descent and stochastic gradient descent are super tools for solving optimisation problems

Chapter 12: Taking the best of all algorithms and implementing hybrid recommenders

Combine recommender algorithms to get a more powerful tool. Achieve better results and patch up corner cases, where algorithms don’t work well. They also allow you to use the data you have available to you
The three different types of overall hybrid recommenders

Monolithic: components of various recommenders and glues them together in new ways;
Ensemble runs different recommenders and then combines the result into one
Mixed recommender runs a number of recommenders, returning all of them

The monolithic mixes components from different recommenders or even adds new steps to improve overall performance. A monolithic hybrid recommender is composed of other recommender system parts.

E.g. a content-based approach that finds all the items that are similar content-wise, mixing that with a collaborative filtering approach to calculate predicted ratings the same way you’d use simple collaborative filtering

The whole point of hybrids is to take advantage of more types of data.

E.g. a collaborative filtering recommender with one extra pre-processing step that adds ratings to the rating-matrix, such that the collaborative filtering will connect things that are related content-wise

A mixed hybrid returns the union of all the results, where you’ve a hierarchy of recommenders

Consider the recommender as a scale of personalisation
Make the first as personalised as possible and then continue until you’re using item popularity to give recommendations
Often the most personalised recommender only produces one or two recommendations, the next recommender produces several more and, in that way, you’ll always have a good quantity of recommendations but with as much quality as possible.
If you’ve several good recommenders, each returning a score, then you can return a list ordered accordingly. Remember that the scores should be normalised so all the results are on the same scale

An ensemble combines predictions from different recommenders into one recommendation

Start with matrix factorization

then add recommenders to create an ensemble recommender

If you’ve two recommenders running already, let’s say content-based and collaborative filtering, then why not run them simultaneously and combine the result to get an even better outcome
You can do use majority voting approach

A switched ensemble recommender is about using the best tool for the job. If you have two or more recommenders, then a switched ensemble recommender will decide which of them to use given the context of the request

Country, time of day, how many ratings a user has made

Weighted ensemble recommender

Consider Collaborative filtering and content-based filtering

Content-based filtering finds similar content.
If you know a user likes a topic, you can use content-based filtering to find similar things.
But content-based filtering doesn’t distinguish between good and bad quality; it’s only concerned with topic or keyword overlap
Collaborative filtering, doesn’t care about similarity only that certain people thought it was good quality and others didn’t

Use them together to combine their strength and you don’t have to give them equal weight.

train two different recommenders and ask them both to produce candidates for recommendations.
When two or more recommenders are combined like this its called a feature recommender

You can find the best feature weights: guessing, regression, A/B testing / multi-armed bandits
It’s good having weights for each recommender, but it’s even better to have these weights change depending on the user or items

Feature-weighted linear stacking (FWLS)

How to make the weights into functions instead.
Meta features: Weights as functions
Make your algorithm more flexible. E.g: use content-based recommendations if the user has only rated a few items, and collaborative filtering if the user has interacted with many items. Replace weights with functions (these functions are called meta functions or feature-weights functions)

Let the machine decide what to do. You can say that you want the content-based recommender to provide 90% of the recommendation if the user has rated fewer than three items; otherwise, it should be 50/50.
You can use this to blend the output of recommenders using weights that are functions that make those extremely flexible.

You want the hybrid recommender function to produce output that makes the difference between an expression and the actual ratings as small as possible.
Split the data and remove something like 20% of it to check how well the algorithm does.

Split data into training, validation, and test sets
Training data: trains the feature recommenders and the hybrid.
Validation data: allows you to understand if the hybrid is a good model and to fine-tune the hyper parameter
Test set: provides an unbiased validation of the model

The evaluation runner is used to evaluate an algorithm. It’s a pipeline where the data is first cleaned, then split into the k folds for cross-validation. Then for each fold it repeats training of the algorithm to evaluate it. When it’s all finished, you aggregate the result.

Summary

A recommender system can be greatly optimised by adding the output of several algorithms.
Hybrids enable you to combine the forces of different recommenders to get better results.
The feature-weighted linear stacking (FWLS) algorithm enables the system to use feature recommenders in a function-weighted way, which makes it strong.

Chapter 13. Ranking and learning to rank (LTR)

Instead of focusing on recommendations as a rating prediction problem, it sometimes makes more sense to look at how the items should be stacked
To define relevancy like this takes away the need to predict ratings, you only have to know what’s preferred over everything else
Information Retrieval Systems: Learning to rank (LTR)

Learning to rank is a way to combine different kinds of data sources, such as popularity, distance, or recommender system outputs
You’re looking for input sources (features) that will give you an ordering of the objects(
Foursquare’s algorithm to rank venues near you

Higher values should denote a shorter distance and a higher average venue rating.
Invert distances (short distance has higher value)
Set a max distance (10 minute walk)
Rescale data so that everything is between 1 and 0
For more information on rescaling, see wikipedia
You want to teach the machine to rank these items based on the input of ratings and distance.

You want the system to learn weights (w0 and w1), which produce a value such that the items get ordered as you’d expect on the page

How do you optimise the function if you don’t know what it’s supposed to return?

You use the check-in feature

A hybrid recommender always predicts ratings, while a LTR algorithm produces orderings
Re-ranking example: take a popularity ordering and re-rank the items using the recommender. Popularity narrows the list to the most popular.
Collaborative filtering algorithms are prone to recommend items liked by few people, but by people who really like the content. The algorithm has no concept of popularity and could be used for re-ranking instead of as the sole source for the ordering
Learning to Rank (LTR)

A ranking recommender has a catalog of items. The system retrieves items that are relevant to the user and then ranks them so the items at the top of the list are the most applicable
A ranking model is trained using an LTR algorithm, which is a supervised learning algorithm, meaning that you provide it with data containing input and output.

that’s a user_id as input and a ranked list of items as output.

This family of algorithms has three subgroups:

Pointwise: produces a score for each item and then ranks them accordingly
Pairwise: a type of binary classifier. It’s a function that takes two items and returns an ordering of the two. Usually optimise so that you’ve the minimal number of inversions of items compared to the optimal rank of the items
Listwise: is the king of all LTR subgroups because it looks at the whole ranked list and optimises that. The advantage of listwise ranking is that it intuits that ordering is more important at the top of a ranked list than at the bottom. Pointwise and pairwise algorithms don’t distinguish where on the ranked list you are

Bayesian Personalised Ranking (BPR)

For each user you want to order the content items in such a way that the most relevant is on top
Define an ordering that says no matter which two items you hold up, the ordering will rank one better than the other. To make that work, you need three rules:

Totality
Anti-symmetry
Transitivity

With implicit ratings you never get negative feedback because you only have events that say “bought.” Makes it hard for ML to know when it’s done something wrong

Bayesian statistics are based on this simple equation:

p(a|b) = \frac{p(B|A)p(A)}{p(B)}

The probability (p) of event A happening given that B happened is equal to the probability that A happens multiplied by the probability that B happens when A occurs, divided by the probability that B happened.
A is the event that it has rained, and B is the event that the street outside is wet. Then Bayes says that the probability that it has rained given that the street is wet (p(A|B)) is equal to the probability that it has rained (p(A)) multiplied by the probability that the street is wet given it has rained (p(B|A)) divided by the probability that the street is wet (p(B)).
In a BPR, you want to find the model such that there’s the highest probability that the model will produce a perfect ordering for all users
Assumes there’s a way to order all your content items flawlessly for each of your users, which is the total ordering that I keep mentioning.
Mix with a content-based recommender or use something else like the multi-armed bandit scheme to introduce more items into the system.

Summary Learning to Rank

(LTR) algorithms solve a different problem than that of the classic recommender problem of predicting ratings.
Foursquare uses a ranking algorithm to combine the ratings and the locations of venues, which is a great example of how to combine relevant data into one recommendation.
It’s not easy to come up with a continuous function that can be optimised

Chapter 14: Future of recommender systems

Pedro Domingo’s famous paper called “A Few Useful Things to Know about Machine Learning,” he points out that more data beats a more complex algorithm. That’s also true for recommender systems, only data collected isn’t always a source of truth
Recommenders will also be used to make decisions in many other scenarios.
Next Best Action recommenders are becoming bigger and bigger in marketing and banking.
The future of recommenders lies in dynamic recommender algorithms, with a core of offline calculations

Practical Recommender Systems

Review

You Might Also Like:

Designing for Behaviour Change

The Right It

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

Key Takeaways

Deep Summary

Part 1: Getting ready for recommender systems

Chapter 1: What is a recommender?

Chapter 2: User behaviour and how to collect it

Chapter 3: Monitoring the system

Chapter 4: Ratings and how to calculate them

Chapter 5: Non-Personalised Recommendations

Chapter 6: The User (and content) who came in from the cold

Part II: Recommender Algorithms

Chapter 7: Finding similarities among users and content

Chapter 8: Collaborative filtering (neighbourhood-based collaborative filtering)

Chapter 9: Evaluating and testing your recommender

Chapter 10: Consent-based filtering

Chapter 11: Finding hidden genres with matrix factorisation

Chapter 12: Taking the best of all algorithms and implementing hybrid recommenders

Chapter 13. Ranking and learning to rank (LTR)

Chapter 14: Future of recommender systems