Author

Huyen Chip

Year

2022

Review

The book focuses on the important considerations of deploying ML Systems in industry. It is more practical than academic, which is rare for a machine learning book and great news for product managers. There’s just enough depth and detail to help you understand the key tradeoffs.

The machine learning space is moving fast, any reference to best practice or tooling is going to age quickly. This book focuses on the more timeless problems and areas of interest.

100,000 words later you’ll have a good understanding of designing Machine Learning systems and some foundational vocabulary that will help you navigate this space.

You Might Also Like:

Book Review and Summary: You Look Like a Thing and I Love You by Janelle Shane

Get a comprehensive review and summary of the best-selling book "You Look Like a Thing and I Love You" by Janelle Shane, the AI researcher and science communicator. Learn about the key principles and practices of artificial intelligence, machine learning, and data science, based on Shane's own research and insights.

andrewclark.co.uk

Book Review and Summary: You Look Like a Thing and I Love You by Janelle Shane

Lean Analytics

Lean Analytics is an incredibly insightful, data-driven approach specifically designed to rapidly assist startups in building superior products. Expertly co-authored by Benjamin Yoskovitz and Alistair Croll, this book serves as an invaluable guide to understanding the most effective way to measure progress. It empowers businesses to ask the most impactful questions and discern what vital information can be gleaned from the data to make swift and informed decisions in our ever-changing market. This book is absolutely essential for entrepreneurs, data analysts, and product managers who are determined to leverage data to drive their product strategy.

andrewclark.co.uk

Practical Recommender Systems

Recommendations have become an important part of most products. As such, Product Managers need a good understanding of the underlying technology. Although the theory can be complex, having an understanding of the tradeoffs and limitations of your system is crucial.

andrewclark.co.uk

Key Takeaways

The 20% that gave me 80% of the value.

Building and deploying machine learning systems is complicated. More stages, stakeholders and components than traditional systems. The actual ML algorithm is only a small part of a production ML system
Machine Learning is an approach to learn complex problems from existing data - and use these patterns to make predictions on un-seen data.
They need something to learn → there must be clear patterns in the data.
What is complex for machines is different from what is complex to humans.
Use ML solutions when → Patterns and tasks are repetitive, the cost of wrong predictions is small, you’re making predictions at scale (required to justify investment), the patterns are changing.

Note - radical changes in data distributions will still require human intervention

Don’t use ML solutions when simpler solutions can do the trick. Always start with a non-ML solution
Typical use cases: Search engines, recommender systems, suggestions, translation, assistants, health monitoring, fraud detection, price optimisation, demand forecasting, churn prediction, support ticket classification, sentiment analysis.
ML in academia and ML in industry look really different

Research	Production
State-of-the-art model performance	Good enough to be useful
Fast training (training throughput)	Fast inference (latency of generating a prediction)
Static data	Changing data
Clean data	Messy data
Ethics less of a consideration	Ethics can’t be ignored
Interpretability not important	Interpretability can be important
Clear goals	Conflicting requirements from stakeholders
Ensembling common	Simpler less complex systems preferred

During model development training is the bottleneck - once a model is in production, inference is the bottleneck. Research prioritises high throughput - Production prioritises low latency. Latency matters a lot in real world applications.
To predict the future, ML algorithms encode the past - perpetuating bias. They can discriminate at scale.
Traditionally data and code are separated (Separation of Concerns) - but not in ML systems.
Models degrade over time - often best when put live

Machine Learning Basics

Business objectives need to be translated into ML objectives. You need to frame your problem so that ML can solve it - and tie the performance of your ML system back to the overall business.

Companies care about outcomes - not ML metrics (F1 score, inference latency)

Also for consideration: Reliability, scalability, adaptability and maintainability

Reliability: The system can continue to perform at the desired level of performance even in the face of adversity. ML systems can and often fail silently.
Scalability: Grow in complexity, traffic, in model count, features. There should be reasonable ways to grow in whatever dimension is most needed. Cloud services are great at auto-scaling. Artefact management is a big part of scaling ML models - as is monitoring and retraining (which needs to be automated at scale).
Adaptability: Coping with shifting data distributions. Discovering aspects for performance improvement and allowing updates without service interruption. Data can change quickly so ML systems need to be able to evolve.
Maintainability: Setup your process and infrastructure so different contributors are comfortable with the tooling. Code, data and artefacts should be documented and versioned. Models need to be reproducible by other contributions.

Developing an ML system is an iterative process - once a system is in production - it’ll meed to be monitored and updated.
Most ML tasks are Classification or Regression

Classification is putting things into categories (e.g Spam / Not Spam)
Regression models output a continuous value (house price prediction)

Classification problems are simpler with few classes. Binary is the simplest form.
Multi-label classification is hard. Labels tend to be less consistent (as human labellers disagree - this a strong warning sign).
Changing the way you frame the problem - could make it much easier for ML to solve
The objective function (or loss function) guides the learning process and tries to minimise wrong predictions. In supervised ML - the loss can be computed vs ground truth labels using RMSE (root mean squared error) or cross entropy
When there are multiple objectives - decouple them first because it makes model development and maintenance easier.
The success of an ML system depends largely on the data it was trained on

Data Engineering Fundamentals

Row-Major VS Column-Major Format: Accessing data by rows is faster than by columns in modern computers. Row-major formats (e.g. CSV) are better for accessing and writing examples. Column-major formats (e.g. Parquet) are better for accessing all features together (column based reads. Note: Pandas default is column-major, NumPy’s default is row major
Data Models: Describe how data is represented. Your model will influence how your systems are built and what problems you can solve.

Relational Models: Breaking data into relational tables. Rows and columns that can be shuffled. Often normalised - structuring tables into normal forms to reduce data redundancy and improve data integrity. The downside of normalisation is that data is spread across multiple relations - you have to join them back together. Query language is the language you use to specify the data you want. SQL is a declarative language - you specify the outputs you want, the computer figures out the steps needed for an action and the computer executes these steps to return the outputs. Python is imperative. You specify the steps needed for an action - and the computer executes them to return an output. In theory - SQL can be used for any computing problem (Turing complete).
NoSQL Models:

Relational models have to follow a strict schema.
Two major types: Document model and the Graph model

Documents for when data is self-contained and relationships are rare
Graphs for when relationships are common and important

Document Model: Can be a single blob of JSON. Documents are more flexible - each one can have a different schema. Document databases shift responsibility of assuming structures from the writing application to the reading application. Each document has locality (holding all relevant information) making retrieval easy. Filtering results for documents with certain attributes is slow - you have to read them all, extract the attribute, then filter. Relational models are faster for this
Graph Model: Graph consists of nodes and edges - edges are the relationships between the nodes. The relationships between the data are a priority. It’s faster to retrieve data based on relationships.

Queries that are easy to do in one data model are hard in another - picking the right data model can make your life so much easier

Structured vs Unstructured Data

Structured follows a predefined data model (a.k.a. schema). Pre-defined structure makes data easier to analyse. Disadvantage - you have to commit to a schema in advance, changing it retrospectively is harder.
Unstructured data becomes appealing if data models are changing quickly - or you’re reliant on data sources outside of your control.

Structured	UnStructured
Schema clearly defined	Data doesn’t have to follow a schema
Easy to search and analyse	Fast arrival
Can only handle data with a specific schema	Can handle data from any source
Schema changes cause trouble	Schema changes are easy. Downstream applications have the issues.
Stored in data warehouses	Stored in data lakes

‣

Two types of processing - Transactional vs Analytical

Transactional processing: actions are inserted as they’re generated, occasionally updated or deleted if something changes. Fast processing (low latency) and high availability are important. ACID (atomicity, consistency, isolation, durability)

Atomicity - if any step in the transaction fails - they all fail
Consistency - must follow pre-defined rules (e.g. validation)
Isolation - Two users can’t access the same data at the same time
Durability - once a transaction is committed - it remains so (even if the device dies)
Systems that are ACID are sometimes referred to as BASE (Basically Available, Soft state and Eventual consistency)
Transactional databases are often row-major - great for writing new records - but bad for asking ‘What’s the average price of items sold in our London stores?’

Analytical processing: Efficient with queries that allow you to look at data from different view points.

ETL: Extract, Transform, Load: Extracted from different sources → transformed to desired format → loaded into target

‣

Three main models of data flow (Database, Request-Driven, Event Driven

through databases

is easiest, but requires that both processes have access to the same database. Read/write can be slow - making unsuitable for low latency applications

through services using requests provided by REST and RPC APIs (POST/GET requests)

Often called ‘request-driven’ and coupled with a service-orientated architecture (or micro-services architecture). A service is a process that can be accessed remotely. Structuring an application as separate services allows for independent development, testing and deployment. Also great for when two companies collaborate.
Popular requests:

REST (representational state transfer)

or restful (CRUD - Create (post), Read(get) , Update(put) , Delete(delete)
you can’t get a branch by state (e.g. apply a filter)
best for just a simple application

RPC (remote procedure call)

flexible - great for business rules - designed for actoins
more scalable in the long run (for different use cases)

through real-time transport like Apache Kafka and Amazon Kinesis

Request-driven data passing between services is synchronous - and can get slow and complicated if there are too many services. One service can cause all the others to fail too
Real-time transport acts as a broker between services - which can either broadcast or listen.
We call the pieces of data being transported events. The architecture is called event-driven.
Publish-Subscribe message brokers (Apache, Kafka, Amazon Kinesis) or Message Queues (Apache RocketMQ and RabbitMQ).
In a message queue model- events often have intended consumers.

Request driven architecture works well for systems that rely more on logic than on data. Event-driven architecture works better for systems that are data heavy.
Batch Processing VS Stream Processing

Batch processing = jobs that are kicked off periodically. Streaming data = uses realtime transport and stream processing (realtime or every few minutes).
Stream processing can give low latency. It’s not always less efficient either. It can be scalable as computations can be done in parallel. You can also save compute by doing things as they happen, vs re-running large batch jobs.
In ML batch is used to compute features that change less often (like a drivers rating)

Batch features - are also known as static features

Stream processing is used to compute features that change quickly

Stream features - are also called dynamic features.

Training Data

Sampling

Sampling happens throughout an the ML project lifecycle
Two families of sampling: Non-probability sampling and random sampling:
Don’t use Non Probability Sampling for ML models
Random Sampling Methods:

Simple Random	All samples in population have an equal probability of selection. Easy, but rare categories of data might not make it to your selection
Stratified	Divide population into groups (stratum) - sample from each separately. Ensures some examples of rare classes. Not always possible. Hard for multi-label tasks
Weighted	Each sample given a weight - determines probability of selection. You can leverage domain expertise - you might want more recent data to have a higher chance of being selected
Reservoir	Useful for streaming data. Have a reservoir of data - randomly select data points from the stream - and randomly replace those in your reservoir. All samples have an equal chance of being selected. You can stop at anytime and your sample will be ready.
Importance	Sample from a distribution when we only have access to another distribution. One data source could be expensive, slow or infeasible to sample from = so you sample from a more available source and weight those samples accordingly.

Labelling

Most ML models are supervised - they need labelled data to learn.
Performance of an ML model - depends on quality and quantity of data
Two types of labels → Hand labels, or Natural labels.
Hand Labels. Acquiring hand labels is often difficult, slow and expensive. Requires somebody seeing your data - so there are privacy implications. Slow labels leads to slow iteration speeds and makes your model less adaptive to changing environments and requirements. The longer the process takes, the more your existing model will degrade.

Label Multiplicity - To get enough labelled data you often have to use multiple annotators or even sources. They will have different levels of expertise and accuracy. This leads to label ambiguity or multiplicity - what to do when there are multiple conflicting labels for a data instance?
Disagreements among annotators are extremely common. If humans can’t agree on a label - what does human-level performance even mean?
Incorporating clear problem definitions and guidance in annotators training can minimise disagreement

Natural labels. Tasks with natural labels can be evaluated by the system.They might have a natural ground truth. E.g. Google Maps knows how long the trip actually took - and they can evaluate how good their prediction was. Recommender systems have natural labels (CLICK or NO CLICK). Labels inferred from user actions (clicks and ratings) are known as behavioural labels.

If you don’t have natural labels - consider adding an optional feedback loop
Companies find it easier and cheaper to start on tasks with natural labels
Implicit labels are presumed. E.g. a recommendation that doesn’t get clicked on is presumed to be bad
Explicit labels are when users explicitly demonstrate their feedback

Data Lineage - keeping track of each data samples origin and labels. Essential if taking data from multiple sources. Helps you flag bias and debug models.

Feedback Loop Length. Time from prediction to feedback. Recommender systems have short feedback loops. User feedback differs by volume, signal strength and feedback loop length.

Four ways to cope with lack of labels

Weak supervision → leverages noisy heuristics to generate labels
Semi-supervision → leverages structural assumptions to generate labels
Transfer learning → leverages models pre-trained on another tasks for your new task
Active learning → labels data samples that are most useful to your model

Class Imbalance

When there is a substantial difference in the number of samples in each class of the training data. E.g. 0.01% of X-rays might contain cancerous cells
Challenges of class imbalance:

Deep learning (and ML) works well in situations when the data distribution is more balanced because

insufficient signal to detect minority cases
stuck in non-optimal solutions using heuristics instead of learning useful things
asymmetric costs of error - the cost of a wrong prediction on a sample of the rare class can be much higher (missing a rare cancer diagnosis). If your loss function isn’t configured to address this asymmetry, your model will treat all samples the same way

Rare events are often more interesting (like in fraud detection, or churn prediction)
Three ways to handle class imbalance:

‣

Use the right evaluation metrics

Model Accuracy and Error Rate (used frequently) are insufficient metrics for tasks with class imbalance because they treat all classes equally. Performance on the majority class dominates the metrics. This is especially bad when the majority class isn’t what you care about.
F1, precision, and recall are metrics that measure your model’s performance with respect to the positive class in binary classification problems, as they rely on true positive—an outcome where the model correctly predicts the positive class.

	Positive Prediction	Negative Prediction
Positive Label	True Positive	False Negative
Negative Label	False Positive	True Negative

Precision = True Positive / (True Positive + False Positive)

Precision = accuracy of positive predictions

Recall = True Positive / (True Positive + False Negative)

Recall = proportion of actual positives (in the data) that were predicted

F1 = 2 x Precision x Recall / (Precision + Recall)
They are all asymmetric metrics - their values change depending on what you call the positive class
Classification problems can be modelled as regression problems (instead of returning a class you return the probability of a class). You can then classify based on that probability by setting a threshold. Moving it up and down allows you to increase the true positive rate (also known as recall) while decreasing the false positive rate (also known as the probability of false alarm), and vice versa

Plotting true positive rate against false positive rate is the ROC curve. The area under the curve is a measure of how close to perfect the model is.

‣

Data-Level Methods: Resampling

Modifying the distribution of the training data to reduce the level of imbalance to make it easier for the model to learn.

You can OverSample the minority class or UnderSample the majority class

Undersampling →

Random removals
Tomek link UnderSampling - finds pairs of samples from opposite classes that are similar and removes the one from the majority class. Helps models learn the boundary but might make them less robust.

Oversampling →

Random duplication
SMOTE (systematic minority oversampling technique) - novel samples of minority class and synthesis them with other minority class examples.

These techniques only work well in data with low-dimensionality
never evaluate your model on resampled data - it will cause your model to overfit to that resampled distribution
Undersampling risks losing important data from removing data
Oversampling risks of overfitting on training data

‣

Algorithm-Level Methods:

Algorithm-level methods keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
Many algorithm-level methods involve adjustment to the loss function (that guides the learning process). Gets the model to prioritise making correct predictions on the more important class. By giving the training instances we care about higher weight, we can make the model focus more on learning these instances
Cost-sensitive learning → loss function is modified to take into account this varying cost of miss-classification (but you have to manually define the cost matrix)
Class-balanced loss → punish the model for making wrong predictions on minority classes to correct this bias by making the weight of each class inversely proportional to the number of samples in that class, so that the rarer classes have higher weight
Focal loss → incentivise the model to focus on learning the samples it still has difficulty classifying. If a sample has a lower probability of being right, it’ll have a higher weight

Data Augmentation - a family of techniques that are used to increase the amount of training data

Simple Label-Preserving Transformations → e.g. cropping, flipping, rotating, inverting, or erasing part of the image. In NLP, you can replace a word with a similar word. A quick way to double or triple your training data.
Perturbation → adding a small amount of noise to make models more robust

Data Synthesis - creating training data to boost a model’s performance. In NLP → templates can be a cheap way to bootstrap your model.

Example Template: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION]”
You can then list all possible cuisines, all reasonable numbers of miles, and locations (home, office, landmarks, exact addresses) for each city, you can generate thousands of training queries from a template

Feature Engineering

The most important thing in developing ML models is having the right features. Coming up with new useful features is a big part of the job. Choosing what information to use and how to extract this information into a format usable by your ML models is feature engineering.
Handling Missing Values

Not all types of missing values are equal. There are three types of missing values:

Missing not at random (MNAR) → Values are missing for reasons related to the values themselves.
Missing at random (MAR) → a value is missing is not due to the value itself, but due to another observed variable
Missing completely at random (MCAR) → there’s no pattern in when the value is missing.

When encountering missing values, you can either fill in the missing values with certain values (imputation) or remove the missing values (deletion)

Scaling

Models tend to give more extreme numbers more importance. That’s a problem if you have age (less than 120) and annual income in your model (values beyond 100k).
It’s therefore important to scale features into similar ranges before putting them into a model - this is called feature scaling.This is one of the simplest things you can do that often results in a performance.
Scaling is a common source of data leakage.

Discretisation

is the process of turning a continuous feature into a discrete feature - by creating buckets for given values. Rarely helps. The process is also known as binning or quantisation.

Feature Crossing.

Combines two or more features to generate new features. Useful to model the nonlinear relationships between features. Example: combine marital status and number of children into “marriage and children”. Can helps models learn nonlinear relationships faster. Caution - can cause overfitting and your models might need more training data

Data Leakage - Essentially when labels are leaked into the features during training. Leakage is often non-obvious and it can cause models to fail in unexpected and spectacular ways (even if evaluated carefully). It’s common and rarely covered in ML curricula.

Example: patients scanned while lying down were more likely to be seriously ill → the model learned to predict serious covid risk from a person’s position
Common Causes of Data Leakage:

Splitting time-correlated data randomly instead of by time
Scaling before splitting
Poor handling of data duplication before splitting
Group Leakage - you need to understand how your data was generated to avoid this type of data leakage
Leakage from data generation process

Detecting Data Leakage

Measure the predictive power of each feature with respect to the label → then investigate high correlation.
Do ablation studies to measure how important a feature is to your model - remove the feature and asses the drop off in performance.
Watch out for new features improving model performance by large amounts

Don’t use your test split for anything other than reporting a model’s final performance

Engineering Good Features

Generally more features are better. Therefore in production the number of features grows over time. But there are downsides too. More chance for data leakage, too many features can cause overfitting, takes more memory, cost and speed, inference latency gets worse for online prediction, you grow technical debt

If a feature doesn’t help a model make good predictions → regularisation techniques like L1 regularisation should reduce that feature’s weight to

There are many built-in and open source packages for computing the importance of your features

It’s measured by → how much that model’s performance deteriorates if that feature or a set of features containing that feature is removed from the model
Often, a small number of features accounts for a large portion of your model’s feature importance . Feature importance techniques are also great for interpretability as they help you understand how your model works.
Coverage is the % of the samples that has values for this feature in the data. The fewer values missing, the higher the coverage. Generally if a feature appears in a very small percentage of your data, it’s not going to be very generalisable.

Model Development

In supervised ML, the inputs and outputs are given, which are called data, and the function is derived from data. ML isn’t powerful enough to derive arbitrary functions from data yet → you need to specify the form that you think the function should take.
The objective function (or loss function) is highly dependent on the model type and whether the labels are available. If the labels aren’t available (e.g. unsupervised learning) the objective functions depend on the data points themselves
For k-means clustering, the objective function is the variance within data points in the same cluster. Unsupervised learning is much less commonly used in production
Root Mean Squared Error and Mean Absolute Error are two common objective functions for scalar outputs (scalar output = single variable output (not a distribution))
Learning procedures the procedures that help your model find the set of parameters that minimise a given objective function for a given set of data, are diverse
Considerations when evaluating a model:

Performance: accuracy, F1 score, and log loss
How much data are needed
How much compute needed
Time required to train
Inference latency
Interpretability (Non-neural network algorithms tend to be more explainable)

Six tips for model selection

Avoid the state-of-the-art
Start with the simplest models
Avoid human biases in selecting models
Evaluate good performance now versus good performance later
Evaluate trade-offs (false positive vs false negative, compute vs accuracy, interpretability vs performance)
Understand your model’s assumptions

Model Assumptions

Prediction Assumption	Prediction models assume that it’s possible to predict the output from the input.
IID	Neural Nets assume examples are independent and identically distributed (independently drawn from the same joint distribution)
Smoothness	Supervised ML assumes if an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
Tractability	Let X be the input and Z be the latent representation of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z\|X).
Boundaries	A linear classifier assumes that decision boundaries are linear.
Conditional Independence	A naive Bayes classifier assumes that the attribute values are independent of each other given the class
Normally distributed	Many statistical methods assume that data is normally distributed

Ensembles: a system that uses multiple models. Each model in the ensemble is called a base learner. Can lead to better accuracy but it’s less favoured in production because they are more complex to deploy and harder to maintain. TCommon when a small performance boost can lead to a huge gain.

You can have 3 models predicting the same class (SPAM, NOT SPAM) and take the majority vote. Makes much more sense if the models aren’t correlated.

3 ways to ensemble: Boosting, Bagging and Stacking:

Bagging - short bootstrap aggregating. You create different datasets called bootstraps - you train a model on each bootstrap and take majority vote.
Boosting - Boosting uses a chain of classifiers, but sample weights are changed based on how well the previous classifier predicted them. A final classifier is made using a combination of the existing ones
Stacking - train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions. The meta-learner could be majority vote or averaging

Aggressive experiment tracking and versioning helps with reproducibility, but it doesn’t ensure reproducibility
Start simple and gradually add more components
Overfit a single batch - to make sure it gets to the smallest possible loss. If it’s for image recognition, overfit on 10 images and see if you can get the accuracy to be 100%. If it can’t overfit a small amount of data, there might be something wrong with your implementation.
Set a random seed - Setting a random seed ensures consistency between different runs. It also allows you to reproduce errors and other people to reproduce your results.
Four Phases of ML Model Development

Before Machine Learning → Start with non-ML solutions
Simplest ML Models → validate the usefulness, easy to implement and deploy
Optimise Simple Models → different objective functions, hyperparameter search, feature engineering, data and ensembles
Complex Models → Experiment, look for more significant improvements, think about model degradation over time

Baselines

Random Baseline	Expected performance of a model that predicts at random
Simple Heuristic	Predictions based on simple heuristics
Zero rule Baseline	Predicting the most common class (e.g. predicting that a user will next open the app they most commonly open)
Human Baseline	How your model compares to human experts
Existing Solutions	ML systems are designed to replace existing solutions - business logic

Model Deployment and Prediction

Two types of inference (generating predictions): online prediction and batch prediction
Machine Learning Deployment Myths

you only deploy 1 or 2 ML models at a time
If we don’t do anything - model performance remains the same
You won’t need to update your models much
Most ML engineers don’t need to worry about scale

	Batch Prediction (asynchronous)	Online Prediction (synchronous)
Frequency	Periodical (every 4 hours)	As soon as requests come
Useful for	Processing accumulated data when you don’t need immediate results	When predictions are needed as soon as a data sample is generated
Optimised for	High throughput	Low latency

Online prediction isn’t necessarily less efficient - Batch processing can be wasteful, you might be computing predictions that you don’t need (e.g. for users who won’t use your product before the next run)

Batch prediction is computing predictions in advance and storing them in a database to be fetched when requests arrive. Can bypass latency issues of complex models. But it makes you less responsive to users’ change preferences and you need to know what requests to generate in advance

Building infrastructure to unify stream processing and batch processing is becoming popular companies can use feature stores to ensure the consistency between the batch features used during training and the streaming features used in prediction.
Three main approaches to reduce its inference latency:

make it do inference faster (inference optimisation)
make the model smaller (model compression)
make the hardware deployed on run faster

As cloud bills climb more companies are looking for ways to push their computations to edge devices.

Data Distribution Shifts and Monitoring

A model’s performance degrades over time in production. Once deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues
Google studied 96 cases where a large ML pipeline at Google broke - 60/ 96 failures happened due to causes not directly related to ML
ML-Specific Failures:

Production data differing from training data
Edge Cases
A Degenerate feedback loops

Covariate Shift	The distributions inputs/independent variables change but the conditional distribution of outputs (answers) are unchanged .
Label Shift / Prior Shift	The output distribution changes but the for a given output, the input distribution stays the same.
Concept Shift	Same input, Different output. In many cases, concept drifts are cyclic or seasonal.

Data distribution shifts are only a problem if they cause your model’s performance to degrade.
Statistical Methods:

Min, Max, Mean, Median, Variance, 5th, 25th, 75th, 95th, skewness, kurtosis
Two-sample test - to determine whether the difference between two populations is statistically significant

Time scale windows for detecting shifts

Shifts can happen across two dimensions. Temporal or Spatial.
To detect temporal shifts - you can treat input data as time-series data. Detecting temporal shifts is hard when shifts are confounded by seasonal variation

Many companies assume that data shifts are inevitable, so they periodically retrain their models—once a month, once a week, or once a day—regardless of the extent of the shift.
Retrain your model using the labeled data from the target distribution.

Stateless retraining - train from scratch
Stateful training (fine-tuning) - continuing training the existing model on new data

You can design your system to make it more robust to shifts.

A system uses multiple features, and different features shift at different rates.
When choosing features consider the trade-off between the performance and the stability
You might also want to design your system to make it easier for it to adapt to shifts. E.g. a separate model for each market, you can update each of them only when necessary

Monitoring

Accuracy related, predictions, features and raw inputs
Logs: recording events produced at runtime

Number of logs grows quickly. Detecting where the problem is harder than detecting when something happened.
Tracing helps us find things later and follow threads → each process has a unique ID that allows us to search logs for it
Each event we record all the metadata needed too
Companies use ML to analyse logs: to detect anomalies and prioritise them

Dashboards: visualising relationships

Helps spot patterns
Makes monitoring accessible to nonengineers
Excessive metrics on a dashboard can also be counter-productive (dashboard rot)

Alerts: Alerting the right people to suspicious signals

Alert Policy: threshold breach for each metric (sometimes over a duration)
Notification channel: slack, pager duty, email

A description of the alert

Helps the alerted person know what’s going on
Make the alert actionable by providing instructions or a runbook

Alert fatigue is real, demoralising and dangerous.

Observability: setting up the system to get visibility into our system to help us investigate when something goes wrong

Continual Learning and Test in Production

Four Stages of Continual Learning in Organisations

Manual - Stateless retraining
Automated retraining
Automated - stateful training
Continual Learning

How often to update your models

You need to figure out how much you gain from updating your model.
Value of data freshness → to gain a sense of the performance gain you can get from fresher data, train your model on data from different time windows in the past and test on data from today to see how the performance changes
Model iteration vs data iteration → do both from time to time - the more resources you spend on one approach the fewer resources you’ll have for the other
In the beginning - when updating your model is manual and slow do it as often as you can

As infrastructure matures - it can be done in hours or minutes the question becomes → how much performance gain would I get from fresher data?

Types of Production Testing:

Shadow Deployment	Deploy candidate in parallel, route every request to both models, log predictions for analysis. New model predictions aren’t served to users os its very safe - but it’s expensive and doubles compute cost.
A/B Testing	Deploy candidate model, route a % of traffic to it and use it for predictions, monitor performance and user feedback / behaviour. - Make sure it’s randomised. Make sure your doing enough volume to make it noticeable. - Book Recommendation: Trustworthy online controlled experiments - Ron Kohav
Canary Release	Slowly roll out the change to small subset of users before rolling out to the entire infrastructure. Deploy → route some traffic → if performance is OK increase → stop when its doing 100%
Interleaving experiments	Expose a user to recommendations from two models at the same time - controlling for position of recommendations (if it affects likelihood of a click)
Bandits	Experiment to see which model has the highest payout over time - route traffic based on relative model performance Requires short feedback loops - use less data before making a decision.

Infrastructure and tooling for MLOps

Each companies infrastructure needs are different:

Single ML App	No infrastructure needed (Jupyter Notebooks, Python and Pandas). Low investment.
Multiple Common Apps	Can use generalised ML infrastructure. Medium Investment. Most companies doing ML are here.
Serving millions of requests per hour	Highly specialised infrastructure. High investment.

Fundamental facilities that support the development and maintenance of ML systems

Storage and Compute	Data is stored and collected. Compute layer provides compute for ML workloads such as training, computing features and generating features
Resource Management	Tools to schedule and orchestrate your workloads to make the most of resources. (Airflow, Kubeflow, and Metaflow)
ML Platform	Tools to aid the development of ML - model stores, feature stores and monitoring tools. SageMaker and MLflow.
Development Environment	Code is written and experiments run. Code needs to be versioned and tested. Experiments need to be tracked

Most multi-cloud strategies are by accident not by design. In theory it would be nice to be able to leverage the cheapest compute and avoid vendor lock-in. But its really hard to move data and orchestrate work-loads across clouds.
If you have 100 micro services you might have 100 containers. Container orchestration tools help you manage them (Docker Compose). Kubernetes is a tool that creates a network of containers to communicate and share resources. Helps you spin up more instances when you need more compute / memory and shuts down containers when you don’t need them.

The Human Side of Machine Learning

UX considerations for ML systems

they are probabilistic instead of deterministic
they are mostly correct but we can’t tell when
latency can be an issue

Mostly correct predictions are OK for those users who can easily correct them → they aren’t useful if users don’t know how to or can’t correct responses.
Smooth Failing → If a model takes too long to respond - you can fall back to a basic heuristic (or cached or precomputed predictions).
Team Structure

Ml systems don’t work without subject matter expertise. You need it throughout the process not just in the labelling phase
Problem formulation - Feature engineering - Error analysis - Model evaluation - reranking predictions - user interface (how best to present the results)
Think about how to explain the ML algorithms limitations and capacities to the user?
No code / low code solutions enable subject matter experts to take the reigns

Having a separate team manage production makes the most sense. Makes hiring easier as you’re splitting skills. Make life easier for each person involved (as they only have to focus on a single concern). Drawbacks: communication overheard, debugging challenges, finger-pointing, narrow context, might miss E2E optimisation opportunities.
Responsible Ai
Designing, developing and deploying AI systems with good intention and sufficient awareness to empower users, to engender trust and ensure fair and positive impact to society

Welcome to the Artificial Intelligence Incident Database

Jacqueline Huskey, a Black woman living in suburban Illinois, tried more than a dozen times to get help from State Farm after a hailstorm punched holes in her roof. Now, thanks to a broad study of how the insurer handles claims like hers, s...

incidentdatabase.ai

Welcome to the Artificial Intelligence Incident Database

Deep Summary

Longer form notes, typically condensed, reworded and de-duplicated.

Preface

Building and deploying machine learning systems is complicated. More stages, stakeholders and components than traditional systems.
As ML systems are data dependent - they tend to be unique, as you have to design around the data.

Chapter 1: Overview of Machine Learning Systems

The actual ML algorithm is only a small part of a production ML system. Think… business requirements, users & interface, feature engineering, evaluation, data, infrastructure, deployment, monitoring and updating
Machine Learning is an approach to learn complex problems from existing data - and use these patterns to make predictions on un-seen data.
They need something to learn from → typically data

Zero-shot learning (or zero-data learning) is really hard for machine learning. Most systems require a lot of data to make good predictions.
You can deploy a model without training it first - but you risk an initial poor customer experience as it learns

They need something to learn → there must be clear patterns in the data
Use deterministic mapping (logic) when you can. ELSE machine learning maybe able to approximate a mapping by learning patterns from inputs and outputs.
What is complex for machines is different from what is complex to humans.
Use a Concierge or Wizard of Oz model to get going - and then use that data to train the model later
ML models make predictions. So they can only solve problems that require a predictive answer.
ML is great if a large numbers of approximate predictions is useful (e.g. movie recommendations).
For your model to be useful - the same patterns must existing in the unseen data to the training data
Use ML solutions when:

Patterns and tasks are repetitive.
The cost of wrong predictions is small (e.g. movie recommendations)
You’re making predictions at scale (to justify the cost {team, compute, data, infra.})
Patterns are changing (ML is adaptable - less brittle than hardcoded rule-based solutions)

Note - radical changes in data distributions will still require human intervention

Don’t use ML solutions when:

If it’s unethical to do so.
Simpler solutions can do the trick. Always start with a non-ML solution
It’s not cost-effective

If ML can’t solve your problem - it might be possible to solve part of the problem
Machine Learning Use Cases:

Search engines, recommender systems, suggestions, translation, assistants, health monitoring, fraud detection, price optimisation, demand forecasting, churn prediction, support ticket classification, sentiment analysis.

Most ML isn’t customer facing. In internal applications accuracy is more important than latency.

Internal: Reducing costs, generating customer insight, internal processing automation
External: improving customer experience, retaining customers, interacting with customers

Machine Learning in Research vs Production

Research	Production
State-of-the-art model performance	Good enough to be useful
Fast training (training throughput)	Fast inference (latency of generating a prediction)
Static data	Changing data
Clean data	Messy data
Ethics less of a consideration	Ethics can’t be ignored
Interpretability not important	Interpretability can be important
Clear goals	Conflicting requirements from stakeholders
Ensembling common	Simpler less complex systems preferred

Ethayarajh and Jurafsky (2020) argued benchmarks have driven progress in natural language processing at the expense of compactness, fairness, and energy efficiency.
People who haven’t deployed an ML system often make the mistake of focusing too much on the model development part and not enough on model deployment and maintenance
During model development training is the bottleneck - once a model is in production, inference is the bottleneck.

Research prioritises high throughput - Production prioritises low latency

Response time - what the client sees
Service time - actual time taken to service the request
Latency - duration that a request is waiting to be handled
Relationship between latency and throughput depends on batch size

Scenario	Latency	Batch Size (queries)	Throughput (queries per second)
A	10ms	1	100 q/s
B	20ms	1	50 q/s
C	10ms	10	1000 q/s
D	20ms	50	2500 q/s

If batch size is 1, lower latency means higher throughput
If batch size scales faster than latency (as in C → D) - then lowering latency can increase throughput

‣

Latency matters a lot in real world applications

2017 - Akamai study found that 100ms delay can hurt conversion rates by 7%
Booking.com found that a 30% increase in latency cost 0.5% conversion rates.
Google found that more than half of people leave a page if it takes 3 seconds to load

You can increase latency by reducing the number of parallel queries - but at the expense of hardware utilisation and increasing the cost to serve.
Latency is not an individual number - but a distribution - best reported in percentiles.

p50 is the most common. ‘100 ms p50’ means 50% of requests take longer than 100 ms

Typically data in research is clean and well formatted - freeing you to focus on model development. If data does have quirks - they’re usually well known and discussed by the community
Production data can be messy, noisy, unstructured, constantly shifting, biased, labels can be unbalanced or sparse, you have to think about privacy.
Fairness - is harder to measure.

Book Reference: Weapons of Math Destruction - Cathy O’Neil
To predict the future, ML algorithms encode the past - perpetuating bias. They can discriminate at scale.

Interpretability - researchers prioritise model performance over interpretability. Interpretability is important to users and developers (for debugging and improving a model).

ML vs Traditional Software

Many challenges are unique to ML and require their own tools
Data and code are separated in traditional software engineering (Separation of Concerns)
ML systems are part code, part data and part artifacts crated from the two.
Systems with the most/best data win → focusing on improving data is sensible
Data changes quickly - so ML systems need fast development and deployment cycles

Models degrade over time - often best when put live

In ML you need to test and version your data (in SWE you test and version your code)
ML models can have millions of parameters and require gigabytes of RAM

Chapter 2: Introduction to Machine Learning Systems Design

Business objectives need to be translated into ML objectives. You need to frame your problem so that ML can solve it - and tie the performance of your ML system back to the overall business.
Companies care about outcomes - not ML metrics (F1 score, inference latency)
Most companies define custom success measures:

Netflix Take Rate: number quality plays / recommendations seen
Higher take rate → higher total streaming hours → lower subscription cancellation rates

Return on investment in ML depends on maturity stage of adoption.

More experienced teams can deploy ML faster - but expect 30 days from idea to production

Also for consideration: Reliability, scalability, adaptability and maintainability

Reliability: The system can continue to perform at the desired level of performance even in the face of adversity. ML systems can and often fail silently.
Scalability: Grow in complexity, traffic, in model count, features. There should be reasonable ways to grow in whatever dimension is most needed. Cloud services are great at auto-scaling. Artefact management is a big part of scaling ML models - as is monitoring and retraining (which needs to be automated at scale).
Adaptability: Coping with shifting data distributions. Discovering aspects for performance improvement and allowing updates without service interruption. Data can change quickly so ML systems need to be able to evolve.
Maintainability: Setup your process and infrastructure so different contributors are comfortable with the tooling. Code, data and artefacts should be documented and versioned. Models need to be reproducible by other contributions.

Developing an ML system is an iterative process - once a system is in production - it’ll meed to be monitored and updated.

Project Scoping → Data Engineering → ML model development → Deployment → Monitoring and continual learning → Business Analysis

AN ML problem is defined by inputs, outputs and the objective function that guides the learning process.
Types of ML Tasks / Problem Framing

The output of a model dictates the task type of your ML problem.
Most ML tasks are Classification or Regression

Classification tasks can be: Binary, Multi-class or Multi-label
Multi-class can be low cardinality or high cardinality

Classification is putting things into categories (e.g Spam / Not Spam)
Regression models output a continuous value (house price prediction)
Regression models can be framed as classification problems and vice versa

you can quantize a continuous feature into buckets (under 4ft, 4-5ft, 5-6ft, over 6ft tall)
if you output a probability that something belongs to a class - that’s regression

Classification problems are simpler with few classes. Binary is the simplest form → calculating F1 and visualising confusion matrices is easier with 2 classes

High cardinality is when you have many classes (100 or 1000)
Hierarchical classification (first classifier puts into sub-groups of classes, second classifier puts into specific class ) can be useful with high cardinality

Multi-Label differs from Binary and Multi-class because in Binary and Multi-class each example belongs to just one class.
Multi-label classification is hard. Labels tend to be less consistent (as people disagree - this a strong warning sign). Because you don’t know how many categories an example could belong to - it’s unclear how to use probabilities (use the highest? use a threshold?)
Changing the way you frame the problem - could make it much easier for ML to solve

Given the problem of predicting the app a user will open next - you could frame it as:

Classification - The input is the user’s features and environment’s features. The output is a distribution over all apps on the phone.
Regression - The input is the user’s, the environment’s and the app’s features. The output is a single value between 0 and 1 - the probability they’ll open that app next.
Classification is a bad approach - each new app added requires you to retrain the model.
The regression structure allows you to not to retrain the model - as another app is just another row to compute

Objective Functions

The objective function (or loss function) guides the learning process and tries to minimise wrong predictions. In supervised ML - the loss can be computed vs ground truth labels using RMSE (root mean squared error) or cross entropy.

machinelearningmastery.com

Popular Loss functions:

Regression	RSME or MAE (mean absolute error)
Binary classification	logistic loss
Multi-class classification	cross entropy

Decoupling objevtives

Framing ML problems is hard when you minimise multiple objective functions
In a newsfeed - you might want to drive engagement, but also to maintain quality

Option 1: combine two losses into one - and train a single model
Option 2: train two models and rank posts by combined scores

When there are multiple objectives - decouple them first because it makes model development and maintenance easier

You can tweak the system without retraining
Different objectives might have different maintenance schedules

The success of an ML system depends largely on the data it was trained on

Focusing on the data is a good way to improve performance
Start with building out your data - in quality and quantity

‣

Data Science Hierarchy of needs - start at the bottom

Imagine this as a pyramid!

Advanced ML	Deep Learning
Learn / Optimise	A/B testing, experimentation, simple ML algorithms
Aggregate / label	Analytics, metrics, segments, features, training data
Explore / transform	Cleaning, anomaly detection, prep
Move / store	Reliable data flow, infrastructure, pipelines, ETL, structured and unstructured data storage
Collect	Instrumentation, logging, sensors, external data, user-generated content

AI - Deep Learning
Learn / Optimise

Collect > Move.

Chapter 3: Data Engineering Fundamentals

Data models define how the data should be stored on machines

Data models → describe the data in the real world
Databases → specify how the data should be stored on machines

Two types of processing: Transactional and Analytical
Historical data / Streaming data
Data Sources

User input data: is often malformattedd and therefore requires more checking and processing. You usually have to process user input data quickly.
System generated data: system outputs and logs. Visibility into how the system id doing. Log everything you can when building an ML system - but soon you have problems with finding and storing.
Internal Databases: data generated by services and applications.

‣

Definitions of 1st, 2nd and 3rd party data

First-Party Data → data that your company collects about your users or customers
Second-Party Data → a different company collects data on their customers
Third-Party Data → a different company collects data on the public (not their customers)

Data Formats

Data storage considerations: cost, speed, security, human readability, access patterns, text or binary, file size
Data serialisation is the process of converting data into a format for storage or transmission

Format	Binary/Text	Human Readable	Example use cases
JSON	Text	Yes	Everywhere
CSV	Text	Yes	Everywhere
Parquet	Binary	No	Hadoop, Amazon, Redshift
Avro	Binary primary	No	Hadoop
Protobuf	Binary primary	No	Google, TensorFlow
Pickle	Binary	No	Python, PyTorch serialisations

JSON (JavaScript Object Notation) is language-independent and easily parsed. Can have different levels of structure. It’s painful to change schema retrospectively. They take up a lot of space too.

Row-Major VS Column-Major Format: Accessing data by rows is faster than by columns in modern computers. Row-major formats (e.g. CSV) are better for accessing and writing examples. Column-major formats (e.g. Parquet) are better for accessing all features together (column based reads. Note: Pandas default is column-major, NumPy’s default is row major
Text VS Binary Format: Binary files are more compact - which can reduce storage by 6x and can be unloaded 2x faster too. Binary files lose human readability.
Data Models: Describe how data is represented. Your model will influence how your systems are built and what problems you can solve.

Relational Models:

Breaking data into relational tables. Rows and columns that can be shuffled. ]
Often normalised - structuring tables into normal forms to reduce data redundancy and improve data integrity
The downside of normalisation is that data is spread across multiple relations - you have to join them back together
Query language is the language you use to specify the data you want
SQL is a declarative language - you specify the outputs you want, the computer figures out the steps needed for an action and the computer executes these steps to return the outputs.
Python is imperative. You specify the steps needed for an action - and the computer executes them to return an output
In theory - SQL can be used for any computing problem (Turing complete)

NoSQL Models:

Relational models have to follow a strict schema.
Two major types: Document model and the Graph model

Documents for when data is self-contained and relationships are rare
Graphs for when relationships are common and important

Document Model:

Can be a single blob of JSON. Documents are more flexible - each one can have a different schema. Document databases shift responsibility of assuming structures from the writing application to the reading application.
Each document has locality (holding all relevant information) making retrieval easy
Filtering results for documents with certain attributes is slow - you have to read them all, extract the attribute, then filter. Relational models are faster for this

Graph Model:

Graph consists of nodes and edges - edges are the relationships between the nodes. The relationships between the data are a priority. It’s faster to retrieve data based on relationships.

Queries that are easy to do in one data model are hard in another - picking the right data model can make your life so much easier

Structured vs Unstructured Data

Structured follows a predefined data model (a.k.a. schema). Pre-defined structure makes data easier to analyse. Disadvantage - you have to commit to a schema in advance, changing it retrospectively is harder.
Unstructured data becomes appealing if data models are changing quickly - or you’re reliant on data sources outside of your control.
A repository for storing structured data is called a data warehouse.
A repository for storing unstructured data is called a data lake.

Structured	UnStructured
Schema clearly defined	Data doesn’t have to follow a schema
Easy to search and analyse	Fast arrival
Can only handle data with a specific schema	Can handle data from any source
Schema changes cause trouble	Schema changes are easy. Downstream applications have the issues.
Stored in data warehouses	Stored in data lakes

Data Storage Engines (a.k.a. databases) and Processing

Data formats and models specify the interfaces for storage and retrieval
Two types of workloads that databases are optimised for:

Transactional processing: actions are inserted as they’re generated, occasionally updated or deleted if something changes. Fast processing (low latency) and high availability are important. ACID (atomicity, consistency, isolation, durability)

Atomicity - if any step in the transaction fails - they all fail
Consistency - must follow pre-defined rules (e.g. validation)
Isolation - Two users can’t access the same data at the same time
Durability - once a transaction is committed - it remains so (even if the device dies)
Systems that are ACID are sometimes referred to as BASE (Basically Available, Soft state and Eventual consistency)
Transactional databases are often row-major - great for writing new records - but bad for asking ‘What’s the average price of items sold in our London stores?’

Analytical processing: Efficient with queries that allow you to look at data from different view points.

Today there are databases that are good at both transactional and analytical tasks.
Today it is more common to decouple storage from processing. BigQuery, Snowflake, and Teradata have a processing layer that can be optimised for different types of queries.

ETL: Extract, Transform, Load

Extracted from different sources → transformed to desired format → loaded into target
Data validation is performed in extraction. Transformation can include standardisation of value ranges, transposing, deduplicating, sorting, aggregating, deriving new features etc.
Loading everything straight into a data lake without a schema is sometimes called ELT (extract, load, transform).
As data grows becomes harder to search
Databricks and Snowflake are hybrid - flexibility of data lakes and the data management aspect of a data warehouse.

Models of data flow

Three main models of data flow (passing data between processes):

through databases

is easiest, but requires that both processes have access to the same database. Read/write can be slow - making unsuitable for low latency applications

through services using requests provided by REST and RPC APIs (POST/GET requests)

Often called ‘request-driven’ and coupled with a service-orientated architecture (or micro-services architecture). A service is a process that can be accessed remotely. Structuring an application as separate services allows for independent development, testing and deployment. Also great for when two companies collaborate.
Popular requests:

REST (representational state transfer)

or restful (CRUD - Create (post), Read(get) , Update(put) , Delete(delete)
you can’t get a branch by state (e.g. apply a filter)
best for just a simple application

RPC (remote procedure call)

flexible - great for business rules - designed for actoins
more scalable in the long run (for different use cases)

through real-time transport like Apache Kafka and Amazon Kinesis

Request-driven data passing between services is synchronous - and can get slow and complicated if there are too many services. One service can cause all the others to fail too
Real-time transport acts as a broker between services - which can either broadcast or listen.
We call the pieces of data being transported events. The architecture is called event-driven.
Publish-Subscribe message brokers (Apache, Kafka, Amazon Kinesis) or Message Queues (Apache RocketMQ and RabbitMQ).
In a message queue model- events often have intended consumers.

Request driven architecture works well for systems that rely more on logic than on data. Event-driven architecture works better for systems that are data heavy.

Batch Processing VS Stream Processing

Batch processing - jobs that are kicked off periodically
Streaming data → uses realtime transport and stream processing (realtime or every few minutes)

Stream processing can give low latency.
It’s not always less efficient either. It can be scalable as computations can be done in parallel. You can also save compute by doing things as they happen, vs re-running large batch jobs.

In ML batch is used to compute features that change less often (like a drivers rating)

Batch features - are also known as static features

Stream processing is used to compute features that change quickly

Stream features - are also called dynamic features.

Many ML systems require a mix of static and dynamic features - with the right infrastructure you can join them together to fee into your ML models. You need a stream computation engine to do computation on data streams. Stream processing is more difficult because the amount is unbounded and the data comes in at variable rates and speeds - it’s easier to make a stream processor do batch processing than vice versa.

Chapter 4: Training Data

Data is messy, complex, unpredictable. It can sink your operation.
Use ‘data’ not ‘dataset’ as “dataset” implies it’s finite and stationary. Data in production is neither finite nor stationary - expect ‘Data Distribution Shifts’
Data is full of potential biases - that arise in collecting, sampling, or labeling.

ML models can perpetuate and amplify any human bias in historical training data

Use data - but don’t trust it

Sampling

Sampling happens throughout an the ML project lifecycle
It is often impossible or infeasible to process all the data that you have access to (due to time and resources) - sampling helps you accomplish a task faster and cheaper
Data scientists often experiment with a subset of data - before training a new model
Two families of sampling: Non-probability sampling and random sampling:

‣

Non Probability Sampling (Don’t use for ML models)

Types:

Convenience	based on availability of data → popular and convenient
Snowball	future samples based on existing samples. For scraping - you might have to start on one node to find others
Judgment	experts decide what samples to include
Quota	select samples based on quotas for certain slices of data (regardless of actual distribution) → selection bias, driven by convenience

Typically not representative of the real-world data and therefore have selection bias
A bad idea to train models on data pulled in this way
Often driven by connivence
Language models are often trained with data that is easily accessible (Wikipedia, Common Crawl, Reddit)
Sentiment analysis is often collected from sources with natural labels (Amazon and IMDB) - biased towards those leaving reviews online - not representative of people
Self-driving car data all comes from Phoenix and California because they had the favourable legislation

Random Sampling:

Simple Random	All samples in population have an equal probability of selection. Easy, but rare categories of data might not make it to your selection
Stratified	Divide population into groups (stratum) - sample from each separately. Ensures some examples of rare classes. Not always possible. Hard for multi-label tasks
Weighted	Each sample given a weight - determines probability of selection. You can leverage domain expertise - you might want more recent data to have a higher chance of being selected
Reservoir	Useful for streaming data. Have a reservoir of data - randomly select data points from the stream - and randomly replace those in your reservoir. All samples have an equal chance of being selected. You can stop at anytime and your sample will be ready.
Importance	Sample from a distribution when we only have access to another distribution. One data source could be expensive, slow or infeasible to sample from = so you sample from a more available source and weight those samples accordingly.

Labelling

Most ML models are supervised - they need labelled data to learn.
Performance of an ML model - depends on quality and quantity of data
Two types of labels → Hand labels, or Natural labels.
Hand Labels

Acquiring hand labels is often difficult, slow and expensive (if subject matter expertise is required). Requires somebody seeing your data - so there are privacy implications.
Slow labels leads to slow iteration speeds and makes your model less adaptive to changing environments and requirements. The longer the process takes, the more your existing model will degrade.
If the tasks changes or data changes, you have to wait for new labels before updating your model

Label Multiplicity - To get enough labelled data you often have to use multiple annotators or even sources. They will have different levels of expertise and accuracy. This leads to label ambiguity or multiplicity - what to do when there are multiple conflicting labels for a data instance?

Disagreements among annotators are extremely common. If humans can’t agree on a label - what does human-level performance even mean?
Incorporating clear problem definitions and guidance in annotators training can minimise disagreement

Data Lineage - keeping track of each data samples origin and labels. Essential if taking data from multiple sources. Helps you flag bias and debug models.
Natural labels

Tasks with natural labels can be evaluated by the system.
They might have a natural ground truth.

E.g. Google Maps knows how long the trip actually took - and they can evaluate how good their prediction was

Recommender systems have natural labels (CLICK or NO CLICK)
Labels inferred from user actions (clicks and ratings) are known as behavioural labels
If you don’t have natural labels - consider adding an optional feedback loop

Examples: like buttons, reactions, ‘submit an alternative translation’

Companies find it easier and cheaper to start on tasks with natural labels
Implicit labels are presumed. E.g. a recommendation that doesn’t get clicked on is presumed to be bad
Explicit labels are when users explicitly demonstrate their feedback

Feedback Loop Length

Time from prediction to feedback. Recommender systems have short feedback loops. If you’re recommending clothes on Stitch Fix - you won’t get feedback until the items have been tried on days or weeks later
User feedback differs by volume, signal strength and feedback loop length.

View, click, add to cart, buy, rate, review

Fraud has a long feedback loop. You might need leading indicators to detect issues with your ML model

Handling a lack of lablels

Four ways to cope:

‣

Weak supervision → leverages noisy heuristics to generate labels

Use subject matter expertise to create heuristics to label your data (instead of using hand labels) (sometimes called label functions or programatic labelling)
Examples: Keywords, regular expressions, database lookups, outputs from other models
A few labels to check your heuristics is helpful .
Usually results in noisy labels
Can be used in privacy situations when you can’t see the data
Cheap / Fast / Adaptive / Not accurate though
Great to use when you’re getting started - to evaluate if it’s worth getting hand labels

‣

Semi-supervision → leverages structural assumptions to generate labels

Leverages structural assumptions to generate new lables based on a small set of initial labels.
Requires seed labels to generate more
Self-training → training a model on existing label data. Use that model to make predictions for unlabelled samples.
Assume that data samples that have similar characteristics should have similar labels. Clustering or k-nearest neighbours.
Perturbation → add small perturbations to a sample shouldn’t change it’s label. Perturbed samples are given the same labels as the unperturbed ones.

‣

Transfer learning → leverages models pre-trained on another tasks for your new task

Zero-shot learning doesn’t require labels, but fine-tuning does
A model developed for one task is used as the starting point for a model developed for another task
Train a language model to predict the next token in a sequence → could be fine tuned to answer questions
Transfer learning is appealing when there isn’t much labelled data
Lowers the barriers into ML (GPT-3 could save you tens of millions of USD)

‣

Active learning → labels data samples that are most useful to your model

Requires labels
Improves the efficiency of data labels.
Trying to get models to have greater accuracy with less data. Instead of randomly selecting samples - train on those that are most helpful to your mode (those where there is uncertainty - hoping your model will learn those boundaries better)
Or you can use data if you have multiple models and their labels disagree.
Active learning is important for real-time systems. Allows your model to learn more effectively in real time and adapt faster to changing environments.

Class Imbalance

When there is a substantial difference in the number of samples in each class of the training data

E.g. 0.01% of X-rays might contain cancerous cells
E.g. estimating the 95th percentile of healthcare bills is important - as that’s where the bulk of the cost is

Challenges of class imbalance:

Deep learning (and ML) works well in situations when the data distribution is more balanced, and usually not so well when the classes are heavily imbalanced because:

insufficient signal to detect minority cases
models get stuck in non-optimal solutions - using simple heuristics instead of learning useful things
asymmetric costs of error - the cost of a wrong prediction on a sample of the rare class can be much higher (missing a rare cancer diagnosis). If your loss function isn’t configured to address this asymmetry, your model will treat all samples the same way

Rare events are often more interesting (like in fraud detection, or churn prediction)
Three ways to handle class imbalance:
Use the right evaluation metrics

Model Accuracy and Error Rate (used frequently) are insufficient metrics for tasks with class imbalance because they treat all classes equally. Performance on the majority class dominates the metrics. This is especially bad when the majority class isn’t what you care about.
F1, precision, and recall are metrics that measure your model’s performance with respect to the positive class in binary classification problems, as they rely on true positive—an outcome where the model correctly predicts the positive class.

	Positive Prediction	Negative Prediction
Positive Label	True Positive	False Negative
Negative Label	False Positive	True Negative

Precision = True Positive / (True Positive + False Positive)

Precision = accuracy of positive predictions

Recall = True Positive / (True Positive + False Negative)

Recall = proportion of actual positives (in the data) that were predicted

F1 = 2 x Precision x Recall / (Precision + Recall)
They are all asymmetric metrics - their values change depending on what you call the positive class
Classification problems can be modelled as regression problems (instead of returning a class you return the probability of a class). You can then classify based on that probability by setting a threshold. Moving it up and down allows you to increase the true positive rate (also known as recall) while decreasing the false positive rate (also known as the probability of false alarm), and vice versa

Plotting true positive rate against false positive rate is the ROC curve. The area under the curve is a measure of how close to perfect the model is.

Data-Level Methods: Resampling

Modifying the distribution of the training data to reduce the level of imbalance to make it easier for the model to learn.

You can OverSample the minority class or UnderSample the majority class

Undersampling →

Random removals
Tomek link UnderSampling - finds pairs of samples from opposite classes that are similar and removes the one from the majority class. Helps models learn the boundary but might make them less robust.

Oversampling →

Random duplication
SMOTE (systematic minority oversampling technique) - novel samples of minority class and synthesis them with other minority class examples.

These techniques only work well in data with low-dimensionality
never evaluate your model on resampled data - it will cause your model to overfit to that resampled distribution
Undersampling risks losing important data from removing data
Oversampling risks of overfitting on training data

Algorithm-Level Methods:

Algorithm-level methods keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
Many algorithm-level methods involve adjustment to the loss function (that guides the learning process). Gets the model to prioritise making correct predictions on the more important class. By giving the training instances we care about higher weight, we can make the model focus more on learning these instances
Cost-sensitive learning → loss function is modified to take into account this varying cost of miss-classification (but you have to manually define the cost matrix)
Class-balanced loss → punish the model for making wrong predictions on minority classes to correct this bias by making the weight of each class inversely proportional to the number of samples in that class, so that the rarer classes have higher weight
Focal loss → incentivise the model to focus on learning the samples it still has difficulty classifying. If a sample has a lower probability of being right, it’ll have a higher weight

Data Augmentation - a family of techniques that are used to increase the amount of training data

Used for tasks that have limited training data, such as in medical imaging
Augmented data can make our models more robust to noise and even adversarial attacks.
Simple Label-Preserving Transformations → e.g. cropping, flipping, rotating, inverting, or erasing part of the image. In NLP, you can replace a word with a similar word. A quick way to double or triple your training data.
Perturbation → Neural networks are sensitive to noise. In computer vision adding a small amount of noise to an image can cause a neural network to misclassify it.

Using deceptive data to trick a neural network into making wrong predictions is called adversarial attacks
Adding noisy samples to training data can help models recognise the weak spots in their learned decision boundary and improve their performance
DeepFool finds the minimum possible noise injection needed to cause a misclassification with high confidence

Data Synthesis - creating training data to boost a model’s performance

In NLP → templates can be a cheap way to bootstrap your model.

Example Template: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION]”

You can then list all possible cuisines, all reasonable numbers of miles, and locations (home, office, landmarks, exact addresses) for each city, you can generate thousands of training queries from a template

Chapter 5 - Feature Engineering

2014 Paper showed the most important thing in developing ML models is having the right features (Practical Lessons from Predicting Clicks on Ads at Facebook) - more than hyperparameter tuning… a technique that gets more airtime
Coming up with new useful features is a big part of the job
Data leakage is a subtle yet disastrous problem that has derailed many ML systems in production

Learned Features Versus Engineered Features

Choosing what information to use and how to extract this information into a format usable by your ML models is feature engineering.
Many hoped deep learning would be the end of handcrafting features - although some can be automatically learned and extracted, we’re still far from the point where all features can be automated.
The majority of ML applications in production aren’t deep learning and ML systems need data beyond just text and images.

For spam detection you might include the person who made the comment, the reactions, when their account was created, how often they post, how many views, how many threads
There are many possible features you could use in your model

Handling Missing Values

Not all types of missing values are equal. There are three types of missing values:

Missing not at random (MNAR) → Values are missing for reasons related to the values themselves.
Missing at random (MAR) → a value is missing is not due to the value itself, but due to another observed variable. E.g. the people of gender A in this survey don’t like disclosing their age.
Missing completely at random (MCAR) → there’s no pattern in when the value is missing. E.g. People just forget to fill in that value sometimes for no particular reason. However, this type of missing is very rare. There are usually reasons why certain values are missing, and you should investigate.

When encountering missing values, you can either fill in the missing values with certain values (imputation) or remove the missing values (deletion)

Deletion → many prefer deletion because it’s easier to do.

Column deletion: if a variable has too many missing values, just remove that variable. You might remove important information and reduce the accuracy of your model.
Row deletion: if a sample has missing value(s), just remove that sample. This method can work when the missing values are completely at random (MCAR) and the number of examples with missing values is small, such as less than 0.1%. You don’t want to do row deletion if that means 10% of your data samples are removed.

Can remove important information that your model needs to make predictions, especially if the missing values are not at random (MNAR)
Can create biases in your model, especially if the missing values are at random (MAR)

Imputation → If you don’t want to delete missing values, you will have to impute them, which means “fill them with certain values.” Deciding the values is the hard part

You could fill with their defaults E.g. an empty string “”
You could fill with the mean, median, or mode
Both practices work well in many cases, but sometimes they can cause hair-pulling bugs.
Avoid filling missing values with possible values, such as filling the missing number of children with 0
Deletion risks losing important information or accentuating biases.
Imputation risks injecting your own bias into and adding noise

Scaling

Models tend to give more extreme numbers more importance. That’s a problem if you have age (less than 120) and annual income in your model (values beyond 100k).
It’s therefore important to scale features into similar ranges before putting them into a model - this is called feature scaling.

This is one of the simplest things you can do that often results in a performance

Neglecting to do so can cause your model to make gibberish predictions (expecially for gradient-boosted trees and logistic regression)
Often people scale values to between 0 and 1 or -1 to 1.
If your variables follow a normal distribution normalise them to have have zero mean and unit variance (this is called standardisation)
ML models tend to struggle with features that have a skewed distribution

To help mitigate the skewness, a technique commonly used is log transformation

Scaling is a common source of data leakage.
Scaling requires global statistics (looking at the entire training data to calculate its min, max, or mean). During inference, you reuse the statistics you had obtained during training to scale new data. If the new data has changed significantly compared to the training, these statistics won’t be very useful. Therefore, it’s important to retrain your model often to account for these changes.

Discretisation

Discretisation is the process of turning a continuous feature into a discrete feature - by creating buckets for given values. The process is also known as binning or quantisation.
The author has rarely found discretisation to help.
Instead of having to learn an infinite number of possible incomes, our model can focus on learning only a few categories, which is a much easier task to learn. This technique is supposed to be more helpful with limited training data.
Categorisation introduces discontinuities at the category boundaries—$34,999 is now treated as completely different from $35,000 → choosing the boundaries of categories might be hard

Encoding Categorical Features

Some categories are static (like age and income brackets). But in production many categories change - Amazon’s recommender system has to deal with new brands and products all the time. How do you put new things into groups?

Of new brands, some will be legitimate, some will be scams. Treating them all the same is a problem
Hashing can help solve for this. I don’t yet understand hashing fully - I’m putting a pin in it to do some further reading at a later date 📌.

Fully Understanding the Hashing Trick | NeurIPS 2018

If you would like to support the channel, please join the membership: https://www.youtube.com/c/AIPursuit/join Subscribe to the channel: https://www.youtube.com/c/AIPursuit?sub_confirmation=1 Donation: w/ BEP20 (BTC, ETH, USDT, SOL, BNB, Doge, Shiba) ⇢ 0x0712795299bf00eee99f13b4cda0e19dc656bf2c USDT (TRN20) ⇢ THV9dCnGfWtGeAiZEBZVWHw8JGdGCWC4Sh ----------------------------------------------------------------------------------------- Video is reposted for educational purpose.

www.youtube.com

Fully Understanding the Hashing Trick | NeurIPS 2018

Feature Crossing. Combines two or more features to generate new features. Useful to model the nonlinear relationships between features. Example: combine marital status and number of children into “marriage and children”. Essential for models that can’t learn or are bad at learning nonlinear relationships, such as linear regression, logistic regression, and tree-based models. It’s less important in neural networks, but can still helps them learn nonlinear relationships faster. Caution - can cause overfitting and your models might need more training data

Discrete and Continuous Positional Embeddings

Paper: “Attention Is All You Need” (Vaswani et al. 2017)
Positional embedding has become a standard data engineering technique for many applications in both computer vision and NLP.
What are Embeddings

An embedding is a vector that represents a piece of data.
All embeddings generated by the same algorithm are called “an embedding space.”
All embedding vectors in the same space are of the same size

For language modelling where you want to predict the next token (e.g., a word, character, or subword) based on the previous sequence of tokens - embeddings are useful

Data Leakage

Data leakage is the phenomenon when a form of the label “leaks” into the set of features used for making predictions, and this same information is not available during inference.

Essentially when labels are leaked into the features during training

Leakage is often non-obvious and it can cause models to fail in unexpected and spectacular ways (even if evaluated carefully).

It’s common
It’s rarely covered in ML curricula

Bad examples:

patients scanned while lying down were more likely to be seriously ill → the model learned to predict serious covid risk from a person’s position
certain hospitals dealt with more serious caseloads → the model used the labels and fonts on those scans to predict covid risk

Common Causes of Data Leakage:

‣

Splitting time-correlated data randomly instead of by time

Often data is time-correlated → the time the data is generated affects its label distribution
Correlation can be obvious - as in stock prices

Similar stocks move together.
To predict the future stock prices → split your training data by time, such as training your model on data from the first six days and evaluating it on data from the seventh day

Or non-obvious - listening to a song

Depends not only on their music taste + the general music trend that day
If an artist dies → people are more likely to listen to that artist that day

Split your data by time, instead of splitting randomly, whenever possible.

If you have 5 weeks of data - use the first four weeks for the train split, then randomly split week 5 into validation and test splits

‣

Scaling before splitting

Always split your data first before scaling
Use statistics from the train split to scale all the splits
Leakage might occur if the mean or median is calculated using entire data instead of just the train split

‣

Poor handling of data duplication before splitting

If you have duplicates or near-duplicates in your data, failing to remove them before splitting your data might cause the same samples to appear in both train and validation / test splits. Data duplication is quite common in the industry.
It can result from data collection or merging of different data sources.
It was common with COVID-19 data as researchers combined several datasets that actually had overlapping data
Always check for duplicates before splitting
If you’re oversampling - do it after splitting.

‣

Group Leakage

Common for objective detection tasks that contain photos of the same object taken milliseconds apart → some of them landed in the train split while others landed in the test split.
You need to understand how your data was generated to avoid this type of data leakage

‣

Leakage from data generation process

Example: Type of CT scan machine can leak data on the seriousness of the patient case
Detecting this type of data leakage requires a deep understanding of the way data is collected.
You have to know about the hospital procedures and the machines.
Mitigate the risk by keeping track of the sources of your data and understanding how it is collected and processed

Detecting Data Leakage

It can happen in many steps: Generating, collecting, sampling, splitting, processing data and feature engineering
Measure the predictive power of each feature with respect to the label → then investigate high correlation.

Two features can independently contain no leakage - but leak data when combined together. An employee’s start date and end date can reveal tenure.

Do ablation studies to measure how important a feature is to your model - remove the feature and asses the drop off in performance.
Watch out for new features improving model performance by large amounts
Don’t use your test split for anything other than reporting a model’s final performance → you risk leaking information from the future into your training process.

Engineering Good Features

Generally more features are better. Therefore in production the number of features grows over time. But there are downsides too

More chance for data leakage
Too many features can cause overfitting
Takes more memory, cost and speed
Inference latency gets worse for online prediction
You grow technical debt

If a feature doesn’t help a model make good predictions → regularisation techniques like L1 regularisation should reduce that feature’s weight to 0.

removing unused features can speed up learning → you can store removed features to add them back later.

Store with feature definitions to reuse and share across teams in an organisation.

Feature Importance

There are many built-in and open source packages for computing the importance of your features
It’s measured by → how much that model’s performance deteriorates if that feature or a set of features containing that feature is removed from the model

SHAP is great because it not only measures a feature’s importance to an entire model, it also measures each features contribution to a model’s specific prediction

Often, a small number of features accounts for a large portion of your model’s feature importance
Facebook found the top 10 features are responsible for about half of the model’s total feature importance, whereas the last 300 features contribute less than 1%.

Feature importance techniques are also great for interpretability as they help you understand how your model works

Feature Generalisation

Features used for the model should generalise to unseen data - but not all features generalise equally.
Measuring feature generalisation is less scientific than measuring feature importance

it requires intuition and subject matter expertise on top of statistical knowledge.

Coverage is the % of the samples that has values for this feature in the data. The fewer values missing, the higher the coverage. Generally if a feature appears in a very small percentage of your data, it’s not going to be very generalisable.

Low coverage features can still be useful especially when missing values are not random
Caution: coverage of a feature can differ wildly between the train and test split

There’s a trade-off between generalisation and specificity.

IS_RUSH_HOUR is more generalisable but less specific than HOUR_OF_THE_DAY

Summary

The success of ML systems depends on their features → organisations need to invest time and effort into feature engineering. H
It requires learning through experience: trying out different features and observing how they affect your models’ performance.
You can learn from winning teams of Kaggle competitions too

Basic ML (Chapter 6 Pre-Read)

The book recommends reading it’s basic into to ML before starting this chapter. So I’ll include that here
Recommended for a more in depth introduction to ML…

Stanford CS 321N: deep learning focused, beginner-friendly.
Machine Learning: A Probabilistic Perspective: foundational, comprehensive, though a bit intense.

A model is a function that transforms inputs into outputs, which can then be used to make predictions.

In traditional programming, functions are given and outputs are calculated from given inputs.
In supervised ML, the inputs and outputs are given, which are called data, and the function is derived from data.

Given x as input and y as output, you want to learn a function f such that applying f on x will produce y.
ML isn’t powerful enough to derive arbitrary functions from data yet → you need to specify the form that you think the function should take

Variables in your function learned in the training process are called parameters

You need an objective function to evaluate how good a given set of parameters is for a dataset

and a procedure to derive the set of parameters best suited for the given data according to that objective, known as a learning procedure.

When talking about model selection, most people think about selecting a function form. However, choosing the right objective function and a learning procedure is extremely important in finding a good set of parameters for your model.

Objective Function

The objective function (or loss function) is highly dependent on the model type and whether the labels are available. If the labels aren’t available (e.g. unsupervised learning) the objective functions depend on the data points themselves.

For k-means clustering, the objective function is the variance within data points in the same cluster. Unsupervised learning is much less commonly used in production.

Most algorithms in production are supervised or semi-supervised. Given a set of parameter values, you calculate the outputs from the given inputs, and compare the given function’s predicted outputs (y') to the actual outputs (y).

Objective functions evaluate how good a set of parameter values is by measuring the distance between predicted outputs and actual outputs.
Root Mean Squared Error and Mean Absolute Error are two common objective functions for scalar outputs (scalar output = single variable output (not a distribution))
Explainer blog post I found helpful …

Various types of Distance Metrics Machine Learning

A number of Machine Learning Algorithms - Supervised or Unsupervised, use Distance Metrics to know the input data pattern in order to make any Data-Based decision. A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly.

medium.com

Various types of Distance Metrics Machine Learning

Objective function	How to calculate	Distance metrics
Root Mean Squared Error (RMSE)	∑i=1n(yi′−yi)2n	Euclidean
Mean Absolute Error (MAE)	∑i=1n\|yi′−yi\|n	Manhattan

If your model outputs a distribution → a common objective function is cross entropy and its variation.
You can modify the objective function to encourage your model to focus on examples of rare classes or examples that are difficult to learn. You can also add regularisers such as L1 and L2 to your loss function to encourage your model to choose parameters of smaller values.
Each objective function gives you a set of possible values your parameters can take. This set of possible values for parameters is known as the loss surface

if time permits, you should experiment with different objective functions to see how your model’s behaviour changes

Learning Procedure

Learning procedures the procedures that help your model find the set of parameters that minimise a given objective function for a given set of data, are diverse

K-means clustering uses an iterative procedure called expectation–maximisation algorithm
The most popular family of iterative procedures today is undoubtedly gradient descent.The gradient is the direction that lowers the loss from a current value the most.
The function that determines how to update a parameter given a gradient value is called an update algorithm (or optimiser).
Good optimizers can both speed up your model training process and help your model converge to a better set of parameters. Even though optimisers help your model find the set of parameters that minimise a given objective function for a given set of data,

Machine Learning

Supervised Learning	Unsupervised Learning	Semi-Supervised Learn.	Reinforcement Learning
Provide input, output and feedback to build model	Use deep learning to arrive at conclusions and patterns through unlabelled training data	Builds a model through a mix of labelled an un labelled data, a set of categories, suggestions and exampled labels	interpreting but based on a system of rewards and punishments learned through trial and error, seeking maximum reward
Linear Regressions: - sales forecasting - risk assessment	Apriori: - sales functions - word associations - searcher	Generative adversarial networks: - audio and video manipulation - data creation	Q-Learning: - policy creation - consumption reduction
Support vector machines: - image classification - financial performance comparison	K-means clustering: - performance monitoring - Searcher intent	Self-trained Naive Bayes classifier: - natural language processing	Model-based value estimation - Linear tasks - estimating parameters
Decision Tree: - predictive analytics - pricing

Machine Learning

Supervised Learning	Unsupervised Learning	Reinforcement Learning
Makes machine learn explicitly Data with clearly defined output is given Direct feedback is given Predicts outcome/future Resolves classification and regression problems	Machine understands the data (identifies patterns / structures) Evaluation is qualitative or indirect Does not predict/find anything specific	An approach to AI Reward based learning Learning from +ve and -ve reinforcement Machine learns how to act in a certain environment To maximise rewards
Inputs → training → output	Inputs → output	Inputs → outputs → rewards → Loop

Chapter 6 - Model Development and Offline Evaluation

Model development is an iterative process - After each iteration, you’ll want to compare your model’s performance against its performance in previous iterations and evaluate how suitable this iteration is for production.

Evaluating ML Models

Deep Learning isn’t going to replace all classical ML algorithms. Even in applications where neural networks are deployed, classic ML algorithms are still being used in tandem.
There are many possible ML solutions to any given problem. You need some knowledge of types of problem, and how they’re currently solved. E.g…

Text classification problem (classify whether it’s toxic or not)

Naive Bayes, logistic regression, recurrent neural networks, and transformer-based models such as BERT, GPT, and their variants.

Abnormality detection problem (fraud detection)

k-nearest neighbour, isolation forest, clustering, and neural networks

Considerations when evaluating a model:

Performance: accuracy, F1 score, and log loss
How much data are needed
How much compute needed
Time required to train
Inference latency
Interpretability (Non-neural network algorithms tend to be more explainable)

E.g. a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labeled data to start, it’s much faster to train, it’s much easier to deploy, and it’s also much easier to explain why it’s making certain predictions
The ML space is moving quickly - monitor trends at major ML conferences such as NeurIPS, ICLR, and ICML, as well as following researchers whose work has a high signal-to-noise ratio on Twitter
Six tips for model selection

1) Avoid the state-of-the-art: theres a big difference between academia and industry. Often the simple solution is a good idea.

‣

2) Start with the simplest models: simple is better than complex.

Easier to deploy, allowing for faster validation.
Adding complexity step-by-step makes models easier to understand and debug
Simple model can be your baseline - a valuable comparison to asses more complex models.

3) Avoid human biases in selecting models. Engineers get excited by architectures → spend more time on them. Investing 10x more time or iterations on a certain architecture is going to make comparisons unfair.
4) Evaluate good performance now versus good performance later. Think about where you’ll be in a couple of months from now. Use learning curves (a plot of performance—e.g., training loss, training accuracy, validation accuracy—against the number of training samples it uses) to estimate performance gain from more data.

Take into account their potential for improvements in the near future, and how easy/difficult it is to achieve those improvements

5) Evaluate trade-offs

False positives vs false negatives. For finger print unlocking you might prefer a model that makes fewer false positive
Compute requirement and accuracy. More complex models can be more accurate but take more compute and are therefore more expensive
Interpretability and performance trade-off. A more complex model can give a better performance, but its results are less interpretable.

6) Understand your model’s assumptions

All models are wrong, but some are useful George Box 1976

Every model comes with its own assumptions
Understanding what assumptions a model makes and whether our data satisfies those assumptions can help you evaluate which model works best for your use case.
Model Assumptions

Prediction Assumption	Prediction models assume that it’s possible to predict the output from the input.
IID	Neural Nets assume examples are independent and identically distributed (independently drawn from the same joint distribution)
Smoothness	Supervised ML assumes if an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
Tractability	Let X be the input and Z be the latent representation of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z\|X).
Boundaries	A linear classifier assumes that decision boundaries are linear.
Conditional Independence	A naive Bayes classifier assumes that the attribute values are independent of each other given the class
Normally distributed	Many statistical methods assume that data is normally distributed

Ensembles

Ensemble: A system that uses multiple models. Each model in the ensemble is called a base learner.
Ensembling models can lead to better accuracy but it’s less favoured in production because they are more complex to deploy and harder to maintain. They are common when a small performance boost can lead to a huge gain.
You can have 3 models predicting the same class (SPAM, NOT SPAM) and take the majority vote. Makes much more sense if the models aren’t correlated.
3 ways to ensemble: Boosting, Bagging and Stacking:

Bagging - short bootstrap aggregating. You create different datasets called bootstraps - you train a model (classification or regression) on each bootstrap. Sampling with replacement ensures that each bootstrap is created independently from its peers. If If Regression - final prediction is the average of them all, if classification the final prediction is the majority vote.

Designed to improve both the training stability and accuracy of ML algorithms.
A random forest is an example of bagging. A random forest is a collection of decisions trees constructed by both bagging and feature randomness - where each tree an pick only from a random subset of features to use

‣

Boosting - Boosting uses a chain of classifiers, but sample weights are changed based on how well the previous classifier predicted them. A final classifier is made using a combination of the existing ones.

Train the first weak classifier on the original dataset
Re-weight samples based on how well the first classifier classifies them (misclassified samples are given higher weight)
Train the second classifier on this re-weighted dataset. Your ensemble now consists of the first and the second classifiers.
Samples are weighted based on how well the ensemble classifies them.
Train the third classifier on this re-weighted dataset. Add the third classifier to the ensemble.
Repeat for as many iterations as needed - form final strong classifier as a weighted combination of the existing classifiers—classifiers with smaller training errors have higher weights.

Stacking - train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions. The meta-learner could be majority vote or averaging

Experiment tracking and versioning

Keep track of all the definitions needed to re-create an experiment and its relevant artefacts. An artefact is a file generated during an experiment
Experiment Tracking: tracking the progress and results
Versioning: logging all the details of an experiment for the purpose of possibly recreating it later or comparing it with other experiments
Experiment tracking

A large part of training an ML model is babysitting the learning processes - many issues can happen when learning
Things to track during training:

The loss curve corresponding to the train split and each of the eval splits.
Model performance on all non-test splits, such as accuracy, F1, perplexity.
Log of corresponding sample, prediction, and ground truth label. For ad hoc analytics and sanity check.
The speed of your model (steps per second or, if your data is text, the number of tokens processed per second)
Memory usage and CPU/GPU utilisation to identify bottlenecks

Versioning

You need to not only version your code but your data as well. Code versioning is standard in the industry but few do data versioning well.
Data versioning is challenging - its large (so can’t be duplicated as easily to show diffs)
There’s still confusion in what exactly constitutes a diff when we version data - and another confusion is in how to resolve merge conflicts.
GDPR can make versioning and duplicating complicated

Aggressive experiment tracking and versioning helps with reproducibility, but it doesn’t ensure reproducibility
Debugging ML Models

Especially frustrating for the following three reasons

ML models fail silently.
It can be frustratingly slow to validate whether the bug has been fixed → you might have to retrain the model and wait until it converges to see whether the bug is fixed, which can take hours
Debugging ML models is hard because of their cross-functional complexity. There are many components in an ML system (data, labels, features, ML algorithms, code, infrastructure) that might be owned by different teams

Things that can cause an ML model to failL

Theoretical constrains	models make assumptions about data and features. It can fail if the data it learns from doesn’t conform to its assumptions
Poor Implementation	the bugs are in the implementation of the model. The more components a model has, the more things that can go wrong
Poor choice of hyperparameters	the model is a good fit but a poor set of hyperparameters renders the model useless
Data problems	Collection, pre-processing issues.
Poor feature choice	Too many features can cause overfitting or leakage. Too few features might lack predictive power.

Debugging should be both preventiventative and curative. Healthy practices are needed to minimise opportunities for bugs to proliferate.
Tips:

Start simple and gradually add more components
Overfit a single batch - to make sure it gets to the smallest possible loss. If it’s for image recognition, overfit on 10 images and see if you can get the accuracy to be 100%. If it can’t overfit a small amount of data, there might be something wrong with your implementation.
Set a random seed - Setting a random seed ensures consistency between different runs. It also allows you to reproduce errors and other people to reproduce your results.

Distributed Training

Expertise in scalability requires having regular access to massive compute resources.
It’s common to train a model using data that doesn’t fit into memory (CT scans, genome, large language models)
Feed-forward models we were able to fit more than 10x larger models onto our GPU, at only a 20% increase in computation time.”
Data parallelism → splitting data onto multiple machines, train your model on all of them, and accumulate gradients. This gives rise to a couple of issues.

As each machine produces its own gradient, if your model waits for all of them to finish a run—synchronous stochastic gradient descent (SGD)—stragglers will cause the entire system to slow down, wasting time and resources.
Spreading your model on multiple machines can cause your batch size to be very big.

Model parallelism → Model parallelism is when different components of your model are trained on different machines. Doesn’t mean that different parts of the model in different machines are executed in parallel.

Pipeline parallelism is a clever technique to make different components of a model on different machines run more in parallel. The key idea is to break the computation of each machine into multiple parts.

AutoML

2018 Jeff Dean (Google) declared that Google intended on replacing ML expertise with 100 times more computational power, introducing AutoML

Instead of paying a group of 100 ML researchers/engineers to fiddle with various models and eventually select a suboptimal one, why not use that money on compute to search for the optimal model

Soft AutoML: Hyperparameter tuning: A hyperparameter is a parameter supplied by users whose value is used to control the learning process, e.g., learning rate, batch size, number of hidden layers, number of hidden units, dropout probability, β1 and β2 in Adam optimiser, etc.

2018 Paper “On the State of the Art of Evaluation in Neural Language Models” weaker models with well-tuned hyperparameters can outperform stronger, fancier models
Despite knowing its importance, many still ignore systematic approaches to hyperparameter tuning.
Most ML pipelines have a form of hyperparamater tuning.
When tuning hyperparameters, keep in mind that a model’s performance might be more sensitive to the change in one hyperparameter than another - sensitive hyperparameters should be more carefully tuned.

Hard AutoML: Involves architecture search → A search space, a performance estimation strategy, a search strategy.

Whether it’s architecture search or meta-learning learning rules, the up-front training cost is expensive enough that only a handful of companies in the world can afford to pursue them.
Auto ML is likely to improve off the shelf model performance from big companies.
More real-world tasks previously impossible with existing architecture will be solves

Four Phases of ML Model Development

Before Machine Learning → Start with non-ML solutions
Simplest ML Models → validate the usefulness, easy to implement and deploy
Optimise Simple Models → different objective functions, hyperparameter search, feature engineering, data and ensembles
Complex Models → Experiment, look for more significant improvements, think about model degradation over time

Model Offline Evaluation

How do I know that our ML models are any good?
Baselines

Random Baseline	Expected performance of a model that predicts at random
Simple Heuristic	Predictions based on simple heuristics
Zero rule Baseline	Predicting the most common class (e.g. predicting that a user will next open the app they most commonly open)
Human Baseline	How your model compares to human experts
Existing Solutions	ML systems are designed to replace existing solutions - business logic

Evaluation Methods

In academia researchers fixate on their performance metrics. In production, we also want our models to be robust, fair, calibrated.

Perturbation tests	Making small changes to your clean training data to make it more robust to noisy real-world data. You the choose the model that works best on the perturbed data
Invariance tests	Certain changes to the inputs shouldn’t lead to changes in the output. Race information shouldn’t affect the mortgage outcome. Exclude the sensitive information from the features used to train the model in the first place
Directional Expectation tests	Changes in inputs should make predictable directional changes in outputs. If predictions go in the other direction investigate.
Model Calibration	Maybe the single most important test of a forecast. Track how often your predictions come true - does that match the probability of the model?
Confidence Measurement	Sample level confidence metrics are really important - do you want to show users predictions where the model is highly uncertain
Slice-based evaluation	Separate your data into subsets and look at your model’s performance on each subset. You might pay particular attention to accuracy of certain types (like churn prediction of high value clients)

Simpsons paradox → trend appears in several groups of data - but disppears or reverses when the groups are combined.
Three ways to slice:

Heuristics-based	Slice your data using domain knowledge you have of the data and the task at hand.
Error analysis	Manually go through misclassified examples and find patterns among them.
Slice Finder	Using an automated slice finder

Chapter 7: Model Deployment and Prediction Service

Deploy: getting your model running and accessible. During model development, your model usually runs in a development environment.
Production means different things in different teams → For teams doing analysis - production might be notebooks and charts. For others it means keeping your models up and running for millions of users a day
Production is hard: latency, availability, accuracy, monitoring, alerting, debugging, releasing
How a model serves and computes the predictions influences how it should be designed, the infrastructure it requires, and the behaviours that users encounter.

Two types of inference (generating predictions): online prediction and batch prediction
On the users device (also referred to as the edge) or in the cloud

Machine Learning Deployment Myths

Myth 1: You Only Deploy One or Two ML Models at a Time
Myth 2: If We Don’t Do Anything, Model Performance Remains the Same

ML systems typically degrade over time - but can also suffer from quick distribution shifts

Myth 3: You Won’t Need to Update Your Models as Much

“How often should I update my models?” It’s the wrong question to ask. The right question should be: “How often can I update my models?”
Since a model’s performance decays over time, we want to update it as fast as possible.

Myth 4: Most ML Engineers Don’t Need to Worry About Scale

A small number of large companies employ the majority of the software engineering workforce → ML engineers should care about scale.

Batch Prediction Versus Online Prediction

Online or batch prediction is one of the more important decisions.

Online prediction (a.k.a. on-demand prediction) is when predictions are generated and returned as soon as requests for these predictions arrive.
Batch prediction is when predictions are generated periodically or whenever triggered. The predictions are stored somewhere and retrieved as needed.

Netflix might generate movie recommendations for all of its users every four hours, and the precomputed recommendations are fetched and shown to users when they log on to Netflix.

Features computed from historical data, such as data in databases and data warehouses, are batch features.
Features computed from streaming data—data in real-time transports—are streaming features. Batch prediction (asynchronous) Online prediction (synchronous)

	Batch Prediction (asynchronous)	Online Prediction (synchronous)
Frequency	Periodical (every 4 hours)	As soon as requests come
Useful for	Processing accumulated data when you don’t need immediate results	When predictions are needed as soon as a data sample is generated
Optimised for	High throughput	Low latency

Online prediction isn’t necessarily less efficient - Batch processing can be wasteful, you might be computing predictions that you don’t need (e.g. for users who won’t use your product before the next run)
From Batch prediction to Online Prediction

A problem with online prediction is that your model might take too long to generate predictions.

Batch prediction is computing predictions in advance and storing them in a database to be fetched when requests arrive. You don’t have to worry about how long a model takes to generate predictions
Batch prediction can bypass latency issues of complex models (if the time to generate the prediction is less than the time to fetch it)
Batch prediction makes you less responsive to users’ change preferences. You also need to know what requests to generate predictions for in advance (a translation system couldn’t anticipate every query)
Batch prediction is a workaround for when online prediction isn’t cheap enough or isn’t fast enough.

Infrastructure is making online predictions faster and cheaper - it’s becoming the default

Two components are required to overcome latency issues of online prediction:

A real-time pipeline that can extract streaming features and input them into a model and quickly return a prediction. A streaming pipeline with real-time transport and a stream computation engine can help with that.
A model that can generate predictions at a speed acceptable to its end users (milliseconds)

Unifying Batch Pipeline and Streaming Pipeline → Early ML adopters were leveraging existing batch systems to make predictions. To use streaming features for online prediction they need to build a separate streaming pipeline

Having two different pipelines to process your data is a common cause for bugs in ML production.

Changes have to be applied carefully to both pipelines

Building infrastructure to unify stream processing and batch processing has become a popular topic - major infrastructure overhauls can unify batch and stream processing pipelines by using a stream processor like Apache Flink
Companies can use feature stores to ensure the consistency between the batch features used during training and the streaming features used in prediction.

Model Compression

Three main approaches to reduce its inference latency:

make it do inference faster (inference optimisation)
make the model smaller (model compression)
make the hardware deployed on run faster

There are Off-the shelf utilities for model compression - they use 4 techniques…

Low-Rank Factorisation	replace high-dimensional tensors with lower-dimensional tensors. Convolutional filters.
Knowledge Distillation	a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
Pruning	Either removing nodes of neural nets or setting least useful parameters to zero. Neural nets are over-parameterized. Pruning was originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification
Quantisation	Reduces a model’s size by using fewer bits to represent its parameters. The most general and commonly used model compression method. Reduces memory footprint but also improves the computation speed. Rounding numbers leads to rounding errors (can clip distributions too). Low-precision training is using quantised data. Fixed-point inference has become a standard in the industry.

ML on the Cloud and on edge

On the cloud: a large chunk of computation is done on the cloud
On the edge: a large chunk of computation is done on consumer devices

browsers, phones, laptops, smartwatches, cars, security cameras, robots, embedded devices etc

The easiest way is to package your model up and deploy it via a managed cloud service such as AWS or GCP

Downsides to cloud deployment

Cost: ML models can be compute-intensive, and compute is expensive. The largest consumers can save money by running their own data centres
As cloud bills climb more companies are looking for ways to push their computations to edge devices.

Edge can be appealing

Don’t need a stable internet connections
Sensitive data doesn’t have to leave the device
Don’t have to transfer large datasets to the cloud in the first place
Network latency is less important (sometimes a bottleneck)
You might be able to reduce the inference latency
Edge computing makes it easier to comply with regulations, like GDPR

But edge devices need to have enough CPU, memory and battery
Cloud hardware vendors tend to offer their own libraries for a narrow range of frameworks
Standard Local optimisation techniques:

Vectorisation	instead of executing it one item at a time, execute multiple elements contiguous in memory at the same time to reduce latency caused by data I/O
Parallelisation	Divide an input array (or n-dimensional array) into different, independent work chunks, and do the operation on each chunk individually.
Loop tiling	Change the data accessing order in a loop to leverage hardware’s memory layout and cache (hardware dependent)
Operator fusion	Fuse multiple operators into one to avoid redundant memory access.

Using ML to optimise ML models

Hand-designed heuristics are non-optimal and nonadaptive.
We can use ML to explore all possible ways to execute a computation graph, record the time they need to run, then pick the best one.

ML in Browsers

If you can run your model in a browser, you can run your model on any device that supports browsers.
There are tools that can help you compile your models into JavaScript, such as TensorFlow.js, Synaptic, and brain.js.

JavaScript is slow, and its capacity as a programming language is limited for complex logics such as extracting features from data.
A more promising approach is WebAssembly (WASM). WASM is an open standard that allows you to run executable programs in browsers. After you’ve built your models in PyTorch, TensorFlow, etc.
WASM if faster than JavaScript but slow compared to running code natively on devices

Deploying ML models is an engineering challenge, not an ML challenge.
I believe that ML systems will transition to making online prediction on-device

Chapter 8: Data Distribution Shifts and Monitoring

A model’s performance degrades over time in production. Once deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues
Causes of ML System Failures

Google studied 96 cases where a large ML pipeline at Google broke - 60/ 96 failures happened due to causes not directly related to ML

Book Recommendation: Reliable Machine Learning

Software Specific Failures:

Dependency Failure	A software package or codebase your system depends on breaks
Deployment Failure	Failures caused by deployment errors
Hardware Failures	Hardware your system depends on fails
Downtime or Crashing	Component of your system is down causing your system to be down

ML-Specific Failures:

‣

Production data differing from training data

it’s essential for the training data and the unseen data to come from a similar distribution
The underlying distribution of the real-world data is unlikely to be the same as the underlying distribution of the training data.
Second, the real world isn’t stationary. Things change. Data distributions shift. Another common failure mode is that a model does great when first deployed, but its performance degrades over time as the data distribution changes. This failure mode needs to be continually monitored and detected for as long as a model remains in production.
Some people have the impression that data shifts only happen because of unusual events, which implies they don’t happen often. Data shifts happen all the time, suddenly, gradually, or seasonally.

Gradually because social norms, cultures, languages, trends, industries, etc. just change over time
Seasonally because people behave differently at certain times of the year

‣

Edge Cases

An ML model that underperforms on edge cases might not be good enough (driverless cars)
Edge cases are the data samples so extreme that they cause the model to make catastrophic mistakes.
Outliers refer to data: an example that differs significantly from other examples.
Edge cases refer to performance: an example where a model performs significantly worse than other examples.
not all outliers are edge cases
It can be beneficial to remove outliers as it helps your model to learn better decision boundaries and generalise better to unseen data → but during inference, you don’t usually have the option to remove or ignore the queries that differ significantly from other queries.

‣

A Degenerate feedback loops

Feedback loop: time between a prediction being shown to when feedback on the prediction is provided.
A degenerate feedback loop: when the predictions themselves influence the feedback - which - influence the next iteration of the model. When a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs.
Predictions can influence how users interact with the system - those interactions might be used as training data - causing unintended consequences
Common in tasks with natural labels from users (recommender systems)
Popular movies, books, or songs can keep getting more popular, which makes it hard for new items to break into popular lists.

Many different names: “exposure bias,” “popularity bias,” “filter bubbles,” and sometimes “echo chambers.”

They can perpetuate and magnify biases embedded in data

Detecting degenerate feedback loops

It’s possible to detect degenerate feedback loops by measuring the popularity diversity of a system’s outputs even when the system is offline.
Aggregate diversity / average coverage of long-tail items / hit rate against popularity → can help you measure the diversity of outputs of a recommender system.

if a recommender system is better at recommending popular items - it likely suffers from popularity bias
If predictions become more homogeneous over time, it likely suffers from degenerate feedback loops

Correcting degenerate feedback loops

Use randomisation → Introducing randomisation into predictions can reduce homogeneity. Show them items the model ranks highly and some random others - let their selections guide future decisions. Can improve diversity - but at the cost of user experience.
Use positional features → If the position in which a prediction is shown affects its feedback in any way - you might want to encode the position information using positional features. Allows your model to learn how much easy position influences user actions

Data Distribution Shifts

Data distribution shift: the phenomenon in supervised learning when data changes over time, which causes this model’s predictions to become less accurate as time passes.
Types of Distribution Shifts:

Covariate Shift	The distributions inputs/independent variables change but the conditional distribution of outputs (answers) are unchanged .
Label Shift / Prior Shift	The output distribution changes but the for a given output, the input distribution stays the same.
Concept Shift	Same input, Different output. In many cases, concept drifts are cyclic or seasonal.

Nearly every real world dataset suffers from covariate shift
Covariate shifts can be caused by:

Biases during data selection
Training data is artificially altered to make it easier to learn.
Learning process through active learning (instead of randomly selecting samples to train a model on, we use the samples most helpful to that model)
usually because of major changes in the environment or in the way your application is used

If you know in advance how the real-world input distribution will differ from your training input distribution, you can leverage techniques such as importance weighting to train your model to work for the real-world data.

Importance weighting

estimate the estimate the density ratio between the real-world input distribution and the training input distribution - then weight the training data according to this ratio and train the ML on this weighted data

General Data Distribution Shifts

More things that can degrade models:
Feature change: such as when new features are added, older features are removed, or the set of all possible values of a feature changes.
Label schema change is when the set of possible values for Y change.
With classification tasks, label schema change could happen because you have new classes. Classes can also become outdated or more fine-grained.

There’s no rule that says that only one type of shift should happen at one time
Detecting Data Distribution Shifts

Data distribution shifts are only a problem if they cause your model’s performance to degrade.

monitor your model’s accuracy-related metrics: accuracy, F1 score, recall, AUC-ROC
Having access to labels within a reasonable time window will vastly help with giving you visibility of data shifts
When ground truth labels are unavailable or too delayed to be useful, we can monitor other distributions of interest. In industry most detection methods focus on detecting changes in the input distribution - especially the distribution of features.

Statistical Methods:

Min, Max, Mean, Median, Variance, 5th, 25th, 75th, 95th, skewness, kurtosis
Two-sample test - to determine whether the difference between two populations is statistically significant

Kolmogorov-Smirnov Test (K-s or KS test)
Least-Squares Density Difference
MMD - Maxiumum Mean Discrepancy

Its worrying if you can detect a difference in a small sample
Two sample tests work better on low- dimensional data (reduce dimensionality before performing a two sample test)

Time scale windows for detecting shifts

Shifts can happen across two dimensions. Temporal or Spatial.
To detect temporal shifts - you can treat input data as time-series data

the timescale you look at affects the shifts you can detect
detecting temporal shifts is hard when shifts are confounded by seasonal variation

Be cautious of cumulative statistics - as they may contain data from previous time windows
Monitor your distribution (hourly, or daily) the shorter your time scale the faster you’ll be able to detect changes. Too short a time window can lead to false alarms

Addressing Data Distribution Shifts

Many companies assume that data shifts are inevitable, so they periodically retrain their models—once a month, once a week, or once a day—regardless of the extent of the shift.
The optimal frequency to retrain your models is an important decision - many companies still determine based on gut feelings instead of experimental data.
To make a model work with a new distribution in production, there are three main approaches.

Train models on massive datasets - hoping if there’s enough data for the model to learn whatever it needs to do well in production
Adapt a trained model to a target distribution without requiring new labels. Heavily under-explored and hasn’t found wide adoption in industry.
Retrain your model using the labeled data from the target distribution. Retraining can mean retraining your model from scratch on both the old and new data or continuing training the existing model on new data. The latter approach is also called fine-tuning.

If retraining your model - there are two big questions:

Stateless or Stateful

Stateless retraining - train from scratch
Stateful training (fine-tuning) - continuing training the existing model on new data

What data to use? last 24 hours, last week, last 6 months, or from the point when data has started to drift?

Run experiments to figure out which retraining strategy works best for you.

You can design your system to make it more robust to shifts. A system uses multiple features, and different features shift at different rates.

When choosing features consider the trade-off between the performance and the stability of a feature: a feature might be really good for accuracy but deteriorate quickly, forcing you to train your model more often.
You might also want to design your system to make it easier for it to adapt to shifts. E.g. a separate model for each market, you can update each of them only when necessary

Detecting a data shift is hard, but determining what causes a shift can be even harder

Monitoring

Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
Categories of: Networking / Machine Performance / Application Performance
Examples of: latency, throughput, # prediction requests, % requests return with a 2xx code, CPU/GPU utilisation, memory utilisation, availability, uptime.

SLO’s: Service level objectives.
SLA’s: Service level agreements

ML Specific Metrics:

Accuracy related, predictions, features and raw inputs
The further through the ML pipeline - the more transformations something has gone through but the more structured the data has become

‣

Accuracy:

User feedback: click, hide, purchase, upvote, downvote, favourite, bookmark, share etc
Collect users feedback too (provide an alternative translation)

‣

Predictions:

Low dimensional predictions are easy to monitor and visualise
Monitor for distribution shifts using two sample tests
Prediction distribution shifts are a proxy for input distribution shifts
Monitor for anything odd (e.g. an unusual number of FALSE predictions in a row)

‣

Features:

Compared to raw input data - features are well structured following a pre-defined schema
Feature validation → ensuring all your features follow the expected schema

Min, Max, Median values within an acceptable range
Value of feature satisfy a regular expression format
If all values of a feature belong to a predefined set
If the values of a feature are always greater than the values of another featuers

Table Testing or Table validation (as features are usually in a table)

Great Expectations and Deequ ware feature validation libraries.

Feature monitoring concerns:

Can be expensive if you have 100s of models with 100s of features
Its not useful for detecting model performance degradation - could be overwhelmed with alerts
Often done in multiple steps and libraries on multiple services. It might be hard to detect whether the change is caused by a processing error or change in input distribution
The schema that your features follows can change over time.

‣

Raw inputs

Not easy to monitor
Sometimes impossible to get access to
ML engineers usually only query the data warehouse
Might be the responsibility of the ML platform team

Monitoring Toolbox: Logs, Dashboards, Alerts

Logs: recording events produced at runtime

Number of logs grows quickly. Detecting where the problem is harder than detecting when something happened.
Tracing helps us find things later and follow threads → each process has a unique ID that allows us to search logs for it
Each event we record all the metadata needed too
Companies use ML to analyse logs: to detect anomalies and prioritise them

Dashboards: visualising relationships

Helps spot patterns
Makes monitoring accessible to nonengineers
Excessive metrics on a dashboard can also be counter-productive (dashboard rot)

Alerts: Alerting the right people to suspicious signals

Alert Policy: threshold breach for each metric (sometimes over a duration)
Notification channel: slack, pager duty, email

A description of the alert

Helps the alerted person know what’s going on
Make the alert actionable by providing instructions or a runbook

Alert fatigue is real, demoralising and dangerous.

Observability

Observability: setting up the system to get visibility into our system to help us investigate when something goes wrong
Systems are getting more complex - many components and services.
Observability - concept from control theory - bringing better visibility into understanding complex behaviour of software using outputs collected from the system at runtime
Observability → assumes the internal state of a system can be inferred from knowledge of external outputs

Instrumenting in a way to ensure that sufficient information is collected and analysed
Show me all the users for which model a returned wrong predictions over the last hour

Monitoring is passive -observability is more active

Chapter 9 - Continual Learning and Test in Production

Continual learning is largely an infrastructure problem
Usually means updating models in micro batches - after every 512, 1024 examples
Updated models shouldn’t be deployed until validated
Only replace the existing model with the new model if the updated one is better

Existing model = champion
Updated replica model = challenger

You don’t have to update models often if you don’t have enough traffic - or your models don’t decay quickly

Stateless Retraining Vs Stateful Training

Stateless → training from scratch
Stateful → allows you to update your model with less data
Most companies do stateless (requires more data)
Stateful requires less data, converges faster and requires less compute power
With stateful training you might be able to avoid storing data altogether
Stateful requires data storage - which isn’t always possible due to privacy.
Most companies doing stateful training will occasionally train their model from scratch too
Once your infrastructure is setup to allow both stateless and stateful training - the training frequency is just a knob to twist
Continual learning → setting up infrastructure so you can update models whenever needed

Model Updates: new features, models or architecture
Data iteration: refreshing only the data

Continual learning can help your model:

adapt to data shifts
adapt to sudden rare events
overcome the continuous cold start problem (making predictions for a new user without historical data)

Can be new users, or because somebody switched device, or isn’t logged in

Goal - have models adapt to the user within each visiting session (TikTok)
Continual Learning Challenges:

Getting access to fresh data → often means building a real-time transport pipeline (as data-warehouses are likely to be too slow)
Speed at which you can label data → natural labels are the best candidates as they have shorter feedback loops

Dynamic pricing, stock price prediction, recommender systems, ad-click through estimation
Where behavioural signals become the labels
You might be able to programatic labelling too

Evaluation challenge → Making sure the updated model is good enough to be deployed

More you update the more chances there are for it to fail
Continual learning makes your models more susceptible to coordinated manipulation and adversarial attack
You need to test models before you deploy them. Evaluation takes time - can be another bottleneck for model update frequency.

Algorithm Challenges → It is easier to adapt models like neural networks than matrix based and tee based models to the continual learning paradigm

Hoeffding Tree → is an exception

Four Stages of Continual Learning in Organisations

Manual - Stateless retraining

Focus is on deploying new models.
Retraining is manual and ad hoc.
Existing models are stateless retrained only when they degrade enough to become a priority for the team

Automated retraining

A few models are now in production
Maintenance and improvement of existing models is now just as important as working on new models
Retraining frequency based on gut feeling
Scripts are created to automate the retraining process
Some multi-model systems are built so each model can be trained at different frequencies
Script automation stages:

Pull data, downsample or upsample, extract features, process or annotate labels, kick off the training process, evaluate the new model, deploy it

Feasibility of automation relies on: schedular, data and model stores

Schedular: tool that handles task scheduling (cron jobs)
Data availabiliy is likely to take most of your time
Model store is needed to version and store your models and their artefacts.

E.g. Amazon SageMaker

Training is still stateless → expensive if you set higher frequencies

Automated - stateful training

Training continues where the previous model left off
You need to track your data and model lineage carefully
You need real-time transports instead of the DataWarehouse - mature streaming infrastructure / streaming pipeline
Updated on a fixed schedule set by developers

Continual Learning

Instead of the fixed schedule, models are automatically updated whenever data distributions shift and the models performance drops
Combining continual learning with edge deployment might be the best
You need a mechanism to trigger updates

Time based, performance based, volume based or drift based.

You need a monitoring solutions - and a strong evaluation pipeline

How often to update your models

You need to figure out how much you gain from updating your model.
Value of data freshness → to gain a sense of the performance gain you can get from fresher data, train your model on data from different time windows in the past and test on data from today to see how the performance changes
Model iteration vs data iteration → do both from time to time - the more resources you spend on one approach the fewer resources you’ll have for the other
In the beginning - when updating your model is manual and slow do it as often as you can

As infrastructure matures - it can be done in hours or minutes the question becomes → how much performance gain would I get from fresher data?

Test in Production

To sufficiently evaluate your model you need a mixture of offline and online evaluation. The only way to know if a model will do well in production is to deploy it.
Offline evaluation has two major test types: test splits and backtests

Test Splits → static trusted benchmark to compare models.

If the distribution has changed it won’t tell you much

Back Test → Try testing on the most recent data you have (last hour).

Types of Production Testing:

Shadow Deployment	Deploy candidate in parallel, route every request to both models, log predictions for analysis. New model predictions aren’t served to users os its very safe - but it’s expensive and doubles compute cost.
A/B Testing	Deploy candidate model, route a % of traffic to it and use it for predictions, monitor performance and user feedback / behaviour. - Make sure it’s randomised. Make sure your doing enough volume to make it noticeable. - Book Recommendation: Trustworthy online controlled experiments - Ron Kohav
Canary Release	Slowly roll out the change to small subset of users before rolling out to the entire infrastructure. Deploy → route some traffic → if performance is OK increase → stop when its doing 100%
Interleaving experiments	Expose a user to recommendations from two models at the same time - controlling for position of recommendations (if it affects likelihood of a click)
Bandits	Experiment to see which model has the highest payout over time - route traffic based on relative model performance Requires short feedback loops - use less data before making a decision.

You can use contextual bandits to balance showing users items they will like and showing items you want feedback on.

Chapter 10: Infrastructure and tooling for MLOps

Infrastructure is the set of fundamental facilities and systems that support the sustainable functionality
Infrastructure if setup well can help automate processes and reducing need for specialised knowledge and engineering time → and speed up development and delivery of ML applications

Setup badly - and it’s hard to use and will slow you down

Each companies infrastructure needs are different:

Single ML App	No infrastructure needed (Jupyter Notebooks, Python and Pandas). Low investment.
Multiple Common Apps	Can use generalised ML infrastructure. Medium Investment. Most companies doing ML are here.
Serving millions of requests per hour	Highly specialised infrastructure. High investment.

Fundamental facilities that support the development and maintenance of ML systems

Storage and Compute	Data is stored and collected. Compute layer provides compute for ML workloads such as training, computing features and generating features
Resource Management	Tools to schedule and orchestrate your workloads to make the most of resources. (Airflow, Kubeflow, and Metaflow)
ML Platform	Tools to aid the development of ML - model stores, feature stores and monitoring tools. SageMaker and MLflow.
Development Environment	Code is written and experiments run. Code needs to be versioned and tested. Experiments need to be tracked

⬆️ Getting more commoditised, ⬇️ More important to data scientists

Storage and Compute

Storage Layer: data is collected and stored. (Amazon S3 or Snowflake). The storage layer is probably in the cloud - but can be on premises.
Computer Layer: all the compute resources a company has access to and the mechanism to determine how these resources can be used. Most likely AWS or GCP.

Computer layer usually divided into compute units. Can be created for a short-lived job (AWS Step Function or GCP Cloud Run) - the unit will be eliminated after the job finishes.
More permanent compute units are an ‘instance’

Some compute layers abstract away the notions of cores and use other units of computation.

Kubernetes uses ‘pods’ → Spark and Ray use ‘job’

Compute units have memory (in gb ) and operation speeds (flops / cores)
FLOPs used by job / FLOPs = Utilisation.

It’s impossible to run at 100% - 50% might be OK depending on what you’re doing

Cloud makes it easy to start building without having to worry about the compute layer. Its appealing if your company has variable sized workloads. Data science workloads go up and down.

Cloud compute is elastic but not magical - you might have limits, or costs might get prohibitive

Companies with a huge cloud bill might consider moving workloads back to their own data centres - this is called ‘cloud repatriation’ → but this is hard and requires a big up front investment.
Most multi-cloud strategies are by accident not by design. In theory it would be nice to be able to leverage the cheapest compute and avoid vendor lock-in. But its really hard to move data and orchestrate work-loads across clouds.

Development Environment

Where code is written, experiments are run and interaction with production for testing too.
Dev Environment = IDE + Versioning + CI/CD
If you do one thing well - make it your development environment - it is where engineers work
Versioning is more important for ML because…

there’s so much you can change (code, parameters, the data itself)
you need to keep track of prior runs to reproduce later on

IDE can be native (VS code or Vim) or browser based (AWS Cloud9). Many data scientists also use notebooks like Jupyter or Google Colab. Notebooks are stateful - they can retrain states after runs → you only need to load your data once.
Standardise your dev environment (company-wide if you can, if not team-wide).
You might want to use a cloud IDE (AWS Cloud 9 or Amazon SageMaker Studio) -
Moving from local dev environments to the cloud has benefits

It support is easier
It makes remote work easier
It can help with security
Reduces the gap between cloud production environment and your dev environment

You might have to move to cloud anyhow as some data can’t be downloaded and stored on local machines.

From Dev to Prod: Containers

In production you dynamically allocate instances as needed - your environment is stateless.

You need to install dependencies using a list of pre-defined instructions
Container technology (like Docker) helps with this. A dockerfile contains instructions to install packages and create an environment in which your model can rul.

Docker Image: what you get if you run everything in a docker file
Docker container: what you get if you run the docker image
The docker file is the recipe to construct a mold, from the mold you can create multiple running instances - each is a docker container

You can build a docker image either from scratch or from another docker image.

You’ll need more than one container - at least one for training and one for Featurising.
If you have 100 micro services you might have 100 containers. Container orchestration tools help you manage them (Docker Compose). Kubernetes is a tool that creates a network of containers to communicate and share resources. Helps you spin up more instances when you need more compute / memory and shuts down containers when you don’t need them.

Resource Management

Used to be about managing finite computer power. With the elasticity of the cloud - the concern has shifted to cost-effectiveness.
Engineers time is more expensive that compute time - so typically it makes sense to invest in automating what you can

Cron, Schedulers, and Orchestrators

ML workflows are influenced by their repetitiveness and dependencies

repetitive tasks can be scheduled and orchestrated to run smoothly and cost-effectively using available resources

Cron → schedules a job to run at a pre-determined time
ML workflows might have complex dependencies - with each step depending on the previous.

DAG (directed acyclic graph) is a common way to represent workflows and dependencies

Schedulers are cron programs that can handle dependencies - leveraging queues to keep track of jobs. Slurm is a popular one. They can also optimise for resource utilisation.

Google’s Borg - estimates how many resources a job will need and reclaim unused resources for other jobs.

Orchestrators are concerned with where to get resources for jobs. It can increase the number of instances in a pool that needs it. It provisions more computers to handle the workload. Kubernetes is an orchestrator.

Data Science Workflow Management

Workflows can be defined using either code (Python) or configuration files (YAML). Each step in a workflow is called a task.
Almost all workflow management tools come with schedulers. Airflow, Argo, PRefect, Kubeflow and Metaflow.

ML Platform

Once you are spending a lot of time on feature management, model management and monitoring across multiple teams - it might be time for an ML platform team.

The more ML tools you have - the more you have to gain from standardisation and reducing your support costs

Considerations → does it play well with your cloud provider, is it open source or a managed service?
Model Deployment

Once trained and tested you want to make your model available to users
Pushing your and dependencies to a location accessible in production - expose it as a service / endpoint
Deployment services can help: AWS Sagemaker, GCP Vertex AI, Azure ML.
Consider if it can do online and batch prediction
Check that you can easily do check the quality of your model

Model Store

They sound simple - many companies dismiss them
To help with debugging and maintenance - its important to track as much information associated with the model as possible

‣

Things to track

Model definition
Model parameters
Featurise and predict functions
Dependencies
Data
Model Generation Code

Frameworks, training steps, how train/valid/test splits were created, the number of experiments run, the range of hyper-parameters considered, the actual set of hyper-parameters that final model used

Experiment artefacts
Tags (to help with model discovery)

ML flow is the most popular model store that isn’t associated with a major cloud provider

Feature Store

Three problems they aim to solve: Feature management, feature transformation, feature consistency
Feature Management → many ML models with many features → some can be useful for other models. Feature stores help you share and discover features, and manage sharing settings for each features. Like a feature catalogue
Feature Computation → feature engineering logic - after being defined needs to be computed. Involves actually looking into you data and computing certain things. A feature store helps perform feature computation and store the results - like a mini data warehouse
Feature Consistency → unify logic for both batch features and streaming features ensuring consistency between features during training and features during inference
SageMaker and Databricks have their own feature stores.

Build vs Buy

At the limit:

Outsource all ML to a company that provides E2E ML applications. Then the only thing you need is to move data.
Build and maintain all your infrastructure in-house, even having your own data centres

Considerations:

The stage your company is at: start with vendor solutions before investing in your own
Is it the focus of your company? Your competitive edge? If it is then build.
The maturity of the available tools → in the early days of ML adoptions, the pioneers had to build their own tooling as there were no solutions mature enough.

Chapter 11: The Human Side of Machine Learning

UX considerations for ML systems

they are probabilistic instead of deterministic
they are mostly correct but we can’t tell when
latency can be an issue

Inconsistency in experience can be a hinderance. There can be a ‘consistency - accuracy’ trade-off as recommendations that are most accurate might not be consistent.
Mostly correct predictions are OK for those users who can easily correct them → they aren’t useful if users don’t know how to or can’t correct responses.
You can show users more predictions - and allow them to choose. Predictions should be rendered in a way that a non-expert can evaluate them.

Human-in-the-loop → humans picking the best predictions or making the ultimate decision

Smooth Failing → If a model takes too long to respond - you can fall back to a basic heuristic (or cached or precomputed predictions).

‘Speed-accuracy’ tradeoff → a model might have worse performance than another model but can do inference much faster. If latency is crucial, a faster less accurate model might be preferred.

Team Structure

Ml systems don’t work without subject matter expertise. You need it throughout the process not just in the labelling phase
Problem formulation - Feature engineering - Error analysis - Model evaluation - reranking predictions - user interface (how best to present the results)
Think about how to explain the ML algorithms limitations and capacities to the user?
No code / low code solutions enable subject matter experts to take the reigns

End-to-End Data Scientists

Theres a lot of infrastructure work in ML systems - should the data scientist do it all?
Having a separate team manage production:

Makes hiring easier as you’re splitting skills
Make life easier for each person involved (as they only have to focus on a single concern)
Drawbacks:

Communication overheard, debugging challenges, finger-pointing, narrow context, might miss E2E optimisation opportunities

Data Scientists own the entire process

It’s a lot of ground to cover for a data scientist
Might end up doing too much production, not enough data science
Infrastructure requires a very different set of skills - in practice the more time you spend learning one the less time you’re spending on learning the other
For data scientists to own the entire process -they need great tooling / infrastructure - to abstract them away from thinking about containerisation and distributed processing.

Responsible Ai

Designing, developing and deploying AI systems with good intention and sufficient awareness to empower users, to engender trust and ensure fair and positive impact to society

Fairness, privacy, transparency, accountability

Book recommendation: Weapons of Math destruction
Concretely implementing ethics, safety, and inclusivity into your ML systems

The AI incident database → logging the incidents that have come into public awareness.

Welcome to the Artificial Intelligence Incident Database

incidentdatabase.ai

Of Qual examples (estimating grades for students):

Failure to set the right objective → Wasn’t grading accuracy for students but more fitting the predicted grade distribution for each school. Bad news if you performed well in a bad school!
Failure to perform fine-grained evaluation to discover potential biases → There wasn’t enough data for small schools, so they defaulted to teacher assigned grades. They were higher, and the smaller schools were disproportionately private
Failure to make the model transparent → failed to make aspects of their auto-grader public before it was too late. Didn’t open themselves up for scrutiny

Strava revealing locations of military basses and patrol routes

US personal likely didn’t know they were sending their data to Strava as it was the default and the permissions were unclear.

A Framework for Responsible AI

Discover Sources of Model Biases

Training Data → must be representative of the real world
Labelilng → human annotators can encode bias that your model will scale
Feature engineering → Don’t use sensitive information that you don’t want the model to learn on.
Models objective → pick one that enables fairness - is your model doing better for certain groups?
Evaluation → are you performing adequate, fine grained evaluation to understand your models performance on different groups?

Understand the limitations of a data-driven approach

Don’t rely on data too much - put in effort to understand your blind spots and bias
Cross over disciplinary and functional boundaries

Understand the trade-offs between different desiderata

Improving one property can cause other properties to degrade. E.g…

Privacy vs accuracy tradeoff → differential privacy - the higher the level of differential privacy the lower the model accuracy will likely be
Compactness vs fairness trade-off → prune a model’s side whilst maintaining high level accuracy but you’ll do less well on outliers and smaller populations

Act early - don’t bypass ethical issues to save cost and time. Surface risks early and deploy deployment if you need to.

The earlier you can think about responsibility the better
NASA - the cost of errors goes up by an order of magnitude at every stage of your project lifecycle

‣

Create Model Cards

Model details	Basic information about the model: - Who is the developer of the model - Model date - Model version - Model type - Information about training algorithms, parameters, fairness constraints, or other approaches, and features - Paper or other resources - Citation details - Licence - Where to send questions or comments
Intended use	Use cases that were envisioned during development - Primary intended use - Primary intended users - Out-of-scope use cases
Factors	Demographic or phenotypic groups, environmental conditions, technical attributes etc - Relevant factors - Evaluation factors
Metrics	- Model performance measures - Decision thresholds - Variation approaches
Evaluation data	If possible - Datasets - Motivation - Preprocessing
Training data	- If possible - mirror evaluation data
Quantitative analyses	- Quantitative analysis (unitary results, intersectional results) - Ethical considerations - Caveats and recommendations

Establish a process for mitigating biases (e.g. Google Responsible AI)
Stay up-to-date on responsible AI

Designing Machine Learning Systems

Review

You Might Also Like:

Book Review and Summary: You Look Like a Thing and I Love You by Janelle Shane

Lean Analytics

Practical Recommender Systems

Key Takeaways

Data Engineering Fundamentals

Training Data

Feature Engineering

Model Development

Model Deployment and Prediction

Data Distribution Shifts and Monitoring

Continual Learning and Test in Production

Infrastructure and tooling for MLOps

The Human Side of Machine Learning

Welcome to the Artificial Intelligence Incident Database

Deep Summary

Preface

Chapter 1: Overview of Machine Learning Systems

Machine Learning in Research vs Production

Chapter 2: Introduction to Machine Learning Systems Design

machinelearningmastery.com

Chapter 3: Data Engineering Fundamentals

Chapter 4: Training Data

Chapter 5 - Feature Engineering

Fully Understanding the Hashing Trick | NeurIPS 2018

Basic ML (Chapter 6 Pre-Read)

Various types of Distance Metrics Machine Learning

Machine Learning

Chapter 6 - Model Development and Offline Evaluation

Chapter 7: Model Deployment and Prediction Service

Chapter 8: Data Distribution Shifts and Monitoring

Chapter 9 - Continual Learning and Test in Production

Chapter 10: Infrastructure and tooling for MLOps

Chapter 11: The Human Side of Machine Learning

Welcome to the Artificial Intelligence Incident Database