Designing Machine Learning Systems

Designing Machine Learning Systems

Author

Huyen Chip

Year
2022
image

Review

The book focuses on the important considerations of deploying ML Systems in industry. It is more practical than academic, which is rare for a machine learning book and great news for product managers. There’s just enough depth and detail to help you understand the key tradeoffs.

The machine learning space is moving fast, any reference to best practice or tooling is going to age quickly. This book focuses on the more timeless problems and areas of interest.

100,000 words later you’ll have a good understanding of designing Machine Learning systems and some foundational vocabulary that will help you navigate this space.

You Might Also Like:

image

Key Takeaways

The 20% that gave me 80% of the value.

  • Building and deploying machine learning systems is complicated. More stages, stakeholders and components than traditional systems. The actual ML algorithm is only a small part of a production ML system
  • Machine Learning is an approach to learn complex problems from existing data - and use these patterns to make predictions on un-seen data.
  • They need something to learn → there must be clear patterns in the data.
  • What is complex for machines is different from what is complex to humans.
  • Use ML solutions whenPatterns and tasks are repetitive, the cost of wrong predictions is small, you’re making predictions at scale (required to justify investment), the patterns are changing.
    • Note - radical changes in data distributions will still require human intervention
  • Don’t use ML solutions when simpler solutions can do the trick. Always start with a non-ML solution
  • Typical use cases: Search engines, recommender systems, suggestions, translation, assistants, health monitoring, fraud detection, price optimisation, demand forecasting, churn prediction, support ticket classification, sentiment analysis.
  • ML in academia and ML in industry look really different
  • Research
    Production
    State-of-the-art model performance
    Good enough to be useful
    Fast training (training throughput)
    Fast inference (latency of generating a prediction)
    Static data
    Changing data
    Clean data
    Messy data
    Ethics less of a consideration
    Ethics can’t be ignored
    Interpretability not important
    Interpretability can be important
    Clear goals
    Conflicting requirements from stakeholders
    Ensembling common
    Simpler less complex systems preferred
  • During model development training is the bottleneck - once a model is in production, inference is the bottleneck. Research prioritises high throughput - Production prioritises low latency. Latency matters a lot in real world applications.
  • To predict the future, ML algorithms encode the past - perpetuating bias. They can discriminate at scale.
  • Traditionally data and code are separated (Separation of Concerns) - but not in ML systems.
  • Models degrade over time - often best when put live

Machine Learning Basics

  • Business objectives need to be translated into ML objectives. You need to frame your problem so that ML can solve it - and tie the performance of your ML system back to the overall business.
    • Companies care about outcomes - not ML metrics (F1 score, inference latency)
  • Also for consideration: Reliability, scalability, adaptability and maintainability
    • Reliability: The system can continue to perform at the desired level of performance even in the face of adversity. ML systems can and often fail silently.
    • Scalability: Grow in complexity, traffic, in model count, features. There should be reasonable ways to grow in whatever dimension is most needed. Cloud services are great at auto-scaling. Artefact management is a big part of scaling ML models - as is monitoring and retraining (which needs to be automated at scale).
    • Adaptability: Coping with shifting data distributions. Discovering aspects for performance improvement and allowing updates without service interruption. Data can change quickly so ML systems need to be able to evolve.
    • Maintainability: Setup your process and infrastructure so different contributors are comfortable with the tooling. Code, data and artefacts should be documented and versioned. Models need to be reproducible by other contributions.
  • Developing an ML system is an iterative process - once a system is in production - it’ll meed to be monitored and updated.
  • Most ML tasks are Classification or Regression
    • Classification is putting things into categories (e.g Spam / Not Spam)
    • Regression models output a continuous value (house price prediction)
  • Classification problems are simpler with few classes. Binary is the simplest form.
  • Multi-label classification is hard. Labels tend to be less consistent (as human labellers disagree - this a strong warning sign).
  • Changing the way you frame the problem - could make it much easier for ML to solve
  • The objective function (or loss function) guides the learning process and tries to minimise wrong predictions. In supervised ML - the loss can be computed vs ground truth labels using RMSE (root mean squared error) or cross entropy
  • When there are multiple objectives - decouple them first because it makes model development and maintenance easier.
  • The success of an ML system depends largely on the data it was trained on

Data Engineering Fundamentals

  • Row-Major VS Column-Major Format: Accessing data by rows is faster than by columns in modern computers. Row-major formats (e.g. CSV) are better for accessing and writing examples. Column-major formats (e.g. Parquet) are better for accessing all features together (column based reads. Note: Pandas default is column-major, NumPy’s default is row major
  • Data Models: Describe how data is represented. Your model will influence how your systems are built and what problems you can solve.
    • Relational Models: Breaking data into relational tables. Rows and columns that can be shuffled. Often normalised - structuring tables into normal forms to reduce data redundancy and improve data integrity. The downside of normalisation is that data is spread across multiple relations - you have to join them back together. Query language is the language you use to specify the data you want. SQL is a declarative language - you specify the outputs you want, the computer figures out the steps needed for an action and the computer executes these steps to return the outputs. Python is imperative. You specify the steps needed for an action - and the computer executes them to return an output. In theory - SQL can be used for any computing problem (Turing complete).
    • NoSQL Models:
      • Relational models have to follow a strict schema.
      • Two major types: Document model and the Graph model
        • Documents for when data is self-contained and relationships are rare
        • Graphs for when relationships are common and important
      • Document Model: Can be a single blob of JSON. Documents are more flexible - each one can have a different schema. Document databases shift responsibility of assuming structures from the writing application to the reading application. Each document has locality (holding all relevant information) making retrieval easy. Filtering results for documents with certain attributes is slow - you have to read them all, extract the attribute, then filter. Relational models are faster for this
      • Graph Model: Graph consists of nodes and edges - edges are the relationships between the nodes. The relationships between the data are a priority. It’s faster to retrieve data based on relationships.
    • Queries that are easy to do in one data model are hard in another - picking the right data model can make your life so much easier
  • Structured vs Unstructured Data
    • Structured follows a predefined data model (a.k.a. schema). Pre-defined structure makes data easier to analyse. Disadvantage - you have to commit to a schema in advance, changing it retrospectively is harder.
    • Unstructured data becomes appealing if data models are changing quickly - or you’re reliant on data sources outside of your control.
    • Structured
      UnStructured
      Schema clearly defined
      Data doesn’t have to follow a schema
      Easy to search and analyse
      Fast arrival
      Can only handle data with a specific schema
      Can handle data from any source
      Schema changes cause trouble
      Schema changes are easy. Downstream applications have the issues.
      Stored in data warehouses
      Stored in data lakes
Two types of processing - Transactional vs Analytical
  • Transactional processing: actions are inserted as they’re generated, occasionally updated or deleted if something changes. Fast processing (low latency) and high availability are important. ACID (atomicity, consistency, isolation, durability)
    • Atomicity - if any step in the transaction fails - they all fail
    • Consistency - must follow pre-defined rules (e.g. validation)
    • Isolation - Two users can’t access the same data at the same time
    • Durability - once a transaction is committed - it remains so (even if the device dies)
    • Systems that are ACID are sometimes referred to as BASE (Basically Available, Soft state and Eventual consistency)
    • Transactional databases are often row-major - great for writing new records - but bad for asking ‘What’s the average price of items sold in our London stores?’
  • Analytical processing: Efficient with queries that allow you to look at data from different view points.
  • ETL: Extract, Transform, Load: Extracted from different sources → transformed to desired format → loaded into target
Three main models of data flow (Database, Request-Driven, Event Driven
  • through databases
    • is easiest, but requires that both processes have access to the same database. Read/write can be slow - making unsuitable for low latency applications
  • through services using requests provided by REST and RPC APIs (POST/GET requests)
    • Often called ‘request-driven’ and coupled with a service-orientated architecture (or micro-services architecture). A service is a process that can be accessed remotely. Structuring an application as separate services allows for independent development, testing and deployment. Also great for when two companies collaborate.
    • Popular requests:
      • REST (representational state transfer)
        • or restful (CRUD - Create (post), Read(get) , Update(put) , Delete(delete)
        • you can’t get a branch by state (e.g. apply a filter)
        • best for just a simple application
      • RPC (remote procedure call)
        • flexible - great for business rules - designed for actoins
        • more scalable in the long run (for different use cases)
  • through real-time transport like Apache Kafka and Amazon Kinesis
    • Request-driven data passing between services is synchronous - and can get slow and complicated if there are too many services. One service can cause all the others to fail too
    • Real-time transport acts as a broker between services - which can either broadcast or listen.
    • We call the pieces of data being transported events. The architecture is called event-driven.
    • Publish-Subscribe message brokers (Apache, Kafka, Amazon Kinesis) or Message Queues (Apache RocketMQ and RabbitMQ).
    • In a message queue model- events often have intended consumers.
  • Request driven architecture works well for systems that rely more on logic than on data. Event-driven architecture works better for systems that are data heavy.
  • Batch Processing VS Stream Processing
    • Batch processing = jobs that are kicked off periodically. Streaming data = uses realtime transport and stream processing (realtime or every few minutes).
    • Stream processing can give low latency. It’s not always less efficient either. It can be scalable as computations can be done in parallel. You can also save compute by doing things as they happen, vs re-running large batch jobs.
    • In ML batch is used to compute features that change less often (like a drivers rating)
      • Batch features - are also known as static features
    • Stream processing is used to compute features that change quickly
      • Stream features - are also called dynamic features.

Training Data

  • Sampling
    • Sampling happens throughout an the ML project lifecycle
    • Two families of sampling: Non-probability sampling and random sampling:
    • Don’t use Non Probability Sampling for ML models
    • Random Sampling Methods:
    • Simple Random
      All samples in population have an equal probability of selection. Easy, but rare categories of data might not make it to your selection
      Stratified
      Divide population into groups (stratum) - sample from each separately. Ensures some examples of rare classes. Not always possible. Hard for multi-label tasks
      Weighted
      Each sample given a weight - determines probability of selection. You can leverage domain expertise - you might want more recent data to have a higher chance of being selected
      Reservoir
      Useful for streaming data. Have a reservoir of data - randomly select data points from the stream - and randomly replace those in your reservoir. All samples have an equal chance of being selected. You can stop at anytime and your sample will be ready.
      Importance
      Sample from a distribution when we only have access to another distribution. One data source could be expensive, slow or infeasible to sample from = so you sample from a more available source and weight those samples accordingly.
  • Labelling
    • Most ML models are supervised - they need labelled data to learn.
    • Performance of an ML model - depends on quality and quantity of data
    • Two types of labels → Hand labels, or Natural labels.
    • Hand Labels. Acquiring hand labels is often difficult, slow and expensive. Requires somebody seeing your data - so there are privacy implications. Slow labels leads to slow iteration speeds and makes your model less adaptive to changing environments and requirements. The longer the process takes, the more your existing model will degrade.
      • Label Multiplicity - To get enough labelled data you often have to use multiple annotators or even sources. They will have different levels of expertise and accuracy. This leads to label ambiguity or multiplicity - what to do when there are multiple conflicting labels for a data instance?
      • Disagreements among annotators are extremely common. If humans can’t agree on a label - what does human-level performance even mean?
      • Incorporating clear problem definitions and guidance in annotators training can minimise disagreement
    • Natural labels. Tasks with natural labels can be evaluated by the system.They might have a natural ground truth. E.g. Google Maps knows how long the trip actually took - and they can evaluate how good their prediction was. Recommender systems have natural labels (CLICK or NO CLICK). Labels inferred from user actions (clicks and ratings) are known as behavioural labels.
      • If you don’t have natural labels - consider adding an optional feedback loop
      • Companies find it easier and cheaper to start on tasks with natural labels
      • Implicit labels are presumed. E.g. a recommendation that doesn’t get clicked on is presumed to be bad
      • Explicit labels are when users explicitly demonstrate their feedback
    • Data Lineage - keeping track of each data samples origin and labels. Essential if taking data from multiple sources. Helps you flag bias and debug models.
    • Feedback Loop Length. Time from prediction to feedback. Recommender systems have short feedback loops. User feedback differs by volume, signal strength and feedback loop length.

  • Four ways to cope with lack of labels
    • Weak supervision → leverages noisy heuristics to generate labels
    • Semi-supervision → leverages structural assumptions to generate labels
    • Transfer learning → leverages models pre-trained on another tasks for your new task
    • Active learning → labels data samples that are most useful to your model
  • Class Imbalance
    • When there is a substantial difference in the number of samples in each class of the training data. E.g. 0.01% of X-rays might contain cancerous cells
    • Challenges of class imbalance:
      • Deep learning (and ML) works well in situations when the data distribution is more balanced because
        • insufficient signal to detect minority cases
        • stuck in non-optimal solutions using heuristics instead of learning useful things
        • asymmetric costs of error - the cost of a wrong prediction on a sample of the rare class can be much higher (missing a rare cancer diagnosis). If your loss function isn’t configured to address this asymmetry, your model will treat all samples the same way
    • Rare events are often more interesting (like in fraud detection, or churn prediction)
    • Three ways to handle class imbalance:
    • Use the right evaluation metrics
      • Model Accuracy and Error Rate (used frequently) are insufficient metrics for tasks with class imbalance because they treat all classes equally. Performance on the majority class dominates the metrics. This is especially bad when the majority class isn’t what you care about.
      • F1, precision, and recall are metrics that measure your model’s performance with respect to the positive class in binary classification problems, as they rely on true positive—an outcome where the model correctly predicts the positive class.
      Positive Prediction
      Negative Prediction
      Positive Label
      True Positive
      False Negative
      Negative Label
      False Positive
      True Negative
      • Precision = True Positive / (True Positive + False Positive)
        • Precision = accuracy of positive predictions
      • Recall = True Positive / (True Positive + False Negative)
        • Recall = proportion of actual positives (in the data) that were predicted
      • F1 = 2 x Precision x Recall / (Precision + Recall)
      • They are all asymmetric metrics - their values change depending on what you call the positive class
      • Classification problems can be modelled as regression problems (instead of returning a class you return the probability of a class). You can then classify based on that probability by setting a threshold. Moving it up and down allows you to increase the true positive rate (also known as recall) while decreasing the false positive rate (also known as the probability of false alarm), and vice versa
        • Plotting true positive rate against false positive rate is the ROC curve. The area under the curve is a measure of how close to perfect the model is.
      Data-Level Methods: Resampling
      • Modifying the distribution of the training data to reduce the level of imbalance to make it easier for the model to learn.
        • You can OverSample the minority class or UnderSample the majority class
          • Undersampling →
            • Random removals
            • Tomek link UnderSampling - finds pairs of samples from opposite classes that are similar and removes the one from the majority class. Helps models learn the boundary but might make them less robust.
          • Oversampling →
            • Random duplication
            • SMOTE (systematic minority oversampling technique) - novel samples of minority class and synthesis them with other minority class examples.
      • These techniques only work well in data with low-dimensionality
      • never evaluate your model on resampled data - it will cause your model to overfit to that resampled distribution
      • Undersampling risks losing important data from removing data
      • Oversampling risks of overfitting on training data
      Algorithm-Level Methods:
      • Algorithm-level methods keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
      • Many algorithm-level methods involve adjustment to the loss function (that guides the learning process). Gets the model to prioritise making correct predictions on the more important class. By giving the training instances we care about higher weight, we can make the model focus more on learning these instances
      • Cost-sensitive learning → loss function is modified to take into account this varying cost of miss-classification (but you have to manually define the cost matrix)
      • Class-balanced loss → punish the model for making wrong predictions on minority classes to correct this bias by making the weight of each class inversely proportional to the number of samples in that class, so that the rarer classes have higher weight
      • Focal loss → incentivise the model to focus on learning the samples it still has difficulty classifying. If a sample has a lower probability of being right, it’ll have a higher weight

Data Augmentation - a family of techniques that are used to increase the amount of training data

  • Simple Label-Preserving Transformations → e.g. cropping, flipping, rotating, inverting, or erasing part of the image. In NLP, you can replace a word with a similar word. A quick way to double or triple your training data.
  • Perturbation → adding a small amount of noise to make models more robust

Data Synthesis - creating training data to boost a model’s performance. In NLP → templates can be a cheap way to bootstrap your model.

  • Example Template: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION]”
  • You can then list all possible cuisines, all reasonable numbers of miles, and locations (home, office, landmarks, exact addresses) for each city, you can generate thousands of training queries from a template

Feature Engineering

  • The most important thing in developing ML models is having the right features. Coming up with new useful features is a big part of the job. Choosing what information to use and how to extract this information into a format usable by your ML models is feature engineering.
  • Handling Missing Values
    • Not all types of missing values are equal. There are three types of missing values:
      • Missing not at random (MNAR) → Values are missing for reasons related to the values themselves.
      • Missing at random (MAR) → a value is missing is not due to the value itself, but due to another observed variable
      • Missing completely at random (MCAR) → there’s no pattern in when the value is missing.
    • When encountering missing values, you can either fill in the missing values with certain values (imputation) or remove the missing values (deletion)
  • Scaling
    • Models tend to give more extreme numbers more importance. That’s a problem if you have age (less than 120) and annual income in your model (values beyond 100k).
    • It’s therefore important to scale features into similar ranges before putting them into a model - this is called feature scaling.This is one of the simplest things you can do that often results in a performance.
    • Scaling is a common source of data leakage.
  • Discretisation
    • is the process of turning a continuous feature into a discrete feature - by creating buckets for given values. Rarely helps. The process is also known as binning or quantisation.
  • Feature Crossing.
    • Combines two or more features to generate new features. Useful to model the nonlinear relationships between features. Example: combine marital status and number of children into “marriage and children”. Can helps models learn nonlinear relationships faster. Caution - can cause overfitting and your models might need more training data
  • Data Leakage - Essentially when labels are leaked into the features during training. Leakage is often non-obvious and it can cause models to fail in unexpected and spectacular ways (even if evaluated carefully). It’s common and rarely covered in ML curricula.
    • Example: patients scanned while lying down were more likely to be seriously ill → the model learned to predict serious covid risk from a person’s position
    • Common Causes of Data Leakage:
      • Splitting time-correlated data randomly instead of by time
      • Scaling before splitting
      • Poor handling of data duplication before splitting
      • Group Leakage - you need to understand how your data was generated to avoid this type of data leakage
      • Leakage from data generation process
    • Detecting Data Leakage
      • Measure the predictive power of each feature with respect to the label → then investigate high correlation.
      • Do ablation studies to measure how important a feature is to your model - remove the feature and asses the drop off in performance.
      • Watch out for new features improving model performance by large amounts
    • Don’t use your test split for anything other than reporting a model’s final performance
  • Engineering Good Features
    • Generally more features are better. Therefore in production the number of features grows over time. But there are downsides too. More chance for data leakage, too many features can cause overfitting, takes more memory, cost and speed, inference latency gets worse for online prediction, you grow technical debt
      • If a feature doesn’t help a model make good predictions → regularisation techniques like L1 regularisation should reduce that feature’s weight to
    • There are many built-in and open source packages for computing the importance of your features
  • It’s measured by → how much that model’s performance deteriorates if that feature or a set of features containing that feature is removed from the model
  • Often, a small number of features accounts for a large portion of your model’s feature importance . Feature importance techniques are also great for interpretability as they help you understand how your model works.
  • Coverage is the % of the samples that has values for this feature in the data. The fewer values missing, the higher the coverage. Generally if a feature appears in a very small percentage of your data, it’s not going to be very generalisable.

Model Development

  • In supervised ML, the inputs and outputs are given, which are called data, and the function is derived from data. ML isn’t powerful enough to derive arbitrary functions from data yet → you need to specify the form that you think the function should take.
  • The objective function (or loss function) is highly dependent on the model type and whether the labels are available. If the labels aren’t available (e.g. unsupervised learning) the objective functions depend on the data points themselves
  • For k-means clustering, the objective function is the variance within data points in the same cluster. Unsupervised learning is much less commonly used in production
  • Root Mean Squared Error and Mean Absolute Error are two common objective functions for scalar outputs (scalar output = single variable output (not a distribution))
  • Learning procedures the procedures that help your model find the set of parameters that minimise a given objective function for a given set of data, are diverse
  • Considerations when evaluating a model:
    • Performance: accuracy, F1 score, and log loss
    • How much data are needed
    • How much compute needed
    • Time required to train
    • Inference latency
    • Interpretability (Non-neural network algorithms tend to be more explainable)
  • Six tips for model selection
    1. Avoid the state-of-the-art
    2. Start with the simplest models
    3. Avoid human biases in selecting models
    4. Evaluate good performance now versus good performance later
    5. Evaluate trade-offs (false positive vs false negative, compute vs accuracy, interpretability vs performance)
    6. Understand your model’s assumptions
  • Model Assumptions
  • Prediction Assumption
    Prediction models assume that it’s possible to predict the output from the input.
    IID
    Neural Nets assume examples are independent and identically distributed (independently drawn from the same joint distribution)
    Smoothness
    Supervised ML assumes if an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
    Tractability
    Let X be the input and Z be the latent representation of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z|X).
    Boundaries
    A linear classifier assumes that decision boundaries are linear.
    Conditional Independence
    A naive Bayes classifier assumes that the attribute values are independent of each other given the class
    Normally distributed
    Many statistical methods assume that data is normally distributed
  • Ensembles: a system that uses multiple models. Each model in the ensemble is called a base learner. Can lead to better accuracy but it’s less favoured in production because they are more complex to deploy and harder to maintain. TCommon when a small performance boost can lead to a huge gain.
    • You can have 3 models predicting the same class (SPAM, NOT SPAM) and take the majority vote. Makes much more sense if the models aren’t correlated.
  • 3 ways to ensemble: Boosting, Bagging and Stacking:
    • Bagging - short bootstrap aggregating. You create different datasets called bootstraps - you train a model on each bootstrap and take majority vote.
    • Boosting - Boosting uses a chain of classifiers, but sample weights are changed based on how well the previous classifier predicted them. A final classifier is made using a combination of the existing ones
    • Stacking - train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions. The meta-learner could be majority vote or averaging
  • Aggressive experiment tracking and versioning helps with reproducibility, but it doesn’t ensure reproducibility
  • Start simple and gradually add more components
  • Overfit a single batch - to make sure it gets to the smallest possible loss. If it’s for image recognition, overfit on 10 images and see if you can get the accuracy to be 100%. If it can’t overfit a small amount of data, there might be something wrong with your implementation.
  • Set a random seed - Setting a random seed ensures consistency between different runs. It also allows you to reproduce errors and other people to reproduce your results.
  • Four Phases of ML Model Development
    1. Before Machine Learning → Start with non-ML solutions
    2. Simplest ML Models → validate the usefulness, easy to implement and deploy
    3. Optimise Simple Models → different objective functions, hyperparameter search, feature engineering, data and ensembles
    4. Complex Models → Experiment, look for more significant improvements, think about model degradation over time
  • Baselines
  • Random Baseline
    Expected performance of a model that predicts at random
    Simple Heuristic
    Predictions based on simple heuristics
    Zero rule Baseline
    Predicting the most common class (e.g. predicting that a user will next open the app they most commonly open)
    Human Baseline
    How your model compares to human experts
    Existing Solutions
    ML systems are designed to replace existing solutions - business logic

Model Deployment and Prediction

  • Two types of inference (generating predictions): online prediction and batch prediction
  • Machine Learning Deployment Myths
    1. you only deploy 1 or 2 ML models at a time
    2. If we don’t do anything - model performance remains the same
    3. You won’t need to update your models much
    4. Most ML engineers don’t need to worry about scale
Batch Prediction (asynchronous)
Online Prediction (synchronous)
Frequency
Periodical (every 4 hours)
As soon as requests come
Useful for
Processing accumulated data when you don’t need immediate results
When predictions are needed as soon as a data sample is generated
Optimised for
High throughput
Low latency
  • Online prediction isn’t necessarily less efficient - Batch processing can be wasteful, you might be computing predictions that you don’t need (e.g. for users who won’t use your product before the next run)
    • Batch prediction is computing predictions in advance and storing them in a database to be fetched when requests arrive. Can bypass latency issues of complex models. But it makes you less responsive to users’ change preferences and you need to know what requests to generate in advance
  • Building infrastructure to unify stream processing and batch processing is becoming popular companies can use feature stores to ensure the consistency between the batch features used during training and the streaming features used in prediction.
  • Three main approaches to reduce its inference latency:
    • make it do inference faster (inference optimisation)
    • make the model smaller (model compression)
    • make the hardware deployed on run faster
  • As cloud bills climb more companies are looking for ways to push their computations to edge devices.

Data Distribution Shifts and Monitoring

  • A model’s performance degrades over time in production. Once deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues
  • Google studied 96 cases where a large ML pipeline at Google broke - 60/ 96 failures happened due to causes not directly related to ML
  • ML-Specific Failures:
    • Production data differing from training data
    • Edge Cases
    • A Degenerate feedback loops
Covariate Shift
The distributions inputs/independent variables change but the conditional distribution of outputs (answers) are unchanged .
Label Shift / Prior Shift
The output distribution changes but the for a given output, the input distribution stays the same.
Concept Shift
Same input, Different output. In many cases, concept drifts are cyclic or seasonal.
  • Data distribution shifts are only a problem if they cause your model’s performance to degrade.
  • Statistical Methods:
    • Min, Max, Mean, Median, Variance, 5th, 25th, 75th, 95th, skewness, kurtosis
    • Two-sample test - to determine whether the difference between two populations is statistically significant
  • Time scale windows for detecting shifts
    • Shifts can happen across two dimensions. Temporal or Spatial.
    • To detect temporal shifts - you can treat input data as time-series data. Detecting temporal shifts is hard when shifts are confounded by seasonal variation
  • Many companies assume that data shifts are inevitable, so they periodically retrain their models—once a month, once a week, or once a day—regardless of the extent of the shift.
  • Retrain your model using the labeled data from the target distribution.
    1. Stateless retraining - train from scratch
    2. Stateful training (fine-tuning) - continuing training the existing model on new data
  • You can design your system to make it more robust to shifts.
    • A system uses multiple features, and different features shift at different rates.
    • When choosing features consider the trade-off between the performance and the stability
    • You might also want to design your system to make it easier for it to adapt to shifts. E.g. a separate model for each market, you can update each of them only when necessary
  • Monitoring
    • Accuracy related, predictions, features and raw inputs
    • Logs: recording events produced at runtime
      • Number of logs grows quickly. Detecting where the problem is harder than detecting when something happened.
      • Tracing helps us find things later and follow threads → each process has a unique ID that allows us to search logs for it
      • Each event we record all the metadata needed too
      • Companies use ML to analyse logs: to detect anomalies and prioritise them
    • Dashboards: visualising relationships
      • Helps spot patterns
      • Makes monitoring accessible to nonengineers
      • Excessive metrics on a dashboard can also be counter-productive (dashboard rot)
    • Alerts: Alerting the right people to suspicious signals
      • Alert Policy: threshold breach for each metric (sometimes over a duration)
      • Notification channel: slack, pager duty, email
    • A description of the alert
      • Helps the alerted person know what’s going on
      • Make the alert actionable by providing instructions or a runbook
    • Alert fatigue is real, demoralising and dangerous.
  • Observability: setting up the system to get visibility into our system to help us investigate when something goes wrong

Continual Learning and Test in Production

  • Four Stages of Continual Learning in Organisations
    • Manual - Stateless retraining
    • Automated retraining
    • Automated - stateful training
    • Continual Learning
  • How often to update your models
    • You need to figure out how much you gain from updating your model.
    • Value of data freshness → to gain a sense of the performance gain you can get from fresher data, train your model on data from different time windows in the past and test on data from today to see how the performance changes
    • Model iteration vs data iteration → do both from time to time - the more resources you spend on one approach the fewer resources you’ll have for the other
    • In the beginning - when updating your model is manual and slow do it as often as you can
      • As infrastructure matures - it can be done in hours or minutes the question becomes → how much performance gain would I get from fresher data?
  • Types of Production Testing:
Shadow Deployment
Deploy candidate in parallel, route every request to both models, log predictions for analysis. New model predictions aren’t served to users os its very safe - but it’s expensive and doubles compute cost.
A/B Testing
Deploy candidate model, route a % of traffic to it and use it for predictions, monitor performance and user feedback / behaviour. - Make sure it’s randomised. Make sure your doing enough volume to make it noticeable. - Book Recommendation: Trustworthy online controlled experiments - Ron Kohav
Canary Release
Slowly roll out the change to small subset of users before rolling out to the entire infrastructure. Deploy → route some traffic → if performance is OK increase → stop when its doing 100%
Interleaving experiments
Expose a user to recommendations from two models at the same time - controlling for position of recommendations (if it affects likelihood of a click)
Bandits
Experiment to see which model has the highest payout over time - route traffic based on relative model performance Requires short feedback loops - use less data before making a decision.

Infrastructure and tooling for MLOps

  • Each companies infrastructure needs are different:
  • Single ML App
    No infrastructure needed (Jupyter Notebooks, Python and Pandas). Low investment.
    Multiple Common Apps
    Can use generalised ML infrastructure. Medium Investment. Most companies doing ML are here.
    Serving millions of requests per hour
    Highly specialised infrastructure. High investment.
  • Fundamental facilities that support the development and maintenance of ML systems
Storage and Compute
Data is stored and collected. Compute layer provides compute for ML workloads such as training, computing features and generating features
Resource Management
Tools to schedule and orchestrate your workloads to make the most of resources. (Airflow, Kubeflow, and Metaflow)
ML Platform
Tools to aid the development of ML - model stores, feature stores and monitoring tools. SageMaker and MLflow.
Development Environment
Code is written and experiments run. Code needs to be versioned and tested. Experiments need to be tracked
  • Most multi-cloud strategies are by accident not by design. In theory it would be nice to be able to leverage the cheapest compute and avoid vendor lock-in. But its really hard to move data and orchestrate work-loads across clouds.
  • If you have 100 micro services you might have 100 containers. Container orchestration tools help you manage them (Docker Compose). Kubernetes is a tool that creates a network of containers to communicate and share resources. Helps you spin up more instances when you need more compute / memory and shuts down containers when you don’t need them.

The Human Side of Machine Learning

  • UX considerations for ML systems
    • they are probabilistic instead of deterministic
    • they are mostly correct but we can’t tell when
    • latency can be an issue
  • Mostly correct predictions are OK for those users who can easily correct them → they aren’t useful if users don’t know how to or can’t correct responses.
  • Smooth Failing → If a model takes too long to respond - you can fall back to a basic heuristic (or cached or precomputed predictions).
  • Team Structure
    • Ml systems don’t work without subject matter expertise. You need it throughout the process not just in the labelling phase
    • Problem formulation - Feature engineering - Error analysis - Model evaluation - reranking predictions - user interface (how best to present the results)
    • Think about how to explain the ML algorithms limitations and capacities to the user?
    • No code / low code solutions enable subject matter experts to take the reigns
  • Having a separate team manage production makes the most sense. Makes hiring easier as you’re splitting skills. Make life easier for each person involved (as they only have to focus on a single concern). Drawbacks: communication overheard, debugging challenges, finger-pointing, narrow context, might miss E2E optimisation opportunities.
  • Responsible Ai
  • Designing, developing and deploying AI systems with good intention and sufficient awareness to empower users, to engender trust and ensure fair and positive impact to society
image

Deep Summary

Longer form notes, typically condensed, reworded and de-duplicated.

Preface

  • Building and deploying machine learning systems is complicated. More stages, stakeholders and components than traditional systems.
  • As ML systems are data dependent - they tend to be unique, as you have to design around the data.

Chapter 1: Overview of Machine Learning Systems

  • The actual ML algorithm is only a small part of a production ML system. Think… business requirements, users & interface, feature engineering, evaluation, data, infrastructure, deployment, monitoring and updating
  • Machine Learning is an approach to learn complex problems from existing data - and use these patterns to make predictions on un-seen data.
  • They need something to learn from → typically data
    • Zero-shot learning (or zero-data learning) is really hard for machine learning. Most systems require a lot of data to make good predictions.
    • You can deploy a model without training it first - but you risk an initial poor customer experience as it learns
  • They need something to learn → there must be clear patterns in the data
  • Use deterministic mapping (logic) when you can. ELSE machine learning maybe able to approximate a mapping by learning patterns from inputs and outputs.
  • What is complex for machines is different from what is complex to humans.
  • Use a Concierge or Wizard of Oz model to get going - and then use that data to train the model later
  • ML models make predictions. So they can only solve problems that require a predictive answer.
  • ML is great if a large numbers of approximate predictions is useful (e.g. movie recommendations).
  • For your model to be useful - the same patterns must existing in the unseen data to the training data
  • Use ML solutions when:
    • Patterns and tasks are repetitive.
    • The cost of wrong predictions is small (e.g. movie recommendations)
    • You’re making predictions at scale (to justify the cost {team, compute, data, infra.})
    • Patterns are changing (ML is adaptable - less brittle than hardcoded rule-based solutions)
      • Note - radical changes in data distributions will still require human intervention
  • Don’t use ML solutions when:
    • If it’s unethical to do so.
    • Simpler solutions can do the trick. Always start with a non-ML solution
    • It’s not cost-effective
  • If ML can’t solve your problem - it might be possible to solve part of the problem
  • Machine Learning Use Cases:
    • Search engines, recommender systems, suggestions, translation, assistants, health monitoring, fraud detection, price optimisation, demand forecasting, churn prediction, support ticket classification, sentiment analysis.
  • Most ML isn’t customer facing. In internal applications accuracy is more important than latency.
    • Internal: Reducing costs, generating customer insight, internal processing automation
    • External: improving customer experience, retaining customers, interacting with customers

Machine Learning in Research vs Production

Research
Production
State-of-the-art model performance
Good enough to be useful
Fast training (training throughput)
Fast inference (latency of generating a prediction)
Static data
Changing data
Clean data
Messy data
Ethics less of a consideration
Ethics can’t be ignored
Interpretability not important
Interpretability can be important
Clear goals
Conflicting requirements from stakeholders
Ensembling common
Simpler less complex systems preferred
  • Ethayarajh and Jurafsky (2020) argued benchmarks have driven progress in natural language processing at the expense of compactness, fairness, and energy efficiency.
  • People who haven’t deployed an ML system often make the mistake of focusing too much on the model development part and not enough on model deployment and maintenance
  • During model development training is the bottleneck - once a model is in production, inference is the bottleneck.
    • Research prioritises high throughput - Production prioritises low latency
  • Response time - what the client sees
  • Service time - actual time taken to service the request
  • Latency - duration that a request is waiting to be handled
  • Relationship between latency and throughput depends on batch size
    1. Scenario
      Latency
      Batch Size (queries)
      Throughput (queries per second)
      A
      10ms
      1
      100 q/s
      B
      20ms
      1
      50 q/s
      C
      10ms
      10
      1000 q/s
      D
      20ms
      50
      2500 q/s
    2. If batch size is 1, lower latency means higher throughput
    3. If batch size scales faster than latency (as in C → D) - then lowering latency can increase throughput
Latency matters a lot in real world applications
  • 2017 - Akamai study found that 100ms delay can hurt conversion rates by 7%
  • Booking.com found that a 30% increase in latency cost 0.5% conversion rates.
  • Google found that more than half of people leave a page if it takes 3 seconds to load
  • You can increase latency by reducing the number of parallel queries - but at the expense of hardware utilisation and increasing the cost to serve.
  • Latency is not an individual number - but a distribution - best reported in percentiles.
    • p50 is the most common. ‘100 ms p50’ means 50% of requests take longer than 100 ms
  • Typically data in research is clean and well formatted - freeing you to focus on model development. If data does have quirks - they’re usually well known and discussed by the community
  • Production data can be messy, noisy, unstructured, constantly shifting, biased, labels can be unbalanced or sparse, you have to think about privacy.
  • Fairness - is harder to measure.
    • Book Reference: Weapons of Math Destruction - Cathy O’Neil
    • To predict the future, ML algorithms encode the past - perpetuating bias. They can discriminate at scale.
  • Interpretability - researchers prioritise model performance over interpretability. Interpretability is important to users and developers (for debugging and improving a model).
  • ML vs Traditional Software
    • Many challenges are unique to ML and require their own tools
    • Data and code are separated in traditional software engineering (Separation of Concerns)
    • ML systems are part code, part data and part artifacts crated from the two.
    • Systems with the most/best data win → focusing on improving data is sensible
    • Data changes quickly - so ML systems need fast development and deployment cycles
      • Models degrade over time - often best when put live
    • In ML you need to test and version your data (in SWE you test and version your code)
    • ML models can have millions of parameters and require gigabytes of RAM

Chapter 2: Introduction to Machine Learning Systems Design

  • Business objectives need to be translated into ML objectives. You need to frame your problem so that ML can solve it - and tie the performance of your ML system back to the overall business.
  • Companies care about outcomes - not ML metrics (F1 score, inference latency)
  • Most companies define custom success measures:
    • Netflix Take Rate: number quality plays / recommendations seen
    • Higher take rate → higher total streaming hours → lower subscription cancellation rates
  • Return on investment in ML depends on maturity stage of adoption.
    • More experienced teams can deploy ML faster - but expect 30 days from idea to production
  • Also for consideration: Reliability, scalability, adaptability and maintainability
    • Reliability: The system can continue to perform at the desired level of performance even in the face of adversity. ML systems can and often fail silently.
    • Scalability: Grow in complexity, traffic, in model count, features. There should be reasonable ways to grow in whatever dimension is most needed. Cloud services are great at auto-scaling. Artefact management is a big part of scaling ML models - as is monitoring and retraining (which needs to be automated at scale).
    • Adaptability: Coping with shifting data distributions. Discovering aspects for performance improvement and allowing updates without service interruption. Data can change quickly so ML systems need to be able to evolve.
    • Maintainability: Setup your process and infrastructure so different contributors are comfortable with the tooling. Code, data and artefacts should be documented and versioned. Models need to be reproducible by other contributions.
  • Developing an ML system is an iterative process - once a system is in production - it’ll meed to be monitored and updated.
    • Project Scoping → Data Engineering → ML model development → Deployment → Monitoring and continual learning → Business Analysis
  • AN ML problem is defined by inputs, outputs and the objective function that guides the learning process.
  • Types of ML Tasks / Problem Framing
    • The output of a model dictates the task type of your ML problem.
    • Most ML tasks are Classification or Regression
      • Classification tasks can be: Binary, Multi-class or Multi-label
      • Multi-class can be low cardinality or high cardinality
    • Classification is putting things into categories (e.g Spam / Not Spam)
    • Regression models output a continuous value (house price prediction)
    • Regression models can be framed as classification problems and vice versa
      • you can quantize a continuous feature into buckets (under 4ft, 4-5ft, 5-6ft, over 6ft tall)
      • if you output a probability that something belongs to a class - that’s regression
    • Classification problems are simpler with few classes. Binary is the simplest form → calculating F1 and visualising confusion matrices is easier with 2 classes
      • High cardinality is when you have many classes (100 or 1000)
      • Hierarchical classification (first classifier puts into sub-groups of classes, second classifier puts into specific class ) can be useful with high cardinality
    • Multi-Label differs from Binary and Multi-class because in Binary and Multi-class each example belongs to just one class.
    • Multi-label classification is hard. Labels tend to be less consistent (as people disagree - this a strong warning sign). Because you don’t know how many categories an example could belong to - it’s unclear how to use probabilities (use the highest? use a threshold?)
    • Changing the way you frame the problem - could make it much easier for ML to solve
      • Given the problem of predicting the app a user will open next - you could frame it as:
        • Classification - The input is the user’s features and environment’s features. The output is a distribution over all apps on the phone.
        • Regression - The input is the user’s, the environment’s and the app’s features. The output is a single value between 0 and 1 - the probability they’ll open that app next.
        • Classification is a bad approach - each new app added requires you to retrain the model.
        • The regression structure allows you to not to retrain the model - as another app is just another row to compute
  • Objective Functions
    • The objective function (or loss function) guides the learning process and tries to minimise wrong predictions. In supervised ML - the loss can be computed vs ground truth labels using RMSE (root mean squared error) or cross entropy.
    • Popular Loss functions:
    • Regression
      RSME or MAE (mean absolute error)
      Binary classification
      logistic loss
      Multi-class classification
      cross entropy
    • Decoupling objevtives
      • Framing ML problems is hard when you minimise multiple objective functions
      • In a newsfeed - you might want to drive engagement, but also to maintain quality
        • Option 1: combine two losses into one - and train a single model
        • Option 2: train two models and rank posts by combined scores
      • When there are multiple objectives - decouple them first because it makes model development and maintenance easier
        • You can tweak the system without retraining
        • Different objectives might have different maintenance schedules
  • The success of an ML system depends largely on the data it was trained on
    • Focusing on the data is a good way to improve performance
    • Start with building out your data - in quality and quantity
    • Data Science Hierarchy of needs - start at the bottom

      Imagine this as a pyramid!

      Advanced ML
      Deep Learning
      Learn / Optimise
      A/B testing, experimentation, simple ML algorithms
      Aggregate / label
      Analytics, metrics, segments, features, training data
      Explore / transform
      Cleaning, anomaly detection, prep
      Move / store
      Reliable data flow, infrastructure, pipelines, ETL, structured and unstructured data storage
      Collect
      Instrumentation, logging, sensors, external data, user-generated content
      • AI - Deep Learning
      • Learn / Optimise
        • Collect > Move.

Chapter 3: Data Engineering Fundamentals

  • Data models define how the data should be stored on machines
    • Data models → describe the data in the real world
    • Databases → specify how the data should be stored on machines
  • Two types of processing: Transactional and Analytical
  • Historical data / Streaming data
  • Data Sources
    • User input data: is often malformattedd and therefore requires more checking and processing. You usually have to process user input data quickly.
    • System generated data: system outputs and logs. Visibility into how the system id doing. Log everything you can when building an ML system - but soon you have problems with finding and storing.
    • Internal Databases: data generated by services and applications.
Definitions of 1st, 2nd and 3rd party data
  • First-Party Data → data that your company collects about your users or customers
  • Second-Party Data → a different company collects data on their customers
  • Third-Party Data → a different company collects data on the public (not their customers)
  • Data Formats
    • Data storage considerations: cost, speed, security, human readability, access patterns, text or binary, file size
    • Data serialisation is the process of converting data into a format for storage or transmission
    • Format
      Binary/Text
      Human Readable
      Example use cases
      JSON
      Text
      Yes
      Everywhere
      CSV
      Text
      Yes
      Everywhere
      Parquet
      Binary
      No
      Hadoop, Amazon, Redshift
      Avro
      Binary primary
      No
      Hadoop
      Protobuf
      Binary primary
      No
      Google, TensorFlow
      Pickle
      Binary
      No
      Python, PyTorch serialisations
    • JSON (JavaScript Object Notation) is language-independent and easily parsed. Can have different levels of structure. It’s painful to change schema retrospectively. They take up a lot of space too.
  • Row-Major VS Column-Major Format: Accessing data by rows is faster than by columns in modern computers. Row-major formats (e.g. CSV) are better for accessing and writing examples. Column-major formats (e.g. Parquet) are better for accessing all features together (column based reads. Note: Pandas default is column-major, NumPy’s default is row major
  • Text VS Binary Format: Binary files are more compact - which can reduce storage by 6x and can be unloaded 2x faster too. Binary files lose human readability.
  • Data Models: Describe how data is represented. Your model will influence how your systems are built and what problems you can solve.
    • Relational Models:
      • Breaking data into relational tables. Rows and columns that can be shuffled. ]
      • Often normalised - structuring tables into normal forms to reduce data redundancy and improve data integrity
      • The downside of normalisation is that data is spread across multiple relations - you have to join them back together
      • Query language is the language you use to specify the data you want
      • SQL is a declarative language - you specify the outputs you want, the computer figures out the steps needed for an action and the computer executes these steps to return the outputs.
      • Python is imperative. You specify the steps needed for an action - and the computer executes them to return an output
      • In theory - SQL can be used for any computing problem (Turing complete)
    • NoSQL Models:
      • Relational models have to follow a strict schema.
      • Two major types: Document model and the Graph model
        • Documents for when data is self-contained and relationships are rare
        • Graphs for when relationships are common and important
      • Document Model:
        • Can be a single blob of JSON. Documents are more flexible - each one can have a different schema. Document databases shift responsibility of assuming structures from the writing application to the reading application.
        • Each document has locality (holding all relevant information) making retrieval easy
        • Filtering results for documents with certain attributes is slow - you have to read them all, extract the attribute, then filter. Relational models are faster for this
      • Graph Model:
        • Graph consists of nodes and edges - edges are the relationships between the nodes. The relationships between the data are a priority. It’s faster to retrieve data based on relationships.
      • Queries that are easy to do in one data model are hard in another - picking the right data model can make your life so much easier
  • Structured vs Unstructured Data
    • Structured follows a predefined data model (a.k.a. schema). Pre-defined structure makes data easier to analyse. Disadvantage - you have to commit to a schema in advance, changing it retrospectively is harder.
    • Unstructured data becomes appealing if data models are changing quickly - or you’re reliant on data sources outside of your control.
    • A repository for storing structured data is called a data warehouse.
    • A repository for storing unstructured data is called a data lake.
    • Structured
      UnStructured
      Schema clearly defined
      Data doesn’t have to follow a schema
      Easy to search and analyse
      Fast arrival
      Can only handle data with a specific schema
      Can handle data from any source
      Schema changes cause trouble
      Schema changes are easy. Downstream applications have the issues.
      Stored in data warehouses
      Stored in data lakes
  • Data Storage Engines (a.k.a. databases) and Processing
    • Data formats and models specify the interfaces for storage and retrieval
    • Two types of workloads that databases are optimised for:
      • Transactional processing: actions are inserted as they’re generated, occasionally updated or deleted if something changes. Fast processing (low latency) and high availability are important. ACID (atomicity, consistency, isolation, durability)
        • Atomicity - if any step in the transaction fails - they all fail
        • Consistency - must follow pre-defined rules (e.g. validation)
        • Isolation - Two users can’t access the same data at the same time
        • Durability - once a transaction is committed - it remains so (even if the device dies)
        • Systems that are ACID are sometimes referred to as BASE (Basically Available, Soft state and Eventual consistency)
        • Transactional databases are often row-major - great for writing new records - but bad for asking ‘What’s the average price of items sold in our London stores?’
      • Analytical processing: Efficient with queries that allow you to look at data from different view points.
    • Today there are databases that are good at both transactional and analytical tasks.
    • Today it is more common to decouple storage from processing. BigQuery, Snowflake, and Teradata have a processing layer that can be optimised for different types of queries.
  • ETL: Extract, Transform, Load
    • Extracted from different sources → transformed to desired format → loaded into target
    • Data validation is performed in extraction. Transformation can include standardisation of value ranges, transposing, deduplicating, sorting, aggregating, deriving new features etc.
    • Loading everything straight into a data lake without a schema is sometimes called ELT (extract, load, transform).
    • As data grows becomes harder to search
    • Databricks and Snowflake are hybrid - flexibility of data lakes and the data management aspect of a data warehouse.
  • Models of data flow
    • Three main models of data flow (passing data between processes):
      • through databases
        • is easiest, but requires that both processes have access to the same database. Read/write can be slow - making unsuitable for low latency applications
      • through services using requests provided by REST and RPC APIs (POST/GET requests)
        • Often called ‘request-driven’ and coupled with a service-orientated architecture (or micro-services architecture). A service is a process that can be accessed remotely. Structuring an application as separate services allows for independent development, testing and deployment. Also great for when two companies collaborate.
        • Popular requests:
          • REST (representational state transfer)
            • or restful (CRUD - Create (post), Read(get) , Update(put) , Delete(delete)
            • you can’t get a branch by state (e.g. apply a filter)
            • best for just a simple application
          • RPC (remote procedure call)
            • flexible - great for business rules - designed for actoins
            • more scalable in the long run (for different use cases)
      • through real-time transport like Apache Kafka and Amazon Kinesis
        • Request-driven data passing between services is synchronous - and can get slow and complicated if there are too many services. One service can cause all the others to fail too
        • Real-time transport acts as a broker between services - which can either broadcast or listen.
        • We call the pieces of data being transported events. The architecture is called event-driven.
        • Publish-Subscribe message brokers (Apache, Kafka, Amazon Kinesis) or Message Queues (Apache RocketMQ and RabbitMQ).
        • In a message queue model- events often have intended consumers.
    • Request driven architecture works well for systems that rely more on logic than on data. Event-driven architecture works better for systems that are data heavy.
  • Batch Processing VS Stream Processing
    • Batch processing - jobs that are kicked off periodically
    • Streaming data → uses realtime transport and stream processing (realtime or every few minutes)
      • Stream processing can give low latency.
      • It’s not always less efficient either. It can be scalable as computations can be done in parallel. You can also save compute by doing things as they happen, vs re-running large batch jobs.
    • In ML batch is used to compute features that change less often (like a drivers rating)
      • Batch features - are also known as static features
    • Stream processing is used to compute features that change quickly
      • Stream features - are also called dynamic features.
    • Many ML systems require a mix of static and dynamic features - with the right infrastructure you can join them together to fee into your ML models. You need a stream computation engine to do computation on data streams. Stream processing is more difficult because the amount is unbounded and the data comes in at variable rates and speeds - it’s easier to make a stream processor do batch processing than vice versa.

Chapter 4: Training Data

  • Data is messy, complex, unpredictable. It can sink your operation.
  • Use ‘data’ not ‘dataset’ as “dataset” implies it’s finite and stationary. Data in production is neither finite nor stationary - expect ‘Data Distribution Shifts’
  • Data is full of potential biases - that arise in collecting, sampling, or labeling.
    • ML models can perpetuate and amplify any human bias in historical training data
  • Use data - but don’t trust it

Sampling

  • Sampling happens throughout an the ML project lifecycle
  • It is often impossible or infeasible to process all the data that you have access to (due to time and resources) - sampling helps you accomplish a task faster and cheaper
  • Data scientists often experiment with a subset of data - before training a new model
  • Two families of sampling: Non-probability sampling and random sampling:
Non Probability Sampling (Don’t use for ML models)
  • Types:
  • Convenience
    based on availability of data → popular and convenient
    Snowball
    future samples based on existing samples. For scraping - you might have to start on one node to find others
    Judgment
    experts decide what samples to include
    Quota
    select samples based on quotas for certain slices of data (regardless of actual distribution) → selection bias, driven by convenience
  • Typically not representative of the real-world data and therefore have selection bias
  • A bad idea to train models on data pulled in this way
  • Often driven by connivence
  • Language models are often trained with data that is easily accessible (Wikipedia, Common Crawl, Reddit)
  • Sentiment analysis is often collected from sources with natural labels (Amazon and IMDB) - biased towards those leaving reviews online - not representative of people
  • Self-driving car data all comes from Phoenix and California because they had the favourable legislation
  • Random Sampling:
  • Simple Random
    All samples in population have an equal probability of selection. Easy, but rare categories of data might not make it to your selection
    Stratified
    Divide population into groups (stratum) - sample from each separately. Ensures some examples of rare classes. Not always possible. Hard for multi-label tasks
    Weighted
    Each sample given a weight - determines probability of selection. You can leverage domain expertise - you might want more recent data to have a higher chance of being selected
    Reservoir
    Useful for streaming data. Have a reservoir of data - randomly select data points from the stream - and randomly replace those in your reservoir. All samples have an equal chance of being selected. You can stop at anytime and your sample will be ready.
    Importance
    Sample from a distribution when we only have access to another distribution. One data source could be expensive, slow or infeasible to sample from = so you sample from a more available source and weight those samples accordingly.

Labelling

  • Most ML models are supervised - they need labelled data to learn.
  • Performance of an ML model - depends on quality and quantity of data
  • Two types of labels → Hand labels, or Natural labels.
  • Hand Labels
    • Acquiring hand labels is often difficult, slow and expensive (if subject matter expertise is required). Requires somebody seeing your data - so there are privacy implications.
    • Slow labels leads to slow iteration speeds and makes your model less adaptive to changing environments and requirements. The longer the process takes, the more your existing model will degrade.
    • If the tasks changes or data changes, you have to wait for new labels before updating your model
  • Label Multiplicity - To get enough labelled data you often have to use multiple annotators or even sources. They will have different levels of expertise and accuracy. This leads to label ambiguity or multiplicity - what to do when there are multiple conflicting labels for a data instance?
    • Disagreements among annotators are extremely common. If humans can’t agree on a label - what does human-level performance even mean?
    • Incorporating clear problem definitions and guidance in annotators training can minimise disagreement
  • Data Lineage - keeping track of each data samples origin and labels. Essential if taking data from multiple sources. Helps you flag bias and debug models.
  • Natural labels
    • Tasks with natural labels can be evaluated by the system.
    • They might have a natural ground truth.
      • E.g. Google Maps knows how long the trip actually took - and they can evaluate how good their prediction was
    • Recommender systems have natural labels (CLICK or NO CLICK)
    • Labels inferred from user actions (clicks and ratings) are known as behavioural labels
    • If you don’t have natural labels - consider adding an optional feedback loop
      • Examples: like buttons, reactions, ‘submit an alternative translation’
    • Companies find it easier and cheaper to start on tasks with natural labels
    • Implicit labels are presumed. E.g. a recommendation that doesn’t get clicked on is presumed to be bad
    • Explicit labels are when users explicitly demonstrate their feedback

Feedback Loop Length

  • Time from prediction to feedback. Recommender systems have short feedback loops. If you’re recommending clothes on Stitch Fix - you won’t get feedback until the items have been tried on days or weeks later
  • User feedback differs by volume, signal strength and feedback loop length.
    • View, click, add to cart, buy, rate, review
  • Fraud has a long feedback loop. You might need leading indicators to detect issues with your ML model

Handling a lack of lablels

  • Four ways to cope:
  • Weak supervision → leverages noisy heuristics to generate labels
    • Use subject matter expertise to create heuristics to label your data (instead of using hand labels) (sometimes called label functions or programatic labelling)
    • Examples: Keywords, regular expressions, database lookups, outputs from other models
    • A few labels to check your heuristics is helpful .
    • Usually results in noisy labels
    • Can be used in privacy situations when you can’t see the data
    • Cheap / Fast / Adaptive / Not accurate though
    • Great to use when you’re getting started - to evaluate if it’s worth getting hand labels
    Semi-supervision → leverages structural assumptions to generate labels
    • Leverages structural assumptions to generate new lables based on a small set of initial labels.
    • Requires seed labels to generate more
    • Self-training → training a model on existing label data. Use that model to make predictions for unlabelled samples.
    • Assume that data samples that have similar characteristics should have similar labels. Clustering or k-nearest neighbours.
    • Perturbation → add small perturbations to a sample shouldn’t change it’s label. Perturbed samples are given the same labels as the unperturbed ones.
    Transfer learning → leverages models pre-trained on another tasks for your new task
    • Zero-shot learning doesn’t require labels, but fine-tuning does
    • A model developed for one task is used as the starting point for a model developed for another task
    • Train a language model to predict the next token in a sequence → could be fine tuned to answer questions
    • Transfer learning is appealing when there isn’t much labelled data
    • Lowers the barriers into ML (GPT-3 could save you tens of millions of USD)
    Active learning → labels data samples that are most useful to your model
    • Requires labels
    • Improves the efficiency of data labels.
    • Trying to get models to have greater accuracy with less data. Instead of randomly selecting samples - train on those that are most helpful to your mode (those where there is uncertainty - hoping your model will learn those boundaries better)
    • Or you can use data if you have multiple models and their labels disagree.
    • Active learning is important for real-time systems. Allows your model to learn more effectively in real time and adapt faster to changing environments.

Class Imbalance

  • When there is a substantial difference in the number of samples in each class of the training data
    • E.g. 0.01% of X-rays might contain cancerous cells
    • E.g. estimating the 95th percentile of healthcare bills is important - as that’s where the bulk of the cost is
  • Challenges of class imbalance:
    • Deep learning (and ML) works well in situations when the data distribution is more balanced, and usually not so well when the classes are heavily imbalanced because:
      • insufficient signal to detect minority cases
      • models get stuck in non-optimal solutions - using simple heuristics instead of learning useful things
      • asymmetric costs of error - the cost of a wrong prediction on a sample of the rare class can be much higher (missing a rare cancer diagnosis). If your loss function isn’t configured to address this asymmetry, your model will treat all samples the same way
    • Rare events are often more interesting (like in fraud detection, or churn prediction)
    • Three ways to handle class imbalance:
    • Use the right evaluation metrics
      • Model Accuracy and Error Rate (used frequently) are insufficient metrics for tasks with class imbalance because they treat all classes equally. Performance on the majority class dominates the metrics. This is especially bad when the majority class isn’t what you care about.
      • F1, precision, and recall are metrics that measure your model’s performance with respect to the positive class in binary classification problems, as they rely on true positive—an outcome where the model correctly predicts the positive class.
      • Positive Prediction
        Negative Prediction
        Positive Label
        True Positive
        False Negative
        Negative Label
        False Positive
        True Negative
      • Precision = True Positive / (True Positive + False Positive)
        • Precision = accuracy of positive predictions
      • Recall = True Positive / (True Positive + False Negative)
        • Recall = proportion of actual positives (in the data) that were predicted
      • F1 = 2 x Precision x Recall / (Precision + Recall)
      • They are all asymmetric metrics - their values change depending on what you call the positive class
      • Classification problems can be modelled as regression problems (instead of returning a class you return the probability of a class). You can then classify based on that probability by setting a threshold. Moving it up and down allows you to increase the true positive rate (also known as recall) while decreasing the false positive rate (also known as the probability of false alarm), and vice versa
        • Plotting true positive rate against false positive rate is the ROC curve. The area under the curve is a measure of how close to perfect the model is.
    • Data-Level Methods: Resampling
      • Modifying the distribution of the training data to reduce the level of imbalance to make it easier for the model to learn.
        • You can OverSample the minority class or UnderSample the majority class
          • Undersampling →
            • Random removals
            • Tomek link UnderSampling - finds pairs of samples from opposite classes that are similar and removes the one from the majority class. Helps models learn the boundary but might make them less robust.
          • Oversampling →
            • Random duplication
            • SMOTE (systematic minority oversampling technique) - novel samples of minority class and synthesis them with other minority class examples.
      • These techniques only work well in data with low-dimensionality
      • never evaluate your model on resampled data - it will cause your model to overfit to that resampled distribution
      • Undersampling risks losing important data from removing data
      • Oversampling risks of overfitting on training data
    • Algorithm-Level Methods:
      • Algorithm-level methods keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
      • Many algorithm-level methods involve adjustment to the loss function (that guides the learning process). Gets the model to prioritise making correct predictions on the more important class. By giving the training instances we care about higher weight, we can make the model focus more on learning these instances
      • Cost-sensitive learning → loss function is modified to take into account this varying cost of miss-classification (but you have to manually define the cost matrix)
      • Class-balanced loss → punish the model for making wrong predictions on minority classes to correct this bias by making the weight of each class inversely proportional to the number of samples in that class, so that the rarer classes have higher weight
      • Focal loss → incentivise the model to focus on learning the samples it still has difficulty classifying. If a sample has a lower probability of being right, it’ll have a higher weight

Data Augmentation - a family of techniques that are used to increase the amount of training data

  • Used for tasks that have limited training data, such as in medical imaging
  • Augmented data can make our models more robust to noise and even adversarial attacks.
  • Simple Label-Preserving Transformations → e.g. cropping, flipping, rotating, inverting, or erasing part of the image. In NLP, you can replace a word with a similar word. A quick way to double or triple your training data.
  • Perturbation → Neural networks are sensitive to noise. In computer vision adding a small amount of noise to an image can cause a neural network to misclassify it.
    • Using deceptive data to trick a neural network into making wrong predictions is called adversarial attacks
    • Adding noisy samples to training data can help models recognise the weak spots in their learned decision boundary and improve their performance
    • DeepFool finds the minimum possible noise injection needed to cause a misclassification with high confidence

Data Synthesis - creating training data to boost a model’s performance

  • In NLP → templates can be a cheap way to bootstrap your model.
    • Example Template: “Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION]”
      • You can then list all possible cuisines, all reasonable numbers of miles, and locations (home, office, landmarks, exact addresses) for each city, you can generate thousands of training queries from a template

Chapter 5 - Feature Engineering

  • 2014 Paper showed the most important thing in developing ML models is having the right features (Practical Lessons from Predicting Clicks on Ads at Facebook) - more than hyperparameter tuning… a technique that gets more airtime
  • Coming up with new useful features is a big part of the job
  • Data leakage is a subtle yet disastrous problem that has derailed many ML systems in production

Learned Features Versus Engineered Features

  • Choosing what information to use and how to extract this information into a format usable by your ML models is feature engineering.
  • Many hoped deep learning would be the end of handcrafting features - although some can be automatically learned and extracted, we’re still far from the point where all features can be automated.
  • The majority of ML applications in production aren’t deep learning and ML systems need data beyond just text and images.
    • For spam detection you might include the person who made the comment, the reactions, when their account was created, how often they post, how many views, how many threads
    • There are many possible features you could use in your model

Handling Missing Values

  • Not all types of missing values are equal. There are three types of missing values:
    • Missing not at random (MNAR) → Values are missing for reasons related to the values themselves.
    • Missing at random (MAR) → a value is missing is not due to the value itself, but due to another observed variable. E.g. the people of gender A in this survey don’t like disclosing their age.
    • Missing completely at random (MCAR) → there’s no pattern in when the value is missing. E.g. People just forget to fill in that value sometimes for no particular reason. However, this type of missing is very rare. There are usually reasons why certain values are missing, and you should investigate.
  • When encountering missing values, you can either fill in the missing values with certain values (imputation) or remove the missing values (deletion)
    • Deletion → many prefer deletion because it’s easier to do.
      • Column deletion: if a variable has too many missing values, just remove that variable. You might remove important information and reduce the accuracy of your model.
      • Row deletion: if a sample has missing value(s), just remove that sample. This method can work when the missing values are completely at random (MCAR) and the number of examples with missing values is small, such as less than 0.1%. You don’t want to do row deletion if that means 10% of your data samples are removed.
        • Can remove important information that your model needs to make predictions, especially if the missing values are not at random (MNAR)
        • Can create biases in your model, especially if the missing values are at random (MAR)
    • Imputation → If you don’t want to delete missing values, you will have to impute them, which means “fill them with certain values.” Deciding the values is the hard part
      • You could fill with their defaults E.g. an empty string “”
      • You could fill with the mean, median, or mode
      • Both practices work well in many cases, but sometimes they can cause hair-pulling bugs.
      • Avoid filling missing values with possible values, such as filling the missing number of children with 0
      • Deletion risks losing important information or accentuating biases.
      • Imputation risks injecting your own bias into and adding noise

Scaling

  • Models tend to give more extreme numbers more importance. That’s a problem if you have age (less than 120) and annual income in your model (values beyond 100k).
  • It’s therefore important to scale features into similar ranges before putting them into a model - this is called feature scaling.
    • This is one of the simplest things you can do that often results in a performance
  • Neglecting to do so can cause your model to make gibberish predictions (expecially for gradient-boosted trees and logistic regression)
  • Often people scale values to between 0 and 1 or -1 to 1.
  • If your variables follow a normal distribution normalise them to have have zero mean and unit variance (this is called standardisation)
  • ML models tend to struggle with features that have a skewed distribution
    • To help mitigate the skewness, a technique commonly used is log transformation
  • Scaling is a common source of data leakage.
  • Scaling requires global statistics (looking at the entire training data to calculate its min, max, or mean). During inference, you reuse the statistics you had obtained during training to scale new data. If the new data has changed significantly compared to the training, these statistics won’t be very useful. Therefore, it’s important to retrain your model often to account for these changes.

Discretisation

  • Discretisation is the process of turning a continuous feature into a discrete feature - by creating buckets for given values. The process is also known as binning or quantisation.
  • The author has rarely found discretisation to help.
  • Instead of having to learn an infinite number of possible incomes, our model can focus on learning only a few categories, which is a much easier task to learn. This technique is supposed to be more helpful with limited training data.
  • Categorisation introduces discontinuities at the category boundaries—$34,999 is now treated as completely different from $35,000 → choosing the boundaries of categories might be hard

Encoding Categorical Features

  • Feature Crossing. Combines two or more features to generate new features. Useful to model the nonlinear relationships between features. Example: combine marital status and number of children into “marriage and children”. Essential for models that can’t learn or are bad at learning nonlinear relationships, such as linear regression, logistic regression, and tree-based models. It’s less important in neural networks, but can still helps them learn nonlinear relationships faster. Caution - can cause overfitting and your models might need more training data

Discrete and Continuous Positional Embeddings

  • Paper: “Attention Is All You Need” (Vaswani et al. 2017)
  • Positional embedding has become a standard data engineering technique for many applications in both computer vision and NLP.
  • What are Embeddings
    • An embedding is a vector that represents a piece of data.
    • All embeddings generated by the same algorithm are called “an embedding space.”
    • All embedding vectors in the same space are of the same size
  • For language modelling where you want to predict the next token (e.g., a word, character, or subword) based on the previous sequence of tokens - embeddings are useful

Data Leakage

  • Data leakage is the phenomenon when a form of the label “leaks” into the set of features used for making predictions, and this same information is not available during inference.
    • Essentially when labels are leaked into the features during training
  • Leakage is often non-obvious and it can cause models to fail in unexpected and spectacular ways (even if evaluated carefully).
    • It’s common
    • It’s rarely covered in ML curricula
  • Bad examples:
    • patients scanned while lying down were more likely to be seriously ill → the model learned to predict serious covid risk from a person’s position
    • certain hospitals dealt with more serious caseloads → the model used the labels and fonts on those scans to predict covid risk
  • Common Causes of Data Leakage:
  • Splitting time-correlated data randomly instead of by time
    • Often data is time-correlated → the time the data is generated affects its label distribution
    • Correlation can be obvious - as in stock prices
      • Similar stocks move together.
      • To predict the future stock prices → split your training data by time, such as training your model on data from the first six days and evaluating it on data from the seventh day
    • Or non-obvious - listening to a song
      • Depends not only on their music taste + the general music trend that day
      • If an artist dies → people are more likely to listen to that artist that day
    • Split your data by time, instead of splitting randomly, whenever possible.
      • If you have 5 weeks of data - use the first four weeks for the train split, then randomly split week 5 into validation and test splits
    Scaling before splitting
    • Always split your data first before scaling
    • Use statistics from the train split to scale all the splits
    • Leakage might occur if the mean or median is calculated using entire data instead of just the train split
    Poor handling of data duplication before splitting
    • If you have duplicates or near-duplicates in your data, failing to remove them before splitting your data might cause the same samples to appear in both train and validation / test splits. Data duplication is quite common in the industry.
    • It can result from data collection or merging of different data sources.
    • It was common with COVID-19 data as researchers combined several datasets that actually had overlapping data
    • Always check for duplicates before splitting
    • If you’re oversampling - do it after splitting.
    Group Leakage
    • Common for objective detection tasks that contain photos of the same object taken milliseconds apart → some of them landed in the train split while others landed in the test split.
    • You need to understand how your data was generated to avoid this type of data leakage
    Leakage from data generation process
    • Example: Type of CT scan machine can leak data on the seriousness of the patient case
    • Detecting this type of data leakage requires a deep understanding of the way data is collected.
    • You have to know about the hospital procedures and the machines.
    • Mitigate the risk by keeping track of the sources of your data and understanding how it is collected and processed
  • Detecting Data Leakage
    • It can happen in many steps: Generating, collecting, sampling, splitting, processing data and feature engineering
    • Measure the predictive power of each feature with respect to the label → then investigate high correlation.
      • Two features can independently contain no leakage - but leak data when combined together. An employee’s start date and end date can reveal tenure.
    • Do ablation studies to measure how important a feature is to your model - remove the feature and asses the drop off in performance.
    • Watch out for new features improving model performance by large amounts
    • Don’t use your test split for anything other than reporting a model’s final performance → you risk leaking information from the future into your training process.

Engineering Good Features

  • Generally more features are better. Therefore in production the number of features grows over time. But there are downsides too
    • More chance for data leakage
    • Too many features can cause overfitting
    • Takes more memory, cost and speed
    • Inference latency gets worse for online prediction
    • You grow technical debt
  • If a feature doesn’t help a model make good predictions → regularisation techniques like L1 regularisation should reduce that feature’s weight to 0.
    • removing unused features can speed up learning → you can store removed features to add them back later.
      • Store with feature definitions to reuse and share across teams in an organisation.

Feature Importance

  • There are many built-in and open source packages for computing the importance of your features
  • It’s measured by → how much that model’s performance deteriorates if that feature or a set of features containing that feature is removed from the model
    • SHAP is great because it not only measures a feature’s importance to an entire model, it also measures each features contribution to a model’s specific prediction
  • Often, a small number of features accounts for a large portion of your model’s feature importance
  • Facebook found the top 10 features are responsible for about half of the model’s total feature importance, whereas the last 300 features contribute less than 1%.
image
  • Feature importance techniques are also great for interpretability as they help you understand how your model works

Feature Generalisation

  • Features used for the model should generalise to unseen data - but not all features generalise equally.
  • Measuring feature generalisation is less scientific than measuring feature importance
    • it requires intuition and subject matter expertise on top of statistical knowledge.
  • Coverage is the % of the samples that has values for this feature in the data. The fewer values missing, the higher the coverage. Generally if a feature appears in a very small percentage of your data, it’s not going to be very generalisable.
    • Low coverage features can still be useful especially when missing values are not random
    • Caution: coverage of a feature can differ wildly between the train and test split
  • There’s a trade-off between generalisation and specificity.
    • IS_RUSH_HOUR is more generalisable but less specific than HOUR_OF_THE_DAY

Summary

  • The success of ML systems depends on their features → organisations need to invest time and effort into feature engineering. H
  • It requires learning through experience: trying out different features and observing how they affect your models’ performance.
  • You can learn from winning teams of Kaggle competitions too

Basic ML (Chapter 6 Pre-Read)

  • The book recommends reading it’s basic into to ML before starting this chapter. So I’ll include that here
  • Recommended for a more in depth introduction to ML…
  • A model is a function that transforms inputs into outputs, which can then be used to make predictions.
    • In traditional programming, functions are given and outputs are calculated from given inputs.
    • In supervised ML, the inputs and outputs are given, which are called data, and the function is derived from data.
      • Given x as input and y as output, you want to learn a function f such that applying f on x will produce y.
      • ML isn’t powerful enough to derive arbitrary functions from data yet → you need to specify the form that you think the function should take
        • Variables in your function learned in the training process are called parameters
    • You need an objective function to evaluate how good a given set of parameters is for a dataset
      • and a procedure to derive the set of parameters best suited for the given data according to that objective, known as a learning procedure.
    • When talking about model selection, most people think about selecting a function form. However, choosing the right objective function and a learning procedure is extremely important in finding a good set of parameters for your model.

Objective Function

  • The objective function (or loss function) is highly dependent on the model type and whether the labels are available. If the labels aren’t available (e.g. unsupervised learning) the objective functions depend on the data points themselves.
    • For k-means clustering, the objective function is the variance within data points in the same cluster. Unsupervised learning is much less commonly used in production.
  • Most algorithms in production are supervised or semi-supervised. Given a set of parameter values, you calculate the outputs from the given inputs, and compare the given function’s predicted outputs (y') to the actual outputs (y).
  • If your model outputs a distribution → a common objective function is cross entropy and its variation.
  • You can modify the objective function to encourage your model to focus on examples of rare classes or examples that are difficult to learn. You can also add regularisers such as L1 and L2 to your loss function to encourage your model to choose parameters of smaller values.
  • Each objective function gives you a set of possible values your parameters can take. This set of possible values for parameters is known as the loss surface
    • if time permits, you should experiment with different objective functions to see how your model’s behaviour changes

Learning Procedure

  • Learning procedures the procedures that help your model find the set of parameters that minimise a given objective function for a given set of data, are diverse
    • K-means clustering uses an iterative procedure called expectation–maximisation algorithm
    • The most popular family of iterative procedures today is undoubtedly gradient descent.The gradient is the direction that lowers the loss from a current value the most.
    • The function that determines how to update a parameter given a gradient value is called an update algorithm (or optimiser).
    • Good optimizers can both speed up your model training process and help your model converge to a better set of parameters. Even though optimisers help your model find the set of parameters that minimise a given objective function for a given set of data,

Machine Learning

Supervised Learning
Unsupervised Learning
Semi-Supervised Learn.
Reinforcement Learning
Provide input, output and feedback to build model
Use deep learning to arrive at conclusions and patterns through unlabelled training data
Builds a model through a mix of labelled an un labelled data, a set of categories, suggestions and exampled labels
interpreting but based on a system of rewards and punishments learned through trial and error, seeking maximum reward
Linear Regressions: - sales forecasting - risk assessment
Apriori: - sales functions - word associations - searcher
Generative adversarial networks: - audio and video manipulation - data creation
Q-Learning: - policy creation - consumption reduction
Support vector machines: - image classification - financial performance comparison
K-means clustering: - performance monitoring - Searcher intent
Self-trained Naive Bayes classifier: - natural language processing
Model-based value estimation - Linear tasks - estimating parameters
Decision Tree: - predictive analytics - pricing

Machine Learning

Supervised Learning
Unsupervised Learning
Reinforcement Learning
Makes machine learn explicitly Data with clearly defined output is given Direct feedback is given Predicts outcome/future Resolves classification and regression problems
Machine understands the data (identifies patterns / structures) Evaluation is qualitative or indirect Does not predict/find anything specific
An approach to AI Reward based learning Learning from +ve and -ve reinforcement Machine learns how to act in a certain environment To maximise rewards
Inputs → training → output
Inputs → output
Inputs → outputs → rewards → Loop

Chapter 6 - Model Development and Offline Evaluation

  • Model development is an iterative process - After each iteration, you’ll want to compare your model’s performance against its performance in previous iterations and evaluate how suitable this iteration is for production.

Evaluating ML Models

  • Deep Learning isn’t going to replace all classical ML algorithms. Even in applications where neural networks are deployed, classic ML algorithms are still being used in tandem.
  • There are many possible ML solutions to any given problem. You need some knowledge of types of problem, and how they’re currently solved. E.g…
    • Text classification problem (classify whether it’s toxic or not)
      • Naive Bayes, logistic regression, recurrent neural networks, and transformer-based models such as BERT, GPT, and their variants.
    • Abnormality detection problem (fraud detection)
      • k-nearest neighbour, isolation forest, clustering, and neural networks
  • Considerations when evaluating a model:
    • Performance: accuracy, F1 score, and log loss
    • How much data are needed
    • How much compute needed
    • Time required to train
    • Inference latency
    • Interpretability (Non-neural network algorithms tend to be more explainable)
  • E.g. a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labeled data to start, it’s much faster to train, it’s much easier to deploy, and it’s also much easier to explain why it’s making certain predictions
  • The ML space is moving quickly - monitor trends at major ML conferences such as NeurIPS, ICLR, and ICML, as well as following researchers whose work has a high signal-to-noise ratio on Twitter
  • Six tips for model selection
    • 1) Avoid the state-of-the-art: theres a big difference between academia and industry. Often the simple solution is a good idea.
    • 2) Start with the simplest models: simple is better than complex.
      • Easier to deploy, allowing for faster validation.
      • Adding complexity step-by-step makes models easier to understand and debug
      • Simple model can be your baseline - a valuable comparison to asses more complex models.
    • 3) Avoid human biases in selecting models. Engineers get excited by architectures → spend more time on them. Investing 10x more time or iterations on a certain architecture is going to make comparisons unfair.
    • 4) Evaluate good performance now versus good performance later. Think about where you’ll be in a couple of months from now. Use learning curves (a plot of performance—e.g., training loss, training accuracy, validation accuracy—against the number of training samples it uses) to estimate performance gain from more data.
      • Take into account their potential for improvements in the near future, and how easy/difficult it is to achieve those improvements
    • 5) Evaluate trade-offs
      • False positives vs false negatives. For finger print unlocking you might prefer a model that makes fewer false positive
      • Compute requirement and accuracy. More complex models can be more accurate but take more compute and are therefore more expensive
      • Interpretability and performance trade-off. A more complex model can give a better performance, but its results are less interpretable.
    • 6) Understand your model’s assumptions
      1. All models are wrong, but some are useful George Box 1976
      2. Every model comes with its own assumptions
      3. Understanding what assumptions a model makes and whether our data satisfies those assumptions can help you evaluate which model works best for your use case.
      4. Model Assumptions
      5. Prediction Assumption
        Prediction models assume that it’s possible to predict the output from the input.
        IID
        Neural Nets assume examples are independent and identically distributed (independently drawn from the same joint distribution)
        Smoothness
        Supervised ML assumes if an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
        Tractability
        Let X be the input and Z be the latent representation of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z|X).
        Boundaries
        A linear classifier assumes that decision boundaries are linear.
        Conditional Independence
        A naive Bayes classifier assumes that the attribute values are independent of each other given the class
        Normally distributed
        Many statistical methods assume that data is normally distributed

Ensembles

  • Ensemble: A system that uses multiple models. Each model in the ensemble is called a base learner.
  • Ensembling models can lead to better accuracy but it’s less favoured in production because they are more complex to deploy and harder to maintain. They are common when a small performance boost can lead to a huge gain.
  • You can have 3 models predicting the same class (SPAM, NOT SPAM) and take the majority vote. Makes much more sense if the models aren’t correlated.
  • 3 ways to ensemble: Boosting, Bagging and Stacking:
    • Bagging - short bootstrap aggregating. You create different datasets called bootstraps - you train a model (classification or regression) on each bootstrap. Sampling with replacement ensures that each bootstrap is created independently from its peers. If If Regression - final prediction is the average of them all, if classification the final prediction is the majority vote.
      • Designed to improve both the training stability and accuracy of ML algorithms.
      • A random forest is an example of bagging. A random forest is a collection of decisions trees constructed by both bagging and feature randomness - where each tree an pick only from a random subset of features to use
      Boosting - Boosting uses a chain of classifiers, but sample weights are changed based on how well the previous classifier predicted them. A final classifier is made using a combination of the existing ones.
      1. Train the first weak classifier on the original dataset
      2. Re-weight samples based on how well the first classifier classifies them (misclassified samples are given higher weight)
      3. Train the second classifier on this re-weighted dataset. Your ensemble now consists of the first and the second classifiers.
      4. Samples are weighted based on how well the ensemble classifies them.
      5. Train the third classifier on this re-weighted dataset. Add the third classifier to the ensemble.
      6. Repeat for as many iterations as needed - form final strong classifier as a weighted combination of the existing classifiers—classifiers with smaller training errors have higher weights.
    • Stacking - train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions. The meta-learner could be majority vote or averaging

Experiment tracking and versioning

  • Keep track of all the definitions needed to re-create an experiment and its relevant artefacts. An artefact is a file generated during an experiment
  • Experiment Tracking: tracking the progress and results
  • Versioning: logging all the details of an experiment for the purpose of possibly recreating it later or comparing it with other experiments
  • Experiment tracking
    • A large part of training an ML model is babysitting the learning processes - many issues can happen when learning
    • Things to track during training:
      • The loss curve corresponding to the train split and each of the eval splits.
      • Model performance on all non-test splits, such as accuracy, F1, perplexity.
      • Log of corresponding sample, prediction, and ground truth label. For ad hoc analytics and sanity check.
      • The speed of your model (steps per second or, if your data is text, the number of tokens processed per second)
      • Memory usage and CPU/GPU utilisation to identify bottlenecks
  • Versioning
    • You need to not only version your code but your data as well. Code versioning is standard in the industry but few do data versioning well.
    • Data versioning is challenging - its large (so can’t be duplicated as easily to show diffs)
    • There’s still confusion in what exactly constitutes a diff when we version data - and another confusion is in how to resolve merge conflicts.
    • GDPR can make versioning and duplicating complicated
  • Aggressive experiment tracking and versioning helps with reproducibility, but it doesn’t ensure reproducibility
  • Debugging ML Models
    • Especially frustrating for the following three reasons
      • ML models fail silently.
      • It can be frustratingly slow to validate whether the bug has been fixed → you might have to retrain the model and wait until it converges to see whether the bug is fixed, which can take hours
      • Debugging ML models is hard because of their cross-functional complexity. There are many components in an ML system (data, labels, features, ML algorithms, code, infrastructure) that might be owned by different teams
    • Things that can cause an ML model to failL
    • Theoretical constrains
      models make assumptions about data and features. It can fail if the data it learns from doesn’t conform to its assumptions
      Poor Implementation
      the bugs are in the implementation of the model. The more components a model has, the more things that can go wrong
      Poor choice of hyperparameters
      the model is a good fit but a poor set of hyperparameters renders the model useless
      Data problems
      Collection, pre-processing issues.
      Poor feature choice
      Too many features can cause overfitting or leakage. Too few features might lack predictive power.
    • Debugging should be both preventiventative and curative. Healthy practices are needed to minimise opportunities for bugs to proliferate.
    • Tips:
      • Start simple and gradually add more components
      • Overfit a single batch - to make sure it gets to the smallest possible loss. If it’s for image recognition, overfit on 10 images and see if you can get the accuracy to be 100%. If it can’t overfit a small amount of data, there might be something wrong with your implementation.
      • Set a random seed - Setting a random seed ensures consistency between different runs. It also allows you to reproduce errors and other people to reproduce your results.
  • Distributed Training
    • Expertise in scalability requires having regular access to massive compute resources.
    • It’s common to train a model using data that doesn’t fit into memory (CT scans, genome, large language models)
    • Feed-forward models we were able to fit more than 10x larger models onto our GPU, at only a 20% increase in computation time.”
    • Data parallelism → splitting data onto multiple machines, train your model on all of them, and accumulate gradients. This gives rise to a couple of issues.
      • As each machine produces its own gradient, if your model waits for all of them to finish a run—synchronous stochastic gradient descent (SGD)—stragglers will cause the entire system to slow down, wasting time and resources.
      • Spreading your model on multiple machines can cause your batch size to be very big.
    • Model parallelism → Model parallelism is when different components of your model are trained on different machines. Doesn’t mean that different parts of the model in different machines are executed in parallel.
      • Pipeline parallelism is a clever technique to make different components of a model on different machines run more in parallel. The key idea is to break the computation of each machine into multiple parts.
  • AutoML
    • 2018 Jeff Dean (Google) declared that Google intended on replacing ML expertise with 100 times more computational power, introducing AutoML
      • Instead of paying a group of 100 ML researchers/engineers to fiddle with various models and eventually select a suboptimal one, why not use that money on compute to search for the optimal model
    • Soft AutoML: Hyperparameter tuning: A hyperparameter is a parameter supplied by users whose value is used to control the learning process, e.g., learning rate, batch size, number of hidden layers, number of hidden units, dropout probability, β1 and β2 in Adam optimiser, etc.
      • 2018 Paper “On the State of the Art of Evaluation in Neural Language Models” weaker models with well-tuned hyperparameters can outperform stronger, fancier models
      • Despite knowing its importance, many still ignore systematic approaches to hyperparameter tuning.
      • Most ML pipelines have a form of hyperparamater tuning.
      • When tuning hyperparameters, keep in mind that a model’s performance might be more sensitive to the change in one hyperparameter than another - sensitive hyperparameters should be more carefully tuned.
    • Hard AutoML: Involves architecture search → A search space, a performance estimation strategy, a search strategy.
      • Whether it’s architecture search or meta-learning learning rules, the up-front training cost is expensive enough that only a handful of companies in the world can afford to pursue them.
      • Auto ML is likely to improve off the shelf model performance from big companies.
      • More real-world tasks previously impossible with existing architecture will be solves

Four Phases of ML Model Development

  1. Before Machine Learning → Start with non-ML solutions
  2. Simplest ML Models → validate the usefulness, easy to implement and deploy
  3. Optimise Simple Models → different objective functions, hyperparameter search, feature engineering, data and ensembles
  4. Complex Models → Experiment, look for more significant improvements, think about model degradation over time
  • Model Offline Evaluation
    • How do I know that our ML models are any good?
    • Baselines
    • Random Baseline
      Expected performance of a model that predicts at random
      Simple Heuristic
      Predictions based on simple heuristics
      Zero rule Baseline
      Predicting the most common class (e.g. predicting that a user will next open the app they most commonly open)
      Human Baseline
      How your model compares to human experts
      Existing Solutions
      ML systems are designed to replace existing solutions - business logic
  • Evaluation Methods
    • In academia researchers fixate on their performance metrics. In production, we also want our models to be robust, fair, calibrated.
    • Perturbation tests
      Making small changes to your clean training data to make it more robust to noisy real-world data. You the choose the model that works best on the perturbed data
      Invariance tests
      Certain changes to the inputs shouldn’t lead to changes in the output. Race information shouldn’t affect the mortgage outcome. Exclude the sensitive information from the features used to train the model in the first place
      Directional Expectation tests
      Changes in inputs should make predictable directional changes in outputs. If predictions go in the other direction investigate.
      Model Calibration
      Maybe the single most important test of a forecast. Track how often your predictions come true - does that match the probability of the model?
      Confidence Measurement
      Sample level confidence metrics are really important - do you want to show users predictions where the model is highly uncertain
      Slice-based evaluation
      Separate your data into subsets and look at your model’s performance on each subset. You might pay particular attention to accuracy of certain types (like churn prediction of high value clients)
  • Simpsons paradox → trend appears in several groups of data - but disppears or reverses when the groups are combined.
  • Three ways to slice:
Heuristics-based
Slice your data using domain knowledge you have of the data and the task at hand.
Error analysis
Manually go through misclassified examples and find patterns among them.
Slice Finder
Using an automated slice finder

Chapter 7: Model Deployment and Prediction Service

  • Deploy: getting your model running and accessible. During model development, your model usually runs in a development environment.
  • Production means different things in different teams → For teams doing analysis - production might be notebooks and charts. For others it means keeping your models up and running for millions of users a day
  • Production is hard: latency, availability, accuracy, monitoring, alerting, debugging, releasing
  • How a model serves and computes the predictions influences how it should be designed, the infrastructure it requires, and the behaviours that users encounter.
    • Two types of inference (generating predictions): online prediction and batch prediction
    • On the users device (also referred to as the edge) or in the cloud

Machine Learning Deployment Myths

  • Myth 1: You Only Deploy One or Two ML Models at a Time
  • Myth 2: If We Don’t Do Anything, Model Performance Remains the Same
    • ML systems typically degrade over time - but can also suffer from quick distribution shifts
  • Myth 3: You Won’t Need to Update Your Models as Much
    • “How often should I update my models?” It’s the wrong question to ask. The right question should be: “How often can I update my models?”
    • Since a model’s performance decays over time, we want to update it as fast as possible.
  • Myth 4: Most ML Engineers Don’t Need to Worry About Scale
    • A small number of large companies employ the majority of the software engineering workforce → ML engineers should care about scale.

Batch Prediction Versus Online Prediction

  • Online or batch prediction is one of the more important decisions.
    • Online prediction (a.k.a. on-demand prediction) is when predictions are generated and returned as soon as requests for these predictions arrive.
    • Batch prediction is when predictions are generated periodically or whenever triggered. The predictions are stored somewhere and retrieved as needed.
      • Netflix might generate movie recommendations for all of its users every four hours, and the precomputed recommendations are fetched and shown to users when they log on to Netflix.
  • Features computed from historical data, such as data in databases and data warehouses, are batch features.
  • Features computed from streaming data—data in real-time transports—are streaming features. Batch prediction (asynchronous) Online prediction (synchronous)
Batch Prediction (asynchronous)
Online Prediction (synchronous)
Frequency
Periodical (every 4 hours)
As soon as requests come
Useful for
Processing accumulated data when you don’t need immediate results
When predictions are needed as soon as a data sample is generated
Optimised for
High throughput
Low latency
  • Online prediction isn’t necessarily less efficient - Batch processing can be wasteful, you might be computing predictions that you don’t need (e.g. for users who won’t use your product before the next run)
  • From Batch prediction to Online Prediction
    • A problem with online prediction is that your model might take too long to generate predictions.
      • Batch prediction is computing predictions in advance and storing them in a database to be fetched when requests arrive. You don’t have to worry about how long a model takes to generate predictions
      • Batch prediction can bypass latency issues of complex models (if the time to generate the prediction is less than the time to fetch it)
      • Batch prediction makes you less responsive to users’ change preferences. You also need to know what requests to generate predictions for in advance (a translation system couldn’t anticipate every query)
      • Batch prediction is a workaround for when online prediction isn’t cheap enough or isn’t fast enough.
    • Infrastructure is making online predictions faster and cheaper - it’s becoming the default
  • Two components are required to overcome latency issues of online prediction:
    • A real-time pipeline that can extract streaming features and input them into a model and quickly return a prediction. A streaming pipeline with real-time transport and a stream computation engine can help with that.
    • A model that can generate predictions at a speed acceptable to its end users (milliseconds)
  • Unifying Batch Pipeline and Streaming Pipeline → Early ML adopters were leveraging existing batch systems to make predictions. To use streaming features for online prediction they need to build a separate streaming pipeline
    • Having two different pipelines to process your data is a common cause for bugs in ML production.
      • Changes have to be applied carefully to both pipelines
    • Building infrastructure to unify stream processing and batch processing has become a popular topic - major infrastructure overhauls can unify batch and stream processing pipelines by using a stream processor like Apache Flink
    • Companies can use feature stores to ensure the consistency between the batch features used during training and the streaming features used in prediction.
  • Model Compression
    • Three main approaches to reduce its inference latency:
      • make it do inference faster (inference optimisation)
      • make the model smaller (model compression)
      • make the hardware deployed on run faster
    • There are Off-the shelf utilities for model compression - they use 4 techniques…
    • Low-Rank Factorisation
      replace high-dimensional tensors with lower-dimensional tensors. Convolutional filters.
      Knowledge Distillation
      a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
      Pruning
      Either removing nodes of neural nets or setting least useful parameters to zero. Neural nets are over-parameterized. Pruning was originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification
      Quantisation
      Reduces a model’s size by using fewer bits to represent its parameters. The most general and commonly used model compression method. Reduces memory footprint but also improves the computation speed. Rounding numbers leads to rounding errors (can clip distributions too). Low-precision training is using quantised data. Fixed-point inference has become a standard in the industry.

ML on the Cloud and on edge

  • On the cloud: a large chunk of computation is done on the cloud
  • On the edge: a large chunk of computation is done on consumer devices
    • browsers, phones, laptops, smartwatches, cars, security cameras, robots, embedded devices etc
  • The easiest way is to package your model up and deploy it via a managed cloud service such as AWS or GCP
    • Downsides to cloud deployment
      • Cost: ML models can be compute-intensive, and compute is expensive. The largest consumers can save money by running their own data centres
      • As cloud bills climb more companies are looking for ways to push their computations to edge devices.
  • Edge can be appealing
    • Don’t need a stable internet connections
    • Sensitive data doesn’t have to leave the device
    • Don’t have to transfer large datasets to the cloud in the first place
    • Network latency is less important (sometimes a bottleneck)
    • You might be able to reduce the inference latency
    • Edge computing makes it easier to comply with regulations, like GDPR
  • But edge devices need to have enough CPU, memory and battery
  • Cloud hardware vendors tend to offer their own libraries for a narrow range of frameworks
  • Standard Local optimisation techniques:
  • Vectorisation
    instead of executing it one item at a time, execute multiple elements contiguous in memory at the same time to reduce latency caused by data I/O
    Parallelisation
    Divide an input array (or n-dimensional array) into different, independent work chunks, and do the operation on each chunk individually.
    Loop tiling
    Change the data accessing order in a loop to leverage hardware’s memory layout and cache (hardware dependent)
    Operator fusion
    Fuse multiple operators into one to avoid redundant memory access.
  • Using ML to optimise ML models
    • Hand-designed heuristics are non-optimal and nonadaptive.
    • We can use ML to explore all possible ways to execute a computation graph, record the time they need to run, then pick the best one.
  • ML in Browsers
    • If you can run your model in a browser, you can run your model on any device that supports browsers.
    • There are tools that can help you compile your models into JavaScript, such as TensorFlow.js, Synaptic, and brain.js.
      • JavaScript is slow, and its capacity as a programming language is limited for complex logics such as extracting features from data.
      • A more promising approach is WebAssembly (WASM). WASM is an open standard that allows you to run executable programs in browsers. After you’ve built your models in PyTorch, TensorFlow, etc.
      • WASM if faster than JavaScript but slow compared to running code natively on devices
  • Deploying ML models is an engineering challenge, not an ML challenge.
  • I believe that ML systems will transition to making online prediction on-device

Chapter 8: Data Distribution Shifts and Monitoring

  • A model’s performance degrades over time in production. Once deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues
  • Causes of ML System Failures
    • Google studied 96 cases where a large ML pipeline at Google broke - 60/ 96 failures happened due to causes not directly related to ML
      • Book Recommendation: Reliable Machine Learning
    • Software Specific Failures:
    • Dependency Failure
      A software package or codebase your system depends on breaks
      Deployment Failure
      Failures caused by deployment errors
      Hardware Failures
      Hardware your system depends on fails
      Downtime or Crashing
      Component of your system is down causing your system to be down
    • ML-Specific Failures:
    • Production data differing from training data
      • it’s essential for the training data and the unseen data to come from a similar distribution
      • The underlying distribution of the real-world data is unlikely to be the same as the underlying distribution of the training data.
      • Second, the real world isn’t stationary. Things change. Data distributions shift. Another common failure mode is that a model does great when first deployed, but its performance degrades over time as the data distribution changes. This failure mode needs to be continually monitored and detected for as long as a model remains in production.
      • Some people have the impression that data shifts only happen because of unusual events, which implies they don’t happen often. Data shifts happen all the time, suddenly, gradually, or seasonally.
        • Gradually because social norms, cultures, languages, trends, industries, etc. just change over time
        • Seasonally because people behave differently at certain times of the year
      Edge Cases
      • An ML model that underperforms on edge cases might not be good enough (driverless cars)
      • Edge cases are the data samples so extreme that they cause the model to make catastrophic mistakes.
      • Outliers refer to data: an example that differs significantly from other examples.
      • Edge cases refer to performance: an example where a model performs significantly worse than other examples.
      • not all outliers are edge cases
      • It can be beneficial to remove outliers as it helps your model to learn better decision boundaries and generalise better to unseen data → but during inference, you don’t usually have the option to remove or ignore the queries that differ significantly from other queries.
      A Degenerate feedback loops
      • Feedback loop: time between a prediction being shown to when feedback on the prediction is provided.
      • A degenerate feedback loop: when the predictions themselves influence the feedback - which - influence the next iteration of the model. When a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs.
      • Predictions can influence how users interact with the system - those interactions might be used as training data - causing unintended consequences
      • Common in tasks with natural labels from users (recommender systems)
      • Popular movies, books, or songs can keep getting more popular, which makes it hard for new items to break into popular lists.
        • Many different names: “exposure bias,” “popularity bias,” “filter bubbles,” and sometimes “echo chambers.”
      • They can perpetuate and magnify biases embedded in data

      Detecting degenerate feedback loops

      • It’s possible to detect degenerate feedback loops by measuring the popularity diversity of a system’s outputs even when the system is offline.
      • Aggregate diversity / average coverage of long-tail items / hit rate against popularity → can help you measure the diversity of outputs of a recommender system.
        • if a recommender system is better at recommending popular items - it likely suffers from popularity bias
        • If predictions become more homogeneous over time, it likely suffers from degenerate feedback loops

      Correcting degenerate feedback loops

      • Use randomisation → Introducing randomisation into predictions can reduce homogeneity. Show them items the model ranks highly and some random others - let their selections guide future decisions. Can improve diversity - but at the cost of user experience.
      • Use positional features → If the position in which a prediction is shown affects its feedback in any way - you might want to encode the position information using positional features. Allows your model to learn how much easy position influences user actions
  • Data Distribution Shifts
    • Data distribution shift: the phenomenon in supervised learning when data changes over time, which causes this model’s predictions to become less accurate as time passes.
    • Types of Distribution Shifts:
    • Covariate Shift
      The distributions inputs/independent variables change but the conditional distribution of outputs (answers) are unchanged .
      Label Shift / Prior Shift
      The output distribution changes but the for a given output, the input distribution stays the same.
      Concept Shift
      Same input, Different output. In many cases, concept drifts are cyclic or seasonal.
    • Nearly every real world dataset suffers from covariate shift
    • Covariate shifts can be caused by:
      • Biases during data selection
      • Training data is artificially altered to make it easier to learn.
      • Learning process through active learning (instead of randomly selecting samples to train a model on, we use the samples most helpful to that model)
      • usually because of major changes in the environment or in the way your application is used
    • If you know in advance how the real-world input distribution will differ from your training input distribution, you can leverage techniques such as importance weighting to train your model to work for the real-world data.
      • Importance weighting
        • estimate the estimate the density ratio between the real-world input distribution and the training input distribution - then weight the training data according to this ratio and train the ML on this weighted data
  • General Data Distribution Shifts
    • More things that can degrade models:
    • Feature change: such as when new features are added, older features are removed, or the set of all possible values of a feature changes.
    • Label schema change is when the set of possible values for Y change.
    • With classification tasks, label schema change could happen because you have new classes. Classes can also become outdated or more fine-grained.
  • There’s no rule that says that only one type of shift should happen at one time
  • Detecting Data Distribution Shifts
    • Data distribution shifts are only a problem if they cause your model’s performance to degrade.
      • monitor your model’s accuracy-related metrics: accuracy, F1 score, recall, AUC-ROC
      • Having access to labels within a reasonable time window will vastly help with giving you visibility of data shifts
      • When ground truth labels are unavailable or too delayed to be useful, we can monitor other distributions of interest. In industry most detection methods focus on detecting changes in the input distribution - especially the distribution of features.
    • Statistical Methods:
      • Min, Max, Mean, Median, Variance, 5th, 25th, 75th, 95th, skewness, kurtosis
      • Two-sample test - to determine whether the difference between two populations is statistically significant
        • Kolmogorov-Smirnov Test (K-s or KS test)
        • Least-Squares Density Difference
        • MMD - Maxiumum Mean Discrepancy
      • Its worrying if you can detect a difference in a small sample
      • Two sample tests work better on low- dimensional data (reduce dimensionality before performing a two sample test)
    • Time scale windows for detecting shifts
      • Shifts can happen across two dimensions. Temporal or Spatial.
      • To detect temporal shifts - you can treat input data as time-series data
        • the timescale you look at affects the shifts you can detect
        • detecting temporal shifts is hard when shifts are confounded by seasonal variation
      • Be cautious of cumulative statistics - as they may contain data from previous time windows
      • Monitor your distribution (hourly, or daily) the shorter your time scale the faster you’ll be able to detect changes. Too short a time window can lead to false alarms
  • Addressing Data Distribution Shifts
    • Many companies assume that data shifts are inevitable, so they periodically retrain their models—once a month, once a week, or once a day—regardless of the extent of the shift.
    • The optimal frequency to retrain your models is an important decision - many companies still determine based on gut feelings instead of experimental data.
    • To make a model work with a new distribution in production, there are three main approaches.
      1. Train models on massive datasets - hoping if there’s enough data for the model to learn whatever it needs to do well in production
      2. Adapt a trained model to a target distribution without requiring new labels. Heavily under-explored and hasn’t found wide adoption in industry.
      3. Retrain your model using the labeled data from the target distribution. Retraining can mean retraining your model from scratch on both the old and new data or continuing training the existing model on new data. The latter approach is also called fine-tuning.
    • If retraining your model - there are two big questions:
      1. Stateless or Stateful
        • Stateless retraining - train from scratch
        • Stateful training (fine-tuning) - continuing training the existing model on new data
      2. What data to use? last 24 hours, last week, last 6 months, or from the point when data has started to drift?
        • Run experiments to figure out which retraining strategy works best for you.
    • You can design your system to make it more robust to shifts. A system uses multiple features, and different features shift at different rates.
      • When choosing features consider the trade-off between the performance and the stability of a feature: a feature might be really good for accuracy but deteriorate quickly, forcing you to train your model more often.
      • You might also want to design your system to make it easier for it to adapt to shifts. E.g. a separate model for each market, you can update each of them only when necessary
    • Detecting a data shift is hard, but determining what causes a shift can be even harder
  • Monitoring
    • Monitoring: tracking, measuring, and logging different metrics that can help us determine when something goes wrong
    • Categories of: Networking / Machine Performance / Application Performance
    • Examples of: latency, throughput, # prediction requests, % requests return with a 2xx code, CPU/GPU utilisation, memory utilisation, availability, uptime.
      • SLO’s: Service level objectives.
      • SLA’s: Service level agreements
    • ML Specific Metrics:
      • Accuracy related, predictions, features and raw inputs
      • The further through the ML pipeline - the more transformations something has gone through but the more structured the data has become
      • Accuracy:
        • User feedback: click, hide, purchase, upvote, downvote, favourite, bookmark, share etc
        • Collect users feedback too (provide an alternative translation)
        Predictions:
        • Low dimensional predictions are easy to monitor and visualise
        • Monitor for distribution shifts using two sample tests
        • Prediction distribution shifts are a proxy for input distribution shifts
        • Monitor for anything odd (e.g. an unusual number of FALSE predictions in a row)
        Features:
        • Compared to raw input data - features are well structured following a pre-defined schema
        • Feature validation → ensuring all your features follow the expected schema
          • Min, Max, Median values within an acceptable range
          • Value of feature satisfy a regular expression format
          • If all values of a feature belong to a predefined set
          • If the values of a feature are always greater than the values of another featuers
        • Table Testing or Table validation (as features are usually in a table)
          • Great Expectations and Deequ ware feature validation libraries.
        • Feature monitoring concerns:
          • Can be expensive if you have 100s of models with 100s of features
          • Its not useful for detecting model performance degradation - could be overwhelmed with alerts
          • Often done in multiple steps and libraries on multiple services. It might be hard to detect whether the change is caused by a processing error or change in input distribution
          • The schema that your features follows can change over time.
        Raw inputs
        • Not easy to monitor
        • Sometimes impossible to get access to
        • ML engineers usually only query the data warehouse
        • Might be the responsibility of the ML platform team
    • Monitoring Toolbox: Logs, Dashboards, Alerts
      • Logs: recording events produced at runtime
        • Number of logs grows quickly. Detecting where the problem is harder than detecting when something happened.
        • Tracing helps us find things later and follow threads → each process has a unique ID that allows us to search logs for it
        • Each event we record all the metadata needed too
        • Companies use ML to analyse logs: to detect anomalies and prioritise them
      • Dashboards: visualising relationships
        • Helps spot patterns
        • Makes monitoring accessible to nonengineers
        • Excessive metrics on a dashboard can also be counter-productive (dashboard rot)
      • Alerts: Alerting the right people to suspicious signals
        • Alert Policy: threshold breach for each metric (sometimes over a duration)
        • Notification channel: slack, pager duty, email
      • A description of the alert
        • Helps the alerted person know what’s going on
        • Make the alert actionable by providing instructions or a runbook
      • Alert fatigue is real, demoralising and dangerous.
  • Observability
    • Observability: setting up the system to get visibility into our system to help us investigate when something goes wrong
    • Systems are getting more complex - many components and services.
    • Observability - concept from control theory - bringing better visibility into understanding complex behaviour of software using outputs collected from the system at runtime
    • Observability → assumes the internal state of a system can be inferred from knowledge of external outputs
      • Instrumenting in a way to ensure that sufficient information is collected and analysed
      • Show me all the users for which model a returned wrong predictions over the last hour
    • Monitoring is passive -observability is more active

Chapter 9 - Continual Learning and Test in Production

  • Continual learning is largely an infrastructure problem
  • Usually means updating models in micro batches - after every 512, 1024 examples
  • Updated models shouldn’t be deployed until validated
  • Only replace the existing model with the new model if the updated one is better
    • Existing model = champion
    • Updated replica model = challenger
  • You don’t have to update models often if you don’t have enough traffic - or your models don’t decay quickly

Stateless Retraining Vs Stateful Training

  • Stateless → training from scratch
  • Stateful → allows you to update your model with less data
  • Most companies do stateless (requires more data)
  • Stateful requires less data, converges faster and requires less compute power
  • With stateful training you might be able to avoid storing data altogether
  • Stateful requires data storage - which isn’t always possible due to privacy.
  • Most companies doing stateful training will occasionally train their model from scratch too
  • Once your infrastructure is setup to allow both stateless and stateful training - the training frequency is just a knob to twist
  • Continual learning → setting up infrastructure so you can update models whenever needed
    • Model Updates: new features, models or architecture
    • Data iteration: refreshing only the data
  • Continual learning can help your model:
    • adapt to data shifts
    • adapt to sudden rare events
    • overcome the continuous cold start problem (making predictions for a new user without historical data)
      • Can be new users, or because somebody switched device, or isn’t logged in
  • Goal - have models adapt to the user within each visiting session (TikTok)
  • Continual Learning Challenges:
    • Getting access to fresh data → often means building a real-time transport pipeline (as data-warehouses are likely to be too slow)
    • Speed at which you can label data → natural labels are the best candidates as they have shorter feedback loops
      • Dynamic pricing, stock price prediction, recommender systems, ad-click through estimation
      • Where behavioural signals become the labels
      • You might be able to programatic labelling too
    • Evaluation challenge → Making sure the updated model is good enough to be deployed
      • More you update the more chances there are for it to fail
      • Continual learning makes your models more susceptible to coordinated manipulation and adversarial attack
      • You need to test models before you deploy them. Evaluation takes time - can be another bottleneck for model update frequency.
    • Algorithm Challenges → It is easier to adapt models like neural networks than matrix based and tee based models to the continual learning paradigm
      • Hoeffding Tree → is an exception
  • Four Stages of Continual Learning in Organisations
    1. Manual - Stateless retraining
      • Focus is on deploying new models.
      • Retraining is manual and ad hoc.
      • Existing models are stateless retrained only when they degrade enough to become a priority for the team
    2. Automated retraining
      • A few models are now in production
      • Maintenance and improvement of existing models is now just as important as working on new models
      • Retraining frequency based on gut feeling
      • Scripts are created to automate the retraining process
      • Some multi-model systems are built so each model can be trained at different frequencies
      • Script automation stages:
        • Pull data, downsample or upsample, extract features, process or annotate labels, kick off the training process, evaluate the new model, deploy it
      • Feasibility of automation relies on: schedular, data and model stores
        • Schedular: tool that handles task scheduling (cron jobs)
        • Data availabiliy is likely to take most of your time
        • Model store is needed to version and store your models and their artefacts.
          • E.g. Amazon SageMaker
      • Training is still stateless → expensive if you set higher frequencies
    3. Automated - stateful training
      • Training continues where the previous model left off
      • You need to track your data and model lineage carefully
      • You need real-time transports instead of the DataWarehouse - mature streaming infrastructure / streaming pipeline
      • Updated on a fixed schedule set by developers
    4. Continual Learning
      • Instead of the fixed schedule, models are automatically updated whenever data distributions shift and the models performance drops
      • Combining continual learning with edge deployment might be the best
      • You need a mechanism to trigger updates
        • Time based, performance based, volume based or drift based.
      • You need a monitoring solutions - and a strong evaluation pipeline
  • How often to update your models
    • You need to figure out how much you gain from updating your model.
    • Value of data freshness → to gain a sense of the performance gain you can get from fresher data, train your model on data from different time windows in the past and test on data from today to see how the performance changes
    • Model iteration vs data iteration → do both from time to time - the more resources you spend on one approach the fewer resources you’ll have for the other
    • In the beginning - when updating your model is manual and slow do it as often as you can
      • As infrastructure matures - it can be done in hours or minutes the question becomes → how much performance gain would I get from fresher data?
  • Test in Production
    • To sufficiently evaluate your model you need a mixture of offline and online evaluation. The only way to know if a model will do well in production is to deploy it.
    • Offline evaluation has two major test types: test splits and backtests
      • Test Splits → static trusted benchmark to compare models.
        • If the distribution has changed it won’t tell you much
      • Back Test → Try testing on the most recent data you have (last hour).
    • Types of Production Testing:
    • Shadow Deployment
      Deploy candidate in parallel, route every request to both models, log predictions for analysis. New model predictions aren’t served to users os its very safe - but it’s expensive and doubles compute cost.
      A/B Testing
      Deploy candidate model, route a % of traffic to it and use it for predictions, monitor performance and user feedback / behaviour. - Make sure it’s randomised. Make sure your doing enough volume to make it noticeable. - Book Recommendation: Trustworthy online controlled experiments - Ron Kohav
      Canary Release
      Slowly roll out the change to small subset of users before rolling out to the entire infrastructure. Deploy → route some traffic → if performance is OK increase → stop when its doing 100%
      Interleaving experiments
      Expose a user to recommendations from two models at the same time - controlling for position of recommendations (if it affects likelihood of a click)
      Bandits
      Experiment to see which model has the highest payout over time - route traffic based on relative model performance Requires short feedback loops - use less data before making a decision.
    • You can use contextual bandits to balance showing users items they will like and showing items you want feedback on.

Chapter 10: Infrastructure and tooling for MLOps

  • Infrastructure is the set of fundamental facilities and systems that support the sustainable functionality
  • Infrastructure if setup well can help automate processes and reducing need for specialised knowledge and engineering time → and speed up development and delivery of ML applications
    • Setup badly - and it’s hard to use and will slow you down
  • Each companies infrastructure needs are different:
  • Single ML App
    No infrastructure needed (Jupyter Notebooks, Python and Pandas). Low investment.
    Multiple Common Apps
    Can use generalised ML infrastructure. Medium Investment. Most companies doing ML are here.
    Serving millions of requests per hour
    Highly specialised infrastructure. High investment.
  • Fundamental facilities that support the development and maintenance of ML systems
    1. Storage and Compute
      Data is stored and collected. Compute layer provides compute for ML workloads such as training, computing features and generating features
      Resource Management
      Tools to schedule and orchestrate your workloads to make the most of resources. (Airflow, Kubeflow, and Metaflow)
      ML Platform
      Tools to aid the development of ML - model stores, feature stores and monitoring tools. SageMaker and MLflow.
      Development Environment
      Code is written and experiments run. Code needs to be versioned and tested. Experiments need to be tracked
    2. ⬆️ Getting more commoditised,  ⬇️  More important to data scientists
  • Storage and Compute
    • Storage Layer: data is collected and stored. (Amazon S3 or Snowflake). The storage layer is probably in the cloud - but can be on premises.
    • Computer Layer: all the compute resources a company has access to and the mechanism to determine how these resources can be used. Most likely AWS or GCP.
      • Computer layer usually divided into compute units. Can be created for a short-lived job (AWS Step Function or GCP Cloud Run) - the unit will be eliminated after the job finishes.
      • More permanent compute units are an ‘instance’
    • Some compute layers abstract away the notions of cores and use other units of computation.
      • Kubernetes uses ‘pods’ → Spark and Ray use ‘job’
    • Compute units have memory (in gb ) and operation speeds (flops / cores)
    • FLOPs used by job / FLOPs = Utilisation.
      • It’s impossible to run at 100% - 50% might be OK depending on what you’re doing
    • Cloud makes it easy to start building without having to worry about the compute layer. Its appealing if your company has variable sized workloads. Data science workloads go up and down.
      • Cloud compute is elastic but not magical - you might have limits, or costs might get prohibitive
    • Companies with a huge cloud bill might consider moving workloads back to their own data centres - this is called ‘cloud repatriation’ → but this is hard and requires a big up front investment.
    • Most multi-cloud strategies are by accident not by design. In theory it would be nice to be able to leverage the cheapest compute and avoid vendor lock-in. But its really hard to move data and orchestrate work-loads across clouds.
  • Development Environment
    • Where code is written, experiments are run and interaction with production for testing too.
    • Dev Environment = IDE + Versioning + CI/CD
    • If you do one thing well - make it your development environment - it is where engineers work
    • Versioning is more important for ML because…
      • there’s so much you can change (code, parameters, the data itself)
      • you need to keep track of prior runs to reproduce later on
    • IDE can be native (VS code or Vim) or browser based (AWS Cloud9). Many data scientists also use notebooks like Jupyter or Google Colab. Notebooks are stateful - they can retrain states after runs → you only need to load your data once.
    • Standardise your dev environment (company-wide if you can, if not team-wide).
    • You might want to use a cloud IDE (AWS Cloud 9 or Amazon SageMaker Studio) -
    • Moving from local dev environments to the cloud has benefits
      • It support is easier
      • It makes remote work easier
      • It can help with security
      • Reduces the gap between cloud production environment and your dev environment
        • You might have to move to cloud anyhow as some data can’t be downloaded and stored on local machines.
  • From Dev to Prod: Containers
    • In production you dynamically allocate instances as needed - your environment is stateless.
      • You need to install dependencies using a list of pre-defined instructions
      • Container technology (like Docker) helps with this. A dockerfile contains instructions to install packages and create an environment in which your model can rul.
        • Docker Image: what you get if you run everything in a docker file
        • Docker container: what you get if you run the docker image
        • The docker file is the recipe to construct a mold, from the mold you can create multiple running instances - each is a docker container
      • You can build a docker image either from scratch or from another docker image.
    • You’ll need more than one container - at least one for training and one for Featurising.
    • If you have 100 micro services you might have 100 containers. Container orchestration tools help you manage them (Docker Compose). Kubernetes is a tool that creates a network of containers to communicate and share resources. Helps you spin up more instances when you need more compute / memory and shuts down containers when you don’t need them.
  • Resource Management
    • Used to be about managing finite computer power. With the elasticity of the cloud - the concern has shifted to cost-effectiveness.
    • Engineers time is more expensive that compute time - so typically it makes sense to invest in automating what you can
  • Cron, Schedulers, and Orchestrators
    • ML workflows are influenced by their repetitiveness and dependencies
      • repetitive tasks can be scheduled and orchestrated to run smoothly and cost-effectively using available resources
    • Cron → schedules a job to run at a pre-determined time
    • ML workflows might have complex dependencies - with each step depending on the previous.
      • DAG (directed acyclic graph) is a common way to represent workflows and dependencies
    • Schedulers are cron programs that can handle dependencies - leveraging queues to keep track of jobs. Slurm is a popular one. They can also optimise for resource utilisation.
      • Google’s Borg - estimates how many resources a job will need and reclaim unused resources for other jobs.
    • Orchestrators are concerned with where to get resources for jobs. It can increase the number of instances in a pool that needs it. It provisions more computers to handle the workload. Kubernetes is an orchestrator.
  • Data Science Workflow Management
    • Workflows can be defined using either code (Python) or configuration files (YAML). Each step in a workflow is called a task.
    • Almost all workflow management tools come with schedulers. Airflow, Argo, PRefect, Kubeflow and Metaflow.

ML Platform

  • Once you are spending a lot of time on feature management, model management and monitoring across multiple teams - it might be time for an ML platform team.
    • The more ML tools you have - the more you have to gain from standardisation and reducing your support costs
  • Considerations → does it play well with your cloud provider, is it open source or a managed service?
  • Model Deployment
    • Once trained and tested you want to make your model available to users
    • Pushing your and dependencies to a location accessible in production - expose it as a service / endpoint
    • Deployment services can help: AWS Sagemaker, GCP Vertex AI, Azure ML.
    • Consider if it can do online and batch prediction
    • Check that you can easily do check the quality of your model
  • Model Store
    • They sound simple - many companies dismiss them
    • To help with debugging and maintenance - its important to track as much information associated with the model as possible
    • Things to track
      • Model definition
      • Model parameters
      • Featurise and predict functions
      • Dependencies
      • Data
      • Model Generation Code
        • Frameworks, training steps, how train/valid/test splits were created, the number of experiments run, the range of hyper-parameters considered, the actual set of hyper-parameters that final model used
      • Experiment artefacts
      • Tags (to help with model discovery)
    • ML flow is the most popular model store that isn’t associated with a major cloud provider
  • Feature Store
    • Three problems they aim to solve: Feature management, feature transformation, feature consistency
    • Feature Management → many ML models with many features → some can be useful for other models. Feature stores help you share and discover features, and manage sharing settings for each features. Like a feature catalogue
    • Feature Computation → feature engineering logic - after being defined needs to be computed. Involves actually looking into you data and computing certain things. A feature store helps perform feature computation and store the results - like a mini data warehouse
    • Feature Consistency → unify logic for both batch features and streaming features ensuring consistency between features during training and features during inference
    • SageMaker and Databricks have their own feature stores.
  • Build vs Buy
    • At the limit:
      • Outsource all ML to a company that provides E2E ML applications. Then the only thing you need is to move data.
      • Build and maintain all your infrastructure in-house, even having your own data centres
    • Considerations:
      • The stage your company is at: start with vendor solutions before investing in your own
      • Is it the focus of your company? Your competitive edge? If it is then build.
      • The maturity of the available tools → in the early days of ML adoptions, the pioneers had to build their own tooling as there were no solutions mature enough.

Chapter 11: The Human Side of Machine Learning

  • UX considerations for ML systems
    • they are probabilistic instead of deterministic
    • they are mostly correct but we can’t tell when
    • latency can be an issue
  • Inconsistency in experience can be a hinderance. There can be a ‘consistency - accuracy’ trade-off as recommendations that are most accurate might not be consistent.
  • Mostly correct predictions are OK for those users who can easily correct them → they aren’t useful if users don’t know how to or can’t correct responses.
  • You can show users more predictions - and allow them to choose. Predictions should be rendered in a way that a non-expert can evaluate them.
    • Human-in-the-loop → humans picking the best predictions or making the ultimate decision
  • Smooth Failing → If a model takes too long to respond - you can fall back to a basic heuristic (or cached or precomputed predictions).
    • ‘Speed-accuracy’ tradeoff → a model might have worse performance than another model but can do inference much faster. If latency is crucial, a faster less accurate model might be preferred.

Team Structure

  • Ml systems don’t work without subject matter expertise. You need it throughout the process not just in the labelling phase
  • Problem formulation - Feature engineering - Error analysis - Model evaluation - reranking predictions - user interface (how best to present the results)
  • Think about how to explain the ML algorithms limitations and capacities to the user?
  • No code / low code solutions enable subject matter experts to take the reigns

End-to-End Data Scientists

  • Theres a lot of infrastructure work in ML systems - should the data scientist do it all?
  • Having a separate team manage production:
    • Makes hiring easier as you’re splitting skills
    • Make life easier for each person involved (as they only have to focus on a single concern)
    • Drawbacks:
      • Communication overheard, debugging challenges, finger-pointing, narrow context, might miss E2E optimisation opportunities
  • Data Scientists own the entire process
    • It’s a lot of ground to cover for a data scientist
    • Might end up doing too much production, not enough data science
    • Infrastructure requires a very different set of skills - in practice the more time you spend learning one the less time you’re spending on learning the other
    • For data scientists to own the entire process -they need great tooling / infrastructure - to abstract them away from thinking about containerisation and distributed processing.

Responsible Ai

  • Designing, developing and deploying AI systems with good intention and sufficient awareness to empower users, to engender trust and ensure fair and positive impact to society
    • Fairness, privacy, transparency, accountability
  • Book recommendation: Weapons of Math destruction
  • Concretely implementing ethics, safety, and inclusivity into your ML systems
  • image
  • The AI incident database → logging the incidents that have come into public awareness.
  • Of Qual examples (estimating grades for students):
    • Failure to set the right objective → Wasn’t grading accuracy for students but more fitting the predicted grade distribution for each school. Bad news if you performed well in a bad school!
    • Failure to perform fine-grained evaluation to discover potential biases → There wasn’t enough data for small schools, so they defaulted to teacher assigned grades. They were higher, and the smaller schools were disproportionately private
    • Failure to make the model transparent → failed to make aspects of their auto-grader public before it was too late. Didn’t open themselves up for scrutiny
  • Strava revealing locations of military basses and patrol routes
    • US personal likely didn’t know they were sending their data to Strava as it was the default and the permissions were unclear.

A Framework for Responsible AI

  • Discover Sources of Model Biases
    • Training Data → must be representative of the real world
    • Labelilng → human annotators can encode bias that your model will scale
    • Feature engineering → Don’t use sensitive information that you don’t want the model to learn on.
    • Models objective → pick one that enables fairness - is your model doing better for certain groups?
    • Evaluation → are you performing adequate, fine grained evaluation to understand your models performance on different groups?
  • Understand the limitations of a data-driven approach
    • Don’t rely on data too much - put in effort to understand your blind spots and bias
    • Cross over disciplinary and functional boundaries
  • Understand the trade-offs between different desiderata
    • Improving one property can cause other properties to degrade. E.g…
      • Privacy vs accuracy tradeoff → differential privacy - the higher the level of differential privacy the lower the model accuracy will likely be
      • Compactness vs fairness trade-off → prune a model’s side whilst maintaining high level accuracy but you’ll do less well on outliers and smaller populations
  • Act early - don’t bypass ethical issues to save cost and time. Surface risks early and deploy deployment if you need to.
    • The earlier you can think about responsibility the better
    • NASA - the cost of errors goes up by an order of magnitude at every stage of your project lifecycle
Create Model Cards
Model details
Basic information about the model: - Who is the developer of the model - Model date - Model version - Model type - Information about training algorithms, parameters, fairness constraints, or other approaches, and features - Paper or other resources - Citation details - Licence - Where to send questions or comments
Intended use
Use cases that were envisioned during development - Primary intended use - Primary intended users - Out-of-scope use cases
Factors
Demographic or phenotypic groups, environmental conditions, technical attributes etc - Relevant factors - Evaluation factors
Metrics
- Model performance measures - Decision thresholds - Variation approaches
Evaluation data
If possible - Datasets - Motivation - Preprocessing
Training data
- If possible - mirror evaluation data
Quantitative analyses
- Quantitative analysis (unitary results, intersectional results) - Ethical considerations - Caveats and recommendations
  • Establish a process for mitigating biases (e.g. Google Responsible AI)
  • Stay up-to-date on responsible AI