ICML2020_Machine Learning Production Pipeline

Consideration to make before starting your Machine Learning project

programming
machine learning
AI
  1. Home
  2. Google Slide
  3. ICML2020_Machine Learning Production Pipeline

ICML2020_Machine Learning Production Pipeline

Consideration to make before starting your Machine Learning project

programming, machine learning, AI

Machine Learning Production Pipeline

Project Flow and Landscape

Chip Huyen | @chipro

Snorkel AI | snorkel.ai

[email protected]

07/17/2020

My background

Writing

Product

AI/ML

Cốc Cốc browser

20M+ monthly active users

Baomoi.com

acquired by VNG

Youth Asia

acquired by Groupon

2

Table of Contents

Research vs production

Data pipeline

Modeling & training

Serving

Landscape

Slides posted on Twitter @chipro!

3

Research

vs

Production

4

5

Research

Production

Performance

SOTA

Better than simpler models

6

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Research: train many times, serve few.

Production: train few times, serve many.

7

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Data

Static

Constantly shifting

It’s necessary for datasets in research to be static so that we can benchmark/compare models

8

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Data

Static

Constantly shifting

Fairness

Good to have (sadly)

Important

9

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Data

Static

Constantly shifting

Fairness

Good to have (sadly)

Important

Interpretability*

Good to have

Important

Interpretability

10

11

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Data

Static

Constantly shifting

Fairness

Good to have (sadly)

Important

Interpretability*

Good to have

Important

Complexity

Acceptable

Impractical

12

Research

Production

Performance

SOTA

Better than simpler models

Priority

Fast training

Fast inference

Data

Static

Constantly shifting

Fairness

Good to have (sadly)

Important

Interpretability*

Good to have

Important

Complexity

Acceptable

Impractical

Hard part

Modeling

Everything else

ML Production Pipeline: Iterative

Project setup

Data pipeline

Modeling & training

Serving

13

Research: different kind of iterative

After examining the available data, you realize it’s impossible to get the data needed to solve the problem you previously defined, so you have to frame the problem differently.

After training, you realize that you need more data or need to re-label your data.

After serving, the data distribution changes and you need to add more classes.

Data Pipeline

14

Data pipeline

Deep learning is driven by data

Companies with best data win

Proprietary

“Eye-off”

15

Machine Learning System Design (Chip Huyen, 2019)

Talents join companies for the access to unique datasets

Andrej Karpathy (2018)

16

Data challenges

Machine Learning System Design (Chip Huyen, 2019)

17

Research

Production

Clean

Static

Known quirks

Noisy

Missing values

Missing labels

Unprocessed

Constantly changing

Unknown quirks

NaN values, known typos, known weird spellings (Gutenberg), this tokenizer works better than another tokenizer

Data pipeline

Data availability and collection

User data*

Storage

Data preprocessing & representation

Versioning

Verification

Concerns

18

Machine Learning System Design (Chip Huyen, 2019)

Privacy: What privacy concerns do users have about their data? What anonymizing methods do you want to use on their data? Can you store users’ data back to your servers or can only access their data on their devices?

Biases: What biases might represent in the data? How would you correct the biases? Are your data and your annotation inclusive? Will your data reinforce current societal biases?

Data pipeline

Data availability and collection

What kind of data is available? How much?

How often does the new data come in?

Is it annotated?

If not, how hard/expensive is it to get it annotated? Do you need domain experts?

19

Machine Learning System Design (Chip Huyen, 2019)

Data pipeline

Data availability and collection

User data

What data do you need from users?

How do you collect it? Are you allowed to?

How do you get users’ feedback on the system?

How do you use that feedback?

20

Machine Learning System Design (Chip Huyen, 2019)

Data pipeline

Data availability and collection

User data

Storage

Cloud? On-prem? Users’ devices?

Does a sample fit into memory?

21

Machine Learning System Design (Chip Huyen, 2019)

Data pipeline

Data availability and collection

User data

Storage

Data preprocessing & representation

Featuring engineering? Feature extraction?

What to do with missing data?

What to do with class imbalance?

What if train and test data come from different distributions?

How to combine multimodal data?

22

Machine Learning System Design (Chip Huyen, 2019)

You can’t just feed raw data to models. Pretrained embeddings?

Data pipeline

Data availability and collection

User data

Storage

Data preprocessing & representation

Versioning

How to go back to a previous version of data?

If label schema changes, your model will be outdated.

23

Machine Learning System Design (Chip Huyen, 2019)

Git doesn’t work with binary formats

Data pipeline

Data availability and collection

User data

Storage

Data preprocessing & representation

Versioning

Verification

How to know that your data is correct, fair, and sufficient?

24

Machine Learning System Design (Chip Huyen, 2019)

Data pipeline

Data availability and collection

User data

Storage

Data preprocessing & representation

Versioning

Verification

Concerns

Bias

Privacy

Regulation compliance

25

Machine Learning System Design (Chip Huyen, 2019)

Data: ethical concerns

Who owns the data?

How was it collected?

Do people consent for their data to be used?

Does it contain identifiable information?

Can you share the data with annotators off-prem?

Are you allowed to commercialize a model trained on it?

26

Modeling & Training

27

Modeling & Training

What is taught in most ML courses

Often the easier part*

28

xkcd

Model Selection

Don’t: follow buzzwords

Do: choose the simplest, not the fanciest, model that can do the job

29

Machine Learning System Design (Chip Huyen, 2019)

Be solution-oriented, not technique-oriented

Everyone wants to use BERT

Baselines

Random baseline

Human baseline

Oracle

Simple heuristics

Machine Learning System Design (Chip Huyen, 2019)

30

Not talked about: how to choose a metrics

Baselines

Random baseline

Human baseline

Oracle

Simple heuristics

Don’t underestimate good heuristics

Machine Learning System Design (Chip Huyen, 2019)

31

If your model’s performance is low, just choose an easier baseline (jk)

“If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.”

Martin Zinkevich, Google

32

Deep Learning Catch-22

Need data to develop a model

Can’t collect data without a model

Machine Learning System Design (Chip Huyen, 2019)

33

Deep Learning in Production Catch-22

Want to test DL potential without much investment

Can’t get good performance without $$/time in data labeling

Solution

Weakly-supervised (Snorkel AI)

Unsupervised (moonshot)

Machine Learning System Design (Chip Huyen, 2019)

34

Debugging

Machine Learning System Design (Chip Huyen, 2019)

35

Peak of my career

Why debugging for ML is hard

Blackbox (can’t debug a program if you don’t understand it)

Invisible bugs

Many factors can cause a model to perform poorly

Machine Learning System Design (Chip Huyen, 2019)

36

Reasons a model performs poorly

Theoretical constraints

wrong assumptions

poor model/data fit

Machine Learning System Design (Chip Huyen, 2019)

37

Reasons a model performs poorly

Theoretical constraints

Poor implementation

Machine Learning System Design (Chip Huyen, 2019)

38

Reasons a model performs poorly

Theoretical constraints

Poor implementation

Sloppy training techniques

call model.train() instead of model.eval()during eval

Machine Learning System Design (Chip Huyen, 2019)

39

If your model’s is low, just choose an easier baseline

Reasons a model performs poorly

Theoretical constraints

Poor implementation

Sloppy training techniques

Poor choice of hyperparameters

one set of hp can give SOTA, another doesn’t converge

random seed

Machine Learning System Design (Chip Huyen, 2019)

40

Reasons a model performs poorly

Theoretical constraints

Poor implementation

Sloppy training techniques

Poor choice of hyperparameters

Data problems

mismatched inputs/labels

over-preprocessed data

noisy labels

Machine Learning System Design (Chip Huyen, 2019)

41

Scaling is crucial as models are ...

Machine Learning System Design (Chip Huyen, 2019)

42

Becoming bigger Model can’t fit in memory

Model parallelism

Scaling is crucial as models are ...

Machine Learning System Design (Chip Huyen, 2019)

43

Becoming bigger Model can’t fit in memory

Using more data Data can’t fit in memory

Data parallelism

Scaling is crucial as models are ...

Machine Learning System Design (Chip Huyen, 2019)

44

Becoming bigger Model can’t fit in memory

Using more data Data can’t fit in memory

Using more GPUs Large batchsize, stale gradients

LARS - Layer-wise Adaptive Rate Scaling

Training with large batchsize

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments (Boris Ginsburg et al., 2019)

45

a single DGX-1 with 8 NVIDIA V100 GPUs

Serving

46

Serving

Model compression

Large models are slow/costly for real-time inference

Mobile/edge devices

47

Serving

Model compression

Model compatibility

Framework used in development might not be compatible with consumer devices

48

Serving

Model compression

Model compatibility

CI/CD

ML tests take long time

49

Serving

Model compression

Model compatibility

CI/CD

Monitoring & analysis

When to update your model?

How?

50

Landscape

51

What I learned from looking at 200 machine learning tools (huyenchip.com, 2020)

52

What I learned from looking at 200 machine learning tools (huyenchip.com, 2020)

53

54

https://huyenchip.com/2020/06/22/mlops.html

Thank you!

[email protected]

55

ICML2020_Machine Learning Production Pipeline
Info
Tags Programming, Machine learning, AI
Type Google Slide
Published 11/04/2024, 03:06:59

Resources

The ChatGPT Prompt Book - LifeArchitect.ai - Rev 2
ChatGPT & Education