The Machine Learning Life Cycle: how to run a ML project

Category: [Machine learning & Statistics]

2019/01/20

7min read

I recently came across this page in the DataRobot Artificial Intelligence Wiki. If you don't already know, DataRobot is currently one of the top automated machine learning platform in the market, with emphasis on supervised learning and citizen data science. I am quite a big fan of their platform - even though I don't use it in my work, I believe that they and their competitors in the market are heading into the right direction towards automated machine learning. In any case, this is not my intended topic for today.

If you have worked on data science projects previously, odds are you would have heard of the term CRISP-DM, short of CRoss Industry Standard Process for Data Mining. CRISP-DM was developed by five European countries, including Teradata, in 1997, though it's now largely recognized as being associated with IBM and SPSS.

The CRISP-DM Process

The official CRISP-DM manual is this 50-page document, which if I said that I have read it, I would be lying. Nonetheless, CRISP-DM is intuitive enough for me to use it in my work, in order to scope and run data science projects. There are multiple ways to use CRISP-DM, such as manhours scoping and costing and milestones and success criteria setting. So naturally, I was intrigued by the DataRobot's version of CRISP-DM, and decided to look a little bit deeper.

How to run a ML project - a hypothetical walkthrough

In essence, I will also use this post to illustrate how a typical data science project can be run and managed hypothetically.

Running a project using the DataRobot Machine Learning Cycle - 5 major steps

There are five majors steps in the DataRobot (DR) Machine Cycle:

  1. Define project objectives
  2. Acquire and explore data
  3. Model data
  4. Interpret and communicate
  5. Implement, document and maintain

Let's walk through each of them and look slightly deeper.

1. Define project objectives

Under this step, there are:

2. Acquire and explore data

3. Model data

4. Interpret and communciate

This is where it gets hairy, and where I see most data scientists struggle. Needless to say, this is also one of the most important steps in any project.

5. Implement, document and maintain

After buy-in and green light to productionize, last but definitely not the least we do the following:

Running data science projects using process models

As mentioned above, I wanted to use this post to illustrate how a typical data science project can be run and managed hypothetically. Overall, my understanding on running ML projects is pretty close to the DR Machine Learning Cycle. Finally, note that each of these process models are built with their respective software in mind:

But that doesn't mean that these can't be extrapolated and modified to your needs. That's all for this post, thanks for reading!