[How-To] Machine Learning in Practice: Data Acquisition

In this How-To series, I want to share my experience with machine learning models in productions environments. This starts with the general differences to typical software projects and how to acquire and deal with data sets in such projects, goes through how to manage pre-processing, training of data, persist data pipelines, the management of models and ends with the monitoring of ml systems.

Difference to non-data-driven projects

Machine learning projects, either deep and shallow learning approaches, are somewhat different to typical software engineering projects. Typical modern software projects tend to be as much as agile, as possible. In contrast ML projects do have a limitation in their agility. While they are also iterative projects, they require more structured processes, especially if data acquisition is part of the project.

While all tutorials in machine learning frameworks use prepared data sets and focus in the ML part, in “reality” you don´t have prepared and clean data sets, often you don´t even have collected the data. In almost every project data preparation (and understanding) is one of the most time-consuming parts.

So, the first step is to collect the data. Sounds easy, but the selection of the data is an important step and should be performed wisely. You can be sure that your initial data set will be different to the data set, which will be used in the end. The basic tenor should be: Don´t throw any information away, you might need it later

If you would ask me how this process could look like, I would refer to ASUM-DM, which is a process describtion of machine learning projects, basically from IBM and inspired by CRISP-DM. But in general I am not a big fan of these fancy power point things.

Data = Code

More interesting is the fact that the code is not the source of “business logic” anymore, the model or the data, which trained the model, is. The most systems out there are not capable of event sourcing, they are data bases, files or event streams and therefore they are dynamic, and you cannot get back to any state. Furthermore, IT systems are coupled, and future changes might change the structure of the data completely. While this has several effects to machine learning systems in terms of monitoring (will be discussed later), it is also fundamental to the data acquisition.

Consider you are building a ml pipeline where your data is directly read from the database, preprocessed and trained. The outcome is a working model. Four weeks later you run the same process and your new model does not work anymore. Sure, you have the old one, but for some reason you need a new, better model. Probably you can go through the code and maybe you find out that the code did not handle a changed field or something. But it is possible that you don´t find anything in the code, because it was just the data, which changed. Reverting to the old data is impossible or is a lot of work and without comparing the data, you cannot find the cause. Therefore have a persisted and versioned dataset.

In ML systems data is the driver of behavior, so data should be treated as code and be versioned and persisted.

Identify invalid data early

An ongoing process is checking for wrong data. In almost every real-world data source there are some invalid or highly biased data points. In industrial sensor data, sensors are broken and produce wrong data. In text data, spam bots and messages produce a lot of crap and in images tend to be somehow mislabeled. To train a good classifier you need good data or at least a high percentage of good data.

Have a reserve validation dataset

In shallow learning it is common to use a k-fold cross validation technique to estimate overall model quality and fine tune hyperparameters. Sometimes it happens that your data is limited and very sparse. In this case it often makes sense to train your model with as much data as possible to get close to real-world data. The selected model from k-fold-cross validation is then trained on all data. To proof the model’s performance, you should have a extra validation dataset. This validation dataset can be quite small and should consist of samples, that are required to be predicted correctly for project success (manual selected valuable entries).

In deep learning it’s also a good idea, even if it´s more time consuming to create an extra validation dataset with enough variance. In the end it will pay out, as you don´t fall into the trap of optimizing a model, which does not solve the problem, you want to. Think about all the examples, where a deep learning classifier achieved a high accuracy, but fails on the very important samples.

BTW: Having such a data set is also recommended by Anrew Ng and his Book: Machine Learning Yearning.

Related Posts

Leave a reply