Preprocessing and data transformation are the most important parts of all machine learning pipelines. No matter what type of model you use, if the preprocessing pipeline is buggy, your model will deliver wrong predictions . This remains also true, if…
Java
In this How-To series, I want to share my experience with machine learning models in productions environments. This starts with the general differences to typical software projects and how to acquire and deal with data sets in such projects, goes…
In this post I will show, how to report jmx metrics to logstash via TCP on a push based way, without changing java code from an existing application.
This post is about data structures and how they compare in performance, when considering the CPU Prefetch mechanism and other cache effects. It shows that LinkedLists are bad in the most cases and how to deal with variable sharing.
Spark 2.0 data frames offer a very powerful way of accessing structured data in a SQL like way. But sometimes its hard to find the right build-in expression. So I would like to show some things, which is was dealing…
Spark 2.0 is now released. Time to move forward. After spending some time migrating the titanic project to the new version, it seems that Spark 2.0 does not change too much. Like said in the Spark Summit in San Francisco 2016…
When running Spark 1.6 on yarn clusters, i ran into problems, when yarn preempted spark containers and then the spark job failed. This happens only sometimes, when yarn used a fair scheduler and other queues with a higher priority submitted…
After getting good results with the Random Forest algorithm in the last post, we will take a look at feed forward networks, which are artificial neural networks. Artificial neural networks consist of many artificial neurons, which are based on the…
In the previous post i showed how to use the Support Vector Machine in Spark and apply the PCA to the features. In this post i wills show how to use Decisions Trees on the titanic data and why its better…
In my previous post i showed how to increase the parallelism of spark processing by increasing the number of executors on the cluster. In this post i will try to show how to distribute the data in a way, that the cluster…