Big Data Archives - Digital Thinking

Big Data | Machine Learning | Software Engineering

In the past years a new job role has become very popular (at least I noticed it in job postings and media). The machine learning engineer. But what is the difference between this new job role and data engineering or…

August 18, 2018 Read more

[Deep Learning] Combining numerical and text features in (deep) neural networks

Big Data | Machine Learning | research

In this post I will show how to combine features from natural language processing with traditional features (meta data) in one single model in keras (end-to-end learning). The solution is a multiple inputs model. Problem definition Scientific data sets are…

July 16, 2018 35 Comments Read more

[Tensorflow] Deep Learning for pose estimation

Big Data | Machine Learning | research

In the last months I was working on a deep learning project. The goal was to dig into Tensorflow and deep learning in gerneral. For this purpose I re-implemented a paper from January 2016 called convolutional pose machines , which uses deep…

February 25, 2017 5 Comments Read more

[Apache Spark] Yarn and preemption and fair schedulers

Big Data | Java | Performance

When running Spark 1.6 on yarn clusters, i ran into problems, when yarn preempted spark containers and then the spark job failed. This happens only sometimes, when yarn used a fair scheduler and other queues with a higher priority submitted…

July 15, 2016 Read more

[Apache Spark] Machine Learning from Disaster: multilayer perceptrons

Big Data | Java | Machine Learning

After getting good results with the Random Forest algorithm in the last post, we will take a look at feed forward networks, which are artificial neural networks. Artificial neural networks consist of many artificial neurons, which are based on the…

June 30, 2016 1 Comment Read more

[Apache Spark] Machine Learning from Disaster: Random Forest

Big Data | Java | Machine Learning

In the previous post i showed how to use the Support Vector Machine in Spark and apply the PCA to the features. In this post i wills show how to use Decisions Trees on the titanic data and why its better…

June 19, 2016 1 Comment Read more

[Apache Spark] Performance: Partitioning

Big Data | Java | Performance

In my previous post i showed how to increase the parallelism of spark processing by increasing the number of executors on the cluster. In this post i will try to show how to distribute the data in a way, that the cluster…

June 18, 2016 Read more

[Apache Spark] Machine Learning from Disaster: SVM

Big Data | Java | Machine Learning

This ist the third part of the Kaggle´s Machine Learning by Disaster challenge where i show, how you can use Apache Spark for model based prediction (supervised learning). This post is about support vector machines. The Support Vector Machine (SVM) is…

June 15, 2016 1 Comment Read more

[Apache Spark] Machine Learning from Disaster: Naive Bayes

Big Data | Java | Machine Learning

The naive bayes classification is a probabilistic classifier, which is used to classify a feature vector to a class. It does a probably wrong assumption, that features are statistically independent to each other. Anyway Apache Spark has implemented Naive Bayes and i…

June 5, 2016 1 Comment Read more

[Apache Spark] Performance: Configuration and Memory (YARN)

Big Data | Java | Performance

In Apache Spark the key to get performance is parallelism. The first thing to get parallelism is to get the partition count to a good level, as the partition is the atom of each job. Reaching a good level of…

May 28, 2016 1 Comment Read more