spark Archives - Digital Thinking

[Apache Spark] Yarn and preemption and fair schedulers

When running Spark 1.6 on yarn clusters, i ran into problems, when yarn preempted spark containers and then the spark job failed. This happens only sometimes, when yarn used a fair scheduler and other queues with a higher priority submitted…

July 15, 2016 Read more

[Apache Spark] Machine Learning from Disaster: multilayer perceptrons

Big Data | Java | Machine Learning

After getting good results with the Random Forest algorithm in the last post, we will take a look at feed forward networks, which are artificial neural networks. Artificial neural networks consist of many artificial neurons, which are based on the…

June 30, 2016 1 Comment Read more

[Apache Spark] Machine Learning from Disaster: Random Forest

Big Data | Java | Machine Learning

In the previous post i showed how to use the Support Vector Machine in Spark and apply the PCA to the features. In this post i wills show how to use Decisions Trees on the titanic data and why its better…

June 19, 2016 1 Comment Read more

[Apache Spark] Performance: Partitioning

Big Data | Java | Performance

In my previous post i showed how to increase the parallelism of spark processing by increasing the number of executors on the cluster. In this post i will try to show how to distribute the data in a way, that the cluster…

June 18, 2016 Read more

[Apache Spark] Machine Learning from Disaster: SVM

Big Data | Java | Machine Learning

This ist the third part of the Kaggle´s Machine Learning by Disaster challenge where i show, how you can use Apache Spark for model based prediction (supervised learning). This post is about support vector machines. The Support Vector Machine (SVM) is…

June 15, 2016 1 Comment Read more

[Apache Spark] Machine Learning from Disaster: Naive Bayes

Big Data | Java | Machine Learning

The naive bayes classification is a probabilistic classifier, which is used to classify a feature vector to a class. It does a probably wrong assumption, that features are statistically independent to each other. Anyway Apache Spark has implemented Naive Bayes and i…

June 5, 2016 1 Comment Read more

[Apache Spark] Performance: Configuration and Memory (YARN)

Big Data | Java | Performance

In Apache Spark the key to get performance is parallelism. The first thing to get parallelism is to get the partition count to a good level, as the partition is the atom of each job. Reaching a good level of…

May 28, 2016 1 Comment Read more

[Apache Spark] Machine Learning from Disaster: The Data

Big Data | Java | Machine Learning

How did the fare, age or gender affected the probability to survive on the Titanic? And how can we use this information to predict the probability to survive for each passenger? That´s the competition, which is offered by Kaggle to get into machine learning…

May 26, 2016 1 Comment Read more

[Apache Spark] Adding a column to Dataframes with service invocation

Big Data | Java | Software Engineering

Spark SQL is used to process structured data. I faced the problem, when I wanted to do operations per partition (connect to a web service etc.), and add fields to the original data, when i read the data from the new Dataframe…

April 13, 2016 Read more