The naive bayes classification is a probabilistic classifier, which is used to classify a feature vector to a class. It does a probably wrong assumption, that features are statistically independent to each other. Anyway Apache Spark has implemented Naive Bayes and i…
Java
In Apache Spark the key to get performance is parallelism. The first thing to get parallelism is to get the partition count to a good level, as the partition is the atom of each job. Reaching a good level of…
How did the fare, age or gender affected the probability to survive on the Titanic? And how can we use this information to predict the probability to survive for each passenger? That´s the competition, which is offered by Kaggle to get into machine learning…
Spark SQL is used to process structured data. I faced the problem, when I wanted to do operations per partition (connect to a web service etc.), and add fields to the original data, when i read the data from the new Dataframe…
While using ehcache to cache RESTful calls to webservices, i dealt with the problem, that ehcache was sometimes not able to cache generated my xjc generated java classes from xsd, because in default they are not implementing the serializable Interface. To…