[Apache Spark] Machine Learning from Disaster: The Data

How did the fare, age or gender affected the probability to survive on the Titanic? And how can we use this information to predict the probability to survive for each passenger? That´s the competition, which is offered by Kaggle to get into machine learning and data analytics. Kaggle is a platform, which offers competitions from companies/researchers to everyone, who is interested. And you can also win awards by solving these competitions as the best. The greatest fact is, that everyone can use the published dataset.

I´d like to share some techniques, how we can estimate if a passenger survives with Apache Spark. But first i will share some insights into the dataset, to get a rough overview, what data is given. These visualization is all created with Python and the Mathplotlib.


This series is about using Apache Spark for supervised learning and making predictions with the most relevant classifier types:

  1. Data Exploration: This post
  2. Naive Bayes
  3. SVM
  4. Random Forest
  5. ANN

The Data

In very large data sets there is often invalid, incomplete or just missing data. So the first thing to do is always look at the data, analyse the data and understand the data. This step is very important and can save you very much time. Pick some sample data and visualize some things.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 72.500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 712.833 C85 C

Datatypes (the statistics view)


  • nominal (countable)
  • ordinal (countable, sortable)


  • interval (countable, sortable, +/- ,  mean)
  • ratio (countable, sortable, +/-,  mean,  zero point)

In the titanic data set there are these columns:

survival        Survival(nominal)
                (0 = No; 1 = Yes)
pclass          Passenger Class(ordinal)
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name (nominal)
sex             Sex (nominal)
age             Age (ratio)
sibsp           Number of Siblings/Spouses Aboard (ratio)
parch           Number of Parents/Children Aboard (ratio)
ticket          Ticket Number (nominal)
fare            Passenger Fare (ratio)
cabin           Cabin (nominal)
embarked        Port of Embarkation (nominal)
The assignment of the types is not carved in stone. It depends on the application.

Data Visualization


Here you can see how much people belong to each class

density age per class

Probability density function of the age for every class

Prepare the data

When analyzing the training dataset, we can see what in every big dataset happens. Some entries are incomplete. There are several ways to deal with this data and it strongly depends on the application, which is a good way.


  • Ignore the incomplete entry
  • Interpolation
  • Only ignore the empty value
In other data sets there are other problems like duplicates and systematic errors.

Data enrichment


A widely used way to deal with missing data is to fill in the average of the full data set. As you see this can be dangerous, in some cases.

May 26, 2016

Related Posts

1 comment

Leave a reply