[Apache Spark] Machine Learning: Spark 2.0 and Dota 2

Spark 2.0 is now released. Time to move forward. After spending some time migrating the titanic project to the new version, it seems that Spark 2.0 does not change too much. Like said in the Spark Summit in San Francisco 2016 the core changes are structured streaming and performance with the Dataset/Dataframe API. Also MLib focuses more on Dataframes. A quick hint: There are two different APIs for Spark Machine Learning the mlib (org.apache.spark.mllib) package and the ml(org.apache.spark.ml) package. The second one is the “new” one based on Dataframes and the other one is for RDDs.

Better (Big) Data

The titanic data set from kaggle is nice play around and try some things out, but it´s very small. So in practice it would be a huge overhead using Spark for it. So i was looking for a better data set, which can be used for clustering and recommender systems. Finally i had an idea, which i could be very cool if it works, but can be also fail: Dota 2 match data.

Dota2 match data is available via the SteamAPI, which is unstable and sometimes does weird stuff. But i managed to get about 300 MByte of match data so far, and i will try to get some more. For now this should be enough. The data contains information about Dota2 matches, like duration, winnerteam, players, heros, gold, items etc. if you don´t know how mobas like Dota 2 work, you can take a look here. If you want to know about the data frame expressions, which are used in the code you can read my post about it.

Data Preprocessing

First we want to filter out matches, that are not relevant for a real predictions. I ended up with these filters, which are applied before we save the data.

  • only games without leavers
  • only games with 10 human players
  • only games longer than 15 minutes
  • only ranked

Find the 10 most played heros

The first thing we can aggregate is the relative occurrence of picked heros. Frist we select the column “hero_id” and group by it. Now we join the heroId with the real names and sort them by their count. To get the percentage we can just use the dataframe expressions for a scalar division.

Predicting wins by items

Ok lets do something more interesting and use the already known techniques for some predictions. The first one is using RandomForest for item based predictions. Lets say we know the items of a player and we want to predict wins by items.

First we need features from the data by creating a sparse vector of items.

Now we split the data into training and validation data and train the RandomForest.

And finally validate the data and get the error:

As a result we get about 35% error for RandomForest, which is not that bad.

Lets try other algorithms:

To be honest its not a very useful prediction at the moment but it shows basically that this direction would be a nice idea. Consider a tool (like dotapicker), which recommends what items you should pick to win a game. The first thing to change would be to make distinctions by hero or at least hero type (carry, support, ganker etc.)

Code here -> Github

5.00 avg. rating (96% score) - 1 vote

Related Posts

Leave a reply