[Apache Spark] Machine Learning: Spark 2.0 and Dota 2

Spark 2.0 is now released. Time to move forward. After spending some time migrating the titanic project to the new version, it seems that Spark 2.0 does not change too much. Like said in the Spark Summit in San Francisco 2016 the core changes are structured streaming and performance with the Dataset/Dataframe API. Also MLib focuses more on Dataframes. A quick hint: There are two different APIs for Spark Machine Learning the mlib (org.apache.spark.mllib) package and the ml(org.apache.spark.ml) package. The second one is the “new” one based on Dataframes and the other one is for RDDs.

Better (Big) Data

The titanic data set from kaggle is nice play around and try some things out, but it´s very small. So in practice it would be a huge overhead using Spark for it. So i was looking for a better data set, which can be used for clustering and recommender systems. Finally i had an idea, which i could be very cool if it works, but can be also fail: Dota 2 match data.

Dota2 match data is available via the SteamAPI, which is unstable and sometimes does weird stuff. But i managed to get about 300 MByte of match data so far, and i will try to get some more. For now this should be enough. The data contains information about Dota2 matches, like duration, winnerteam, players, heros, gold, items etc. if you don´t know how mobas like Dota 2 work, you can take a look here. If you want to know about the data frame expressions, which are used in the code you can read my post about it.

Data Preprocessing

First we want to filter out matches, that are not relevant for a real predictions. I ended up with these filters, which are applied before we save the data.

  • only games without leavers
  • only games with 10 human players
  • only games longer than 15 minutes
  • only ranked

Find the 10 most played heros

The first thing we can aggregate is the relative occurrence of picked heros. Frist we select the column “hero_id” and group by it. Now we join the heroId with the real names and sort them by their count. To get the percentage we can just use the dataframe expressions for a scalar division.

    val heroCounts = playersDf.select("col.hero_id")
      .join(herosIdName, $"hero_id" === $"id")    

    val totalCount = playersDf.count().toDouble / 100.0
    val percentage = heroCounts.select($"hero_id", $"localized_name", ($"count" / totalCount).as("percent"))

Predicting wins by items

Ok lets do something more interesting and use the already known techniques for some predictions. The first one is using RandomForest for item based predictions. Lets say we know the items of a player and we want to predict wins by items.

First we need features from the data by creating a sparse vector of items.

  val vectors = playerWithWinStatus
      .select($"winner", $"players.hero_id", $"players.item_0", $"players.item_1", $"players.item_2", $"players.item_3", $"players.item_4", $"players.item_5")
      .rdd.map(row => {
      val seq: Seq[Int] = Seq[Int](
      val occurances = seq.groupBy(l => l).map(t => (t._1, t._2.length.toDouble)).toSeq
      val winnerValue = if (row.getAs("winner").asInstanceOf[Boolean]) 1.0 else 0.0
      LabeledPoint(winnerValue, Vectors.sparse(itemCount, occurances))

Now we split the data into training and validation data and train the RandomForest.

 val splits = vectors.randomSplit(Array(0.7, 0.3))
 val (trainingData, testData) = (splits(0), splits(1))

 val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]((0, itemCount), (1, itemCount), (2, itemCount), (3, itemCount), (4, itemCount), (5, itemCount))
    val numTrees = 256
    val featureSubsetStrategy = "auto"
    val impurity = "gini"
    val maxDepth = 10
    val maxBins = itemCount

    val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

And finally validate the data and get the error:

    val labelAndPreds = testData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)

    val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
    println("Test Error = " + testErr)

As a result we get about 35% error for RandomForest, which is not that bad.

Lets try other algorithms:

  val model = NaiveBayes.train(trainingData)
  Test Error = 0.31350517735183964

  val model = LogisticRegressionWithSGD.train(trainingData, 1000)
  Test Error = 0.3380093520374082

  val model = SVMWithSGD.train(trainingData, 1000)
  Test Error = 0.3442694663167104

To be honest its not a very useful prediction at the moment but it shows basically that this direction would be a nice idea. Consider a tool (like dotapicker), which recommends what items you should pick to win a game. The first thing to change would be to make distinctions by hero or at least hero type (carry, support, ganker etc.)

Code here -> Github

Related Posts

Leave a reply