[Apache Spark] Tips and Tricks

After some months dealing with Apache Spark 1.6, i want to write down sum tips, which i really like to have read before.

General

  • MapReduce is slower than  Spark, but
  • Spark performance is hard to achieve
  • Spark is highly depended on configuration
  • Spark grows and fails with in-memory computation
  • use Scala in Spark
  • In general the spark Version does not depend on the hadoop version
  • Use broadcasts to distribute variables across the cluster
  • Sequencefiles are very fast (compared to small files), but handling them makes them only useful as internal format
  • parquet is very fast, and does achieve a huge compression, due its column oriented, but its also hard to read outside of hadoop

Configuration

  • ("spark.serializer", "org.apache.spark.serializer.KryoSerializer") uses the Kyroserialzier, which is fast. ("spark.kryoserializer.buffer.max",  "2000m") does configure the buffersize
  • ("spark.rdd.compress", "true") does compress rdds and save memory
  • ("spark.io.compression.codec", "snappy") the internal compression for spill, shuffle and broadcasting
  • ("spark.executor.extraJavaOptions", " -XX:+UseCompressedOops ") pointer adress space reduction, when using less then 32 GB

Submitting Applications

  • –num-executors 25 -> use 25 instances, wehere spark runs as a application
  • –executor-cores 4 -> use 4 cores for each of this applications
  • –executor-memory 8G -> use 8GB RAM per executor (2GB per core)
  • –driver-memory 2048m -> use 2GB for the driver

25 * 8GB  + 2GB = 202 GB RAM in theory, in practice you need more

RDD API

  • the old API, which is more general
  • you can cache your rdds with persists/cache on different storage levels
  • you should free the memory, after ussing it
  • forEachPartition does allow you to execute something on each executor (database connections etc), you can think about it as a scala program for each executor (e.g. use of spring )
  • PairRDDs are very usefull for counting by key or doing some additional key based operations, like custom partitioning by a given Key (preformance)
  • in Scala import implicits to make sure you can see all methods
  • you can get a RDD from a dataframe, and convert between those
  • e.g. you can use custom partitioning in rdd and convert back
  • you can extends RDDS by yourself
  • if you want to stream a huge dataset to the driver, dont use collect, you can use toLocalIter, which does get partition by partition

Dataframe API

  • faster then RDDs (some exceptions)
  • Is part of SparkSQL
  • uses tungsten to for memory optimization
  • tungsten uses offhead/native/C like memory
  • Allows SQL-queries
  • better IO handling (split output by column)
  • does not allow a custom partitioner at the moment (which can slow down the hole application)
  • databricks does provide nice output format extensions (avro, parquet, csv and json)

Errors

Spark or hadoop in general does often not provide usefull error messages

  • first increase memory of executors, and decrease the number of cores, in the most cases job failures with crazy exceptions are caused by memory problems
  • look at your per node logs, to find causes
4.00 avg. rating (86% score) - 1 vote

Related Posts

Leave a reply