Optimizing Apache Spark & Tuning Best Practices
Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.
Clients we've helped
What you'll learn
Spark execution model: Driver/Executors
Spark resource managers (YARN, MESOS, K8s)
Understanding RDDs/DataFrames APIs and bindings
Difference between Actions and Transformations
How to read the Query plan (Physical/Logical)
Spark Memory model
Understanding persistence (caching)
Catalyst optimizer and Tungsten project
Shuffle service and how is shuffle operation executed
Concept of fair scheduling and pools
Java and Kryo serializer
Step into JVM world: what you need to know about GC when running Spark applications
Spark optimisation: main problems and issues
The most common memory problems
Benefit of using early filtering
Understanding partition and predicate filtering
Combating Data skew (preprocessing, broadcasting, salting)
Understanding shuffle partitions: how to tackle memory/disk spill
Downside of using UDF’s
Executor idle timeout
Data formats examples
Moving to production
Debugging / troubleshooting
Productionizing your Spark application
Dynamic allocation and dynamic partitioning
Profiling your Spark application (Sparklint)
Data Engineering Learning Journey
This online course is perfect for
Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.
Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.
What will you learn during Optimizing Apache Spark & Tuning Best Practices?
After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.
Structured, to-the-point, good combination of theory and practical examples, very knowledgeable trainer who can explain concepts very well
It was a hands-on and tangible course. We could apply what we learned in a matter of minutes. The trainer did a great job of answering ad-hoc questions that complemented the material. We appreciated the fact that we could apply what we were taught directly to our company.
I liked every aspect of this training and would like to thank the trainers. They did an excellent job of explaining how to use Spark for data science. This is the fourth GoDataDriven training I’ve followed. All were great, but this was the best one so far.
Climbing a steep Python and Machine Learning curve in three days. This would have taken me months on my own.