what you learn / training schedule / your trainer

Training schedule

IN-COMPANY TRAINING PROGRAMS

Contact Giovanni Lanzani, if you want to know more about custom data & AI training for your teams. He’ll be happy to help you!
Check out more

Optimizing Apache Spark & Tuning Best Practices

Processing data efficiently can be challenging as it scales up. Building up from the experience we built at the largest Apache Spark users in the world, we give you an in-depth overview of the do’s and don’ts of one of the most popular analytics engines out there.

Clients we've helped

What you'll learn

Fundamentals

Spark execution model: Driver/Executors
Spark resource managers (YARN, MESOS, K8s)
Understanding RDDs/DataFrames APIs and bindings
Difference between Actions and Transformations
How to read the Query plan (Physical/Logical)

Spark internals

Spark Memory model
Understanding persistence (caching)
Catalyst optimizer and Tungsten project
Shuffle service and how is shuffle operation executed
Concept of fair scheduling and pools
Java and Kryo serializer
Step into JVM world: what you need to know about GC when running Spark applications

Spark optimisation: main problems and issues

The most common memory problems
Benefit of using early filtering
Understanding partition and predicate filtering
Join optimisation
Combating Data skew (preprocessing, broadcasting, salting)
Understanding shuffle partitions: how to tackle memory/disk spill
Downside of using UDF’s
Executor idle timeout
Data formats examples

Moving to production

Debugging / troubleshooting
Productionizing your Spark application
Dynamic allocation and dynamic partitioning
Profiling your Spark application (Sparklint)
JVM profiler

This online course is perfect for

Data and Machine Learning Engineers who deal with transformation of large volumes of data and need production-quality code.

Expert Data Scientists can also participate: they will learn how to get the most performance out of Spark and how simple tweaks can increase the performance dramatically.

What will you learn during Optimizing Apache Spark & Tuning Best Practices?

After this training, you will have learned how Apache Spark works internally, the best practices to write performant code, and have acquired essential skills necessary to debug and tweak your Spark applications.

meet your trainer

Vadim Nelidov

Data Enchanter

Vadim is Data Scientist passionate about solving data-driven problems and sharing his analytical insights to make Data literacy a reality for all.

Flexible delivery

The Right Format For Your Preferred Learning Style

In-Classroom & In-Company Training

Online, Instructor-Led Training

Hybrid and Blended Learning

Self-Paced Training

Structured, to-the-point, good combination of theory and practical examples, very knowledgeable trainer who can explain concepts very well

Data scientist

It was a hands-on and tangible course. We could apply what we learned in a matter of minutes. The trainer did a great job of answering ad-hoc questions that complemented the material. We appreciated the fact that we could apply what we were taught directly to our company.

Technical Leader & Software Architect

I liked every aspect of this training and would like to thank the trainers. They did an excellent job of explaining how to use Spark for data science. This is the fourth GoDataDriven training I’ve followed. All were great, but this was the best one so far.

Data Scientist

Climbing a steep Python and Machine Learning curve in three days. This would have taken me months on my own.

Data Scientist

Optimizing Apache Spark & Tuning Best Practices

Training schedule

IN-COMPANY TRAINING PROGRAMS

Optimizing Apache Spark & Tuning Best Practices

Clients we've helped

What you'll learn

Fundamentals

Spark internals

Spark optimisation: main problems and issues

Moving to production

This online course is perfect for

What will you learn during Optimizing Apache Spark & Tuning Best Practices?

Vadim Nelidov

The Right Format For Your Preferred Learning Style

Have any questions?

Course: Optimizing Apache Spark & Tuning Best Practices