Training schedule
IN-COMPANY TRAINING PROGRAMS
Contact Giovanni Lanzani, if you want to know more about custom data & AI training for your teams. He’ll be happy to help you!
Check out more
Data Science with Spark Training
Apache Spark is a powerful, open-source processing engine built around speed, ease of use, and advanced analytics. In this course, you will learn to unlock its full potential and master this challenging tool.
This training is for you if…
You have worked with Python before and you want to know how to scale to large datasets
You have started, or are about to start, working with large data
You know the concepts of machine learning and you want to know how to apply them at scale
This training is not for you if…
You won’t be working with spark but want to Python (check out the Python for Data Analysis training instead)
You want a deep dive into machine learning (check out the Certified Data Science with Python training instead)
Clients we've helped
What you'll learn
Spark basics
- Spark execution and the Spark session
- Transformations vs. actions
- Laziness and lineage: how Spark optimizes code
- How to use the Spark UI
DataFrames
- Spark DataFrames vs pandas DataFrames
- How to load and save DataFrames
- How to join data
- User-defined functions and pandas’ user-defined functions (with performance implications)
- Window operations
Advanced Spark
- How to apply partitioning and how Spark reads and writes data
- Shuffling, narrow wide operations, and thei impact on performance
- The catalyst optimizer
- About scheduling and job execution
- About caching and persistence levels
Spark.ml
- Machine learning with Spark
- Pre-processing data and feature engineering
- Model selection
- Pipeline API
- Advanced topics
Spark structured streaming
- Structured streaming
- Machine learning & streaming
- Windows and aggregations
- Fault tolerance & Kafka
- Kafka as a source and sink
The schedule
- Spark execution and Spark sessions
- DataFrame methods, properties, and actions
- APIs: (Py)Spark DataFrame vs Spark SQL
- Reading and writing data in Spark
- The anatomy of a Spark job
- Narrow and wide transformations
- Window functions
- Applied machine learning in Spark
- Spark structured streaming
- Integrating Apache Spark with Apache Kafka
After the training you will be able to:
- Process large-scale data using PySpark
- Understand the fundamentals of Apache Spark
- Perform machine learning on large-scale data