Datasets in Spark

Published in

Geek Culture

5 min readMar 6, 2022

Spark has become the ubiquitous platform for data processing and has taken over the traditional MapReduce framework. In fact, some technologists would go so far as to declare MapReduce dead. Spark has been proven to outperform MapReduce by several orders of magnitude in numerous benchmarks and performance studies. Below, we briefly recount the history behind Spark’s dominance in the big data space.

In this article, we will cover the Spark Dataset in Brief

Below is the definition of a Dataset from the official Databricks documentation:

“A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python since these languages are dynamically typed languages”.

Overview

Dataset is a data structure in SparkSQL which is strongly typed and is a map to a relational schema. It represents structured queries with encoders. It is an extension to data frame API. Spark Dataset provides both type safety and object-oriented programming interface.
Dataset clubs the features of RDD and DataFrame. It provides:

The convenience of RDD.
Performance optimization of DataFrame.
Static type-safety of Scala.

After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.

In the context of Scala we can think of a DataFrame as an alias for a collection of generic objects represented as Dataset[Row]. The Row object is untyped and is a generic JVM object that can hold different types of fields. In contrast, a Dataset is a collection of strongly typed JVM objects in Scala or a class in Java. It is fair to say that each Dataset in Scala has an untyped view called DataFrame, which is a Dataset of Row. The following table captures the notion of Datasets and DataFrames in various Spark supported languages.

Datasets are possible because of a feature called the encoder, which converts from JVM types to Spark SQL’s specialized internal (tabular) representation. Encoders are highly specialized and optimized code generators that generate custom bytecode for the serialization and deserialization of your data. Encoders are required by all Datasets. The encoder maps domain-specific types to Spark’s internal representation for that type. For instance, an Int that is one of the fields in a Row will be mapped or converted to IntegerType or IntegerType() respectively for Scala or Java and Python. Domain-specific types can be expressed as beans for Java and as case classes for Scala. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Whilst using the Dataset API, Spark generates code at runtime to serialize a Java object into an internal binary structure and vice versa. This conversion can have a slight impact on performance, but there are several benefits to be had. For example, Spark understands the structure of data in Datasets, and it can create a more optimal layout in memory when caching Datasets.

Features of Dataset in Spark

After having the introduction to dataSet, let’s now discuss various features of Spark Dataset-

Optimised Query

Dataset in Spark provides Optimized query using Catalyst Query Optimizer and Tungsten. Catalyst Query Optimizer is an execution-agnostic framework. It represents and manipulates a data-flow graph. Data flow graph is a tree of expressions and relational operators. By optimizing the Spark job Tungsten improves the execution. Tungsten emphasizes the hardware architecture of the platform on which Apache Spark runs.

Analysis at compile time

Using Dataset we can check syntax and analysis at compile time. It is not possible using Dataframe, RDDs or regular SQL queries.

Persistent Storage

Spark Datasets are both serializable and Queryable. Thus, we can save it to persistent storage.

Faster Computation

The implementation of the Dataset is much faster than the RDD implementation. Thus increases the performance of the system. For the same performance using the RDD, the user manually considers how to express computation that parallelizes optimally.

Differences with DataFrames

We’ll need to contrast DataFrames and Datasets to gain a better understanding of the two. Datasets check if types conform to the specification at compile time. DataFrames aren’t truly untyped, as their types are maintained by Spark, but the verifying whether the types conform to the specification in the schema is done at runtime. Said another way, DataFrames can be thought of as Datasets of type `Row`, which is Spark’s internal optimized in-memory representation for computation. Having its own internal representation of types allows Spark to skip JVM types that can be slow to instantiate and have garbage collection costs.

Use-cases for Datasets

It may seem redundant to have Datasets when we already have DataFrames, but there are certain scenarios in which Datasets are a better fit than DataFrames. That is, some operations can’t be expressed with DataFrames but only with Datasets. Consider also the desire for type-safety. For example, attempting to multiply two string variables in code will fail at compile-time instead of at run-time. Additionally, development may be helpful as IDEs and other tools can provide auto-complete and other hints when objects are strongly typed. Another reason to use Datasets is if all of one’s data and transformations accept case classes (Scala); it is trivial to reuse them for both distributed and local workloads.

In this article, we have scratched the surface on what Spark Datasets to offer. Hope, this is an introductory article helpful, Until next time cheers!