JP
- May 22, 2021
- 2 min read

How to read CSV file with Apache Spark

Apache Spark works very well in reading several files for data extraction, in this post we'll create an example of reading a CSV file using Spark, Java and Maven.

Maven

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.1.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.1.0</version>
    </dependency>
</dependencies>

CSV Content

Let's suppose that file's name below is movies.csv.

title;year;rating
The Shawshank Redemption;1994;9.3
The Godfather;1972;9.2
The Dark Knight;2008;9.0
The Lord of the Rings: The Return of the King ;2003;8.9
Pulp Fiction;1994;8.9
Fight Club;1999;8.8
Star Wars: Episode V - The Empire Strikes Back;1980;8.7
Goodfellas;1990;8.7
Star Wars;1977;8.6

Creating a SparkSession

SparkConf sparkConf = new SparkConf();
sparkConf.setMaster("local[*]");
sparkConf.setAppName("app");

SparkSession sparkSession = SparkSession.builder()
        .config(sparkConf)
        .getOrCreate();

Running the Read

Dataset<Row> ds = sparkSession.read()
        .format("CSV")
        .option("sep",";")
        .option("inferSchema", "true")
        .option("header", "true")
        .load("movies.csv");
        
ds.select("title","year","rating").show();

Result

Understanding some parameters

.option("sep", ";"): Defines the use of a default delimiter for file reading, in this case the delimiter is a semicolon (;)
.option("inferSchema", "true"): The inferSchema parameter makes it possible to infer the file(s) in order to understand (guess) the data types of each field
.option("header", "true"): Enabling the header parameter makes it possible to use the name of each field defined in the file header
.load("movies.csv"): movies.csv is the name of the file to be read

Books to study and read

If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s):

Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python.

Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading.

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more.

Maven: The Definitive Guide Written by Maven creator Jason Van Zyl and his team at Sonatype, Maven: The Definitive Guide clearly explains how this tool can bring order to your software development projects.

In this book you'll learn about: The POM and Project Relationships, The Build Lifecycle, Plugins, Project website generation, Advanced site generation, Reporting, Properties, Build Profiles, The Maven Repository and more.

Well that’s it, I hope you enjoyed it!

Download free e-books

How to read CSV file with Apache Spark

Recent Posts