• JP

Working with Schemas in Spark Dataframes using PySpark


What's a schema in the Dataframes context?


Schemas are metadata that allows working with a standardized Data. Well, that was my definition about schemas but we also can understanding schemas as a structure that represents a data context or a business model.


Spark enables using schemas with Dataframes and I believe that is a good point to keep data quality, reliability and we also can use these points to understand the data and connect to the business.


But if you know a little more about Dataframes, working with schema isn't a rule. Spark provides features that we can infer to a schema without defined schemas and reach to the same result, but depending on the data source, the inference couldn't work as we expect.


In this post we're going to create a simple Dataframe example that will read a CSV file without a schema and another one using a defined schema. Through examples we'll can see the advantages and disadvantages.


Let's to the work!


CSV File content

"type","country","engines","first_flight","number_built"
"Airbus A220","Canada",2,2013-03-02,179
"Airbus A320","France",2,1986-06-10,10066
"Airbus A330","France",2,1992-01-02,1521
"Boeing 737","USA",2,1967-08-03,10636
"Boeing 747","USA",4,1969-12-12,1562
"Boeing 767","USA",2,1981-03-22,1219

If you noticed in the content above, we have different data types. We have string, numeric and date column types. The content above will be represented by airliners.csv in the code.



Writing a Dataframe without Schema

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("schema-app") \
        .getOrCreate()

    air_liners_df = spark.read \
        .option("header", "true") \
        .format("csv") \
        .load("airliners.csv")

    air_liners_df.show()
    air_liners_df.printSchema()

Dataframe/Print schema result


It seems that worked fine but if you look with attention, you'll realize that in the schema structure there are some field types that don't match with their values, for example fields like number_built, engines and first_flight. They aren't string types, right?


We can try to fix it adding the following parameter called "inferSchema" and setting up to "true".

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("schema-app") \
        .getOrCreate()

    air_liners_df = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .format("csv") \
        .load("airliners.csv")

    air_liners_df.show()
    air_liners_df.printSchema()

Dataframe/Print schema result


Even inferring the schema, the field first_flight keeping as a string type. Let's try to use Dataframe with a defined schema to see if this details will be fixed.

 

Writing a Dataframe with Schema


Now it's possible to see the differences between the codes. We're adding an object that represents the schema. This schema describes the content in CSV file, you can note that we have to describe the column name and type.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, IntegerType, DateType, StructField

if __name__ == "__main__":

    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("schema-app") \
        .getOrCreate()

    StructSchema = StructType([
        StructField("type", StringType()),
        StructField("country", StringType()),
        StructField("engines", IntegerType()),
        StructField("first_flight", DateType()),
        StructField("number_built", IntegerType())
    ])

    air_liners_df = spark.read \
        .option("header", "true") \
        .format("csv") \
        .schema(StructSchema) \
        .load("airliners.csv")

    air_liners_df.show()
    air_liners_df.printSchema()

Dataframe/Print schema result


After we defined the schema, all the field types match with their values. This shows how important is to use schemas with Dataframes. Now it's possible to manipulate the data according to the type with no concerns.


Cool? I hope you enjoyed it!


Posts recentes

Ver tudo