Differences between External and Internal tables in Hive

JP
5 de jan. de 2023
2 min de leitura

There are two ways to create tables in the Hive context and this post we'll show the differences, advantages and disadvantages.

Internal Table

Internal tables are known as Managed tables and we'll understand the reason in the following. Now, let's create an internal table using SQL in the Hive context and see the advantages and disadvantages.

create table coffee_and_tips_table (name string, age int, address string) stored as textfile;

Advantages

To be honest I wouldn't say that it's an advantage but Internal tables are managed by Hive

Disadvantages

Internal tables can't access remote storage services for example in clouds like Amazon AWS, Microsoft Azure and Google Cloud.
Dropping Internal tables all the data including metadata and partitions will be lost.

External Table

External tables has some interesting features compared to Internal tables and it's a good and recommended approach when we need to create tables.

In the script below you can see the difference between Internal table creation and External table related to the last section. We just added the reserved word external in the script.

create external table coffee_and_tips_external (name string, age int, address string) stored as textfile;

Advantages

The data and metadata won't be lost if drop table
External tables can be accessed and managed by external process
External tables allows access to remote storage service as a source location

Disadvantages

Again, I wouldn't say that it's a disadvantage but if you need to change schema or dropping a table, probably you'll need to run a command to repair the table as shown below.

msck repair table <table_name>

Depending on the volume, this operation may take some time to complete.

To check out a table type, run the following command below and you'll see at the column table_type the result.

hive> describe formatted <table_name>

Books to study and read

If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s):

Programming Hive This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem.

Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python.

Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading.

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more.

Cool? I hope you enjoyed it!

Coffee and Tips Newsletter