Applying Change Data Feed for auditing on Delta tables
What is the Change Data Feed? Change Data Feed is a Delta Lake feature as of version 2.0.0 that allows tracking at row levels in Delta tables, changes such as DML operations (Merge, Delete or Update), data versions and the timestamp of when the change happened. The process maps Merge, Delete and Update operations, maintaining the history of changes at line level, that is, each event suffered in a record, Delta through the Change Data Feed manages to register as a kind of audit . Of course it is possible to use it for different use cases, the possibilities are extensive. How it works in practice Applying Change Data Feed for Delta tables is an interesting way to handle with row level records and for this post we will show how it works. We will perform some operations to explore more about the power of the Change Data Feed. We will work with the following Dataset: Creating the Spark Session and configuring some Delta parameters From now on, we'll create the code in chunks for easy understanding. In the code below we are creating the method responsible for maintaining the Spark session and configuring some parameters for Delta to work. Loading the Dataset Let's load the Dataset and create a temporary view to be used in our pipeline later. Creating the Delta Table Now we will create the Delta table already configuring Change Data Feed in the table properties and all the metadata will be based on the previously presented Dataset. Note that we're using the following parameter in the property delta.enableChangeDataFeed = true for activating the Change Data Feed. Performing a Data Merge Now we'll perform a simple Merge operation so that the Change Data Feed can register it as a change in our table. See what Merge uses in our previously created global_temp.raw_product view to upsert the data. Auditing the table Now that the Merge has been executed, let's perform a read on our table to understand what happened and how the Change Data Feed works. Notice that we're passing the following parameters: 1. readChangeFeed where required for using the Change Data Feed. 2. startingVersion is the parameter responsible for restricting which version we want it to be displayed from. Result after execution: See that in addition to the columns defined when creating the table, we have 3 new columns managed by the Change Data Feed. 1. _change_type: Column containing values according to each operation performed as insert , update_preimage , update_postimage , delete 2. _commit_version : Change version 3. _commit_timestamp : Timestamp representing the change date In the above result, the result of the upsert was a simple insert , as it didn't contain all the possible conditions of an update. Deleting a record In this step we will do a simple delete in a table record, just to validate how the Change Data Feed will behave. Auditing the table (again) Note below that after deleting record with id 6, we now have a new record created as delete in the table and its version incremented to 2. Another point is that the original record was maintained, but with the old version. Updating a record Now as a last test, we will update a record to understand again the behavior of the Change Data Feed. Auditing the table (last time) Now as a last test, we run a simple update on a record to understand how it will behave. Notice that 2 new values have been added/updated in the _change_type column. The update_postimage value is the value after the update was performed and this time, for the old record, the same version of the new one was kept in the column _commit_version , because this same record was updated according to column _change_type to update_preimage , that is, value before change. Conclusion The Change Data Feed is a great resource to understand the behavior of your data pipeline and also a way to audit records in order to better understand the operations performed there. According to the Delta team itself, it is a feature that, if maintained, does not generate any significant overhead. It's a feature that can be fully adopted in your data strategy as it has several benefits as shown in this post. Repository GitHub Hope you enjoyed!