Buscar | Coffee and Tips

44 items found for ""

How to generate random Data using Datafaker lib
Sometimes in our projects we have to fill Java objects for unit tests or even to create a database dump with random data to test a specific feature and etc. We need to be creative trying to create names, street names, cities or documents. There's an interesting and helpful Java library called Datafaker that allows to create random data with a large number of providers. Providers are objects based on a context, for example: If you want to generate data about a Person object, there's a specific provider for this context that will generate name, last name and etc. If you need to create a unit test that you need data about address, you'll find it. In this post we'll create some examples using Maven but the library also provides support for Gradle projects. Maven net.datafaker datafaker 1.1.0 Generating Random Data Let's create a simple Java class that contains some properties like name, last name, address, favorite music genre and food. public class RandomPerson { public String firstName; public String lastName; public String favoriteMusicGenre; public String favoriteFood; public String streetAddress; public String city; public String country; @Override public String toString() { return "firstName=" + firstName + "\n" + "lastName=" + lastName + "\n" + "favoriteMusicGenre="+favoriteMusicGenre + "\n" + "favoriteFood=" + favoriteFood + "\n" + "streetAddress=" + streetAddress + "\n" + "city=" + city + "\n" + "country=" + country ; } static void print(RandomPerson randomPerson){ System.out.println( randomPerson ); } } In the next step we'll fill an object using the providers that we quote in the first section. First of all, we create an object called randomData that represents Faker class. This class contains all the providers in the example below. public static void main(String[] args) { Faker randomData = new Faker(); RandomPerson randomPerson = new RandomPerson(); randomPerson.firstName = randomData.name().firstName(); randomPerson.lastName = randomData.name().lastName(); randomPerson.favoriteMusicGenre = randomData.music().genre(); randomPerson.favoriteFood = randomData.food().dish(); randomPerson.streetAddress = randomData.address().streetAddress(); randomPerson.city = randomData.address().city(); randomPerson.country = randomData.address().country(); print(randomPerson); } After the execution, we can see the results like this at the console: Result firstName=Dorthy lastName=Jones favoriteMusicGenre=Electronic favoriteFood=Cauliflower Penne streetAddress=7411 Darin Gateway city=Gutkowskifort country=Greece Every execution will be a new result because of providers are randoms. Another interesting feature is that we can set up the Locale when instantiate an object. Faker randomData = new Faker(Locale.JAPANESE); See the results based on Local.JAPANESE: Result firstName=航 lastName=横山 favoriteMusicGenre=Non Music favoriteFood=French Fries with Sausages streetAddress=418 美桜Square city=南斉藤区 country=Togo Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Unit Testing Principles, Practices, and Patterns: Effective Testing Styles, Patterns, and Reliable Automation for Unit Testing, Mocking, and Integration Testing with Examples in C# is a book that covers Unit Testing Principles, Patterns and Practices teaches you to design and write tests that target key areas of your code including the domain model. In this clearly written guide, you learn to develop professional-quality tests and test suites and integrate testing throughout the application life cycle. Mastering Unit Testing Using Mockito and JUnit is a book that covers JUnit practices using one of the most famous testing libraries called Mockito. This book teaches how to create and maintain automated unit tests using advanced features of JUnit with the Mockito framework, continuous integration practices (famous CI) using market tools like Jenkins along with one of the largest dependency managers in Java projects, Maven. For you who are starting in this world, it is an excellent choice. Isn't a cool library!? See you!
Differences between External and Internal tables in Hive
There are two ways to create tables in the Hive context and this post we'll show the differences, advantages and disadvantages. Internal Table Internal tables are known as Managed tables and we'll understand the reason in the following. Now, let's create an internal table using SQL in the Hive context and see the advantages and disadvantages. create table coffee_and_tips_table (name string, age int, address string) stored as textfile; Advantages To be honest I wouldn't say that it's an advantage but Internal tables are managed by Hive Disadvantages Internal tables can't access remote storage services for example in clouds like Amazon AWS, Microsoft Azure and Google Cloud. Dropping Internal tables all the data including metadata and partitions will be lost. External Table External tables has some interesting features compared to Internal tables and it's a good and recommended approach when we need to create tables. In the script below you can see the difference between Internal table creation and External table related to the last section. We just added the reserved word external in the script. create external table coffee_and_tips_external (name string, age int, address string) stored as textfile; Advantages The data and metadata won't be lost if drop table External tables can be accessed and managed by external process External tables allows access to remote storage service as a source location Disadvantages Again, I wouldn't say that it's a disadvantage but if you need to change schema or dropping a table, probably you'll need to run a command to repair the table as shown below. msck repair table Depending on the volume, this operation may take some time to complete. To check out a table type, run the following command below and you'll see at the column table_type the result. hive> describe formatted Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Programming Hive This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Cool? I hope you enjoyed it!
How to save costs on S3 running Data Lake
Cloud services provides useful resources to scale your business faster but not always we can measure cloud costs when we’re starting a business from the scratch or even being a solid business, costs always makes part of the strategy for any company who want to provide a better service. Me and my teammates have worked in a Data platform based on events enable to process 350 million events every day. We provide data to the client applications and to the businesses teams to make decisions and it always a challenge do deal with the massive data traffic and how we can maintain these data and saving money with storage at the same time. Storage is too expensive and there are some strategies to save money. For this post I’ll describe some strategies that we’ve adopted to save costs on S3 (Simple Storage Service) and I hope we can help it. Strategies Strategy #1 Amazon S3 storage classes Amazon S3 provides a way to manage files through life cycle settings, out there you can set ways to move files to different storage classes depending on the file’s age and access frequency. This strategy can save a lot of money to your company. Working with storage class enable us saving costs. By default, data are stored on S3 Standard storage class. This storage type has some benefits of storage and data access but we realized that after data transformed in the Silver layer, data in the Bronze layer it wasn’t accessed very often and it was totally possible to move them to a cheaper storage class. We decided to move it using life cycle settings to S3 Intelligent Tiering storage class. This storage class it was a perfect fit to our context because we could save costs with storage and even in case to access these files for a reason we could keeping a fair cost. We’re working on for a better scenario which we could set it a life cycle in the Silver layer to move files that hasn’t been accessed for a period to a cheaper storage class but at the moment we need to access historical files with high frequency. If you check AWS documentation you’ll note that there’s some cheapest storage classes but you and your team should to analyse each case because how cheapest is to store data more expensive will be to access them. So, be careful, try to understand the patterns about storage and data access in your Data Lake architecture before choosing a storage class that could fit better to your business. Strategy #2 Partitioning Data Apache Spark is the most famous framework to process a large amount of data and has been adopted by data teams around the world. During the data transformation using Spark you can set it a Dataframe to partition data through a specific column. This approach is too useful to perform SQL queries better. Note that partitioning approach has no relation to S3 directly but the usage avoids full scans in S3 objects. Full scans means that after SQL queries, the SQL engine can load gigabytes even terabytes of data. This could be very expensive to your company, because you can be charged easily depending on amount of loaded data. So, partitioning data has an important role when we need to save costs. Strategy #3 Delta Lake vacuum Delta Lake has an interesting feature called vacuum that’s a mechanism to remove files from the disk with no usage. Usually teams adopt this strategy after restoring versions that some files will be remain and they won’t be managed by Delta Lake. For example, in the image below we have 5 versions of Delta tables and their partitions. Suppose that we need to restore to version because we found some inconsistent data after version 1. After this command, Delta will point his management to version 1 as the current version but the parquet files related to others version will be there with no usage. We can remove these parquets running vacuum command as shown below. Note that parquets related to versions after 1 were removed releasing space in the storage. For more details I strongly recommend seeing Delta Lake documentation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
First steps with DBT - Data Build Tool
DBT has been used by a lot of companies on Data area and I believe that we can extract good insights in this post about it. That's going to be a practical post showing how DBT works it and hope you guys enjoy it. What's DBT? DBT means Data Build Tool and enables teams to transform data already loaded in their warehouse with simple select statements. DBT does the T in ELT processes, in the other words, he doesn't work to extract and load data but he's useful to transform it. Step 1: Creating a DBT Project Now, we're assume that DBT is already installed but if not, I recommend see this link. After DBT installed you can create a new project using CLI or you can clone this project from the DBT Github repository. Here for this post we're going to use CLI mode to create our project and also to complete the next steps. To create a new project, run the command below. dbt init After running this command, you need to type the project's name and which warehouse or database you're going to use like the image below. For this post, we're going to use postgres adapter. It's very important that you have a postgres database already installed or you can up a postgres image using docker. About adapters, DBT supports different of them and you can check here. I created a table structure and also loaded it with data simulating data from a video platform called wetube and we're going to use them to understand how DBT works it. Follow the structure: Step 2: Structure and more about DBT After running dbt init command to create the project, a structure of folders and files below will be created. I won't talk about the whole directories of project but I'd like to focus in two of them. Sources Sources are basically the data already loaded into your warehouse. In DBT process, sources have the same meaning of raw data. There's no folders representing source data for this project but you need to know about this term because we're going to set up tables already created as sources for the next sections. Seeds Seeds is an interesting and useful mechanism to load static data into your warehouse through CSV files. If you want to load these data you need to create a CSV file on this directory and run the command below. dbt seed For each field on CSV file, DBT will infer their types and create a table into warehouse or database. Models DBT works with Model paradigm, the main idea is that you can create models through the transformation using SQL statements based on tables sources or existing models Every SQL file located in your model folder will create a model into your warehouse when the command below runs. dbt run Remember that a model can be created through a source or another model and don't worry about this, I'll show you more details about it. Step 3: Setting up database connection After project already created, we need to set up our database's connection and here at this post, we're going to use postgres as database. After initialize the project a bunch of files are created and one of them is called profiles.yml. profiles.yml file is responsible to control the different profiles to the different database's connection like dev and production environment. If you've noticed, we can't see this file on the image above because this file is created outside of project to avoid sensitive credentials. You can find this file in ~/.dbt/ directory. If you note, we have one profile named dbt_blog and a target called dev, by default the target refer to dev with the database's connection settings. Also, It's possible to create one or more profiles and targets, it enables working with different environments. Another important detail is that dbt_blog profile should be specified on dbt_project.yml file as a default profile. For the next sections, we'll discuss what and how dbt_project.yml file works it. Step 4: Creating dbt_project.yml file Every DBT project has a dbt_project.yml file, you can set up informations like project name, directories, profiles and materialization type. name: 'dbt_blog' version: '1.0.0' config-version: 2 profile: 'dbt_blog' model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by `dbt clean` - "target" - "dbt_packages" models: dbt_blog: # Config indicated by + and applies to all files under models/example/ mart: +materialized: table Note that profile field was set up as the same profile specified on profiles.yml file and another important detail is about materialized field. Here was set up as a "table" value but by default, is a "view". Materialized fields allows you to create models as a table or view on each run. There are others type of materialization but we won't discuss here and I recommend see dbt docs. Step 5: Creating our first model Creating first files Let's change a little and let's going to create a sub-folder on model directory called mart and inside this folder we're going to create our .SQL files and also another important file that we don't discuss yet called schema.yml. Creating schema file Schema files are used to map sources and to document models like model's name, columns and more. Now you can create a file called schema.yml e fill up with these informations below. version: 2 sources: - name: wetube tables: - name: account - name: city - name: state - name: channel - name: channel_subs - name: video - name: video_like - name: user_address models: - name: number_of_subs_by_channel description: "Number of subscribers by channel" columns: - name: id_channel description: "Channel's ID" tests: - not_null - name: channel description: "Channel's Name" tests: - not_null - name: num_of_subs description: "Number of Subs" tests: - not_null Sources: At sources field you can include tables from your warehouse or database that's going to be used on model creation. models: At models field you can include the name's model, columns and their description Creating a model This part is where we can create SQL scripts that's going to result in our first model. For the first model, we're going to create a SQL statement to represent a model that we can see the numbers of subscribers by channel. Let's create a file called number_of_subs_by_channel.sql and fill up with these scripts below. with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_channel_subs as ( select * from {{ source('wetube','channel_subs') }} ), number_of_subs_by_channel as ( select source_channel.id_channel, source_channel.name, count(source_channel_subs.id_subscriber) num_subs from source_channel_subs inner join source_channel using (id_channel) group by 1, 2 ) select * from number_of_subs_by_channel Understanding model creation Note that we have multiple scripts separated by common table expression (CTE) that becomes useful to understand the code. DBT enables using Jinja template {{ }} bringing a better flexibility to our code. The usage of keyword source inside Jinja template means that we're referring source tables. To refer a model you need to use ref keyword. The last SELECT statement based on source tables generates the model that will be created as table in the database. Running our first model Run the command below to create our first model dbt run Output: Creating another model Imagine that we need to create a model containing account information and it's channels. Let's get back to schema.yml file to describe this new model. - name: account_information description: "Model containing account information and it's channels" columns: - name: id_account description: "Account ID" tests: - not_null - name: first_name description: "First name of user's account" tests: - not_null - name: last_name description: "Last name of user's account" tests: - not_null - name: email description: "Account's email" tests: - not_null - name: city_name description: "city's name" tests: - not_null - name: state_name description: "state's name" tests: - not_null - name: id_channel description: "channel's Id" tests: - not_null - name: channel_name description: "channel's name" tests: - not_null - name: channel_creation description: "Date of creation name" tests: - not_null Now, let's create a new SQL file and name it as account_information.sql and put scripts below: with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_city as ( select * from {{ source('wetube','city') }} ), source_state as ( select * from {{ source('wetube','state') }} ), source_user_address as ( select * from {{ source('wetube','user_address') }} ), source_account as ( select * from {{ source('wetube','account') }} ), account_info as ( select account.id_user as id_account, account.first_name, account.last_name, account.email, city.name as city_name, state.name as state_name, channel.id_channel, channel.name as channel, channel.creation_date as channel_creation FROM source_account account inner join source_channel channel on (channel.id_account = account.id_user) inner join source_user_address user_address using (id_user) inner join source_state state using (id_state) inner join source_city city using (id_city) ) select * from account_info Creating our last model For our last model, we going to create a model about how many likes has a video. Let's change again the schema.yml to describe and to document our future and last model. - name: total_likes_by_video description: "Model containing total of likes by video" columns: - name: id_channel description: "Channel's Id" tests: - not_null - name: channel description: "Channel's name" tests: - not_null - name: id_video description: "Video's Id" tests: - not_null - name: title description: "Video's Title" tests: - not_null - name: total_likes description: "Total of likes" tests: - not_null Name it a file called total_likes_by_video.sql and put the code below: with source_video as ( select * from {{ source('wetube','video') }} ), source_video_like as ( select * from {{ source('wetube','video_like') }} ), source_account_info as ( select * from {{ ref('account_information') }} ), source_total_like_by_video as ( select source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title, count(*) as total_likes FROM source_video_like inner join source_video using (id_video) inner join source_account_info using (id_channel) GROUP BY source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title ORDER BY total_likes DESC ) select * from source_total_like_by_video Running DBT again After creation of our files, let's run them again to create the models dbt run Output The models were created in the database and you can run select statements directly in your database to check it. Model: account_information Model: number_of_subs_by_channel Model: total_likes_by_video Step 6: DBT Docs Documentation After generated our models, now we're going to generate docs based on these models. DBT generates a complete documentation about models and sources and their columns and also you can see through a web page. Generating docs dbt docs generate Running docs on webserver After docs generated you can run command below to start a webserver on port 8080 and see the documentation locally. dbt docs serve Lineage Another detail about documentation is that you can see through of a Lineage the models and it's dependencies. Github code You can checkout this code through our Github page. Cool? I hope you guys enjoyed it!
Getting started using Terraform on AWS
Terraform is a IaC (Infrastructure as Code) tool that makes it possible to provision infrastructure in cloud services. Instead of manually creating resources in the cloud, Terraform facilitates the creation and control of these services through management of state in a few lines of code. Terraform has its own language and can be used independently with other languages isolating business layer from infrastructure layer. For this tutorial, we will create an S3 Bucket and an SQS through Terraform on AWS Terraform Installation For installation, download the installer from this link according to your operating system. AWS Provider We'll use AWS as a provider. Thus, when we select AWS as a provider, Terraform will download the packages that will enable the creation of specific resources for AWS. To follow the next steps, we hope you already know about: AWS Credentials Your user already has the necessary permissions to create resources on AWS Authentication As we are using AWS as provider, we need to configure Terraform to authenticate and then create the resources. There are a few ways to authenticate. For this tutorial, I chose to use one of the AWS mechanisms that allows you to allocate credentials in a file in the $HOME/.aws folder and use it as a single authentication source. To create this folder with the credentials, we need to install the AWS CLI, access this link and follow the installation steps. This mechanism avoids using credentials directly in the code, so if you need to run a command or SDK that connects to AWS locally, these credentials will be loaded from this file. Credentials settings After installing the AWS CLI, open the terminal and run the following command: aws configure In the terminal itself, fill in the fields using your user's credentials data: After filling in, 2 text files will be created in the $HOME/.aws directory config: containing the profile, in this case the default profile was created credentials: containing own credentials Let's change the files to suit this tutorial, change the config file as below: [profile staging] output = json region = us-east-1 [default] output = json region = us-east-1 In this case, we have 2 profiles configured, the default and staging profile. Change the credentials file as below, replacing it with your credentials. [staging] aws_access_key_id = [Access key ID] aws_secret_access_key = [Secret access key] [default] aws_access_key_id = [Access key ID] aws_secret_access_key = [Secret access key] Creating Terraform files After all these configurations, we will actually start working with Terraform. For this we need to create some base files that will help us create resources on AWS. 1º Step: In the root directory of your project, create a folder called terraform/ 2º Step: Inside the terraform/ folder, create the files: main.tf vars.tf 3º Step: Create another folder called staging inside terraform/ 4º Step: Inside the terraform/staging/ folder create the file: vars.tfvars Okay, now we have the folder structure that we will use for the next steps. Setting up Terraform files Let's start by declaring the variables using the vars.tf file. vars.tf In this file is where we're going to create the variables to be used on resources and bring a better flexibility to our code. We can create variables with a default value or simply empty, where they will be filled in according to the execution environment, which will be explained later. variable "region" { default = "us-east-1" type = "string" } variable "environment" { } We create two variables: region: Variable of type string and its default value is the AWS region in which we are going to create the resources, in this case, us-east-1. environment: Variable that will represent the execution environment staging/vars.tfvars In this file we are defining the value of the environment variable previously created with no default value. environment = "staging" This strategy is useful when we have more than one environment, for example, if we had a production environment, we could have created another vars.tfvars file in a folder called production/. Now, we can choose in which environment we will run Terraform. We'll understand this part when we run it later. main.tf Here is where we'll declare resources such as S3 bucket and SQS to be created on AWS. Let's understand the file in parts. In this first part we're declaring AWS as a provider and setting the region using the variable already created through interpolation ${..}. Provider provider "aws" { region = "${var.region}" } Creating S3 Bucket To create a resource via Terraform, we always start with the resource keyword, then the resource name, and finally an identifier. resource "name_resource" "identifier" {} In this snippet we're creating a S3 Bucket called bucket.blog.data, remember that Bucket names must be unique. The acl field defines the Bucket restrictions, in this case, private. The tags field is used to provide extra information to the resource, in this case it will be provide the value of the environment variable. resource "aws_s3_bucket" "s3_bucket" { bucket = "bucket.blog.data" acl = "private" tags = { Environment = "${var.environment}" } } Creating SQS For now, we'll create an SQS called sqs-posts. Note that resource creation follows the same rules as we described earlier. For this scenario, we set the delay_seconds field that define the delay time for a message to be delivered. More details here. resource "aws_sqs_queue" "sqs-blog" { name = "sqs-posts" delay_seconds = 90 tags = { Environment = "${var.environment}" } } Running Terraform 1º Step : Initialize Terraform Open the terminal and inside terraform/ directory, run the command: terraform init Console message after running the command. 2º Step: In Terraform you can create workspaces. These are runtime environments that Terraform provides and bringing flexibility when it's necessary to run in more than one environment. Once initialized, a default workspace is created. Try to run the command below and see which workspace you're running. terraform workspace list For this tutorial we will simulate a development environment. Remember we created a folder called /staging ? Let's getting start using this folder as a development environment. For that, let's create a workspace in Terraform called staging as well. If we had a production environment, a production workspace also could be created. terraform workspace new "staging" Done, a new workspace called staging was created! 3º Step: In this step, we're going to list all existing resources or those that will be created, in this case, the last option. terraform plan -var-file=staging/vars.tfvars The plan parameter makes it possible to visualize the resources that will be created or updated, it is a good option to understand the behavior before the resource is definitively created. The second -var-file parameter makes it possible to choose a specific path containing the values of the variables that will be used according to the execution environment. In this case, the /staging/vars.tfvars file containing values related to the staging environment. If there was a production workspace, the execution would be the same, but for a different folder, got it? Messages console after running the last command using plan parameter. Looking at the console, note that resources earlier declared will be created: aws_s3_bucket.s3_bucket aws_sqs_queue.sqs-blog 4º Step: In this step, we are going to definitely create the resources. terraform apply -var-file=staging/vars.tfvars Just replace plan parameter with apply, then a confirmation message will be shown in the console: To confirm the resources creation, just type yes. That's it, S3 Bucket and SQS were created! Now you can check it right in the AWS console. Select workspace If you need to change workspace, run the command below selecting the workspace you want to use: terraform workspace select "[workspace]" Destroying resources This part of the tutorial requires a lot of attention. The next command makes it possible to remove all the resources that were created without having to remove them one by one and avoiding unexpected surprises with AWS billing. terraform destroy -var-file=staging/vars.tfvars Type yes if you want to delete all created resources. I don't recommend using this command in a production environment, but for this tutorial it's useful, Thus, don't forget to delete and AWS won't charge you in the future. Conclusion Terraform makes it possible to create infrastructure very simply through a decoupled code. For this tutorial we use AWS as a provider, but it is possible to use Google Cloud, Azure and other cloud services. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastructure as Code is a book focused on how to use Terraform and its benefits. The author make comparisons with several other IaC (Infrastructure as code) tools such as Ansible and Cloudformation (IaC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer or work in the Devops area, I strongly recommend learning about the topic. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Overview about AWS SNS - Simple Notification Service
SNS (Simple Notification Service) provides a notification service based on Pub/Sub strategy. It's a way of publishing messages to one or more subscribers through endpoints. Is that confuse? Let's go deeper a little more about this theme. Usually the term Pub/Sub is related to event-driven architectures. In this architecture publishing messages can be done through notifications to one or more destinations already known, providing an asynchronous approach. For a destiny to become known, there must be a way to signal that the destination becomes a candidate to receive any message from the source. But how does subscriptions work? For SNS context, each subscriber could be associated to one or more SNS Topics. Thus, for each published message through Topic, one or more subscribers will receive them. We can compare when we receive push notifications from installed apps in our smartphones. It's the same idea, after an installed app we became a subscriber of that service. Thus, each interaction from that application could be done through notifications or published messages. The above example demonstrates a possible use case that SNS could be applied. For the next sections we'll discuss details for a better understanding. SNS basically provides two main characteristics, Publishers and Subscribers. Both of them work together providing resources through AWS console and APIs. 1. Topics/Publishers Topics are logical endpoints that works as an interface between Publisher and Subscriber. Basically Topics provides messages to the subscribers after published messages from the publisher. There are two types of Topics, Fifo and Standard: Fifo: Fifo type allows messages ordering (First in/First out), has a limit up to 300 published messages per second, prevent messages duplication and supports only SQS protocols as subscriber. Standard: It does not guarantee messages ordering rules and all of the supported delivery protocols can subscribe to a standard topic such as SQS, Lambda, HTTP, SMS, EMAIL and mobile apps endpoints. Limits AWS allows to create up to 100.000 topics per account. 2. Subscribers The subscription is a way to connect an endpoint to a specific Topic. Each subscription must have associated to a Topic to receive notifications from that Topic. Examples of endpoints: AWS SQS HTTP HTTPS AWS Kinesis Data Firehose E-mail SMS AWS Lambda These above endpoints are examples of delivery or transportation formats to receive notifications from a Topic through a subscription. Limit Subscriptions AWS allows up to 10 millions subscriptions per Topic. 3. Message size limit SNS messages can contain up to 256 KB of text data. Different of the SMS that can contain up to 140 bytes of text data. 4. Message types SNS supports different types of messages such as text, XML, JSON and more. 5. Differences between SNS and SQS Sometimes people get confused about the differences but don't worry I can explain these differences. SNS and SQS are different services but they can be associated. SQS is an AWS queue service that retains sent messages from the different clients and contexts, but at the same time they can works as a subscriber to a Topic. Thus, a SQS protocol subscribed to a Topic will start receiving notifications, becoming an asynchronous integration. Look at the image above, we're simulating a scenario that we have three SQS subscribed in three Topics. SQS 1 is a subscriber to the Topic 1 and 2. Thus, SQS 1 will receive notifications/messages from the both Topics, 1 and 2. SQS 2 is another subscriber to the Topic 2 and 3 that automatically also will receive messages from Topic 2 and 3. And for the last one case, we have SQS 3 as a subscriber to the Topic 3 that will receive messages only from Topic 3. For more details I recommend read this doc. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Working with Schemas in Spark Dataframes using PySpark
What's a schema in the Dataframes context? Schemas are metadata that allows working with a standardized Data. Well, that was my definition about schemas but we also can understanding schemas as a structure that represents a data context or a business model. Spark enables using schemas with Dataframes and I believe that is a good point to keep data quality, reliability and we also can use these points to understand the data and connect to the business. But if you know a little more about Dataframes, working with schema isn't a rule. Spark provides features that we can infer to a schema without defined schemas and reach to the same result, but depending on the data source, the inference couldn't work as we expect. In this post we're going to create a simple Dataframe example that will read a CSV file without a schema and another one using a defined schema. Through examples we'll can see the advantages and disadvantages. Let's to the work! CSV File content "type","country","engines","first_flight","number_built" "Airbus A220","Canada",2,2013-03-02,179 "Airbus A320","France",2,1986-06-10,10066 "Airbus A330","France",2,1992-01-02,1521 "Boeing 737","USA",2,1967-08-03,10636 "Boeing 747","USA",4,1969-12-12,1562 "Boeing 767","USA",2,1981-03-22,1219 If you noticed in the content above, we have different data types. We have string, numeric and date column types. The content above will be represented by airliners.csv in the code. Writing a Dataframe without Schema from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result It seems that worked fine but if you look with attention, you'll realize that in the schema structure there are some field types that don't match with their values, for example fields like number_built, engines and first_flight. They aren't string types, right? We can try to fix it adding the following parameter called "inferSchema" and setting up to "true". from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result Even inferring the schema, the field first_flight keeping as a string type. Let's try to use Dataframe with a defined schema to see if this details will be fixed. Writing a Dataframe with Schema Now it's possible to see the differences between the codes. We're adding an object that represents the schema. This schema describes the content in CSV file, you can note that we have to describe the column name and type. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, IntegerType, DateType, StructField if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() StructSchema = StructType([ StructField("type", StringType()), StructField("country", StringType()), StructField("engines", IntegerType()), StructField("first_flight", DateType()), StructField("number_built", IntegerType()) ]) air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .schema(StructSchema) \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result After we defined the schema, all the field types match with their values. This shows how important is to use schemas with Dataframes. Now it's possible to manipulate the data according to the type with no concerns. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!
Creating a Java code using Builder pattern
If you're using a language that supports oriented object in your project, probably there's some lines of codes with Builder pattern. If not, this post will help you to understand about it. What's Builder Pattern? Builder Pattern belongs to an area in Software Engineer called Design Patterns, the idea behind of a pattern is to solve commons problems in your project following best practices. Builder Pattern is very useful when we need to provide a better solution in the creational objects part in our project. Sometimes we need to instantiate an object with a lot of parameters and this could be a problem if you pass a wrong parameter value. Things like this happen every time and results in bugs and you will need to find out where's the issue and maybe, refactoring code to improve it. Let's write some lines of code to see how does Builder Pattern works and when to apply it. The code below is an example of a traditional Class with constructor used to load values when the object instantiated. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public PersonalInfo(String firstName, String lastName, Date birthDate, String address, String city, String zipCode, String state, int population){ this.firstName = firstName; this.lastName = lastName; this.birthDate = birthDate; this.address = address; this.city = city; this.zipCode = zipCode; this.state = state; this.population = population; } } And now, we can instantiate the object simulating the client code. PersonalInfo personalInfo = new BuilderPattern("Mônica", "Avelar", new Date(), "23 Market Street", "San Francisco", "94016", "CA", 800000); If you notice, to instantiate the object we should pass all the values related to each property of our class and there's a big chance to pass a wrong value. Another disadvantage of this approach is the possibility to not scale it. In this example we have a few properties but tomorrow we can add more properties and the disadvantage becomes clearer. Working with Builder Pattern Let's rewrite the code above to the Builder Pattern and see the differences. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public static class Builder { private String firstName; private String lastName; private Date birthDate; private String address; private String city; private String zipCode; private String state; private int population; public Builder firstName(String value) { firstName = value; return this; } public Builder lastName(String value) { lastName = value; return this; } public Builder birthDate(Date value) { birthDate = value; return this; } public Builder address(String value) { address = value; return this; } public Builder city(String value) { city = value; return this; } public Builder zipCode(String value) { zipCode = value; return this; } public Builder state(String value) { state = value; return this; } public Builder population(int value) { population = value; return this; } public BuilderPattern build() { return new BuilderPattern(this); } } public PersonalInfo(Builder builder){ firstName = builder.firstName; lastName = builder.lastName; birthDate = builder.birthDate; address = builder.address; city = builder.city; zipCode = builder.zipCode; state = builder.state; population = builder.population; } } If you compare both codes you will conclude that the first one is smaller and better to understand than the second one and I agree it. The advantage of the usage is going to be clear for the next example when we create an object based on Builder pattern. Simulating client code using Builder Pattern PersonalInfo personalInfo = new Builder() .firstName("Mônica") .lastName("Avelar") .birthDate(new Date()) .address("23 Market Street") .city("San Francisco") .zipCode("94016") .state("CA") .population(80000) .build(); This last example of creation object using Builder Pattern turns an organized code following the best practices and easy to read. Another advantage of Builder is that we can identify each property before passing values. To be honest I've been using Builder Pattern in my projects and I strongly recommend you do it the same in your next projects. There's an easier way to implement Builder pattern in projects nowadays and I'll write a post about it, see you soon! Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Design Patterns com Java. Projeto Orientado a Objetos Guiado por Padrões (Portuguese version) is a book that shows the concepts and fundamentals of Design Patterns and how to apply for different contexts using Java language.
Creating AWS CloudWatch alarms
The use of alarms is an essential requirement when working with various resources in the cloud. It is one of the most efficient ways to monitor and understand the behavior of an application if the metrics are different than expected. In this post, we're going to create an alarm from scratch using AWS CloudWatch based on specific scenario. There are several other tools that allow us to set up alarms, but when working with AWS, setting alarms using CloudWatch is very simple and fast. Use Case Scenario To a better understanding, suppose we create a resiliency mechanism in an architecture to prevent data losses. This mechanism always acts whenever something goes wrong, like components not working as expected sending failure messages to a SQS. CloudWatch allows us to set an alarm. Thus, when a message is sent to this queue, an alarm is triggered. First of all, we need to create a queue and sending messages just to generate some metrics that we're going to use in our alarm. That's a way to simulate a production environment. After queue and alarm creation, we'll send more message for the alarms tests. Creating a SQS Queue Let's create a simple SQS queue and choose some metrics that we can use in our alarm. Thus, access the AWS console and in the search bar, type "sqs" as shown in the image below and then access the service. After accessing the service, click Create queue Let's create a Standard queue for this example and name as sqs-messages. You don't need to pay attention to the other details, just click on the Create queue button to finish it. Queue has been created, now the next step we'll send a few messages just to generate metrics. Sending messages Let's send few messages to the previously created queue, feel free to change the message content if you want to. After sending these messages, automatically will generate some metrics according to the action. In this case, a metric called NumberOfMessagesSent was created on CloudWatch and we can use it to create the alarm. Creating an Alarm For our example, let's choose the metric based on number of messages sent (NumberOfMessagesSent). Access AWS via the console and search for CloudWatch in the search bar, as shown in the image below. After accessing the service, click on the In Alarms/In alarm option in the left corner of the screen and then click on the Create alarm button. Select metric according to the screen below Choose SQS Then click Queue Metrics Search for queue name and select the metric name column item labeled NumberOfMessagesSent, then click Select Metric. Setting metrics Metric name: is the metric chosen in the previous steps. This metric measures the number of messages sent to the SQS (NumberOfMessagesSent). QueueName: Name of the SQS in which the alarm will be configured. Statistic: In this field we can choose options such as Average, Sum, Minimum and more. This will depend on the context you will need to configure the alarm and the metric. For this example we choose Sum, because we want to get the sum of the number of messages sent in a given period. Period: In this field we define the period in which the alarm will be triggered if it reaches the limit condition, which will be defined in the next steps. Setting conditions Threshlod type: For this example we will use Static. Whenever NumberOfMessagesSent is...: Let's select the Greater option Than...: In this field we will configure the number of NumberOfMessagesSent as a condition to trigger the alarm. Let's put 5. Additional configuration For additional configuration, we have the datapoints field for the alarm in which I would like to detail its operation a little more. Datapoints to alarm This additional option makes the alarm configuration more flexible, combined with the previously defined conditions. By default, this setting is: 1 of 1 How it works? The first field refers to the number of points and the second one refers to the period. Keeping the previous settings combined to the additional settings means that the alarm will be triggered if the NumberOfMessagesSent metric is greater than the sum of 5 in a period of 5 minutes. Until then, the default additional configuration does not change the previously defined settings, nothing changes. Now, let's change this setting to understand better. Let's change from: 1 of 1 to 2 of 2. This tells us that when the alarm condition is met, i.e. for the NumberOfMessagesSent metric, the sum is greater than 5. Thus, the alarm will be triggered for 2 datapoints in 10 minutes. Note that the period was multiplied due to the second field with the value 2. Summarizing, even if the condition is met, the alarm will only be triggered if there are 2 datapoints above the threshold in a period of 10 minutes. We will understand even better when we carry out some alarm activation tests. Let's keep the following settings and click Next Configuring actions On the next screen, we're going to configure the actions responsible for notifying a destination if an alarm is triggered. On this screen, we're going to keep the In alarm setting and then creating a new topic and finally, we're going to add an email in which we want to receive error notifications. Select the option Create new topic and fill in a desired name and then enter a valid email in the field Email endpoints that will receive notification ... Once completed, click Create topic and then an email will be sent to confirm subscription to the created topic. Make sure you've received an email confirmation and click Next on the alarm screen to proceed with the creation. Now, we need to add the name of the alarm in the screen below and then click on Next. The next screen will be the review screen, click on Create alarm to finish it. Okay, now we have an alarm created and it's time to test it. Alarm Testing In the beginning we sent a few messages just to generate the NumberOfMessagesSent metric but at this point, we need to send more messages that will trigger the alarm. Thus, let's send more messages and see what's going to happen. After sending the messages, notice that even if the threshold has exceeded, the alarm was not triggered. This is due to the threshold just reached 1 datapoint within the 10 minute window. Now, let's send continuous messages that exceed the threshold in short periods within the 10 minute window. Note that in the image above the alarm was triggered because in addition to having reached the condition specified in the settings, it also reached the 2 data points. Check the email added in the notification settings, probably an email was sent with the alarm details The status alarm will set to OK when the messages not exceed the threshold anymore. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
How to create S3 notification events using SQS via Terraform
S3 (Simple Storage Service) makes it possible to notify through events when an action occurs within a Bucket or in a specific folder. In other words, it works as a listener. Therefore, any action that takes place on a source, an event notification will be sent to a destination. What would those actions be? Any actions that takes place within a S3 Bucket such as creating objects, folders, removing files, restoring files and more. Destinations For each event notification configuration, there must be a destination. For this destination, information about each action will be sent, for example: A new file has been created in a specific folder, so information about the file will be sent, such as the creation date, file size, event type, file name, and more. Remembering that in this process, the content of the file is not sent, okay? There are 3 types of destinations: Lambda SNS SQS Understanding how it works In this post we are going to create an event notification settings in an S3 Bucket, simulating an action and understanding the final behavior. We could create this setting via console but for good practice reasons, we'll use Terraform as IaC tool. For those who aren't very familiar with Terraform, follow this tutorial on Getting Started using Terraform on AWS. In the next step, we will create a flow simulating the image below. We'll set in S3 Bucket for every file created within files/ folder, a notification event will be sent to a SQS queue. Creating Terraform files Create a folder called terraform/ in your project and from now on, all .tf files will be created inside it. Now, create a file called vars.tf where we're going to store the variables that will be used and paste the content below to this file. variable "region" { default = "us-east-1" type = string } variable "bucket" { type = string } Create a file called provider.tf , where we will add the provider settings, which will be AWS. This means, Terraform will use AWS as the cloud to create the resources and will download the required plugins on startup. Copy the code below to the file. provider "aws" { region = "${var.region}" } Create a file called s3.tf , where we'll add the settings for creating a new S3 Bucket that will be used for this tutorial. resource "aws_s3_bucket" "s3_bucket_notification" { bucket = "${var.bucket}" } Now, create a file called sqs.tf , where we'll add the settings for creating an SQS queue and some permissions according to the code below: resource "aws_sqs_queue" "s3-notifications-sqs" { name = "s3-notifications-sqs" policy = <<POLICY { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:*:*:s3-notifications-sqs", "Condition": { "ArnEquals": { "aws:SourceArn": "${aws_s3_bucket.s3_bucket_notification.arn}" } } } ] } POLICY } Understanding code above In the code above, we're creating an SQS and adding some policy settings, see more details: SQS name will be s3-notifications-sqs, detailed value in name field In the policy field, we define a policy that allows S3 sending messages notification to SQS. Notice that we're referencing the Bucket S3 via ARN in the snippet ${aws_s3_bucket.s3_bucket_notification.arn} For the last file, let's create the settings that allows sending event notifications from S3 Bucket to an SQS. Therefore, create s3_notification.tf file and add the code below: resource "aws_s3_bucket_notification" "s3_notification" { bucket = aws_s3_bucket.s3_bucket_notification.id queue { events = ["s3:ObjectCreated:*"] queue_arn = aws_sqs_queue.s3-notifications-sqs.arn filter_prefix = "files/" } } Understanding code above In the code above, we are creating a resource called aws_s3_bucket_notification which will be responsible for enabling notifications from an S3 Bucket. In the bucket field, we are referring to the S3 bucket setting located on s3.tf file. The block queue contains some settings such as: events: Is the event type of the notification. In this case, for ObjectCreated type events. Notifications will be sent only for created objects. For deleted objects, there will be no notifications. It helps to restrict some types of events. queue_arn: Refers to the SQS defined in the sqs.tf file. filter_prefix: This field defines the folder where we want notifications to be triggered. In the code, we set the folder files/ to be the trigger location when the files are created. Summarizing, for all files created within folder files/ , a notification will be sent to the SQS defined in the queue_arn field. Running Terraform Init Terraform terraform init Running Plan The plan makes it possible to verify which resources will be produced, in this case it is necessary to pass the value of the bucket variable for its creation in S3. terraform plan -var bucket = 'type the bucket name' Running Apply command In this step, the creation of resources will be applied. Remember to pass the name of the bucket you want to create into the bucket variable, and the bucket name must be unique. terraform apply -var bucket = 'type the bucket name' Simulating an event notification After running the previous steps and creating the resources, we will manually upload a file in the files/ folder to the bucket that was created. Via console, access the Bucket created in S3 and create a folder called files. Inside it, load any file. Uploading file After loading the file in the files/ folder, access the created SQS. You'll see some available messages. Usually 3 messages will be available in queue because after creating S3 notification events settings, a test message is sent. The second one happens when we create a folder and for the last, the message related to the file upload. It's done, we have an event notification created!. References: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_notification https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html Coffee and tips Github repository: https://github.com/coffeeandtips/s3-bucket-notification-with-sqs-and-terraform Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastrucutre as Code is a book focused on how to use Terraform and its benefits. The author sought to make comparisons with several other IAC (Infrastructure as code) tools such as Ansible and Cloudformation (IAC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer, I strongly recommend learning about it. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Understanding AWS S3 in 1 minute
Amazon Web Services AWS is one of the most traditional and resourceful Cloud Computing services on the market. One of the most used resources is S3. What's AWS S3? AWS S3 is an abbreviation for Simple Storage Service. It is a resource managed by AWS itself, that is, we do not worry about managing the infrastructure. Provides an object repository allowing you to store objects from different volumes, backups, sites, perform data analysis, manage Data Lakes, etc. It also provides several integrations with other AWS services aimed at a base repository. S3 Structure AWS S3 is divided into different Buckets containing a folder structure based on customer needs. What's a Bucket? S3 Bucket is a kind of container where all objects will be stored and organized. A very important thing is that this Bucket must be unique. Architecture In the image below we have a more simpler design of the S3 architecture with buckets and folder directories. Integrations The AWS ecosystem makes it possible to integrate most of its and third-party tools. S3 is one of the most integrated resources. Some examples of integrated services: Athena Glue Kinesis Firehose Lambda Delta Lake RDS Outros SDK AWS provides SDK compatible for different programming languages that makes it possible to manipulate objects in S3, such as creating Buckets, folders, uploading and downloading files and much more. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Listing AWS Glue tables
Using an AWS SDK is always a good option if you need to explore some feature further in search of a solution. In this post, we're going to explore some of AWS Glue using SDK and Java. Glue is an AWS ETL tool that provides a central repository of metadata, called Glue Catalog. In short, the Glue Catalog keeps the entire structure of databases and tables and their schemas in a single place. The idea of this post will be to programmatically list all the tables of a given database in the Glue Catalog using the SDK and Java. Maven dependencies In this example, we're using the Java 8 version to better explore the use of Streams in the interaction. Undestanding awsGlue object is responsible for accessing the resource through the credentials that must be configured. In this post we will not go into this detail. The getTablesRequest object is responsible for setting the request parameters, in this case, we're setting the database. getTablesResult object is responsible for listing the tables based on the parameters set by the getTablesRequest object and also for controlling the result flow. Note that in addition to returning the tables through the getTablesResult.getTableList() method, this same object returns a token that will be explained further in the next item. The token is represented by the getTablesResult.getNextToken() method, the idea of the token is to control the flow of results, as all results are paged and if there is a token for each result, it means that there is still data to be returned. In the code, we used a repetition structure based on validating the existence of the token. So, if there is still a token, it will be set in the getTableRequest object through the code getTableRequest.setNextToken(token), to return more results. It's a way to paginate results. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Setup recommendations If you have interesting to know what's my setup I've used to develop my tutorials, following: Notebook Dell Inspiron 15 15.6 Monitor LG Ultrawide 29WL500-29 Well that’s it, I hope you enjoyed it!