Buscar | Coffee and Tips

54 results found with an empty search

Creating a Java code using Builder pattern
If you're using a language that supports oriented object in your project, probably there's some lines of codes with Builder pattern. If not, this post will help you to understand about it. What's Builder Pattern? Builder Pattern belongs to an area in Software Engineer called Design Patterns, the idea behind of a pattern is to solve commons problems in your project following best practices. Builder Pattern is very useful when we need to provide a better solution in the creational objects part in our project. Sometimes we need to instantiate an object with a lot of parameters and this could be a problem if you pass a wrong parameter value. Things like this happen every time and results in bugs and you will need to find out where's the issue and maybe, refactoring code to improve it. Let's write some lines of code to see how does Builder Pattern works and when to apply it. The code below is an example of a traditional Class with constructor used to load values when the object instantiated. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public PersonalInfo(String firstName, String lastName, Date birthDate, String address, String city, String zipCode, String state, int population){ this.firstName = firstName; this.lastName = lastName; this.birthDate = birthDate; this.address = address; this.city = city; this.zipCode = zipCode; this.state = state; this.population = population; } } And now, we can instantiate the object simulating the client code. PersonalInfo personalInfo = new BuilderPattern("Mônica", "Avelar", new Date(), "23 Market Street", "San Francisco", "94016", "CA", 800000); If you notice, to instantiate the object we should pass all the values related to each property of our class and there's a big chance to pass a wrong value. Another disadvantage of this approach is the possibility to not scale it. In this example we have a few properties but tomorrow we can add more properties and the disadvantage becomes clearer. Working with Builder Pattern Let's rewrite the code above to the Builder Pattern and see the differences. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public static class Builder { private String firstName; private String lastName; private Date birthDate; private String address; private String city; private String zipCode; private String state; private int population; public Builder firstName(String value) { firstName = value; return this; } public Builder lastName(String value) { lastName = value; return this; } public Builder birthDate(Date value) { birthDate = value; return this; } public Builder address(String value) { address = value; return this; } public Builder city(String value) { city = value; return this; } public Builder zipCode(String value) { zipCode = value; return this; } public Builder state(String value) { state = value; return this; } public Builder population(int value) { population = value; return this; } public BuilderPattern build() { return new BuilderPattern(this); } } public PersonalInfo(Builder builder){ firstName = builder.firstName; lastName = builder.lastName; birthDate = builder.birthDate; address = builder.address; city = builder.city; zipCode = builder.zipCode; state = builder.state; population = builder.population; } } If you compare both codes you will conclude that the first one is smaller and better to understand than the second one and I agree it. The advantage of the usage is going to be clear for the next example when we create an object based on Builder pattern. Simulating client code using Builder Pattern PersonalInfo personalInfo = new Builder() .firstName("Mônica") .lastName("Avelar") .birthDate(new Date()) .address("23 Market Street") .city("San Francisco") .zipCode("94016") .state("CA") .population(80000) .build(); This last example of creation object using Builder Pattern turns an organized code following the best practices and easy to read. Another advantage of Builder is that we can identify each property before passing values. To be honest I've been using Builder Pattern in my projects and I strongly recommend you do it the same in your next projects. There's an easier way to implement Builder pattern in projects nowadays and I'll write a post about it, see you soon! Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Design Patterns com Java. Projeto Orientado a Objetos Guiado por Padrões (Portuguese version) is a book that shows the concepts and fundamentals of Design Patterns and how to apply for different contexts using Java language.
Running Spring Boot with ActiveMQ
Before understanding about ActiveMQ, we have to think about common problems in applications that need to scale and better integrate their services. Today the information transmitted flow is infinitely greater than 10 years ago and it is almost impossible to measure capacity in terms of scalability that an application can support. Use case To understand better, let's imagine that you were hired to design an architecture for an e-commerce that will sell tickets for NFL games. As always you have little time to think about an architecture. The first idea is simple and quick, the result is this drawing below. Thinking about the number of accesses and requests per second, do you think it is a resilient architecture? Does the database scale? Does the bank support multi-access? And if the bank for some reason falls down, will the purchase be lost? We can improve this architecture a little more, making it a little more professional and resilient. Let's go. Let's understand this last drawing. Now, when placing a purchase order, the orders are sent to a message server (Broker). The idea of the Broker is basically a service capable of allocating messages. These are usually texts or text in Json format. In this drawing we can say that the Queue allocates customer data, number of tickets, values, etc. And finally, there is an application that does all the management of orders/purchases. This application reads/removes messages from the Broker and can perform validations before writing to the database. Now, let's assume that one of the requirements is for each sale, the customer must receive an invoice for the ticket. As the new architecture is well decoupled, it's easier to "plug in" a new application to do this job. Then you thought of a new design, follow: Now the application that manages the purchases, in addition to recording the sale after retrieving the messages in Queue 1, it also sends a message to Queue 2, where it will allocate the customers' invoices. And a new application that manages invoices retrieves these messages and records them on a specific database for the financial area. But what are the benefits of this new architecture? The architecture is more resilient, asynchronous and fault-tolerant. In case of, one of the applications fails for some reason, the message returns to the queue until the applications are reestablished. And finally, it facilitates the integration of new applications. What about ActiveMQ? What does he have to do with it? ActiveMQ is the service that provides the messaging server. In the design, it would be the message servers (Brokers). To understand even better, let's create a practical example of how to configure and use ActiveMQ with Spring Boot and JMS. Creating the Project To create this Spring Boot project we're going to use Spring Initializr to generate our project faster. Therefore, access the following https://start.spring.io to create it and choose the dependencies. Fill in the fields and select the 2 dependencies (ActiveMQ 5 and Spring Web) as shown in the image. Generate the file and import it into your project. Pom file Following the pom.xml that was created by Spring Initializr. 4.0.0 org.springframework.boot spring-boot-starter-parent 2.4.2 com.spring.active.mq spring-boot-active-mq 0.0.1-SNAPSHOT spring-boot-active-mq Demo project for Spring Boot 1.8 org.springframework.boot spring-boot-starter-activemq org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-test test org.springframework.boot spring-boot-maven-plugin Installing ActiveMQ Let's download ActiveMQ to make the process more transparent. But there is also the possibility of using the built-in version of Spring Boot, but this time we will present it in a more traditional way. For this example, we're going to use the classic version of ActiveMQ. Download ActiveMQ here: https://activemq.apache.org/components/classic/download/ Steps to install here: https://activemq.apache.org/getting-started After installation, start the server according to the documentation. application.properties file In the created Spring-Boot application, fill in the application.properties file spring.activemq.broker-url=tcp://127.0.0.1:61616 spring.activemq.user=admin spring.activemq.password=admin The first line sets the messages server URL The second and subsequent lines are the authentication data. Ticket class public class Ticket { private String name; private Double price; private int quantity; public Ticket(){} public Ticket(String name, Double price, int quantity){ this.name = name; this.price = price; this.quantity = quantity; } public String getName() { return name; } public void setName(String name) { this.name = name; } public Double getPrice() { return price; } public void setPrice(Double price) { this.price = price; } public int getQuantity() { return quantity; } public void setQuantity(int quantity) { this.quantity = quantity; } @Override public String toString() { return String.format("Compra de ingresso -> " + "Name=%s, Price=%s, Quantity=%s}", getName(), getPrice(), getQuantity()); } } In the SpringBootActiveMqApplication class previously created by the generator, make the following change. @SpringBootApplication @EnableJms public class SpringBootActiveMqApplication { public static void main(String[] args) { SpringApplication.run(SpringBootActiveMqApplication.class, args); } @Bean public JmsListenerContainerFactory defaultFactory( ConnectionFactory connectionFactory, DefaultJmsListenerContainerFactoryConfigurer configurer) { DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory(); configurer.configure(factory, connectionFactory); return factory; } @Bean public MessageConverter jacksonJmsMessageConverter() { MappingJackson2MessageConverter converter = new MappingJackson2MessageConverter(); converter.setTargetType(MessageType.TEXT); converter.setTypeIdPropertyName("_type"); return converter; } } The @EnableJms annotation is the mechanism responsible for enabling JMS. The defaultFactory method configures and registers the factory to connect the queues using JMS. And finally, the jacksonJmsMessageConverter method converts the messages passed from JSON to the type that will be passed in the JmsTemplate that we will see soon. All of these methods use the @Bean annotation. Methods annotated with @Bean are managed by the Spring container. TicketController class In the TicketController class, we create a method called buyTicket that will be responsible for sending messages to the queue called compra_queue (purchase_queue) through a POST request. In this method we're using a JmsTemplate type object that allows objects to be converted and sent to the queue using JMS. package com.spring.active.mq.springbootactivemq.Controller; import com.spring.active.mq.springbootactivemq.pojo.Ticket; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.MediaType; import org.springframework.jms.core.JmsTemplate; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestBody; import org.springframework.web.bind.annotation.RestController; @RestController public class TicketController { @Autowired private JmsTemplate jmsTemplate; @PostMapping(value = "/buy", consumes = MediaType.APPLICATION_JSON_VALUE) public void buyTicket(@RequestBody Ticket ticket){ jmsTemplate.convertAndSend("compra_queue", new Ticket(ticket.getName(), ticket.getPrice(), ticket.getQuantity())); } } EventListener class The EventListener class is a sort of "listener". The @JmsListener annotation defines this listener characteristic. In this same annotation, it is possible to configure the queue's name that will be "listened to" by the method. In short, all messages sent to the queue compra_queue (purchase_queue) will be received by this method. package com.spring.active.mq.springbootactivemq.listener; import com.spring.active.mq.springbootactivemq.pojo.Ticket; import org.springframework.jms.annotation.JmsListener; import org.springframework.stereotype.Component; @Component public class EventListener { @JmsListener(destination = "compra_queue", containerFactory = "defaultFactory") public void receiveMessage(Ticket ticket) { System.out.println("Mensagem da fila:" + ticket); } } Accessing Broker service - ActiveMQ After starting the service according to the documentation, access the service console through a browser http://127.0.0.1:8161/ Creating queue To create the queue, click on the Queues option on the top red bar, as shown in the image below. In the Queue Name field, type the queue's name as shown in the image above and click on the Create button. It's done, queue was created! Starting application Via terminal, access your project directory and run the Maven command below or launch via IDE mvn spring-boot:run Sending messages We will use Postman to send messages, if you don't have Postman installed, download it by accessing this link https://www.postman.com/downloads/ After installation, access Postman and fill in the fields as shown in the image below. Json content {"name":"Joao","price":2.0,"quantity":4} By clicking on the Send button, access the application's console and it will be possible to view the message sent and transmitted in the queue. Access the ActiveMQ console again and you will be able to see the message log that was sent to the queue. The Number of Consumers column is the number of consumers in the queue, which in this case is just 1. The Messages Enqueued column shows the number of messages that were sent and, finally, the Messages Dequeued column is the number of messages that were removed from the queue . Here I have a SpringBoot project with ActiveMQ repository: https://github.com/jpjavagit/jms-active-mq. It's worth checking out! Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spring Microservices in Action is a book that covers the principles of microservices using Spring, Spring Boot applications using Spring Cloud, resiliency, how to deploy and real-life examples of good development practices Spring MVC Beginner's Guide: Beginner's Guide is a book covering fundamental Spring concepts such as architecture, request flows, Bean validation, how to handle exception flows, using REST and Ajax, Testing and much more. This book is an excellent choice for anyone wanting to learn more about the fundamentals of Spring. Spring is a Java Framework containing different projects, Spring MVC being one of them. By acquiring a good Spring MVC foundation, you will be able to tackle challenges using any Spring Framework project. Learn Microservices with Spring Boot: A Practical Approach to RESTful Services using RabbitMQ, Eureka, Ribbon, Zuul and Cucumber is a book covers the main features of the Spring ecosystem using Spring Boot, such as creating microservices, event-based architecture, using RabbitMQ as a messaging feature, creating RESTful services and much more. This book is an excellent choice for anyone who wants to learn more about how Spring Boot and it's features. Well that’s it, I hope you enjoyed it!
Generating a Maven project without IDE in 2 minutes
What's Maven? It's common to hear about maven, especially for Java Projects but don't confuse Maven with Java, okay? I can explain what's maven and it's use case. Maven is a popular build automation tool primarily used for Java projects. It provides a structured way to manage project dependencies, build processes, and releases. Maven uses a declarative approach to project management, where you define your project's specifications and dependencies in an XML file called pom.xml (Project Object Model). Maven helps simplify the build process by managing the dependencies of your project, downloading the required libraries from repositories, and providing a standardized way to build and package your application. It can also generate project documentation, run tests, and perform other tasks related to building and managing Java projects. To summarize, Maven provides a powerful toolset for building, managing, and releasing Java applications, and it is widely used in the Java development community. Generating a Maven project without IDE Usually engineers generate Maven project through an IDE but there are easiest ways to do the same without IDE supports. If you don't install Maven yet, I recommend to install it before we start. Thus, you can download Maven here and after installed, following the steps to install here. First of all, to be sure you've installed Maven, open the terminal and running the commando below: mvn -version A message similar to the one below will be displayed on terminal. Now, let's getting started generating your Maven project. 1° Step: Open the terminal again and running the command below. mvn archetype:generate -DgroupId=com.coffeeantips.maven.app -DartifactId=coffeeantips-maven-app -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.0 -DinteractiveMode=false 2° Step: After running the commando above a folder called coffeeantips-maven-app/ were created. Change into this directory and we'll see the following structure of folders and files. Understanding the command parameters archetype:generate: Generates a new project from an archetype or updates the current project. -DgroupId: Specifies the package where folders and projects files will be generated. -DartifactId: Project's name or artifact. -DarchetypeArtifactId: Maven provides a list of archetypes, you can check here. But for this example, we're using an archetype to generate a sample Maven project. -DarchetypeVersion: Version project. -DinteractiveMode: It's a way to define if Maven will interact with user asking for inputs. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Maven: The Definitive Guide Written by Maven creator Jason Van Zyl and his team at Sonatype, Maven: The Definitive Guide clearly explains how this tool can bring order to your software development projects. In this book you'll learn about: The POM and Project Relationships, The Build Lifecycle, Plugins, Project website generation, Advanced site generation, Reporting, Properties, Build Profiles, The Maven Repository and more. Well that’s it, I hope you enjoyed it!
How to generate random Data using Datafaker lib
Sometimes in our projects we have to fill Java objects for unit tests or even to create a database dump with random data to test a specific feature and etc. We need to be creative trying to create names, street names, cities or documents. There's an interesting and helpful Java library called Datafaker that allows to create random data with a large number of providers. Providers are objects based on a context, for example: If you want to generate data about a Person object, there's a specific provider for this context that will generate name, last name and etc. If you need to create a unit test that you need data about address, you'll find it. In this post we'll create some examples using Maven but the library also provides support for Gradle projects. Maven net.datafaker datafaker 1.1.0 Generating Random Data Let's create a simple Java class that contains some properties like name, last name, address, favorite music genre and food. public class RandomPerson { public String firstName; public String lastName; public String favoriteMusicGenre; public String favoriteFood; public String streetAddress; public String city; public String country; @Override public String toString() { return "firstName=" + firstName + "\n" + "lastName=" + lastName + "\n" + "favoriteMusicGenre="+favoriteMusicGenre + "\n" + "favoriteFood=" + favoriteFood + "\n" + "streetAddress=" + streetAddress + "\n" + "city=" + city + "\n" + "country=" + country ; } static void print(RandomPerson randomPerson){ System.out.println( randomPerson ); } } In the next step we'll fill an object using the providers that we quote in the first section. First of all, we create an object called randomData that represents Faker class. This class contains all the providers in the example below. public static void main(String[] args) { Faker randomData = new Faker(); RandomPerson randomPerson = new RandomPerson(); randomPerson.firstName = randomData.name().firstName(); randomPerson.lastName = randomData.name().lastName(); randomPerson.favoriteMusicGenre = randomData.music().genre(); randomPerson.favoriteFood = randomData.food().dish(); randomPerson.streetAddress = randomData.address().streetAddress(); randomPerson.city = randomData.address().city(); randomPerson.country = randomData.address().country(); print(randomPerson); } After the execution, we can see the results like this at the console: Result firstName=Dorthy lastName=Jones favoriteMusicGenre=Electronic favoriteFood=Cauliflower Penne streetAddress=7411 Darin Gateway city=Gutkowskifort country=Greece Every execution will be a new result because of providers are randoms. Another interesting feature is that we can set up the Locale when instantiate an object. Faker randomData = new Faker(Locale.JAPANESE); See the results based on Local.JAPANESE: Result firstName=航 lastName=横山 favoriteMusicGenre=Non Music favoriteFood=French Fries with Sausages streetAddress=418 美桜Square city=南斉藤区 country=Togo Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Unit Testing Principles, Practices, and Patterns: Effective Testing Styles, Patterns, and Reliable Automation for Unit Testing, Mocking, and Integration Testing with Examples in C# is a book that covers Unit Testing Principles, Patterns and Practices teaches you to design and write tests that target key areas of your code including the domain model. In this clearly written guide, you learn to develop professional-quality tests and test suites and integrate testing throughout the application life cycle. Mastering Unit Testing Using Mockito and JUnit is a book that covers JUnit practices using one of the most famous testing libraries called Mockito. This book teaches how to create and maintain automated unit tests using advanced features of JUnit with the Mockito framework, continuous integration practices (famous CI) using market tools like Jenkins along with one of the largest dependency managers in Java projects, Maven. For you who are starting in this world, it is an excellent choice. Isn't a cool library!? See you!
Working with Schemas in Spark Dataframes using PySpark
What's a schema in the Dataframes context? Schemas are metadata that allows working with a standardized Data. Well, that was my definition about schemas but we also can understanding schemas as a structure that represents a data context or a business model. Spark enables using schemas with Dataframes and I believe that is a good point to keep data quality, reliability and we also can use these points to understand the data and connect to the business. But if you know a little more about Dataframes, working with schema isn't a rule. Spark provides features that we can infer to a schema without defined schemas and reach to the same result, but depending on the data source, the inference couldn't work as we expect. In this post we're going to create a simple Dataframe example that will read a CSV file without a schema and another one using a defined schema. Through examples we'll can see the advantages and disadvantages. Let's to the work! CSV File content "type","country","engines","first_flight","number_built" "Airbus A220","Canada",2,2013-03-02,179 "Airbus A320","France",2,1986-06-10,10066 "Airbus A330","France",2,1992-01-02,1521 "Boeing 737","USA",2,1967-08-03,10636 "Boeing 747","USA",4,1969-12-12,1562 "Boeing 767","USA",2,1981-03-22,1219 If you noticed in the content above, we have different data types. We have string, numeric and date column types. The content above will be represented by airliners.csv in the code. Writing a Dataframe without Schema from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result It seems that worked fine but if you look with attention, you'll realize that in the schema structure there are some field types that don't match with their values, for example fields like number_built, engines and first_flight. They aren't string types, right? We can try to fix it adding the following parameter called "inferSchema" and setting up to "true". from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result Even inferring the schema, the field first_flight keeping as a string type. Let's try to use Dataframe with a defined schema to see if this details will be fixed. Writing a Dataframe with Schema Now it's possible to see the differences between the codes. We're adding an object that represents the schema. This schema describes the content in CSV file, you can note that we have to describe the column name and type. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, IntegerType, DateType, StructField if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() StructSchema = StructType([ StructField("type", StringType()), StructField("country", StringType()), StructField("engines", IntegerType()), StructField("first_flight", DateType()), StructField("number_built", IntegerType()) ]) air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .schema(StructSchema) \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result After we defined the schema, all the field types match with their values. This shows how important is to use schemas with Dataframes. Now it's possible to manipulate the data according to the type with no concerns. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!
How to save costs on S3 running Data Lake
Cloud services provides useful resources to scale your business faster but not always we can measure cloud costs when we’re starting a business from the scratch or even being a solid business, costs always makes part of the strategy for any company who want to provide a better service. Me and my teammates have worked in a Data platform based on events enable to process 350 million events every day. We provide data to the client applications and to the businesses teams to make decisions and it always a challenge do deal with the massive data traffic and how we can maintain these data and saving money with storage at the same time. Storage is too expensive and there are some strategies to save money. For this post I’ll describe some strategies that we’ve adopted to save costs on S3 (Simple Storage Service) and I hope we can help it. Strategies Strategy #1 Amazon S3 storage classes Amazon S3 provides a way to manage files through life cycle settings, out there you can set ways to move files to different storage classes depending on the file’s age and access frequency. This strategy can save a lot of money to your company. Working with storage class enable us saving costs. By default, data are stored on S3 Standard storage class. This storage type has some benefits of storage and data access but we realized that after data transformed in the Silver layer, data in the Bronze layer it wasn’t accessed very often and it was totally possible to move them to a cheaper storage class. We decided to move it using life cycle settings to S3 Intelligent Tiering storage class. This storage class it was a perfect fit to our context because we could save costs with storage and even in case to access these files for a reason we could keeping a fair cost. We’re working on for a better scenario which we could set it a life cycle in the Silver layer to move files that hasn’t been accessed for a period to a cheaper storage class but at the moment we need to access historical files with high frequency. If you check AWS documentation you’ll note that there’s some cheapest storage classes but you and your team should to analyse each case because how cheapest is to store data more expensive will be to access them. So, be careful, try to understand the patterns about storage and data access in your Data Lake architecture before choosing a storage class that could fit better to your business. Strategy #2 Partitioning Data Apache Spark is the most famous framework to process a large amount of data and has been adopted by data teams around the world. During the data transformation using Spark you can set it a Dataframe to partition data through a specific column. This approach is too useful to perform SQL queries better. Note that partitioning approach has no relation to S3 directly but the usage avoids full scans in S3 objects. Full scans means that after SQL queries, the SQL engine can load gigabytes even terabytes of data. This could be very expensive to your company, because you can be charged easily depending on amount of loaded data. So, partitioning data has an important role when we need to save costs. Strategy #3 Delta Lake vacuum Delta Lake has an interesting feature called vacuum that’s a mechanism to remove files from the disk with no usage. Usually teams adopt this strategy after restoring versions that some files will be remain and they won’t be managed by Delta Lake. For example, in the image below we have 5 versions of Delta tables and their partitions. Suppose that we need to restore to version because we found some inconsistent data after version 1. After this command, Delta will point his management to version 1 as the current version but the parquet files related to others version will be there with no usage. We can remove these parquets running vacuum command as shown below. Note that parquets related to versions after 1 were removed releasing space in the storage. For more details I strongly recommend seeing Delta Lake documentation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Differences between FAILFAST, PERMISSIVE and DROPMALFORED modes in Dataframes
There's a bit differences between them and we're going to find out in this post. The parameter mode is a way to handle with corrupted records and depending of the mode, allows validating Dataframes and keeping data consistent. In this post we'll create a Dataframe with PySpark and comparing the differences between these three types of mode: PERMISSIVE DROPMALFORMED FAILFAST CSV file content This content below simulates some corrupted records. There are String types for the engines column that we'll define as an Integer type in the schema. "type","country","city","engines","first_flight","number_built" "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A220","Canada","Calgary","two",2013-03-02,179 "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A320","France","Lyon","two",1986-06-10,10066 "Airbus A330","France","Lyon","two",1992-01-02,1521 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York",2,1967-08-03,10636 "Airbus A220","Canada","Calgary",2,2013-03-02,179 Let's start creating a simple Dataframe that will load data from a CSV file with the content above, let's supposed that the content above it's from a file called airplanes.csv. To modeling the content, we're also creating a schema that will allows us to Data validate. Creating a Dataframe using PERMISSIVE mode The PERMISSIVE mode sets to null field values when corrupted records are detected. By default, if you don't specify the parameter mode, Spark sets the PERMISSIVE value. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "PERMISSIVE") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of PERMISSIVE mode Creating a Dataframe using DROPMALFORMED mode The DROPMALFORMED mode ignores corrupted records. The meaning that, if you choose this type of mode, the corrupted records won't be list. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "DROPMALFORMED") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of DROPMALFORMED mode After execution it's possible to realize that the corrupted records aren't available at Dataframe. Creating a Dataframe using FAILFAST mode Different of DROPMALFORMED and PERMISSIVE mode, FAILFAST throws an exception when detects corrupted records. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "FAILFAST") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of FAILFAST mode ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!
Differences between External and Internal tables in Hive
There are two ways to create tables in the Hive context and this post we'll show the differences, advantages and disadvantages. Internal Table Internal tables are known as Managed tables and we'll understand the reason in the following. Now, let's create an internal table using SQL in the Hive context and see the advantages and disadvantages. create table coffee_and_tips_table (name string, age int, address string) stored as textfile; Advantages To be honest I wouldn't say that it's an advantage but Internal tables are managed by Hive Disadvantages Internal tables can't access remote storage services for example in clouds like Amazon AWS, Microsoft Azure and Google Cloud. Dropping Internal tables all the data including metadata and partitions will be lost. External Table External tables has some interesting features compared to Internal tables and it's a good and recommended approach when we need to create tables. In the script below you can see the difference between Internal table creation and External table related to the last section. We just added the reserved word external in the script. create external table coffee_and_tips_external (name string, age int, address string) stored as textfile; Advantages The data and metadata won't be lost if drop table External tables can be accessed and managed by external process External tables allows access to remote storage service as a source location Disadvantages Again, I wouldn't say that it's a disadvantage but if you need to change schema or dropping a table, probably you'll need to run a command to repair the table as shown below. msck repair table Depending on the volume, this operation may take some time to complete. To check out a table type, run the following command below and you'll see at the column table_type the result. hive> describe formatted Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Programming Hive This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Cool? I hope you enjoyed it!
First steps with DBT - Data Build Tool
DBT has been used by a lot of companies on Data area and I believe that we can extract good insights in this post about it. That's going to be a practical post showing how DBT works it and hope you guys enjoy it. What's DBT? DBT means Data Build Tool and enables teams to transform data already loaded in their warehouse with simple select statements. DBT does the T in ELT processes, in the other words, he doesn't work to extract and load data but he's useful to transform it. Step 1: Creating a DBT Project Now, we're assume that DBT is already installed but if not, I recommend see this link. After DBT installed you can create a new project using CLI or you can clone this project from the DBT Github repository. Here for this post we're going to use CLI mode to create our project and also to complete the next steps. To create a new project, run the command below. dbt init After running this command, you need to type the project's name and which warehouse or database you're going to use like the image below. For this post, we're going to use postgres adapter. It's very important that you have a postgres database already installed or you can up a postgres image using docker. About adapters, DBT supports different of them and you can check here. I created a table structure and also loaded it with data simulating data from a video platform called wetube and we're going to use them to understand how DBT works it. Follow the structure: Step 2: Structure and more about DBT After running dbt init command to create the project, a structure of folders and files below will be created. I won't talk about the whole directories of project but I'd like to focus in two of them. Sources Sources are basically the data already loaded into your warehouse. In DBT process, sources have the same meaning of raw data. There's no folders representing source data for this project but you need to know about this term because we're going to set up tables already created as sources for the next sections. Seeds Seeds is an interesting and useful mechanism to load static data into your warehouse through CSV files. If you want to load these data you need to create a CSV file on this directory and run the command below. dbt seed For each field on CSV file, DBT will infer their types and create a table into warehouse or database. Models DBT works with Model paradigm, the main idea is that you can create models through the transformation using SQL statements based on tables sources or existing models Every SQL file located in your model folder will create a model into your warehouse when the command below runs. dbt run Remember that a model can be created through a source or another model and don't worry about this, I'll show you more details about it. Step 3: Setting up database connection After project already created, we need to set up our database's connection and here at this post, we're going to use postgres as database. After initialize the project a bunch of files are created and one of them is called profiles.yml. profiles.yml file is responsible to control the different profiles to the different database's connection like dev and production environment. If you've noticed, we can't see this file on the image above because this file is created outside of project to avoid sensitive credentials. You can find this file in ~/.dbt/ directory. If you note, we have one profile named dbt_blog and a target called dev, by default the target refer to dev with the database's connection settings. Also, It's possible to create one or more profiles and targets, it enables working with different environments. Another important detail is that dbt_blog profile should be specified on dbt_project.yml file as a default profile. For the next sections, we'll discuss what and how dbt_project.yml file works it. Step 4: Creating dbt_project.yml file Every DBT project has a dbt_project.yml file, you can set up informations like project name, directories, profiles and materialization type. name: 'dbt_blog' version: '1.0.0' config-version: 2 profile: 'dbt_blog' model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by `dbt clean` - "target" - "dbt_packages" models: dbt_blog: # Config indicated by + and applies to all files under models/example/ mart: +materialized: table Note that profile field was set up as the same profile specified on profiles.yml file and another important detail is about materialized field. Here was set up as a "table" value but by default, is a "view". Materialized fields allows you to create models as a table or view on each run. There are others type of materialization but we won't discuss here and I recommend see dbt docs. Step 5: Creating our first model Creating first files Let's change a little and let's going to create a sub-folder on model directory called mart and inside this folder we're going to create our .SQL files and also another important file that we don't discuss yet called schema.yml. Creating schema file Schema files are used to map sources and to document models like model's name, columns and more. Now you can create a file called schema.yml e fill up with these informations below. version: 2 sources: - name: wetube tables: - name: account - name: city - name: state - name: channel - name: channel_subs - name: video - name: video_like - name: user_address models: - name: number_of_subs_by_channel description: "Number of subscribers by channel" columns: - name: id_channel description: "Channel's ID" tests: - not_null - name: channel description: "Channel's Name" tests: - not_null - name: num_of_subs description: "Number of Subs" tests: - not_null Sources: At sources field you can include tables from your warehouse or database that's going to be used on model creation. models: At models field you can include the name's model, columns and their description Creating a model This part is where we can create SQL scripts that's going to result in our first model. For the first model, we're going to create a SQL statement to represent a model that we can see the numbers of subscribers by channel. Let's create a file called number_of_subs_by_channel.sql and fill up with these scripts below. with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_channel_subs as ( select * from {{ source('wetube','channel_subs') }} ), number_of_subs_by_channel as ( select source_channel.id_channel, source_channel.name, count(source_channel_subs.id_subscriber) num_subs from source_channel_subs inner join source_channel using (id_channel) group by 1, 2 ) select * from number_of_subs_by_channel Understanding model creation Note that we have multiple scripts separated by common table expression (CTE) that becomes useful to understand the code. DBT enables using Jinja template {{ }} bringing a better flexibility to our code. The usage of keyword source inside Jinja template means that we're referring source tables. To refer a model you need to use ref keyword. The last SELECT statement based on source tables generates the model that will be created as table in the database. Running our first model Run the command below to create our first model dbt run Output: Creating another model Imagine that we need to create a model containing account information and it's channels. Let's get back to schema.yml file to describe this new model. - name: account_information description: "Model containing account information and it's channels" columns: - name: id_account description: "Account ID" tests: - not_null - name: first_name description: "First name of user's account" tests: - not_null - name: last_name description: "Last name of user's account" tests: - not_null - name: email description: "Account's email" tests: - not_null - name: city_name description: "city's name" tests: - not_null - name: state_name description: "state's name" tests: - not_null - name: id_channel description: "channel's Id" tests: - not_null - name: channel_name description: "channel's name" tests: - not_null - name: channel_creation description: "Date of creation name" tests: - not_null Now, let's create a new SQL file and name it as account_information.sql and put scripts below: with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_city as ( select * from {{ source('wetube','city') }} ), source_state as ( select * from {{ source('wetube','state') }} ), source_user_address as ( select * from {{ source('wetube','user_address') }} ), source_account as ( select * from {{ source('wetube','account') }} ), account_info as ( select account.id_user as id_account, account.first_name, account.last_name, account.email, city.name as city_name, state.name as state_name, channel.id_channel, channel.name as channel, channel.creation_date as channel_creation FROM source_account account inner join source_channel channel on (channel.id_account = account.id_user) inner join source_user_address user_address using (id_user) inner join source_state state using (id_state) inner join source_city city using (id_city) ) select * from account_info Creating our last model For our last model, we going to create a model about how many likes has a video. Let's change again the schema.yml to describe and to document our future and last model. - name: total_likes_by_video description: "Model containing total of likes by video" columns: - name: id_channel description: "Channel's Id" tests: - not_null - name: channel description: "Channel's name" tests: - not_null - name: id_video description: "Video's Id" tests: - not_null - name: title description: "Video's Title" tests: - not_null - name: total_likes description: "Total of likes" tests: - not_null Name it a file called total_likes_by_video.sql and put the code below: with source_video as ( select * from {{ source('wetube','video') }} ), source_video_like as ( select * from {{ source('wetube','video_like') }} ), source_account_info as ( select * from {{ ref('account_information') }} ), source_total_like_by_video as ( select source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title, count(*) as total_likes FROM source_video_like inner join source_video using (id_video) inner join source_account_info using (id_channel) GROUP BY source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title ORDER BY total_likes DESC ) select * from source_total_like_by_video Running DBT again After creation of our files, let's run them again to create the models dbt run Output The models were created in the database and you can run select statements directly in your database to check it. Model: account_information Model: number_of_subs_by_channel Model: total_likes_by_video Step 6: DBT Docs Documentation After generated our models, now we're going to generate docs based on these models. DBT generates a complete documentation about models and sources and their columns and also you can see through a web page. Generating docs dbt docs generate Running docs on webserver After docs generated you can run command below to start a webserver on port 8080 and see the documentation locally. dbt docs serve Lineage Another detail about documentation is that you can see through of a Lineage the models and it's dependencies. Github code You can checkout this code through our Github page. Cool? I hope you guys enjoyed it!
Overview about AWS SNS - Simple Notification Service
SNS (Simple Notification Service) provides a notification service based on Pub/Sub strategy. It's a way of publishing messages to one or more subscribers through endpoints. Is that confuse? Let's go deeper a little more about this theme. Usually the term Pub/Sub is related to event-driven architectures. In this architecture publishing messages can be done through notifications to one or more destinations already known, providing an asynchronous approach. For a destiny to become known, there must be a way to signal that the destination becomes a candidate to receive any message from the source. But how does subscriptions work? For SNS context, each subscriber could be associated to one or more SNS Topics. Thus, for each published message through Topic, one or more subscribers will receive them. We can compare when we receive push notifications from installed apps in our smartphones. It's the same idea, after an installed app we became a subscriber of that service. Thus, each interaction from that application could be done through notifications or published messages. The above example demonstrates a possible use case that SNS could be applied. For the next sections we'll discuss details for a better understanding. SNS basically provides two main characteristics, Publishers and Subscribers. Both of them work together providing resources through AWS console and APIs. 1. Topics/Publishers Topics are logical endpoints that works as an interface between Publisher and Subscriber. Basically Topics provides messages to the subscribers after published messages from the publisher. There are two types of Topics, Fifo and Standard: Fifo: Fifo type allows messages ordering (First in/First out), has a limit up to 300 published messages per second, prevent messages duplication and supports only SQS protocols as subscriber. Standard: It does not guarantee messages ordering rules and all of the supported delivery protocols can subscribe to a standard topic such as SQS, Lambda, HTTP, SMS, EMAIL and mobile apps endpoints. Limits AWS allows to create up to 100.000 topics per account. 2. Subscribers The subscription is a way to connect an endpoint to a specific Topic. Each subscription must have associated to a Topic to receive notifications from that Topic. Examples of endpoints: AWS SQS HTTP HTTPS AWS Kinesis Data Firehose E-mail SMS AWS Lambda These above endpoints are examples of delivery or transportation formats to receive notifications from a Topic through a subscription. Limit Subscriptions AWS allows up to 10 millions subscriptions per Topic. 3. Message size limit SNS messages can contain up to 256 KB of text data. Different of the SMS that can contain up to 140 bytes of text data. 4. Message types SNS supports different types of messages such as text, XML, JSON and more. 5. Differences between SNS and SQS Sometimes people get confused about the differences but don't worry I can explain these differences. SNS and SQS are different services but they can be associated. SQS is an AWS queue service that retains sent messages from the different clients and contexts, but at the same time they can works as a subscriber to a Topic. Thus, a SQS protocol subscribed to a Topic will start receiving notifications, becoming an asynchronous integration. Look at the image above, we're simulating a scenario that we have three SQS subscribed in three Topics. SQS 1 is a subscriber to the Topic 1 and 2. Thus, SQS 1 will receive notifications/messages from the both Topics, 1 and 2. SQS 2 is another subscriber to the Topic 2 and 3 that automatically also will receive messages from Topic 2 and 3. And for the last one case, we have SQS 3 as a subscriber to the Topic 3 that will receive messages only from Topic 3. For more details I recommend read this doc. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
How to create S3 notification events using SQS via Terraform
S3 (Simple Storage Service) makes it possible to notify through events when an action occurs within a Bucket or in a specific folder. In other words, it works as a listener. Therefore, any action that takes place on a source, an event notification will be sent to a destination. What would those actions be? Any actions that takes place within a S3 Bucket such as creating objects, folders, removing files, restoring files and more. Destinations For each event notification configuration, there must be a destination. For this destination, information about each action will be sent, for example: A new file has been created in a specific folder, so information about the file will be sent, such as the creation date, file size, event type, file name, and more. Remembering that in this process, the content of the file is not sent, okay? There are 3 types of destinations: Lambda SNS SQS Understanding how it works In this post we are going to create an event notification settings in an S3 Bucket, simulating an action and understanding the final behavior. We could create this setting via console but for good practice reasons, we'll use Terraform as IaC tool. For those who aren't very familiar with Terraform, follow this tutorial on Getting Started using Terraform on AWS. In the next step, we will create a flow simulating the image below. We'll set in S3 Bucket for every file created within files/ folder, a notification event will be sent to a SQS queue. Creating Terraform files Create a folder called terraform/ in your project and from now on, all .tf files will be created inside it. Now, create a file called vars.tf where we're going to store the variables that will be used and paste the content below to this file. variable "region" { default = "us-east-1" type = string } variable "bucket" { type = string } Create a file called provider.tf , where we will add the provider settings, which will be AWS. This means, Terraform will use AWS as the cloud to create the resources and will download the required plugins on startup. Copy the code below to the file. provider "aws" { region = "${var.region}" } Create a file called s3.tf , where we'll add the settings for creating a new S3 Bucket that will be used for this tutorial. resource "aws_s3_bucket" "s3_bucket_notification" { bucket = "${var.bucket}" } Now, create a file called sqs.tf , where we'll add the settings for creating an SQS queue and some permissions according to the code below: resource "aws_sqs_queue" "s3-notifications-sqs" { name = "s3-notifications-sqs" policy = <<POLICY { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:*:*:s3-notifications-sqs", "Condition": { "ArnEquals": { "aws:SourceArn": "${aws_s3_bucket.s3_bucket_notification.arn}" } } } ] } POLICY } Understanding code above In the code above, we're creating an SQS and adding some policy settings, see more details: SQS name will be s3-notifications-sqs, detailed value in name field In the policy field, we define a policy that allows S3 sending messages notification to SQS. Notice that we're referencing the Bucket S3 via ARN in the snippet ${aws_s3_bucket.s3_bucket_notification.arn} For the last file, let's create the settings that allows sending event notifications from S3 Bucket to an SQS. Therefore, create s3_notification.tf file and add the code below: resource "aws_s3_bucket_notification" "s3_notification" { bucket = aws_s3_bucket.s3_bucket_notification.id queue { events = ["s3:ObjectCreated:*"] queue_arn = aws_sqs_queue.s3-notifications-sqs.arn filter_prefix = "files/" } } Understanding code above In the code above, we are creating a resource called aws_s3_bucket_notification which will be responsible for enabling notifications from an S3 Bucket. In the bucket field, we are referring to the S3 bucket setting located on s3.tf file. The block queue contains some settings such as: events: Is the event type of the notification. In this case, for ObjectCreated type events. Notifications will be sent only for created objects. For deleted objects, there will be no notifications. It helps to restrict some types of events. queue_arn: Refers to the SQS defined in the sqs.tf file. filter_prefix: This field defines the folder where we want notifications to be triggered. In the code, we set the folder files/ to be the trigger location when the files are created. Summarizing, for all files created within folder files/ , a notification will be sent to the SQS defined in the queue_arn field. Running Terraform Init Terraform terraform init Running Plan The plan makes it possible to verify which resources will be produced, in this case it is necessary to pass the value of the bucket variable for its creation in S3. terraform plan -var bucket = 'type the bucket name' Running Apply command In this step, the creation of resources will be applied. Remember to pass the name of the bucket you want to create into the bucket variable, and the bucket name must be unique. terraform apply -var bucket = 'type the bucket name' Simulating an event notification After running the previous steps and creating the resources, we will manually upload a file in the files/ folder to the bucket that was created. Via console, access the Bucket created in S3 and create a folder called files. Inside it, load any file. Uploading file After loading the file in the files/ folder, access the created SQS. You'll see some available messages. Usually 3 messages will be available in queue because after creating S3 notification events settings, a test message is sent. The second one happens when we create a folder and for the last, the message related to the file upload. It's done, we have an event notification created!. References: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_notification https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html Coffee and tips Github repository: https://github.com/coffeeandtips/s3-bucket-notification-with-sqs-and-terraform Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastrucutre as Code is a book focused on how to use Terraform and its benefits. The author sought to make comparisons with several other IAC (Infrastructure as code) tools such as Ansible and Cloudformation (IAC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer, I strongly recommend learning about it. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Understanding AWS S3 in 1 minute
Amazon Web Services AWS is one of the most traditional and resourceful Cloud Computing services on the market. One of the most used resources is S3. What's AWS S3? AWS S3 is an abbreviation for Simple Storage Service. It is a resource managed by AWS itself, that is, we do not worry about managing the infrastructure. Provides an object repository allowing you to store objects from different volumes, backups, sites, perform data analysis, manage Data Lakes, etc. It also provides several integrations with other AWS services aimed at a base repository. S3 Structure AWS S3 is divided into different Buckets containing a folder structure based on customer needs. What's a Bucket? S3 Bucket is a kind of container where all objects will be stored and organized. A very important thing is that this Bucket must be unique. Architecture In the image below we have a more simpler design of the S3 architecture with buckets and folder directories. Integrations The AWS ecosystem makes it possible to integrate most of its and third-party tools. S3 is one of the most integrated resources. Some examples of integrated services: Athena Glue Kinesis Firehose Lambda Delta Lake RDS Outros SDK AWS provides SDK compatible for different programming languages that makes it possible to manipulate objects in S3, such as creating Buckets, folders, uploading and downloading files and much more. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!