Search
54 results found with an empty search
- How to read CSV file with Apache Spark
Apache Spark works very well in reading several files for data extraction, in this post we'll create an example of reading a CSV file using Spark, Java and Maven. Maven org.apache.spark spark-core_2.12 3.1.0 org.apache.spark spark-sql_2.12 3.1.0 CSV Content Let's suppose that file's name below is movies.csv. title;year;rating The Shawshank Redemption;1994;9.3 The Godfather;1972;9.2 The Dark Knight;2008;9.0 The Lord of the Rings: The Return of the King ;2003;8.9 Pulp Fiction;1994;8.9 Fight Club;1999;8.8 Star Wars: Episode V - The Empire Strikes Back;1980;8.7 Goodfellas;1990;8.7 Star Wars;1977;8.6 Creating a SparkSession SparkConf sparkConf = new SparkConf(); sparkConf.setMaster("local[*]"); sparkConf.setAppName("app"); SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .getOrCreate(); Running the Read Dataset ds = sparkSession.read() .format("CSV") .option("sep",";") .option("inferSchema", "true") .option("header", "true") .load("movies.csv"); ds.select("title","year","rating").show(); Result Understanding some parameters .option("sep", ";"): Defines the use of a default delimiter for file reading, in this case the delimiter is a semicolon (;) .option("inferSchema", "true"): The inferSchema parameter makes it possible to infer the file(s) in order to understand (guess) the data types of each field .option("header", "true"): Enabling the header parameter makes it possible to use the name of each field defined in the file header .load("movies.csv"): movies.csv is the name of the file to be read Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Maven: The Definitive Guide Written by Maven creator Jason Van Zyl and his team at Sonatype, Maven: The Definitive Guide clearly explains how this tool can bring order to your software development projects. In this book you'll learn about: The POM and Project Relationships, The Build Lifecycle, Plugins, Project website generation, Advanced site generation, Reporting, Properties, Build Profiles, The Maven Repository and more. Well that’s it, I hope you enjoyed it!
- Accessing and Modifying Terraform State
Before starting to talk about access to states, it is necessary to explain what states or State are. What are States? What's Terraform State? Terraform State is a way for Terraform to manage the infrastructure, configurations and resources created in order to maintain a mapping of what already exists and control the update and creation of new resources. A basic example is when we create an S3 Bucket, an EC2 instance or an SQS via Terraform. All these resources are mapped in the state and are managed by Terraform. State locations Local By default Terraform allocates state locally in the terraform.tfsate file. Using the State locally can work well for a specific scenario where there is no need to share the State between teams. Remote Unlike Local, when we have teams sharing the same resources, using State remotely becomes essential. Terraform provides support so that State can be shared remotely. We won't go into detail on how to configure it, but it's possible to keep State in Amazon S3, Azure Blob Storage, Google Cloud Storage, Alibaba Cloud OSS, and other cloud services. The State is represented by the terraform.tfsate file in JSON format, here is an example of a S3 Bucket mapped on State: { "version": 4, "terraform_version": "0.12.3", "serial": 3, "lineage": "853d8b-4ee1-c1e4-e61e-e10", "outputs": {}, "resources": [ { "mode": "managed", "type": "aws_s3_bucket", "name": "s3_bucket_xpto", "provider": "provider.aws", "instances": [ { "schema_version": 0, "attributes": { "acceleration_status": "", "acl": "private", "arn": "arn:aws:s3:::bucket.xpto", "bucket": "bucket.xpto", "bucket_domain_name": "bucket.xpto", "bucket_prefix": null, "bucket_regional_domain_name": "bucket.xpto", "cors_rule": [], "force_destroy": false, "grant": [], "hosted_zone_id": "Z3NHGSIKTF", "id": "bucket.xpto", "lifecycle_rule": [], "logging": [], "object_lock_configuration": [], "policy": null, "region": "us-east-1", "replication_configuration": [], "request_payer": "BucketOwner", "server_side_encryption_configuration": [], "tags": { "Environment": "development" }, "versioning": [ { "enabled": false, "mfa_delete": false } ], "website": [], "website_domain": null, "website_endpoint": null }, "private": "Ud4JbhV==" } ] } ] } Accessing and updating the State Despite the State being allocated in a JSON file, it is not recommended to change it directly in the file. Terraform provides the use of the Terraform state commands executed via CLI so that small modifications can be made. Through the CLI, we can execute commands in order to manipulate the State, as follows: terraform state [options] [args] Sub-commands: list List the resources in the state mv Move an item in state pull Extract the current state and list the result on stdout push Update a remote state from a local state file rm Remove an instance from state show Show state resources 1. Listing State resources Command: terraform state list The above command makes it possible to list the resources being managed by State Example: $ terraform state list aws_s3_bucket.s3_bucket aws_sqs_queue.sqs-xpto In the example above, we have as a result, an S3 and an SQS Bucket that were created via terraform and are being managed by State. 2. Viewing a resource and its attributes Command: terraform state show [options] RESOURCE_ADDRESS The above command makes it possible to show in detail a specific resource and its attributes Example: $ terraform state show aws_sqs_queue.sqs-xpto # aws_sqs_queue.sqs-xpto: resource "aws_sqs_queue" "sqs-xpto" { arn = "arn:aws:sqs:sqs-xpto" content_based_deduplication = false delay_seconds = 90 fifo_queue = false id = "https://sqs-xpto" kms_data_key_reuse_period_seconds = 300 max_message_size = 262144 message_retention_seconds = 345600 name = "sqs-xpto" receive_wait_time_seconds = 10 tags = { "Environment" = "staging" } visibility_timeout_seconds = 30 } 3. Removing resources from the State Command: terraform state rm [options] RESOURCE_ADDRESS The above command removes one or more items from the State. Unlike a terraform destroy command, which removes the State resource and remote objects created in the cloud. Example: $ terraform state rm aws_sqs_queue.sqs-xpto Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastructure as Code is a book focused on how to use Terraform and its benefits. The author make comparisons with several other IaC (Infrastructure as code) tools such as Ansible and Cloudformation (IaC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer or work in the Devops area, I strongly recommend learning about the topic. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
- Java: Streams API - findFirst()
Java 8 Streams introduced different methods for handling collections. One of these methods is the findFirst(), which allows returning the first element of a Stream through an Optional instance. Example Output Item: Monica Souza Using filter Output Item: Andre Silva Note that it returned the first name with last name Silva from the collection. Not using Streams If we use the traditional way without using Streams. The code would look like this, filtering by last name "Silva" In this case we depend on the break to end the execution. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Head First Java is a complete learning experience in Java and object-oriented programming. With this book, you'll learn the Java language with a unique method that goes beyond how-to manuals and helps you become a great programmer. Through puzzles, mysteries, and soul-searching interviews with famous Java objects, you'll quickly get up to speed on Java's fundamentals and advanced topics including lambdas, streams, generics, threading, networking, and the dreaded desktop GUI. Well that’s it, I hope you enjoyed it!
- Java Streams API - Filter
Since Java 8 released in 2014, dozens of new features have been added, including improvements in the JVM and functions to make the developer's life easier. Among these features are the Expression Lambda (EL) which was the starting point for the entry of Java into the world of functional programming, improvement in the data API and the no need to create implementations of existing Interfaces with the use of Default methods . And the other news is the Streams API, the focus of this post. The Stream API is a new approach to working with Collections making the code cleaner and smarter. The Stream API works with on-demand data processing and provides dozens of features to manipulate Collections, reducing code and simplifying development in a kind of pipeline that will be explained later. To understand better, let's create a simple code of a list of objects and we'll create some conditions in order to extract a new list with the desired values. In this example we will not use Streams API. Let's create a Class representing the City entity which will have the following attributes: name, state and population. And finally a method called listCities that loads a list of objects of type City. Filtering In the next code, we will iterate through a list, and within the iteration will check which cities have a population greater than 20 and then add them to a secondary list. Until then, there are technically no problems with the above code. But it could be cleaner. Now, applying Stream API, here's a new example. Notice the difference, we invoke the method listCities() and because it returns a Collection, it is possible to invoke the Stream method. From the call of the stream() method, the pipeline starts. More examples of Filter Example 1: Output: City: Hollywood /State: CA /Population: 30 City: Venice /State: CA /Population: 10 Example 2: Output: City: Venice /State: CA /Population: 10 How does a pipeline work? Following the previous example, the pipeline is a sequential process that differentiates between intermediate and final operations. In the example, the Stream is invoked from a data source (list of objects of type City) that works on demand, the filter method is an intermediate operation, that is, it processes the data until the collect method is invoked, resulting in a final operation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Head First Java is a complete learning experience in Java and object-oriented programming. With this book, you'll learn the Java language with a unique method that goes beyond how-to manuals and helps you become a great programmer. Through puzzles, mysteries, and soul-searching interviews with famous Java objects, you'll quickly get up to speed on Java's fundamentals and advanced topics including lambdas, streams, generics, threading, networking, and the dreaded desktop GUI. Well that’s it, I hope you enjoyed it!
- How to create Mutation tests with Pitest
Pitest is a tool that provides state-of-the-art mutation testing. It was written to run on Java applications or on applications that run on top of a JVM. How it works? Pitest works by generating mutant codes that will make changes to the bytecodes in the source code, changing logic, removing and even changing method return values. Thus, it's possible to evaluate the code in a more complete way, looking for failures and making the code reliable. Basic example of mutation See the code below: if(name == ""){ return "Empty name"; }else{ return name; } Running Pitest on the code above, it's able to modify the bytecodes by changing the code above to: if(name != ""){ return name; }else{ return "Empty name"; } Pitest enables you identifying possible points of failure and guide you to create tests based on these points. Hands on time Maven junit junit 4.12 test org.pitest pitest 1.6.2 Let's create a simple class representing a bank account called BankAccount and to make it simple we'll implement the logic inside the entity class itself. In this class, we have a method called userValid that checks if the user is valid. package com.bank.entity; public class BankAccount { public String bankName; public String user; public Double balance; public Boolean userValid(BankAccount bankAccount){ if(bankAccount.user != ""){ return true; }else{ return false; } } } Without creating any test classes, run this command via Maven: mvn org.pitest:pitest-maven:mutationCoverage See that a report was generated in the console with the mutations created based on the conditions that must be tested. Looking at the console above, we have 3 mutations or point of failures, which means we need to eliminate these mutations from our code. Pitest provides another way to validate these mutations through a report generated by an HTML page. You can access these generated reports through /target/pit-reports/ folder In the image below we can see a report that summarizes the coverage percentage. Note that we have 0% of coverage. In the BankAccount.java.html file shows details about uncovered code parts. Note that in the image above there is a list of Active mutators that are mutations based on the code we created. Removing these mutations Based on the reports, let's create some tests to reach out 100% of coverage. First of all, we need to create a class test. For this example, we're going to create a class test called BankAccountTest.java. To do this, there's an attention point. You'll have to create this class with the same package's name used for the BankAccount.java. Thus, create inside src/test/java a package called com.bank.entity and finally BankAccountTest.java. package com.bank.entity; import org.junit.Assert; import org.junit.Test; public class BankAccountTest { @Test public void userValid_NegateConditionalsMutator_Test(){ BankAccount bankAccount = new BankAccount( "Chase","Jonas", 2000.00); Boolean actual = bankAccount.userValid(bankAccount); Boolean expected = true; Assert.assertEquals(expected, actual); } @Test public void userValid_BooleanFalseReturnValsMutator_Test(){ BankAccount bankAccount = new BankAccount( "Chase","", 2000.00); Boolean actual = bankAccount.userValid(bankAccount); Boolean expected = false; Assert.assertEquals(expected, actual); } } We have created only 2 test method for our coverage: 1.userValid_NegateConditionalsMutator_Test(): Covers filled-in usernames. The constructor passed the "Chase" argument representing the username so that the method's condition is validated. And finally validates the condition of the method to be returned as TRUE. To this case, we already eliminated two mutations according to statistic report. BooleanTrueReturnValsMutator NegateConditionalsMutator 2. userValid_BooleanFalseReturnValsMutator_Test(): Validates the possibility of the method returning the Boolean value FALSE. Rerun the Maven command below: mvn org.pitest:pitest-maven:mutationCoverage Application console Note that for every mutation generated, 1 was eliminated. Covering 100% coverage in tests. References: https://pitest.org Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Unit Testing Principles, Practices, and Patterns: Effective Testing Styles, Patterns, and Reliable Automation for Unit Testing, Mocking, and Integration Testing with Examples in C# is a book that covers Unit Testing Principles, Patterns and Practices teaches you to design and write tests that target key areas of your code including the domain model. In this clearly written guide, you learn to develop professional-quality tests and test suites and integrate testing throughout the application life cycle. Mastering Unit Testing Using Mockito and JUnit is a book that covers JUnit practices using one of the most famous testing libraries called Mockito. This book teaches how to create and maintain automated unit tests using advanced features of JUnit with the Mockito framework, continuous integration practices (famous CI) using market tools like Jenkins along with one of the largest dependency managers in Java projects, Maven. For you who are starting in this world, it is an excellent choice. Well that’s it, I hope you enjoyed it!
- First steps with CloudFormation
There are different ways to create resources in AWS, you can create a Bucket S3, SQS, RDS and among many other resources manually. But to deal with infrastructure and its management, creating resources manually becomes unsustainable. Another way is using IaC tools - Infrastructure as code that allows you to create, manage and provision resources in the cloud with less effort. At AWS we can use CloudFormation to help us create the resources you want to use. How it works? Starting from a template in JSON or YAML format and then uploading this file to CloudFormation on AWS. Very simple. To better understand this process, let's create an S3 Bucket and an SQS queue through CloudFormation, following what was described earlier, using a template. There are two ways to create a template, you can use a JSON or YAML file. In this example we will use a template in YAML format. Creating S3 Bucket template Resources: S3Bucket: Type: 'AWS::S3::Bucket' DeletionPolicy: Retain Properties: BucketName: blog.data AccessControl: Private BucketEncryption: ServerSideEncryptionConfiguration: - ServerSideEncryptionByDefault: SSEAlgorithm: "AES256" For the template above, we used some essential parameters for creating the Bucket, the complete list can be consulted in the AWS documentation. Next, let's briefly understand what each parameter means: S3Bucket: is an identifier given to the resource, always create an identifier that makes sense to its context Type: resource type DeletionPolicy: There are three options: Delete: If the CloudFormation stack is deleted, all related resources will be deleted. Be very careful and understand the risks before using this option. Retain: Using this option, you guarantee that when deleting a stack, the related resources will be kept. Snapshot: Option used for resources that support snapshots, for example: AWS::EC2::Volume AWS::ElastiCache::CacheCluster AWS::ElastiCache::ReplicationGroup AWS::Neptune::DBCluster AWS::RDS::DBCluster AWS::RDS::DBInstance AWS::Redshift::Cluster In the Properties, we define the characteristics of the Bucket: BucketName: Bucket name. Remembering that the bucket name must be unique and must follow some name standards according to the documentation AccessControl: It's the access control to the Bucket, there are different access options, as follows: Private PublicRead PublicReadWrite AuthenticatedRead LogDeliveryWrite BucketOwnerRead BucketOwnerFullControl AwsExecRead BucketEncryption: These are the encryption settings of Bucket objects, in this case we use the AES256 algorithm. Uploading and creating the resource 1. In the AWS console, go to CloudFormation 2. Click the Create Stack button 3. Select as prerequisite Template is ready 4. In the Specify template section, select Upload a template file, select the created file by clicking on Choose file and finally click on the Next button. A new page will open for filling in the name of the stack. 5. Click Next and do the same for the next pages. 6. Finally, the resource will be created. This may take a few minutes depending on the feature. Notice that two buckets were created: blog.data: Created via CloudFormation cf-templates-1nwl4b3ve439n-us-east-1: Bucket created automatically when uploading the file at the beginning of the process. Creating SQS template Resources: SQS: Type: 'AWS::SQS::Queue' Properties: QueueName: sqs-blog.fifo ContentBasedDeduplication: true DelaySeconds: 120 FifoQueue: true MessageRetentionPeriod: 3600 Understanding the template: SQS: resource identifier Type: resource type QueueName: SQS queue name. An important detail is the .fifo suffix, necessary if the queue is of the Fifo type. ContentBasedDeduplication: Ensures non-duplication of messages, works only for Fifo-type queues. DelaySeconds: Delay time for each message (in seconds). FifoQueue: How the queue manages the arrival and departure of messages (First-in - First-out). MessageRetentionPeriod: period time messages that will be held in the queue (in seconds) SQS queue created Conclusion CloudFormation is an AWS exclusive tool for resource creation, i.e. if your architecture is built or maintained based on the AWS cloud, CloudFormation is a great choice. If you need to maintain flexibility between clouds, such as the ability to use Google Cloud, Terraform may be a better option as an IaC tool. Well that’s it, I hope you enjoyed it!




