Buscar | Coffee and Tips

54 results found with an empty search

What Data Engineers Need to Know in 2024
The Evolution of Data Engineering Data engineering has witnessed a transformative journey, evolving from simple data collection and storage to sophisticated processing and analysis. A historical overview reveals its roots in traditional database management, progressing through the advent of big data, to today's focus on real-time analytics and cloud computing. Recent advances have been catalyzed by the integration of artificial intelligence (AI) and machine learning (ML), pushing the boundaries of what's possible in data-driven decision-making. Core Skills for Data Engineers in 2024 What Data Engineers Need to Know in 2024? To thrive in 2024, data engineers must master a blend of foundational and cutting-edge skills: Programming Languages: Proficiency in languages like Python, Scala, and SQL is non-negotiable, enabling efficient data manipulation and analysis. Database Management: Understanding relational and NoSQL databases, alongside data warehousing solutions, forms the backbone of effective data storage strategies. Cloud Computing Platforms: Expertise in AWS, Google Cloud Platform, and Azure is crucial, as cloud services become central to data engineering projects. Data Modeling & ETL Processes: Developing robust data models and streamlining ETL (Extract, Transform, Load) processes are key to ensuring data quality and accessibility. Emerging Technologies and Their Impact Emerging technologies such as AI and ML, big data frameworks, and automation tools are redefining the landscape: Artificial Intelligence & Machine Learning: These technologies are vital for predictive modeling and advanced data analysis, offering unprecedented insights. Big Data Technologies: Hadoop, Spark, and Flink facilitate the handling of vast datasets, enabling scalable and efficient data processing. Automation and Orchestration Tools: Tools like Apache Airflow and Kubernetes enhance efficiency, automating workflows and data pipeline management. The Importance of Data Governance and Security With increasing data breaches and privacy concerns, data governance and security have become paramount: Regulatory Compliance: Familiarity with GDPR, CCPA, and other regulations is essential for legal compliance. Data Privacy Techniques: Implementing encryption, anonymization, and secure access controls protects sensitive information from unauthorized access. Data Engineering in the Cloud Era The shift towards cloud computing necessitates a deep understanding of cloud services and technologies: Cloud Service Providers: Navigating the offerings of major providers ensures optimal use of cloud resources. Cloud-native Technologies: Knowledge of containerization, microservices, and serverless computing is crucial for modern data engineering practices. Real-time Data Processing The ability to process and analyze data in real-time is becoming increasingly important: Streaming Data Technologies: Tools like Apache Kafka and Amazon Kinesis support high-throughput, low-latency data streams. Real-time Analytics: Techniques for real-time data analysis enable immediate insights, enhancing decision-making processes. Advanced Analytics and Business Intelligence Advanced analytics and BI tools are essential for converting data into actionable insights: Predictive Analytics: Using statistical models and machine learning to predict future trends and behaviors. Visualization Tools: Tools like Tableau and Power BI help in making complex data understandable through interactive visualizations. Career Pathways and Growth Opportunities Exploring certifications, training, and staying informed about industry demand prepares data engineers for career advancement: Certification and Training: Pursuing certifications in specific technologies or methodologies can bolster expertise and credibility. Industry Demand: Understanding the evolving market demand ensures data engineers can align their skills with future opportunities. Preparing for the Future Continuous learning and community engagement are key to staying relevant in the fast-paced field of data engineering: Continuous Learning: Embracing a mindset of lifelong learning ensures data engineers can adapt to new technologies and methodologies. Networking and Community Engagement: Participating in forums, attending conferences, and contributing to open-source projects fosters professional growth and innovation. Conclusion As data becomes increasingly, the role of data engineers in shaping the future of technology cannot be overstated. By mastering core skills, staying informed about emerging technologies, and emphasizing data governance and security, data engineers can lead the charge in leveraging data for strategic advantage in 2024 and beyond.
Programming Language Trends for 2024: What Developers Need to Know
In the ever-evolving landscape of technology, programming languages stand as the foundational tools empowering innovation, driving progress, and shaping the digital world we inhabit. As we venture into 2024, the significance of understanding and leveraging these languages has never been more pronounced. From powering artificial intelligence to enabling seamless web development, programming languages play a pivotal role in defining the trajectory of tech trends and driving transformative change across industries. In this era of rapid technological advancement, staying abreast of the latest programming languages is not merely advantageous—it's imperative. Developers, engineers, and tech enthusiasts alike must recognize the profound impact that mastering these languages can have on their ability to navigate and thrive in the dynamic tech landscape of 2024. Programming languages serve as the building blocks of innovation, providing developers with the means to translate ideas into tangible solutions. In 2024, familiarity with cutting-edge languages equips individuals with the tools needed to push the boundaries of what's possible, whether through developing AI-driven applications, crafting immersive virtual experiences, or architecting resilient software systems. With every technological advancement comes a myriad of opportunities waiting to be seized. Whether it's capitalizing on the burgeoning fields of data science, blockchain technology, or quantum computing, proficiency in the right programming languages positions individuals to harness these opportunities and carve out their niche in the digital landscape of 2024. In an increasingly competitive job market, proficiency in in-demand programming languages can be a game-changer for career advancement. Employers across industries are seeking skilled professionals capable of leveraging the latest tools and technologies to drive business success. By staying ahead of the curve and mastering emerging languages, individuals can enhance their employability and unlock a wealth of career opportunities. For this post, I decided to write about the programming languages trends for 2024 and I hope this can be useful to you and taking the best decisions and which directions you want to follow this year in this large field. Python Python continues to maintain its position as one of the most popular and versatile programming languages. With its simplicity, readability, and extensive ecosystem of libraries and frameworks, Python is widely used in fields such as data science, artificial intelligence, web development, and automation. In 2024, Python's relevance is further amplified by its adoption in emerging technologies like machine learning, quantum computing, and the metaverse. Rust Rust has been gaining traction as a systems programming language known for its performance, safety, and concurrency features. In 2024, Rust is increasingly used in critical systems development, including operating systems, game engines, and web browsers. Its emphasis on memory safety and zero-cost abstractions makes it particularly suitable for building secure and reliable software, making it a favored choice for projects demanding high performance and robustness. TypeScript TypeScript, a superset of JavaScript with static typing, continues to see widespread adoption in web development. Its ability to catch errors at compile time, improve code maintainability, and enhance developer productivity has made it a preferred choice for building large-scale web applications. In 2024, TypeScript's popularity remains strong, driven by its integration with popular frameworks like Angular, React, and Vue.js, as well as its support for modern JavaScript features. Julia Julia, a high-level programming language designed for numerical and scientific computing, is gaining prominence in fields such as data science, computational biology, and finance. Known for its speed and ease of use, Julia combines the flexibility of dynamic languages with the performance of compiled languages, making it well-suited for tasks involving mathematical computations and large-scale data analysis. In 2024, Julia continues to attract researchers, engineers, and data scientists seeking efficient and expressive tools for scientific computing. Kotlin Kotlin, a statically-typed programming language for the Java Virtual Machine (JVM), has emerged as a popular choice for Android app development. Offering modern features, interoperability with Java, and seamless integration with popular development tools, Kotlin enables developers to build robust and efficient Android applications. In 2024, Kotlin's adoption in the Android ecosystem remains strong, driven by its developer-friendly syntax, strong tooling support, and endorsement by Google as a preferred language for Android development. Golang (Go) Go, often referred to as Golang, continues to gain traction as a language for building scalable and efficient software systems. Known for its simplicity, performance, and built-in concurrency support, Go is well-suited for developing cloud-native applications, microservices, and distributed systems. In 2024, Go's popularity is fueled by its role in enabling the development of resilient and high-performance software architectures, particularly in cloud computing, DevOps, and container orchestration. What programming languages do big tech use? Below we have an overview about programming languages that the main big techs companies are using in their stacks, so if you want to work in a Big Tech get ready to learn these languages. Conclusion In 2024, the programming landscape is characterized by a diverse set of languages, each catering to specific use cases and development requirements. From Python's versatility to Rust's performance, TypeScript's productivity to Julia's scientific computing capabilities, Kotlin's Android development to Go's system-level programming, developers have a rich array of tools at their disposal to tackle the challenges and opportunities presented by emerging technologies and industry trends. Whether building AI-powered applications, crafting scalable web services, or optimizing system performance, the choice of programming language plays a crucial role in shaping the success and impact of software projects in the dynamic tech landscape of 2024.
Exploring the Power of Virtual Threads in Java 21
Introduction to Virtual Threads in Java 21 Concurrency has always been a cornerstone of Java programming, empowering developers to build responsive and scalable applications. However, managing threads efficiently while ensuring high performance and low resource consumption has been a perennial challenge. With the release of Java 21, a groundbreaking feature called Virtual Threads emerges as a game-changer in the world of concurrent programming. Concurrency challenges in Java and the problem with traditional threads Concurrency in Java presents developers with both immense opportunities for performance optimization and formidable challenges in ensuring thread safety and managing shared resources effectively. As applications scale and become more complex, navigating these challenges becomes increasingly crucial. Managing Shared Resources: One of the fundamental challenges in concurrent programming is managing shared resources among multiple threads. Without proper synchronization mechanisms, concurrent access to shared data can lead to data corruption and inconsistencies. Avoiding Deadlocks: Deadlocks occur when two or more threads are blocked indefinitely, waiting for each other to release resources. Identifying and preventing deadlocks is crucial for maintaining the responsiveness and stability of concurrent applications. Performance Bottlenecks: While concurrency can improve performance by leveraging multiple threads, it can also introduce overhead and contention, leading to performance bottlenecks. It's essential to carefully design concurrent algorithms and use appropriate synchronization mechanisms to minimize contention and maximize throughput. High Memory Overhead: Traditional threads in Java are implemented as native threads managed by the operating system. Each native thread consumes a significant amount of memory, typically in the range of several megabytes. This overhead becomes problematic when an application needs to create a large number of threads, as it can quickly deplete system resources. Limited Scalability: The one-to-one mapping between Java threads and native threads imposes a limit on scalability. As the number of threads increases, so does the memory overhead and the scheduling complexity. This limits the number of concurrent tasks an application can handle efficiently, hindering its scalability and responsiveness. Difficulty in Debugging and Profiling: Debugging and profiling concurrent applications built with traditional threads can be challenging due to the non-deterministic nature of thread execution and the potential for subtle timing-related bugs. Identifying and diagnosing issues such as race conditions and thread contention requires specialized tools and expertise. What are Virtual Threads? Virtual Threads represent a paradigm shift in how Java handles concurrency. Traditionally, Java applications rely on OS-level threads, which are heavyweight entities managed by the operating system. Each thread consumes significant memory resources, limiting scalability and imposing overhead on the system. Virtual Threads, on the other hand, are lightweight and managed by the Java Virtual Machine (JVM) itself. They are designed to be highly efficient, allowing thousands or even millions of virtual threads to be created without exhausting system resources. Virtual Threads offer a more scalable and responsive concurrency model compared to traditional threads. Benefits of Virtual Threads Virtual Threads come with a host of features and benefits that make them an attractive choice for modern Java applications: Lightweight: Virtual Threads have minimal memory overhead, allowing for the creation of large numbers of threads without exhausting system resources. This lightweight nature makes them ideal for highly concurrent applications. Structured Concurrency: Virtual Threads promote structured concurrency, which helps developers write more reliable and maintainable concurrent code. By enforcing clear boundaries and lifecycles for concurrent tasks, structured concurrency simplifies error handling and resource management. Improved Scalability: With Virtual Threads, developers can achieve higher scalability and throughput compared to traditional threads. The JVM's scheduler efficiently manages virtual threads, ensuring optimal utilization of system resources. Integration with CompletableFuture: Java 21 introduces seamless integration between Virtual Threads and CompletableFuture, simplifying asynchronous programming. CompletableFuture provides a fluent API for composing and chaining asynchronous tasks, making it easier to write non-blocking, responsive applications. Examples of Virtual Threads Creating and Running a Virtual Thread This example demonstrates the creation and execution of a virtual thread. We use the Thread.startVirtualThread() method to start a new virtual thread with the specified task, which prints a message indicating its execution. We then call join() on the virtual thread to wait for its completion before proceeding. CompletableFuture with Virtual Threads This example showcases the usage of virtual threads with CompletableFuture. We chain asynchronous tasks using supplyAsync(), thenApplyAsync(), and thenAcceptAsync() methods. These tasks execute in virtual threads, allowing for efficient asynchronous processing. Virtual Thread Pool Example In this example, we create a virtual thread pool using Executors.newVirtualThreadExecutor(). We then submit tasks to this pool using submit() method. Each task executes in a virtual thread, demonstrating efficient concurrency management. Using ThreadFactory with Virtual Threads Here, we demonstrate the use of a ThreadFactory with virtual threads. We create a virtual thread factory using Thread.builder().virtual().factory(), and then use it to create a fixed-size thread pool with Executors.newFixedThreadPool(). Tasks submitted to this pool execute in virtual threads created by the virtual thread factory. Virtual Thread Group Example In this final example, we demonstrate how to organize virtual threads into a thread group. We obtain a virtual thread group using Thread.builder().virtual().getThreadGroup() and then create a virtual thread within this group. The task executed by the virtual thread prints a message indicating its execution. Conclusion In conclusion, Virtual Threads introduced in Java 21 mark a significant milestone in the evolution of Java's concurrency model. By providing lightweight, scalable concurrency within the JVM, Virtual Threads address many of the limitations associated with traditional threads, offering developers a more efficient and flexible approach to concurrent programming. With Virtual Threads, developers can create and manage thousands or even millions of threads with minimal overhead, leading to improved scalability and responsiveness in Java applications. The structured concurrency model enforced by Virtual Threads simplifies error handling and resource management, making it easier to write reliable and maintainable concurrent code. Furthermore, the integration of Virtual Threads with CompletableFuture and other asynchronous programming constructs enables developers to leverage the full power of Java's concurrency framework while benefiting from the performance advantages of Virtual Threads. Overall, Virtual Threads in Java 21 represent a significant advancement that empowers developers to build highly concurrent and responsive applications with greater efficiency and scalability. As developers continue to explore and adopt Virtual Threads, we can expect to see further optimizations and enhancements that will further elevate Java's capabilities in concurrent programming.
Listing AWS Glue tables
Using an AWS SDK is always a good option if you need to explore some feature further in search of a solution. In this post, we're going to explore some of AWS Glue using SDK and Java. Glue is an AWS ETL tool that provides a central repository of metadata, called Glue Catalog. In short, the Glue Catalog keeps the entire structure of databases and tables and their schemas in a single place. The idea of this post will be to programmatically list all the tables of a given database in the Glue Catalog using the SDK and Java. Maven dependencies In this example, we're using the Java 8 version to better explore the use of Streams in the interaction. Undestanding awsGlue object is responsible for accessing the resource through the credentials that must be configured. In this post we will not go into this detail. The getTablesRequest object is responsible for setting the request parameters, in this case, we're setting the database. getTablesResult object is responsible for listing the tables based on the parameters set by the getTablesRequest object and also for controlling the result flow. Note that in addition to returning the tables through the getTablesResult.getTableList() method, this same object returns a token that will be explained further in the next item. The token is represented by the getTablesResult.getNextToken() method, the idea of the token is to control the flow of results, as all results are paged and if there is a token for each result, it means that there is still data to be returned. In the code, we used a repetition structure based on validating the existence of the token. So, if there is still a token, it will be set in the getTableRequest object through the code getTableRequest.setNextToken(token), to return more results. It's a way to paginate results. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Setup recommendations If you have interesting to know what's my setup I've used to develop my tutorials, following: Notebook Dell Inspiron 15 15.6 Monitor LG Ultrawide 29WL500-29 Well that’s it, I hope you enjoyed it!
Creating AWS CloudWatch alarms
The use of alarms is an essential requirement when working with various resources in the cloud. It is one of the most efficient ways to monitor and understand the behavior of an application if the metrics are different than expected. In this post, we're going to create an alarm from scratch using AWS CloudWatch based on specific scenario. There are several other tools that allow us to set up alarms, but when working with AWS, setting alarms using CloudWatch is very simple and fast. Use Case Scenario To a better understanding, suppose we create a resiliency mechanism in an architecture to prevent data losses. This mechanism always acts whenever something goes wrong, like components not working as expected sending failure messages to a SQS. CloudWatch allows us to set an alarm. Thus, when a message is sent to this queue, an alarm is triggered. First of all, we need to create a queue and sending messages just to generate some metrics that we're going to use in our alarm. That's a way to simulate a production environment. After queue and alarm creation, we'll send more message for the alarms tests. Creating a SQS Queue Let's create a simple SQS queue and choose some metrics that we can use in our alarm. Thus, access the AWS console and in the search bar, type "sqs" as shown in the image below and then access the service. After accessing the service, click Create queue Let's create a Standard queue for this example and name as sqs-messages. You don't need to pay attention to the other details, just click on the Create queue button to finish it. Queue has been created, now the next step we'll send a few messages just to generate metrics. Sending messages Let's send few messages to the previously created queue, feel free to change the message content if you want to. After sending these messages, automatically will generate some metrics according to the action. In this case, a metric called NumberOfMessagesSent was created on CloudWatch and we can use it to create the alarm. Creating an Alarm For our example, let's choose the metric based on number of messages sent (NumberOfMessagesSent). Access AWS via the console and search for CloudWatch in the search bar, as shown in the image below. After accessing the service, click on the In Alarms/In alarm option in the left corner of the screen and then click on the Create alarm button. Select metric according to the screen below Choose SQS Then click Queue Metrics Search for queue name and select the metric name column item labeled NumberOfMessagesSent, then click Select Metric. Setting metrics Metric name: is the metric chosen in the previous steps. This metric measures the number of messages sent to the SQS (NumberOfMessagesSent). QueueName: Name of the SQS in which the alarm will be configured. Statistic: In this field we can choose options such as Average, Sum, Minimum and more. This will depend on the context you will need to configure the alarm and the metric. For this example we choose Sum, because we want to get the sum of the number of messages sent in a given period. Period: In this field we define the period in which the alarm will be triggered if it reaches the limit condition, which will be defined in the next steps. Setting conditions Threshlod type: For this example we will use Static. Whenever NumberOfMessagesSent is...: Let's select the Greater option Than...: In this field we will configure the number of NumberOfMessagesSent as a condition to trigger the alarm. Let's put 5. Additional configuration For additional configuration, we have the datapoints field for the alarm in which I would like to detail its operation a little more. Datapoints to alarm This additional option makes the alarm configuration more flexible, combined with the previously defined conditions. By default, this setting is: 1 of 1 How it works? The first field refers to the number of points and the second one refers to the period. Keeping the previous settings combined to the additional settings means that the alarm will be triggered if the NumberOfMessagesSent metric is greater than the sum of 5 in a period of 5 minutes. Until then, the default additional configuration does not change the previously defined settings, nothing changes. Now, let's change this setting to understand better. Let's change from: 1 of 1 to 2 of 2. This tells us that when the alarm condition is met, i.e. for the NumberOfMessagesSent metric, the sum is greater than 5. Thus, the alarm will be triggered for 2 datapoints in 10 minutes. Note that the period was multiplied due to the second field with the value 2. Summarizing, even if the condition is met, the alarm will only be triggered if there are 2 datapoints above the threshold in a period of 10 minutes. We will understand even better when we carry out some alarm activation tests. Let's keep the following settings and click Next Configuring actions On the next screen, we're going to configure the actions responsible for notifying a destination if an alarm is triggered. On this screen, we're going to keep the In alarm setting and then creating a new topic and finally, we're going to add an email in which we want to receive error notifications. Select the option Create new topic and fill in a desired name and then enter a valid email in the field Email endpoints that will receive notification ... Once completed, click Create topic and then an email will be sent to confirm subscription to the created topic. Make sure you've received an email confirmation and click Next on the alarm screen to proceed with the creation. Now, we need to add the name of the alarm in the screen below and then click on Next. The next screen will be the review screen, click on Create alarm to finish it. Okay, now we have an alarm created and it's time to test it. Alarm Testing In the beginning we sent a few messages just to generate the NumberOfMessagesSent metric but at this point, we need to send more messages that will trigger the alarm. Thus, let's send more messages and see what's going to happen. After sending the messages, notice that even if the threshold has exceeded, the alarm was not triggered. This is due to the threshold just reached 1 datapoint within the 10 minute window. Now, let's send continuous messages that exceed the threshold in short periods within the 10 minute window. Note that in the image above the alarm was triggered because in addition to having reached the condition specified in the settings, it also reached the 2 data points. Check the email added in the notification settings, probably an email was sent with the alarm details The status alarm will set to OK when the messages not exceed the threshold anymore. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
First steps with Delta Lake
What's Delta Lake? Delta Lake is an open-source project that manages storage layer in your Data lake. In practice it's an Apache Spark abstraction reusing the same mechanisms offering extra resources such as ACID transactions support. Everyone knows that keeping data integrity in a data pipeline is a critical task in face of high data read and write concurrency. Delta Lake provides audit history, data versioning and supports DML operations such as deletes, updates and merges. For this tutorial, we're going to simulate a data pipeline locally focusing on Delta Lake advantages. First, we'll load a Spark Dataframe from a JSON file, a temporary view and then a Delta Table which we'll perform some Delta operations. Last, Java as programming language and Maven as dependency manager, besides Spark and Hive to keep our data catalog. Maven org.apache.spark spark-core_2.12 3.0.1 org.apache.spark spark-sql_2.12 3.0.1 org.apache.spark spark-hive_2.12 3.0.1 io.delta delta-core_2.12 0.8.0 The code will be developed in short snippets to a better understanding. Setting Spark with Delta and Hive String val_ext="io.delta.sql.DeltaSparkSessionExtension"; String val_ctl="org.apache.spark.sql.delta.catalog.DeltaCatalog"; SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("app"); sparkConf.setMaster("local[1]"); sparkConf.set("spark.sql.extensions",var_ext); sparkConf.set("spark.sql.catalog.spark_catalog",val_ctl); SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .enableHiveSupport() .getOrCreate(); Understanding the code above We define two variables val_ext and val_ctl by assigning the values to the keys (spark.sql.extensions and spark.sql.catalog.spark_catalog). These are necessary for configuring Delta together with Spark We named the Spark context of app Since we are not running Spark on a cluster, the master is configured to run local local[1] Spark supports Hive, in this case we enable it in the enableHiveSupport( ) Data Ingest Let's work with Spark Dataframe as the data source. We load a Dataframe from a JSON file. order.json file {"id":1, "date_order": "2021-01-23", "customer": "Jerry", "product": "BigMac", "unit": 1, "price": 8.00} {"id":2, "date_order": "2021-01-22", "customer": "Olivia", "product": "Cheese Burguer", "unit": 3, "price": 21.60} {"id":3, "date_order": "2021-01-21", "customer": "Monica", "product": "Quarter", "unit": 2, "price": 12.40} {"id":4, "date_order": "2021-01-23", "customer": "Monica", "product": "McDouble", "unit": 2, "price": 13.00} {"id":5, "date_order": "2021-01-23", "customer": "Suzie", "product": "Double Cheese", "unit": 2, "price": 12.00} {"id":6, "date_order": "2021-01-25", "customer": "Liv", "product": "Hamburger", "unit": 1, "price": 2.00} {"id":7, "date_order": "2021-01-25", "customer": "Paul", "product": "McChicken", "unit": 1, "price": 2.40} Creating a Dataframe Dataset df = sparkSession.read().json("datasource/"); df.createOrReplaceGlobalTempView("order_view"); Understanding the code above In the previous section, we're creating a Dataframe from the JSON file that is inside the datasource/ directory, create this directory so that the structure of your code is more comprehensive and then create the order.json file based on the content shown earlier . Finally, we create a temporary view that will help us in the next steps. Creating a Delta Table Let's create the Delta Table from an SQL script. At first the creation is simple, but notice that we used different types of a table used in a relational database. For example, we use STRING instead of VARCHAR and so on. We are partitioning the table by the date_order field. This field was chosen as a partition because we believe there will be different dates. In this way, queries can use this field as a filter, aiming at better performance. And finally, we define the table as Delta Table from the USING DELTA snippet. String statement = "CREATE OR REPLACE TABLE orders (" + "id STRING, " + "date_order STRING," + "customer STRING," + "product STRING," + "unit INTEGER," + "price DOUBLE) " + "USING DELTA " + "PARTITIONED BY (date_order) "; sparkSession.sql(statement); Understanding the code above In the previous section we're creating a Delta table called orders and then we execute the creation. DML Operations Delta supports Delete, Update and Insert operations using Merge Using Merge together with Insert and Update In this step, we are going to execute a Merge that makes it possible to control the flow of inserting and updating data through a table, Dataframe or view. Merge works from row matches, which will be more understandable in the next section. String mergeStatement = "Merge into orders " + "using global_temp.order_view as orders_view " + "ON orders.id = orders_view.id " + "WHEN MATCHED THEN " + "UPDATE SET orders.product = orders_view.product," + "orders.price = orders_view.price " + "WHEN NOT MATCHED THEN INSERT * "; sparkSession.sql(mergeStatement); Understanding the code above In the snippet above we're executing the Merge operation from the view order_view created in the previous steps. In the same section we have a condition orders.id = orders_view.id that will help in the following matches. If the previous condition is true, that is, MATCHED is true. The data will be updated. Otherwise, NOT MATCHED. Data will be inserted. In the case above, the data will be inserted, because until then there was no data in the orders table. Run the command below to view the inserted data. sparkSession.sql("select * from orders").show(); Update the datasource/order.json file by changing the product, price field and run all snippets again. You will see that all records will be updated. Update operation It is possible to run Update without the need to use Merge, just run the command below: String updateStatement = "update orders " + "set product = 'Milk-Shake' " + "where id = 2"; sparkSession.sql(updateStatement); Delete operation String deleteStatement = "delete from pedidos where id = 2"; sparkSession.sql(deleteStatement); In addition to being able to execute the Delete command, it is possible to use this command with Merge. Understanding Delta Lake Transaction Log (DeltaLog) In addition to supporting ACID transactions, delta generates some JSON files that serve as a way to audit and maintain the history of each transaction, from DDL and DML commands This mechanism it is even possible to go back to a specific state of the table if necessary. For each executed transaction a JSON file is created inside the _delta_log folder. The initial file will always be 000000000.json, containing the transaction commits. In our scenario, this first file contains the commits for creating the orders table. For a better view, go to the local folder that was probably created in the root directory of your project called spark-warehouse. This folder was created by Hive to hold resources created from JSON files and parquets. Inside it will have a folder structure as shown below: Note that the files are created in ascending order from each executed transaction. Access each JSON file and you will see each transaction that was executed through the operation field, in addition to other information. 00000000000000000000.json "operation":"CREATE OR REPLACE TABLE" 00000000000000000001.json "operation":"MERGE" 00000000000000000002.json "operation":"UPDATE" 00000000000000000003.json "operation":"DELETE" Also note that the parquet files were generated partitioned into folders by the date_order field. Hope you enjoyed!
Using Comparator.comparing to sort Java Stream
Introduction Sorting data is a common task in many software development projects. When working with collections of objects in Java, a powerful and flexible approach to sorting is to use the Comparator.comparing interface in conjunction with Streams. In this post, we are going to show that using Comparator.comparing to sort Java Stream can make sorting elegant and efficient. What is the Comparator.comparing interface? The Comparator.comparing interface is a feature introduced in Java 8 as part of the java.util.Comparator. package. It provides a static method called comparing that allows you to specify a key function (sort key) to compare objects. This function is used to extract a value from an object and compare it against that value during sorting. Flexibility in sorting with Comparator.comparing One of the main advantages of the Comparator.comparing interface is its flexibility. With it, we can perform sorting in different fields of an object, allowing the creation of complex sorting logic in a simple and concise way. Notice in the code below that simply in the sorted() method, we pass the Comparator.comparing interface as an argument, which in turn, passes the city field as an argument using method reference (People::getCity) performing the sort by this field. Output Monica John Mary Anthony Seth Multi-criteria ordering Often, it is necessary to perform sorting based on multiple criteria. This is easily achieved with the Comparator.comparing. interface by simply chaining together several comparing methods, each specifying a different criterion. Java will carry out the ordering according to the specified sequence. For example, we can sort the same list by city and then by name: Comparator.comparing(People::getCity).thenComparing(People:: getName). Ascending and descending sort Another important advantage of the Comparator.comparing interface is the ability to perform sorting in both ascending and descending order. To do this, just chain the method reversed() as in the code below: Output Seth Mary John Anthony Monica Efficiency and simplicity By using the Comparator.comparing interface in conjunction with Streams, sorting becomes more efficient and elegant. The combination of these features allows you to write clean code that is easy to read and maintain. Furthermore, Java internally optimizes sorting using efficient algorithms, resulting in satisfactory performance even for large datasets. Final conclusion The Comparator.comparing interface is a powerful tool to perform the sorting of Streams in Java. Its flexibility, ascending and descending sorting capabilities, support for multiple criteria, and efficient execution make it a valuable choice for any Java developer. By taking advantage of this interface, we can obtain a more concise, less verbose and efficient code, facilitating the manipulation of objects in a Stream. Hope you enjoyed!
Applying Change Data Feed for auditing on Delta tables
What is the Change Data Feed? Change Data Feed is a Delta Lake feature as of version 2.0.0 that allows tracking at row levels in Delta tables, changes such as DML operations (Merge, Delete or Update), data versions and the timestamp of when the change happened. The process maps Merge, Delete and Update operations, maintaining the history of changes at line level, that is, each event suffered in a record, Delta through the Change Data Feed manages to register as a kind of audit . Of course it is possible to use it for different use cases, the possibilities are extensive. How it works in practice Applying Change Data Feed for Delta tables is an interesting way to handle with row level records and for this post we will show how it works. We will perform some operations to explore more about the power of the Change Data Feed. We will work with the following Dataset: Creating the Spark Session and configuring some Delta parameters From now on, we'll create the code in chunks for easy understanding. In the code below we are creating the method responsible for maintaining the Spark session and configuring some parameters for Delta to work. Loading the Dataset Let's load the Dataset and create a temporary view to be used in our pipeline later. Creating the Delta Table Now we will create the Delta table already configuring Change Data Feed in the table properties and all the metadata will be based on the previously presented Dataset. Note that we're using the following parameter in the property delta.enableChangeDataFeed = true for activating the Change Data Feed. Performing a Data Merge Now we'll perform a simple Merge operation so that the Change Data Feed can register it as a change in our table. See what Merge uses in our previously created global_temp.raw_product view to upsert the data. Auditing the table Now that the Merge has been executed, let's perform a read on our table to understand what happened and how the Change Data Feed works. Notice that we're passing the following parameters: 1. readChangeFeed where required for using the Change Data Feed. 2. startingVersion is the parameter responsible for restricting which version we want it to be displayed from. Result after execution: See that in addition to the columns defined when creating the table, we have 3 new columns managed by the Change Data Feed. 1. _change_type: Column containing values according to each operation performed as insert, update_preimage , update_postimage, delete 2. _commit_version: Change version 3. _commit_timestamp: Timestamp representing the change date In the above result, the result of the upsert was a simple insert, as it didn't contain all the possible conditions of an update. Deleting a record In this step we will do a simple delete in a table record, just to validate how the Change Data Feed will behave. Auditing the table (again) Note below that after deleting record with id 6, we now have a new record created as delete in the table and its version incremented to 2. Another point is that the original record was maintained, but with the old version. Updating a record Now as a last test, we will update a record to understand again the behavior of the Change Data Feed. Auditing the table (last time) Now as a last test, we run a simple update on a record to understand how it will behave. Notice that 2 new values have been added/updated in the _change_type column. The update_postimage value is the value after the update was performed and this time, for the old record, the same version of the new one was kept in the column _commit_version, because this same record was updated according to column _change_type to update_preimage, that is, value before change. Conclusion The Change Data Feed is a great resource to understand the behavior of your data pipeline and also a way to audit records in order to better understand the operations performed there. According to the Delta team itself, it is a feature that, if maintained, does not generate any significant overhead. It's a feature that can be fully adopted in your data strategy as it has several benefits as shown in this post. Repository GitHub Hope you enjoyed!
Understanding Java Record Class in 2 minutes
Introduction Released in Java 14 as a preview, more specifically in JEP 395, Record Class is an alternative to working with Classes in Java. Record Class was a very interesting approach designed to eliminate the verbosity when you need to create a class and its components, such as: Canonical constructors Public access methods Implement the equals and hashCode methods Implement the toString method Using Record Classes it is no longer necessary to declare the items above, helping the developer to be more focused on other tasks. Let's understand better in practice. Let's create a Java class called User and add some fields and methods. Note that for a simple class with 4 fields, we create a constructor, public access methods, implement the equals and hashCode methods and finally, the toString method. It works well, but we could avoid the complexity and create less verbose code. In that case, we can use Record Classes instead User class above. User Record Class The difference between Record and a traditional Java Class is remarkable. Note that it wasn't necessary to declare the fields, create the access methods or implement any other method. In a Record Class when created, implicitly the public access methods are created, the implementations of the equals, hashCode and toString methods are also created automatically and it is not necessary to implement them explicitly. And finally, the reference fields or components are created as private final with the same names. Output Disadvantages Record Class behaves like a common Java class, but the difference is that you can't work with inheritance. You can't extends another class, only implement one or more interfaces. Another point is that it's not possible to create non-static instance variables. Final conclusion Record Classes is a great approach for anyone looking for less verbose code or who needs agility in implementing models. Despite the limitation of not being able to extends other Record Classes, it's a limitation that doesn't affect its use in general. Hope you enjoyed!
Getting started with Java Reflection in 2 minutes
Introduction Java Reflection is a powerful API that allows a Java program to examine and manipulate information about its own classes at runtime. With Reflection, you can get information about a class's fields, methods, and constructors, and access and modify those elements even if they're private. In this post we're going to write some Java codes exploring some of the facilities of using Reflection and when to apply it in your projects. Bank Class We'll create a simple class called Bank, where we'll create some fields, methods and constructors to be explored using Reflection. Accessing the fields of the Bank class With the Bank class created, let's explore via Reflection the listing of all fields of the class through the getDeclaredFields method of the Class class. Note that through the static method Class.forName, we pass a string with the name of the class we want to explore via Reflection as a parameter. Output Field name: code Field type: class java.lang.Integer ************ Field name: nameOfBank Field type: class java.lang.String ************ Field name: amountOfDepositedMoney Field type: class java.lang.Double ************ Field name: totalOfCustomers Field type: class java.lang.Integer ************ Accessing the methods of the Bank class Through the getDeclaredMethods method, we can retrieve all methods of the Bank class. Output Method name: doDeposit Method type: class java.lang.String ************ Method name: doWithDraw Method type: class java.lang.String ************ Method name: getReceipt Method type: class java.lang.String ************ Creating objects With the use of Reflection to create objects, it is necessary to create them through a constructor. In this case, we must first invoke a constructor to create the object. The detail is that to retrieve this constructor, we must pay attention to the types of parameters that make up the constructor and the order in which they are declared. This makes it flexible to retrieve different constructors with different parameter numbers and type in a class. Notice below that it was necessary to create an array of type Class assigning different types according to the composition of the constructor that we will use to create our object. In this scenario, it will be necessary to invoke the method class.getConstructor(argType) passing the previously created array as an argument. This way, we will have a constructor object that will be used in the creation of our object. Finally, we create a new array of type Object assigning the values that will compose our object following the order defined in the constructor and then just invoke the method constructor.newInstance(argumentsValue) passing the array as a parameter returning the object we want to create. Output Bank{code=1, nameOfBank='Bank of America', amountOfDepositedMoney=1.5, totalOfCustomers=2500} Invoking methods To invoke a method through Reflection is quite simple as shown in the code below. Note that it is necessary to pass as a parameter in the method cls.getMethod("doDeposit", argumentsType) the explicit name of the method, in this case "doDeposit" and in the second parameter, an array representing the type of data used in the parameter of the method doDeposit( double amount), in this case a parameter of type double. Finally, invoke the method method.invoke passing at the first parameter the object referencing the class, in this case an object of type Bank. And as the second parameter, the value that will be executed in the method. Output 145.85 of money has been deposited Conclusion Using Reflection is a good strategy when you need flexibility in exploring different classes and their methods without the need to instantiate objects. Normally, Reflection is used in specific components of an architecture, but it does not prevent it from being used in different scenarios. From the examples shown above, you can see infinite scenarios of its application and the advantages of its use. Hope you enjoyed!
Tutorial : Apache Airflow for beginners
Intro Airflow has been one of the main orchestration tools on the market and much talked about in the Modern Data Stack world, as it is a tool capable of orchestrating data workloads through ETLs or ELTs. But in fact, Airflow is not just about that, it can be applied in several cases of day-to-day use of a Data or Software Engineer. In this Apache Airflow for Beginners Tutorial, we will introduce Airflow in the simplest way, without the need to know or create ETLs. But what is Airflow actually? Apache Airflow is a widely used workflow orchestration platform for scheduling, monitoring, and managing data pipelines. It has several components that work together to provide its functionalities. Airflow components DAG The DAG (Directed Acyclic Graph) is the main component and workflow representation in Airflow. It is composed of tasks (tasks) and dependencies between them. Tasks are defined as operators (operators), such as PythonOperator, BashOperator, SQLOperator and others. The DAG defines the task execution order and dependency relationships. Webserver The Webserver component provides a web interface for interacting with Airflow. It allows you to view, manage and monitor your workflows, tasks, DAGs and logs. The Webserver also allows user authentication and role-based access control. Scheduler The Scheduler is responsible for scheduling the execution of tasks according to the DAG definition. It periodically checks for pending tasks to run and allocates available resources to perform the tasks at the appropriate time. The Scheduler also handles crash recovery and scheduling task retries. Executor The Executor is responsible for executing the tasks defined in the DAGs. There are different types of executors available in Airflow such as LocalExecutor, CeleryExecutor, KubernetesExecutor and etc. Each executor has its own settings and execution behaviors. Metadatabase Metadatabase is a database where Airflow stores metadata about tasks, DAGs, executions, schedules, among others. It is used to track the status of tasks, record execution history, and provide information for workflow monitoring and visualization. It is possible to use several other databases to record the history such as MySQL, Postgres and among others. Workers Workers are the execution nodes in a distributed environment. They receive tasks assigned by the Scheduler and execute them. Workers can be scaled horizontally to handle larger data pipelines or to spread the workload across multiple resources. Plugins Plugins are Airflow extensions that allow you to add new features and functionality to the system. They can include new operators, hooks, sensors, connections to external systems, and more. Plugins provide a way to customize and extend Airflow's capabilities to meet the specific needs of a workflow. Operators Operators are basically the composition of a DAG. Understand an operator as a block of code with its own responsibility. Because Airflow is an orchestrator and executes a workflow, we can have different tasks to be performed, such as accessing an API, sending an email, accessing a table in a database and performing an operation, executing a Python code or even a Bash command. For each of the above tasks, we must use an operator. Next, we will discuss some of the main operators: BashOperator BashOperator allows you to run Bash commands or scripts directly on the operating system where Airflow is running. It is useful for tasks that involve running shell scripts, utilities, or any action that can be performed in the terminal. In short, when we need to open our system's terminal and execute some command to manipulate files or something related to the system itself, but within a DAG, this is the operator to be used. PythonOperator The PythonOperator allows you to run Python functions as tasks in Airflow. You can write your own custom Python functions and use the PythonOperator to call those functions as part of your workflow. DummyOperator The DummyOperator is a "dummy" task that takes no action. It is useful for creating complex dependencies and workflows without having to perform any real action. Sensor Sensors are used to wait for some external event to occur before continuing the workflow, it can work as a listener. For example, the HttpSensor, which is a type of Sensor, can validate if an external API is active, if so, the flow continues to run. It's not an HTTP operator that should return something, but a type of listener. HttpOperator Unlike a Sensor, the HttpOperator is used to perform HTTP requests such as GET, POST, PUT, DELETE end etc. In this case, it allows you to interact more fully with internal or external APIs. SqlOperator SqlOperator is the operator responsible for performing DML and DDL operations in a database, that is, from data manipulations such as SELECTS, INSERTS, UPDATES and so on. Executors Executors are responsible for executing the tasks defined in a workflow (DAG). They manage the allocation and execution of tasks at runtime, ensuring that each task runs efficiently and reliably. Airflow offers different types of executors, each with different characteristics and functionalities, allowing you to choose the most suitable one for your specific needs. Below, we’ll cover some of the top performers: LocalExecutor LocalExecutor is the default executor in Apache Airflow. It is designed to be used in development and test environments where scalability isn't a concern. LocalExecutor runs tasks on separate threads within the same Airflow process. This approach is simple and efficient for smaller pipelines or single-node runs. CeleryExecutor If you need an executor for distributed and high-scale environments, CeleryExecutor is an excellent choice. It uses Celery, a queued task library, to distribute tasks across separate execution nodes. This approach makes Airflow well-suited for running pipelines on clusters of servers, allowing you to scale horizontally on demand. KubernetesExecutor For environments that use Kubernetes as their container orchestration platform, KubernetesExecutor is a natural choice. It leverages Kubernetes' orchestration capability to run tasks in separate pods, which can result in better resource isolation and easier task execution in containers. DaskExecutor If your workflow requires parallel and distributed processing, DaskExecutor might be the right choice. It uses the Dask library to perform parallel computing on a cluster of resources. This approach is ideal for tasks that can be divided into independent sub-tasks, allowing better use of available resources. Programming language Airflow supports Python as programming language. To be honest, it's not a limiter for those who don't know the language well. In practice, the process of creating DAGs is standard, which can change according to your needs, it will deal with different types of operators, whether or not you can use Python. Hands-on Setting up the environment For this tutorial we will use Docker that will help us provision our environment without the need to install Airflow. If you don't have Docker installed, I recommend following the recommendations in this link and after installing it, come back to follow the tutorial. Downloading project To make it easier, clone the project from the following repository and follow the steps to deploy Airflow. Steps to deploy With docker installed and after downloading the project according to the previous item, access the directory where the project is located and open the terminal, run the following docker command: docker-compose up The above command will start the docker containers where the services of Airflow itself, postgres and more. If you're curious about how these services are mapped, open the project's docker-compose.yaml file and there you'll find more details. Anyway, after executing the above command and the containers already started, access the following address via browser http://localhost:8080/ A screen like below will open, just type airflow for the username and password and access the Airflow UI. Creating a DAG Creating a simple Hello World For this tutorial, we will create a simple DAG where the classic "Hello World" will be printed. In the project you downloaded, go to the /dags folder and create the following python file called hello_world.py. The code above is a simple example of a DAG written in Python. We noticed that we started import some functions, including the DAG itself, functions related to the datetime and the Python operator. Next, we create a Python function that will print to the console "Hello World" called by print_hello function. This function will be called by the DAG later on. The declaration of a DAG starts using the following syntax with DAG(..) passing some arguments like: dag_id: DAG identifier in Airflow context start_date: The defined date is only a point of reference and not necessarily the date of the beginning of the execution nor of the creation of the DAG. Usually the executions are carried out at a later date than the one defined in this parameter, and it is important when we need to calculate executions between the beginning and the one defined in the schedule_interval parameter. schedule_interval: In this parameter we define the periodicity in which the DAG will be executed. It is possible to define different forms of executions through CRON expressions or through Strings already defined as @daily, @hourly, @once, @weekly and etc. In the case of the example, the flow will run only once. catchup: This parameter controls retroactive executions, that is, if set to True, Airflow will execute the retroactive period from the date defined in start_date until the current date. In the previous example we defined it as False because there is no need for retroactive execution. After filling in the arguments, we create the hello_task within the DAG itself using the PythonOperator operator, which provides ways to execute python functions within a DAG. Note that we declared an identifier through the task_id and in the python_callable argument, which is native to the PythonOperator operator, we passed the python print_hello function created earlier. Finally, invoke the hello_task. This way, the DAG will understand that this will be the task to be performed. If you have already deployed it, the DAG will appear in Airflow in a short time to be executed as shown in the image below: After the DAG is created, activate and execute it by clicking on Trigger DAG as shown in the image above. Click on the hello_operator task (center) and then a window will open as shown in the image below: Click the Log button to see more execution details: Note how is simple it to create a DAG, just think about the different possibilities and applicability scenarios. For the next tutorials, we'll do more examples that are a bit more complex by exploring several other scenarios. Conclusion Based on the simple example shown, Airflow presented a flexible and simple approach to controlling automated flows, from creating DAGs to navigating your web component. As I mentioned at the beginning, its use is not limited only to the orchestration of ETLs, but also to the possibility of its use in tasks that require any need to control flows that have dependencies between their components within a context. scalable or not. GitHub Repository Hope you enjoyed!
Creating Asynchronous Java Code with Future
Intro Java Future is one of several ways to work with the language asynchronously, providing a multi-thread context in which it is possible to execute tasks in parallel without blocking the process. In the example below, we will simulate sending a fictitious email in which, even during sending, the process will not be blocked, that is, it will not be necessary to wait for the sending to finish for the other functionalities or mechanisms to operate again. EmailService class Understanding the EmailService class The class above represents the sending emails in a fictitious way, the idea of using the loop is to simulate the sending is precisely to delay the process itself. Finally, at the end of sending, the method sendEmailBatch(int numberOfEmailsToBeSent) returns a String containing a message referring to the end of the process. EmailServiceAsync class Understanding the EmailServiceAsync class The EmailServiceAsync class represents the asynchronous mechanism itself, in it we have the method sendEmailBatchAsync(int numberOfEmailsToBeSent) which will be responsible for making the process of sending dummy e-mails asynchronous. The asynchronous process is managed by using the ExecutorService instance which facilitates the management of tasks asynchronously which are assigned to a pool of threads. In this case, the call to the sendEmailBatch(int numberOfEmailsToBeSent) method boils down to a task (task) which will be assigned to a Thread defined in Executors.newFixedThreadPool(1). Finally, the method returns a Future that is literally a promise that task will be completed at some point, representing an asynchronous process. EmailServiceAsyncRun class Understanding the EmailServiceAsyncRun class It is in this class where we will test the asynchronous process using Future. Let's recap, in the EmailService class, we've created a method called sendEmailBatch(int numberOfEmailsToBeSent) in which we're simulating through the for the sending of dummy email and printing a sending message that we'll use to test the concurrency. In the EmailServiceAsync class, the sendEmailBatchAsync(int numberOfEmailsToBeSent) method creates an ExecutorService instance that will manage the tasks together with the thread pool, which in this case, we are creating just one Thread defined in Executors.newFixedThreadPool(1) and will return a Future. Now in the EmailServiceAsyncRun class, this is where we actually test the process, let's understand by parts: We instantiate an object of type EmailServiceAsync We create an object of type Future and assign it to the return of the emailAsync.sendEmailBatchAsync(500) method. The idea of argument 500 is just to control the iteration of the For, delaying the process to be finished. We could even use Thread.sleep() as an alternative and set a delay time which would also work fine. Note that we are using the futureReturn.isDone() method to control the while iteration control, that is, this method allows the process not to be blocked while the email flow is executed. In this case, any process that you want to implement to compete while sending is done, can be created inside the while, such as a flow of updating customer tables or any other process. On line 20, using the futureReturn.get() method, we're printing the result of sending the emails. And finally, we finish the executorService and its tasks through the executorService.shutdown() method. Running the process Notice clearly that there are two distinct processes running, the process of sending email "Sending email Nº 498.." and the process of updating a customer table. Finally the process is finished when the message "A total of 500 emails has been sent" is printed. Working with blocking processes The use of Future is widely used for use cases where we need to block a proces. The current Thread will be blocked until the process being executed by Future ends. To do so, simply invoke the futureReturn.get() method directly without using any iteration control as used in the previous example. An important point is that this type of approach can cause resources to be wasted due to the blocking of the current Thread. Conclusion The use of Future is very promising when we need to add asynchronous processes to our code in the simplest way or even use it to block processes. It's a lean API with a certain resource limitation but that works well for some scenarios. Hope you enjoyed!