Tag Archive

Below you'll find a list of all posts that have been tagged as "Apache Spark RDD"

A Beginner’s Guide to Complete Analysis of Apache Spark RDDs and Java 8 Streams

1. What is Apache Spark RDD? Apache Spark RDD stands for Resilient Distributed Datasets. RDD is a fault tolerant, immutable collection of elements which can be operated on, in parallel. We can perform various parallel operations on them such as map, reduce, filter, count, distinct etc. We can persist them in local memory and perform these operations on them. RDDs can be created in two ways: A. parallelize(): calling a parallelize method on the existing collection in our program (pass collection object to the method). JavaRDD javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5)); B. textFile(): calling textFile method by passing the path of the file at local or shared file system (pass file URI to the method). JavaRDD lines = sparkContext.textFile("URI/to/sample/file.txt"); Both methods are called using the reference of JavaSparkContext class. There are two types of operations that can be performed on RDDs: Transformations: which perform some operations on RDD to return an RDD (map). Actions: which return a value after performing the operation (reduce). Consider the following example of map and reduce to calculate the total length of the lines in the file, using JavaRDDs: JavaRDD lines = sc.textFile("URI/to/sample/file.txt"); JavaRDD lengths = lines.map(l -> l.length()); int totalLength = lengths.reduce((a, b) -> a + b); 2. What is Java 8 Streams API? Java Stream API sounds similar to InputStream and OutputSream in Java IO, but it is completely different, so let’s not get confused. Streams are specifically introduced in Java 8 to ease functional programming. Java Streams are Monads, a structure that represents computations as a chain of steps. Streams are the Java APIs that let you manipulate the collections. You can chain together multiple Stream operations to achieve a complex data processing pipeline. With Streams API, you can write the code that’s Declarative: More concise as well as readable Composable: Greater flexibility Parallelizable: Better performance (using Parallel Streams) Streams can also be created the same way as Spark RDDs A. Collections as well as Arrays: List strings = Arrays.asList("abc", "", "bc", "efg", "abcd","", "jkl"); //get count of empty string int count = strings.parallelStream().filter(string -> string.isEmpty()).count(); B. File Systems: Stream stream = Files.lines(Paths.get("URI/to/sample/file.txt"); Like RDDs Streams, operations are also of two types: Intermediate (like Transformations): which performs some operations on Stream to return a Stream (map). Terminal: which returns a value after performing the operation or can be void (reduce, foreach). Stream lines = Files.lines(Paths.get("URI/to/sample/file.txt"); Stream lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce(0, (a, b) -> a + b); Streams accept Lambda Expression as a parameter, which is a functional interface that specifies the exact behavior of the operation. The intermediate operations are executed only when the terminal operation is called over them. Once a terminal operation is called over a Stream, we cannot reuse it. If we want to use any intermediate operation of a Stream, we have to create a Stream Supplier which constructs a new Stream with the intermediate operations. The supplier provides get() method to fetch the desired intermediate Stream operation that is already saved. 3. What Can We Do with Spark RDD? To perform very fast computations over a shared data set such as iterative distributed computing, we need to have an excellent data sharing architecture. This involves processing data using multiple ad-hoc queries and sharing and reusing of data among multiple jobs. To perform these operations, we need to have a mechanism that stores the intermediate data over a distributed data store which may lead to slower processing due to multiple IO operations. RDDs help us do such operations by breaking the computations into small tasks which run on separate machines. We can cache these RDDs into our local discs to use them in other actions. This helps to execute the future actions much faster. persist() or cache() methods help us keep the computed RDDs in the memory of the nodes. Following properties make RDDs perform best for iterative distributed computing algorithms like K-means clustering, page rank, Logistic regression etc: Immutable Partitioned Fault tolerant Created by coarse grained operations Lazily evaluated Can be persisted More importantly, all the Transformations in RDDs are lazy, which means their result is not calculated right away. The results are just remembered and are computed just when they are actually needed by the driver program. The Actions, on the other hand are eager. 4. What Can We Do with Java 8 Streams API? Stream APIs (and of course Lambdas) were introduced in Java 8 considering parallelism as the main driving force. Streams help to write the complex code in concise way which is more readable, flexible and understandable. Streams can be created from various data sources, like Collections, Arrays, and file resources. Streams are of two types: Sequential and Parallel Streams. We can perform distributed computing operations using multiple threads using Streams. Parallel Streams can be used to boost the performance of Streams when there is a large amount of input data. Like RDDs, we have methods like map, reduce, collect, flatMap, sort, filter, min, max, count etc. in Streams. Consider a list of fruits: List fruits = Arrays.asList("apple", "orange", "pineapple", "grape", "banana", "mango", “blackberry”); Filter() fruits.filter( fruit -> fruit.startsWith("b") ); Map() fruits.map( fruit -> fruit.toUpperCase() ) Collect() List filteredFruits = fruits.filter( fruit -> fruit.startsWith("b") ) .collect(Collectors.toList()); Min() and Max() String shortest = fruits.min(Comparator.comparing(fruit -> fruit.length())).get(); Count() long count = fruits.filter( fruit -> fruit.startsWith("b")).count(); Reduce() String reduced = fruits.filter( item -> item.startsWith("b")) .reduce("", (acc, item) -> acc + " " + item); 5. How Are They Same? RDDs and Streams can be created the same way: from Collections and File Systems. RDDs and Streams perform two (same) types of operations: Transformations in RDDs == Intermediate in Streams Actions in RDDs == Terminal in Streams Transformations (RDDs) and Intermediate (Streams) have the same important characteristic i.e. Laziness. They just remember the transformations instead of computing them unless an Action is needed.While Actions (RDDs) and Terminal (Streams) operations are eager Operations. RDDs and Streams help in reducing the actual number of operations performed over each element as both use Filters and Predicates. Developers can write much concise code using RDDs and Streams. RDDs and (parallel) Streams are used for parallel distributed operations where a large number of data processing is required. Both RDDs and Streams work on the principle of Functional Programming and use lambda Expressions as parameters. 6. How Are They Different? Unlike RDDs, Java 8 Streams are of two types: Sequential and Parallel. Parallelization is just a matter of calling parallel() method in Streams. It internally utilizes the Thread Pool in JVM, while Spark RDDs can be distributed and deployed over a cluster. While Spark has different storage levels for different purposes, Streams are in memory data structures. When you call parallelize method in Streams, your data is split into multiple chunks and they are processed independently. This process is CPU intensive and utilizes all the available CPUs. The Java parallel Streams use a common ForkJoinPool. The capacity of this ThreadPool to use Threads depends on the number of available CPU cores. This value can be increased or decreased using the following JVM parameter, -Djava.util.concurrent.ForkJoinPool.common.parallelism=5 So for executing the parallel Stream operations the Stream utilizes all the available threads in the ThreadPool. Moreover, if you submit a long running task, this could result in blocking all the other threads in the ThreadPool. One long task could result into blocking of entire application. 7. Which One Is Better? When? Though RDDs and Streams provide quite similar type of implementations, APIs and functionalities, Apache Spark is much more powerful than Java 8 Streams. While it’s completely our choice what to use when, we should always try to analyze our requirements and proceed for the implementations. As parallel Streams use a Common Thread Pool, we must ensure that there won’t be any long running task which will cause other threads to stuck. Apache Spark RDDs will help you distribute the data over cluster. When there are such complex situations which involve a real huge amount of data and computations, we should avoid using Java 8 Streams. So for non-distributed parallel processing my choice would be to go with Streams, while Apache Spark RDDs would be preferable for real time analytics or continuously streaming data over distributed environments.

Aziro Marketing

apache Apache Spark RDD apache8 java 8 streams

EXPLORE ALL TAGS

Ansible Test Automation

application containerization

applications

Application Security

application testing

artificial intelligence

asynchronous replication

business intelligence

Cloud Cost Optimization

cloud devops

Cloud Infrastructure

Cloud Interoperability

Cloud Native Solution

Cloud Storage Security

Codeless Automation

Cognitive analytics

Configuration Management

container world conference

continuous-delivery

continuous deployment

continuous integration

data backup and recovery

Descriptive analytics

Descriptive analytics tools

Digital Transformation

dockercon 2019 san francisco

end-to-end-test-automation

High Performance Computing

icinga for monitoring

Image Recognition 2024

kubernetesday bangalore

Low-Code No-Code Platforms

mobile-application-testing

mobile-automation-testing

Predictive analytics tools

prescriptive analysis

Rapid Application Development

raspberry pi

RDMA

real time analytics

realtime analytics platforms

Real-time data analytics

Recovery

Recovery as a service

rsa 2019 san francisco

Selenium Test Automation

selenium testng

serverless

Serverless Computing

Site Reliability Engineering

software defined storage

software-testing

software testing trends

software testing trends 2019

storage virtualization

support

Synchronous Replication

testing automation tools

thought leadership articles

trends

tutorials

ui automation testing

ui testing

ui testing automation

vCenter Operations Manager

vmworld 2019 san francisco

VMworld 2019 US

vROM

Web Automation Testing

web test automation

WFH

LET'S ENGINEER

Your Next Product Breakthrough

Book a Free 30-minute Meeting with our technology experts.

Aziro has been a true engineering partner in our digital transformation journey. Their AI-native approach and deep technical expertise helped us modernize our infrastructure and accelerate product delivery without compromising quality. The collaboration has been seamless, efficient, and outcome-driven.

CTO

Fortune 500 company