Tag Archive

Below you'll find a list of all posts that have been tagged as "Apache Spark RDD"
blogImage

A Beginner’s Guide to Complete Analysis of Apache Spark RDDs and Java 8 Streams

1. What is Apache Spark RDD? Apache Spark RDD stands for Resilient Distributed Datasets. RDD is a fault tolerant, immutable collection of elements which can be operated on, in parallel. We can perform various parallel operations on them such as map, reduce, filter, count, distinct etc. We can persist them in local memory and perform these operations on them. RDDs can be created in two ways: A. parallelize(): calling a parallelize method on the existing collection in our program (pass collection object to the method). JavaRDD javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5)); B. textFile(): calling textFile method by passing the path of the file at local or shared file system (pass file URI to the method). JavaRDD lines = sparkContext.textFile("URI/to/sample/file.txt"); Both methods are called using the reference of JavaSparkContext class. There are two types of operations that can be performed on RDDs: Transformations: which perform some operations on RDD to return an RDD (map). Actions: which return a value after performing the operation (reduce). Consider the following example of map and reduce to calculate the total length of the lines in the file, using JavaRDDs: JavaRDD lines = sc.textFile("URI/to/sample/file.txt"); JavaRDD lengths = lines.map(l -> l.length()); int totalLength = lengths.reduce((a, b) -> a + b); 2. What is Java 8 Streams API? Java Stream API sounds similar to InputStream and OutputSream in Java IO, but it is completely different, so let’s not get confused. Streams are specifically introduced in Java 8 to ease functional programming. Java Streams are Monads, a structure that represents computations as a chain of steps. Streams are the Java APIs that let you manipulate the collections. You can chain together multiple Stream operations to achieve a complex data processing pipeline. With Streams API, you can write the code that’s Declarative: More concise as well as readable Composable: Greater flexibility Parallelizable: Better performance (using Parallel Streams) Streams can also be created the same way as Spark RDDs A. Collections as well as Arrays: List strings = Arrays.asList("abc", "", "bc", "efg", "abcd","", "jkl"); //get count of empty string int count = strings.parallelStream().filter(string -> string.isEmpty()).count(); B. File Systems: Stream stream = Files.lines(Paths.get("URI/to/sample/file.txt"); Like RDDs Streams, operations are also of two types: Intermediate (like Transformations): which performs some operations on Stream to return a Stream (map). Terminal: which returns a value after performing the operation or can be void (reduce, foreach). Stream lines = Files.lines(Paths.get("URI/to/sample/file.txt"); Stream lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce(0, (a, b) -> a + b); Streams accept Lambda Expression as a parameter, which is a functional interface that specifies the exact behavior of the operation. The intermediate operations are executed only when the terminal operation is called over them. Once a terminal operation is called over a Stream, we cannot reuse it. If we want to use any intermediate operation of a Stream, we have to create a Stream Supplier which constructs a new Stream with the intermediate operations. The supplier provides get() method to fetch the desired intermediate Stream operation that is already saved. 3. What Can We Do with Spark RDD? To perform very fast computations over a shared data set such as iterative distributed computing, we need to have an excellent data sharing architecture. This involves processing data using multiple ad-hoc queries and sharing and reusing of data among multiple jobs. To perform these operations, we need to have a mechanism that stores the intermediate data over a distributed data store which may lead to slower processing due to multiple IO operations. RDDs help us do such operations by breaking the computations into small tasks which run on separate machines. We can cache these RDDs into our local discs to use them in other actions. This helps to execute the future actions much faster. persist() or cache() methods help us keep the computed RDDs in the memory of the nodes. Following properties make RDDs perform best for iterative distributed computing algorithms like K-means clustering, page rank, Logistic regression etc: Immutable Partitioned Fault tolerant Created by coarse grained operations Lazily evaluated Can be persisted More importantly, all the Transformations in RDDs are lazy, which means their result is not calculated right away. The results are just remembered and are computed just when they are actually needed by the driver program. The Actions, on the other hand are eager. 4. What Can We Do with Java 8 Streams API? Stream APIs (and of course Lambdas) were introduced in Java 8 considering parallelism as the main driving force. Streams help to write the complex code in concise way which is more readable, flexible and understandable. Streams can be created from various data sources, like Collections, Arrays, and file resources. Streams are of two types: Sequential and Parallel Streams. We can perform distributed computing operations using multiple threads using Streams. Parallel Streams can be used to boost the performance of Streams when there is a large amount of input data. Like RDDs, we have methods like map, reduce, collect, flatMap, sort, filter, min, max, count etc. in Streams. Consider a list of fruits: List fruits = Arrays.asList("apple", "orange", "pineapple", "grape", "banana", "mango", “blackberry”); Filter() fruits.filter( fruit -> fruit.startsWith("b") ); Map() fruits.map( fruit -> fruit.toUpperCase() ) Collect() List filteredFruits = fruits.filter( fruit -> fruit.startsWith("b") ) .collect(Collectors.toList()); Min() and Max() String shortest = fruits.min(Comparator.comparing(fruit -> fruit.length())).get(); Count() long count = fruits.filter( fruit -> fruit.startsWith("b")).count(); Reduce() String reduced = fruits.filter( item -> item.startsWith("b")) .reduce("", (acc, item) -> acc + " " + item); 5. How Are They Same? RDDs and Streams can be created the same way: from Collections and File Systems. RDDs and Streams perform two (same) types of operations: Transformations in RDDs == Intermediate in Streams Actions in RDDs == Terminal in Streams Transformations (RDDs) and Intermediate (Streams) have the same important characteristic i.e. Laziness. They just remember the transformations instead of computing them unless an Action is needed.While Actions (RDDs) and Terminal (Streams) operations are eager Operations. RDDs and Streams help in reducing the actual number of operations performed over each element as both use Filters and Predicates. Developers can write much concise code using RDDs and Streams. RDDs and (parallel) Streams are used for parallel distributed operations where a large number of data processing is required. Both RDDs and Streams work on the principle of Functional Programming and use lambda Expressions as parameters. 6. How Are They Different? Unlike RDDs, Java 8 Streams are of two types: Sequential and Parallel. Parallelization is just a matter of calling parallel() method in Streams. It internally utilizes the Thread Pool in JVM, while Spark RDDs can be distributed and deployed over a cluster. While Spark has different storage levels for different purposes, Streams are in memory data structures. When you call parallelize method in Streams, your data is split into multiple chunks and they are processed independently. This process is CPU intensive and utilizes all the available CPUs. The Java parallel Streams use a common ForkJoinPool. The capacity of this ThreadPool to use Threads depends on the number of available CPU cores. This value can be increased or decreased using the following JVM parameter, -Djava.util.concurrent.ForkJoinPool.common.parallelism=5 So for executing the parallel Stream operations the Stream utilizes all the available threads in the ThreadPool. Moreover, if you submit a long running task, this could result in blocking all the other threads in the ThreadPool. One long task could result into blocking of entire application. 7. Which One Is Better? When? Though RDDs and Streams provide quite similar type of implementations, APIs and functionalities, Apache Spark is much more powerful than Java 8 Streams. While it’s completely our choice what to use when, we should always try to analyze our requirements and proceed for the implementations. As parallel Streams use a Common Thread Pool, we must ensure that there won’t be any long running task which will cause other threads to stuck. Apache Spark RDDs will help you distribute the data over cluster. When there are such complex situations which involve a real huge amount of data and computations, we should avoid using Java 8 Streams. So for non-distributed parallel processing my choice would be to go with Streams, while Apache Spark RDDs would be preferable for real time analytics or continuously streaming data over distributed environments.

Aziro Marketing

EXPLORE ALL TAGS
2019 dockercon
Advanced analytics
Agentic AI
agile
AI
AI ML
AIOps
Amazon Aws
Amazon EC2
Analytics
Analytics tools
AndroidThings
Anomaly Detection
Anomaly monitor
Ansible Test Automation
apache
apache8
Apache Spark RDD
app containerization
application containerization
applications
Application Security
application testing
artificial intelligence
asynchronous replication
automate
automation
automation testing
Autonomous Storage
AWS Lambda
Aziro
Aziro Technologies
big data
Big Data Analytics
big data pipeline
Big Data QA
Big Data Tester
Big Data Testing
bitcoin
blockchain
blog
bluetooth
buildroot
business intelligence
busybox
chef
ci/cd
CI/CD security
cloud
Cloud Analytics
cloud computing
Cloud Cost Optimization
cloud devops
Cloud Infrastructure
Cloud Interoperability
Cloud Native Solution
Cloud Security
cloudstack
cloud storage
Cloud Storage Data
Cloud Storage Security
Codeless Automation
Cognitive analytics
Configuration Management
connected homes
container
Containers
container world 2019
container world conference
continuous-delivery
continuous deployment
continuous integration
Coronavirus
Covid-19
cryptocurrency
cyber security
data-analytics
data backup and recovery
datacenter
data protection
data replication
data-security
data-storage
deep learning
demo
Descriptive analytics
Descriptive analytics tools
development
devops
devops agile
devops automation
DEVOPS CERTIFICATION
devops monitoring
DevOps QA
DevOps Security
DevOps testing
DevSecOps
Digital Transformation
disaster recovery
DMA
docker
dockercon
dockercon 2019
dockercon 2019 san francisco
dockercon usa 2019
docker swarm
DRaaS
edge computing
Embedded AI
embedded-systems
end-to-end-test-automation
FaaS
finance
fintech
FIrebase
flash memory
flash memory summit
FMS2017
GDPR faqs
Glass-Box AI
golang
GraphQL
graphql vs rest
gui testing
habitat
hadoop
hardware-providers
healthcare
Heartfullness
High Performance Computing
Holistic Life
HPC
Hybrid-Cloud
hyper-converged
hyper-v
IaaS
IaaS Security
icinga
icinga for monitoring
Image Recognition 2024
infographic
InSpec
internet-of-things
investing
iot
iot application
iot testing
java 8 streams
javascript
jenkins
KubeCon
kubernetes
kubernetesday
kubernetesday bangalore
libstorage
linux
litecoin
log analytics
Log mining
Low-Code
Low-Code No-Code Platforms
Loyalty
machine-learning
Meditation
Microservices
migration
Mindfulness
ML
mobile-application-testing
mobile-automation-testing
monitoring tools
Mutli-Cloud
network
network file storage
new features
NFS
NVMe
NVMEof
NVMes
Online Education
opensource
openstack
opscode-2
OSS
others
Paas
PDLC
Positivty
predictive analytics
Predictive analytics tools
prescriptive analysis
private-cloud
product sustenance
programming language
public cloud
qa
qa automation
quality-assurance
Rapid Application Development
raspberry pi
RDMA
real time analytics
realtime analytics platforms
Real-time data analytics
Recovery
Recovery as a service
recovery as service
rsa
rsa 2019
rsa 2019 san francisco
rsac 2018
rsa conference
rsa conference 2019
rsa usa 2019
SaaS Security
san francisco
SDC India 2019
SDDC
security
Security Monitoring
Selenium Test Automation
selenium testng
serverless
Serverless Computing
Site Reliability Engineering
smart homes
smart mirror
SNIA
snia india 2019
SNIA SDC 2019
SNIA SDC INDIA
SNIA SDC USA
software
software defined storage
software-testing
software testing trends
software testing trends 2019
SRE
STaaS
storage
storage events
storage replication
Storage Trends 2018
storage virtualization
support
Synchronous Replication
technology
tech support
test-automation
Testing
testing automation tools
thought leadership articles
trends
tutorials
ui automation testing
ui testing
ui testing automation
vCenter Operations Manager
vCOPS
virtualization
VMware
vmworld
VMworld 2019
vmworld 2019 san francisco
VMworld 2019 US
vROM
Web Automation Testing
web test automation
WFH

LET'S ENGINEER

Your Next Product Breakthrough

Book a Free 30-minute Meeting with our technology experts.

Aziro has been a true engineering partner in our digital transformation journey. Their AI-native approach and deep technical expertise helped us modernize our infrastructure and accelerate product delivery without compromising quality. The collaboration has been seamless, efficient, and outcome-driven.

Customer Placeholder
CTO

Fortune 500 company