Big data Updates

Uncover our latest and greatest product updates

How to classify Product Catalogue Using Ensemble

The Problem StatementThe Otto Product Classification Challenge was a competition hosted on Kaggle, a website dedicated to solving complex data science problems. The purpose of this challenge was to classify products into correct category, based on their recorded features.DataThe organizers had provided a training data set containing 61878 entries, and a test data set that had 144368 entries. The data contained 93 features, based on which the products had to be classified. The target column in the training data set indicated the category of the product. The training and test data sets are available for download here. A sample of training data set can be seen in the figure 1.Solution ApproachThe features in the training data set had a large variance. Anscombe transform on the features reduced the variance. In the process, it also transformed the features from an approximately Poisson distribution into an approximately normal distribution. Rest of the data was pretty clean, and thus we could use it directly as input to the classification algorithm. For classification, we tried two approaches { one was using the xgboost algorithm, and other using the deeplearning algorithm through h2o. xgboost is an implementation of extreme gradient boosting algorithm, which can be used effectively for classification.Figure 1: Sample DataAs discussed in the TFI blog, The gradient boosting algorithm is an ensemble method based on decision trees. For classification, at every branch in the tree, it tries to eliminate a category, to finally have only one category per leaf node. For this, it needs to build trees for each category separately. But since we had only 9 categories, we decided to use it. The deep learning algorithm provided by h2o is based on a multi-layer neural network. It is trained using a variation of gradient descent method. We used multi-class logarithmic loss as the error metric to find a good model, as this was also the model used for ranking by Kaggle.Building the ClassifierInitially, we created a classifier using xgboost. The xgboost configuration for our best submission is as below: param list(`objective' = `multi:softprob', `eval metric' = `mlogloss', `num class' = 9, `nthread' = 8, `eta'= 0.1, `max depth' = 27, `gamma' = 2, `min child weight' = 3, `subsample' = 0.75, `colsample bytree' = 0.85) nround = 5000 classier = xgboost(param=param, data = x, label = y, nrounds=nround) Here, we have specified our objective function to be multi:softprob’. This function returns the probabilities for a product being classified into a specific category. Evaluation metric, as specified earlier, is the multiclass logarithmic loss function. The eta’ parameter shrinks the priors for features, thus making the algorithm less prone to overfitting. The parameter usually takes values between 0:1 to 0:001. The max depth’ parameter will limit the height of the decision trees. Shallow trees are constructed faster. Not specifying this parameter lets the trees grow as deep as required. The min child weight’ controls the splitting of the tree. The value of the parameter puts a lower bound on the weight of each child node, before it can be split further. The subsample’ parameter makes the algorithm choose a subset of the training set. In our case, it randomly chooses 75% of the training data to build the classifier. The colsample bytree’ parameter makes the algorithm choose a subset of features while building the tree, in our case to 85%. Both subsample’ as well as colsample bytree’ help in preventing overt. The classier was built by performing 5000 iterations over the training data set. These parameters were tuned by experimentation, by trying to minimize the log-loss error. The log-loss error on public leader board for this configuration was 0.448.We also created another classifier using the deep-learning algorithm provided by h2o. The configuration for this algorithm was as follows:classication=T, activation=`RectierWithDropout', hidden=c(1024,512,256), hidden dropout ratio=c(0.5,0.5,0.5), input dropout ratio=0.05, epochs=50, l1=1e-5, l2=1e-5, rho=0.99, epsilon=1e-8, train samples per iteration=4000, max w2=10, seed=1 This configuration creates a neural network with 3 hidden layers, each with 1024, 512 and 256 neurons respectively. This is specified using the `hidden’ parameter. The activation function in this case is Rectifier with Dropout. The rectifier function filters negative inputs for each neuron. Dropout lets us randomly drop inputs to the hidden neuron layers. Dropout builds better generalizations. The hidden_dropout _ratio specifies the percentage of inputs to hidden layers to be dropped. The input dropout ratio specifies the percentage of inputs to the input layer to be dropped. Epochs define the number of training iterations to be carried out. Setting train samples per iteration makes the algorithm choose subset of training data. Setting l1 and l2 scales the weights assigned to each feature. l1 reduces model complexity, and l2 introduces bias in estimation. Rho and Epsilon together slow convergence. The max w2 function sets an upper limit on the sum of squared incoming weights into a neuron. This needs to be set for rectifier activation function. Seed is the random seed that controls sampling. Using these parameters, we performed 10 iterations of deep learning, and submitted the mean of the 10 results as the output. The log-loss error on public leaderboard for this configuration was 0.448.We then merged both the results by taking a mean, and that resulted in the top 10% submission, with a score of 0.428.ResultsSince the public leaderboard evaluation was based on 70% of the test data, the public leaderboard rank was fairly stable. We submitted the best two results, and our rank remained in the top 10% at the end of final evaluation. Overall, it was a good learning experience.What We LearntRelying on results of one model may not give the best possible result. Combining results of various approaches may reduce the error by significant margin. As can been seen here, the winning solution in this competition had a much more complex ensemble. Complex ensembles may be the way ahead for getting better at complex problems like classification.

Aziro Marketing

big data data-analytics machine-learning

How to Implement Big Data for Your Organization in the Right Way

Let’s start with an introduction to what IT across the globe calls “big data.” From a use case perspective, few terms are so overused and hackneyed as big data. Some people say it’s the entire data in your company, while some others say it’s anything above one terabyte; a third group argues it’s something you cannot easily tackle. Well, in essence, big data is a broad term applied to data sets so large and complex as to obsolesce traditional data processing systems. This means new systems have to be implemented in order to get a grasp of such large volumes of data.One might ask what benefits can be drawn by analyzing such deluge of data. The answer to that can be quite broad. Businesses can draw huge benefits from big data analytics, and the primary and the most sought-after of them all is key business intelligence that translates to high profits.Traditional business intelligence systems can not only be revamped but also made more beneficial by the implementation of big data systems. Look at Google’s Flu Trends, which gives near real-time trends on flu across the globe, with the help of the search data that Google captures from its data centers. This certainly qualifies as big data, and with its help, Google is able to provide an accurate worldwide analysis of flu trends. This is one of the major use cases of big data analysis. When it comes to your organization, a big data analytics implementation can make all the difference in profits. Many organizations have identified that just by implementing a recommendation engine, they are able to perceive huge difference in sales. Let’s look at some key best practices in implementing big data.1. Analyze Business RequirementsAs the first step, you need to know what you’ll be using big data tools for. The answer to that is your business requirements. You need to gather all business requirements and then analyze them in order to understand them. It’s important that all big data projects be aligned to specific business needs.2. Agile Implementation Is KeyAgile and iterative techniques are required for quick implementation of big data analytics in your organization. Business agility in its basic sense is the inherent capability to respond to changes rapidly. Throughout implementation, you may see that the requirements of the organization will evolve as it understands and implements big data. Agile and iterative techniques in implementation deliver quick solutions based on the extant needs.3. Business Decision MakingBig data analytics solutions reach their maximum potential if implemented from a business perspective rather than an engineering perspective. This essentially means the data analytics solutions have to be tailor-made for your business rather than being general. Tailor-made business-centric solutions will help you achieve the results that you are looking for.4. Make Use of the Data You HaveA key thing that many organizations need to do is understanding the potential of the data they already have. One of the reasons why this is essential is to keep up with the ever-increasing competition out there. Gartner survey says 73 percent of organizations will implement big data in the next two years. The other 27 percent will lose out in competition, for sure! According to IDC, visual data discovery tools will grow 2.5 times faster than the traditional business intelligence market. You already have an edge if you make use of the existing data in your organization. This is a key business decision that you need to make.5. Don’t Abandon Legacy Data SystemsAbandoning expensive legacy data systems that you may already have in place may be a big mistake. Relational database management solutions will not end their race soon, just because a new smarter kid is in the block. Although RDBMS may be erstwhile, the cost of complete abandonment may be large enough to render it unnecessary.6. Evaluate the Data Requirements CarefullyA complete evaluation of data that your organization gathers on a daily basis is essential, even if big data implementation is not in your immediate business plan. A stakeholder meet that clearly consolidates everyone’s opinion is necessary in this.7. Approach It From the Ground UpOne key thing that you should not forget is it’s nearly impossible to tackle an entire organization’s data in one shot. It’s better to go at it in a granular way, gradually incorporating data in sets and testing thoroughly the efficiency of implementation. Taking too much data in the first step itself will yield unreliable results or cause a complete collapse of your setup.8. Set Up Centers of ExcellenceIn order to optimize knowledge transfer across your organization, tackle any oversights, and correct mistakes, set up a center of excellence. This will also help share costs across the enterprise. One key benefit of setting up centers of excellence is that it will ensure the maturity of information architecture in a systematic way.ConclusionAssociate big data analytics platforms with the data gathered by your enterprise apps, such as HR management systems, CRMs, ERPs, etc. This will enable information workers to understand and unearth insights from different sources of data.Big data is associated with four key aspects: volume, variety, veracity, velocity, according to IBM. This means you have very little time to start analyzing the data, and the infrastructure needed to analyze it and provide insights from it will be well worth your time and investment. This is the reason why there is such immense interest in big data analytics.

Aziro Marketing

big data

How big data benefits Business intelligence

Big data is a big phenomenon—one that can overwhelm you in an unimaginably large scale. With the progress of technology firms through the Internet, it has become quite apparent that big data would have importance beyond all scales.The term big data is applied to unstructured, highly complex collection of data that warrants special processing technologies. It simply cannot be processed like regular data that you are familiar with. Today’s organizations are increasingly looking for ways to uncover actionable insights, correlations, and hidden patterns from hoards of data lying around. Organizations such as Google, Facebook, and Microsoft have immense stores of big data and they use specific analytics tools to make sense of them. BI and Big Data Over the past few decades enterprises saw a gradual evolution of data management technologies from OLTP (Online Transaction Processing) to data warehousing and business intelligence. The latest trend in this data management technologies may be big data. Almost every large enterprise is using big data in its business advancement environments. Business intelligence accelerates major business decisions made by IT organizations today. Enterprises today store transactional data on big data warehouses and make use of this data for various analysis purposes. Although the business intelligence tools have existed for quite a while in the enterprises, the tools and technologies for BI requires a revisit to accommodate one of the biggest changes that happened in the IT organizations today—advent of big data. How does Big Data Benefit Business Intelligence? Big data comes with some heavy advantages that can transform businesses completely. Some of them are: Highly scalable: While the traditional DataStack platform is unable to scale to heavy workloads, big-data-based platform can scale almost infinitely. Cost savings: Most of the big data platforms are coming with open-source licensing options. Cost of ownership, hence, is significantly less with big data BI architecture. MapReduce: Programming interface MapReduce provides powerful custom data management and processing capabilities. Unstructured data support: One major advantage of big data is that it supports structured, semi-structured, an unstructured data elements. Conclusion Enterprises should leverage big data for enterprise business intelligence by incorporating them in the existing BI architecture. The advantages of doing this has been discussed at length already. Also, the cost for doing this may be well justified when you reap the myriad business benefits. Aziro (formerly MSys Technologies) solves complex big data challenges for global enterprises. As one of the leading Big Data Service Providers, Aziro (formerly MSys Technologies) supports clients in refining their big data strategy by choosing the right tools, technologies and processes that help achieve strategic objectives. Our vendor-neutral Big Data Analytics solutions are tailored to customer’s current technology landscape, preferences and objectives. We enable enterprises to establish a well-defined modern architecture, which yields greater efficiency in their day-to-day processes. Our expertise in Hadoop-based platforms, MPP databases, cloud storage systems and other emerging technologies help organizations harness the power of big data to improve business outcomes.

Aziro Marketing

business intelligence

How you can Hyperscale your Applications Using Mesos & Marathon

In a previous blog post we have seen what Apache Mesos is and how it helps to create dynamic partitioning of our available resources which results in increased utilization, efficiency, reduced latency, and better ROI. We also discussed how to install, configure and run Mesos and sample frameworks. There is much more to Mesos than above.In this post we will explore and experiment with a close to real-life Mesos cluster running multiple master-slave configurations along with Marathon, a meta-framework that acts as cluster-wide init and control system for long running services. We will set up a 3 Mesos master(s) and 3 Mesos slaves(s), cluster them along with Zookeeper and Marathon, and finally run a Ruby on Rails application on this Mesos cluster. The post will demo scaling up and down of the Rails application with the help of Marathon. We will use Vagrant to set up our nodes inside VirtualBox and will link the relevant Vagrantfile later in this post.To follow this guide you will need to obtain the binaries for Ubuntu 14.04 (64 bit arch) (Trusty)Apache MesosMarathonApache ZookeeperRuby / RailsVirtualBoxVagrantVagrant Pluginsvagrant-hostsvagrant-cachierLet me briefly explain what Marathon and Zookeeper are.Marathon is a meta-framework you can use to start other Mesos frameworks or applications (anything that you could launch from your standard shell). So if Mesos is your data center kernel, Marathon is your “init” or “upstart”. Marathon provides an excellent REST API to start, stop and scale your application.Apache Zookeeper is coordination server for distributed systems to maintain configuration information, naming, provide distributed synchronization, and group services. We will use Zookeeper to coordinate between the masters themselves and slaves.For Apache Mesos, Marathon and Zookeeper we will use the excellent packages from Mesosphere, the company behind Marathon. This will save us a lot of time from building the binaries ourselves. Also, we get to leverage bunch of helpers that these packages provide, such as creating required directories, configuration files and templates, startup/shutdown scripts, etc. Our cluster will look like this:The above cluster configuration ensures that the Mesos cluster is highly available because of multiple masters. Leader election, coordination and detection is Zookeeper’s responsibility. Later in this post we will show how all these are configured to work together as a team. Operational Guidelines and High Availability are goodreads to learn and understand more about this topic.InstallationIn each of the nodes we first add Mesosphere APT repositories to repository source lists and relevant keys and update the system.$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF $ echo "deb http://repos.mesosphere.io/ubuntu trusty main" | sudo tee /etc/apt/sources.list.d/mesosphere.list $ sudo apt-get -y update If you are using some version other than Ubuntu 14.04 then you will have to change the above line accordingly and if you are using some other distribution like CentOS then you will have to use relevant rpm and yum commands. This applies everywhere henceforth.On master nodes:In our configuration, we are running Marathon on the same box as Mesos masters. The folks at Mesosphere have created a meta-package called mesosphere which installs Mesos, marathon and also zookeeper.$ sudo apt-get install mesosphere On slave nodes:On slave nodes, we require only zookeeper and mesos installed. The following command should take care of it.$ sudo apt-get install mesos As mentioned above, installing the above packages will do more that just installing packages. Much of the plumbing work is taken care of for the better. You need not worry if the mandatory “work_dir” has been created in the absence of which Apache Mesos would not run and other such important things. If you want to understand more, extracting scripts from the package and studying them is highly recommended. That is what I did as well.You can save a lot of time if you clone this repository and then run the following command inside your copy.$ vagrant up This command will launch a cluster, set the IPs for all nodes, install all required packages to follow this post. You are now ready to configure your cluster.ConfigurationIn this section we will configure each tool/application one by one. We will start with Zookeeper, then Mesos servers, then Mesos slaves and finally Marathon.ZookeeperLet us stop Apache Zookeeper on all nodes (masters and slaves).$ sudo service zookeeper stop Let us configure Apache Zookeeper on all masters. Do the following steps on each master.Edit /etc/zookeeper/conf/myid on each of the master nodes. Replace the boilerplate text in this file with a unique number (per server) from 1 to 255. These numbers will be the IDs for the servers being controlled by Zookeeper. Lets chose 10, 30 and 50 as IDs for the 3 Mesos master nodes. Save the files after adding 10, 30 and 50 respectively in /etc/zookeeper/conf/myid for the nodes. Here’s what I had to do on the first master node. Same has to be repeated on other nodes with respective IDs.$ echo 10 | sudo tee /etc/zookeeper/conf/myid Next we configure the Zookeeper configuration file ( /etc/zookeeper/conf/zoo.cfg ) for each master node. For the purpose of this blog we are just adding the master node IPs and relevant server IDs that was selected in the previous step.Note the configuration template line below. server.id=host:port1:port2. port1 is used by peer ZooKeeper servers to communicate with each other, and port2 is used for leader election. The recommended values are 2888 and 3888 for port1 and port2 respectively but you can choose to use custom values for your cluster.Assuming that you have chosen the IP range 10.10.20.11-13 for your Mesos servers as mentioned above, edit /etc/zookeeper/conf/zoo.cfg to reflect the following:# /etc/zookeeper/conf/zoo.cfg server.10=10.10.20.11:2888:3888 server.30=10.10.20.12:2888:3888 server.50=10.10.20.13:2888:3888 This file will have many other Zookeeper related configurations which are beyond the scope of this post. If you are using the packages mentioned above, the configuration templates should be a lot of help. Definitely read the comments sections, a lot to learn there.This is a good tutorial on understand fundamentals of Zookeeper. And this document is perhaps the latest and best document to know more about administering Apache Zookeeper, specifically this section is of relevance of what we are doing.All NodesZookeeper Connection DetailsFor all nodes (masters and slaves), we have to set up Zookeeper connection details. This will be stored in /etc/mesos/zk, a configuration file that you will get thanks to the packages. Edit this file on each node and add the following url carefully.#/etc/mesos/zk zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/mesos Port 2181 is Zookeeper’s client port that it listens to for client connections. IP addresses will differ if you have chosen IPs for your servers differently.IP AddressesNext we set up IP address information for all nodes (masters and slaves).Masters$ echo | sudo tee /etc/mesos-master/ip $ sudo cp /etc/mesos-master/ip /etc/mesos-master/hostname Write the IP of the node in the file. Save and close the file.Slaves$ echo | sudo tee /etc/mesos-slave/ip $ sudo cp /etc/mesos-slave/ip /etc/mesos-slave/hostname Write the IP of the node in the file. Save and close the file. Keeping the hostname same as IP makes it easier to resolve DNS.If you are using the Mesosphere packages, then you get a bunch of intelligent defaults. One of the most important things you get is a convenient way to pass CLI options to Mesos. All you need to do is create a file with same name as that of the CLI option and put the correct value that you want to pass to Mesos (master or slave). The file needs to be copied to a correct directory. In case of Mesos masters, you need to copy the correct file to /etc/mesos-master and for slaves you should copy the file to /etc/mesos-slave For example: echo 5050 > sudo tee /etc/mesos-slave/port We will see some examples of similar configuration setup below. Here you can find all the CLI options that you can pass to Mesos master/slave.Mesos ServersWe need to set a quorum for the servers. This can be done by editing /etc/mesos-master/quorum and setting it to a correct value. For our case, the quorum value can be 2 or 3. We will use 2 in this post. Quorum is the strict majority. Since we chose 2 as quorum value it means that out of 3 masters, we will definitely need at least 2 master nodes running for our cluster to run properly.We need to stop the slave service on all masters if they are running. If they are not, the following command might give you a harmless warning.$ sudo service mesos-slave stop Then we disable the slave service by setting a manual override.$ echo manual | sudo tee /etc/init/mesos-slave.override Mesos SlavesSimilarly we need to stop the master service on all slaves if they are running. If they are not, the following command might give you a harmless warning. We also set the master and zookeeper service on each slave to manual override.$ sudo service mesos-master stop $ echo manual | sudo tee /etc/init/mesos-master.override $ echo manual | sudo tee /etc/init/zookeeper.override The above .override files are read by upstart on Ubuntu box to start/stop processes. If you are using a different distribution or even Ubuntu 15.04 then you might have to do this differently.MarathonWe can now configure Marathon, for which we need some work to be done. We will configure Marathon only on the server nodes.First create a directory for Marathon configuration.$ sudo mkdir -p /etc/marathon/conf Then like we did before, we will set configuration properties by creating files with same name as that of property to be set and adding the value of the property as the only content of the file (see box above).Marathon binary needs to know the values for –master and –hostname. We can reuse the files that we used for Mesos configuration.$ sudo cp /etc/mesos-master/ip /etc/marathon/conf/hostname $ sudo cp /etc/mesos/zk /etc/marathon/conf/master To make sure Marathon can use Zookeeper, do the following (note the endpoint is different in this case i.e. marathon):$ echo zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/marathon \ | sudo tee /etc/marathon/conf/zk Here you can find all the command line options that you can pass to Marathon.Starting ServicesNow that we have configured our cluster, we can resume all services.Master$ sudo service zookeeper start $ sudo service mesos-master start $ sudo service marathon start Slave$ sudo service mesos-slave start Running Your ApplicationMarathon provides nice Web UI to set up your application. It also provides an excellent REST API to create, launch, scale applications, check health status and more.Go to your Marathon Web UI, if you followed the above instructions then the URL should be one of the Mesos masters on port 8080 ( i.e. http://10.10.20.11:8080 ). Click on “New App” button to deploy a new application. Fill in the details. Application ID is mandatory. Select relevant values for CPU, Memory, Disk Space for your application. For now let number of instances be 1. We will increase them later when we scale up the application in our shiny new cluster.There are a few optional settings that you might have to take care depending on our your slaves are provisioned and configured. For this post, I made sure each slave had Ruby, Ruby related dependencies and Bundler gem were installed. I took care of this when I launched and provisioned the slaves nodes.One of the important optional settings is “Command” that Marathon can execute. Marathon monitors this command and reruns it if it stops for some reason. Thus Marathon claims to fame as “init” and runs long running applications. For this post, I have used the following command (without the quotes).“cd hello && bundle install && RAILS_ENV=production bundle exec unicorn -p 9999” This command reads the Gemfile in the Rails application, installs all the necessary gems required for the application, and then runs the application on port 9999.I am using a sample Ruby On Rails application. I have put the url of the tarred application in the URI field. Marathon understands a few archive/package formats and takes care of unpacking them so we needn’t worry about them. Applications need resources to run properly, URIs can be used for this purpose. Read more about applications and resourceshere.Once you click “Create”, you will see that Marathon starts deploying the Rails application. A slave is selected by Mesos, the application tarball is downloaded, untarred, requirements are installed and the application is run. You can monitor all the above steps by watching the “Sandbox” logs that you should find on Mesos main web UI page. When the state of task will change from “Staging” to “Running” we have a Rails application run via Marathon on a Mesos slave node. Hurrah!If you followed the steps from above, and you read the “Sandbox” logs you know the IP of the node where the application was deployed. Navigate to the SLAVE_NODE_IP:9999 to see your rails application running. Scaling Your ApplicationAll good but how do we scale? After all, the idea is for our application to reach web scale and become the next Twitter, and this post is all about scaling application with Mesos and Marathon. So this is going to be difficult! Scaling up/down is difficult but not when you have Mesos and Marathon for company. Navigate to the application page on Marathon UI. You should see a button that says “Scale”. Click on it and increase the number to 2 or 3 or whatever you prefer (assuming that you have that many slave nodes). In this post we have 3 slave nodes, so I can choose 2 or 3. I chose 3. And voila! The application is deployed seamlessly to the other two nodes just like it was deployed to the first node. You can see for yourself by navigating to SLAVE_NODE_IP:9999 where SLAVE_NODE_IP will be the IP of the slave where the application was deployed. And there you go, you have your application running on multiple nodes.It would be trivial to put these IPs behind a load-balancer and a reverse proxy so that access to your application is as simple as possible. Graceful Degradation (and vice versa)Sometimes nodes in your clusters go down for one reason or other. Very often we get an email from your IaaS provider that your node will be retired in few days time and at other times a node dies before you could figure out what happened. When such inevitable things happen and the node in question is part of the cluster running the application, the dynamic duo of Mesos and Marathon have your back. The system will detect the failure, will de-register the slave and deploy the application to a different slave available in the cluster. You could tie it up with your IaaS-provided scaling option and spawn required number of new slave nodes as a part of your cluster which once registered with the Mesos cluster can run your application.Marathon REST APIAlthough we have used the Web UI to add a new application and scale it. We could have done the same (and much more) using REST API and thus do Marathon operations via some program or scripts. Here’s an simple example that will scale the application to 2 instances. Use any REST client or just curl to make a PUT request to the application ID, in our case http://10.10.20.11:8080/v2/apps/rails-app-on-mesos-marathon with the following JSON data as payload. You will notice that Marathon deploys the application to another instance if there was only 1 instance before.{ "instances" : 2 } You can do much more than above, do health checks, add/suspend/kill/scale applications etc. This can become a complete blog post in itself and will be dealt at a later time.ConclusionScaling your application becomes as easy as pressing buttons with a combination of Mesos and Marathon. Setting up a cluster can become almost trivial once you get your requirements in place and ideally automate the configuration and provisioning of your nodes. For this post, I relied on simple Vagrantfile and a shell script that provision the system. Later I configured the system by hand as per above steps. Using Chef or alike would make the configuration step a single command work. In fact there are a few open-source projects that are already very successful and do just that. I have played witheverpeace/vagrant-mesos and it is an excellent starting point. Reading the code from these projects will help you understand a lot about building and configuring clusters with Mesos.There are other projects that do similar things like Marathon and sometimes more. I definitely would like to mention Apache Aurora and HubSpot’s Singularity.

Aziro Marketing

big data data-analytics datacenter machine-learning

Role of Apache Cassandra in the CAP theorem

Have you heard of Cassandra? Wikipedia describes her quite aptly:“Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.“Ideal for high scalability without compromising on performance, Cassandra is the perfect database platform for mission-critical data.This blog guides engineers to understand what Cassandra is, how Cassandra works, why do we need Cassandra in our applications, and how to use the features and capabilities of Apache Cassandra.Basics FirstThere is a very famous theorem (CAP Theorem) in the Database world, which still proves and states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:Consistency – which means that data should be same in all the nodes in the cluster. If the user reads/writes from any node, the user should get the same data.Availability – which means at any point in time, the database should be accessible for read/write and there should not be any downtime in accessing the database.Partition Tolerance – which means that in a distributed system, the cluster continues to function even if there is a communications breakdown between two nodes. In this case, nodes are up in the cluster but not able to communicate between them but still it should work as expected.According to this theorem, a distributed system cannot satisfy all three of these guarantees at the same time. To be frank, this theorem says you can either have CA or CP or AP in any of the databases.Well, imagine if you are able to create a new database system, which supports CAP! That would be a priceless innovation in the database world and very lucrative indeed.Where Cassandra fits in CAP?Firstly, Cassandra is a database and It is classified as AP in the CAP. So this is a database which focuses or providing importance to Availability and Partition tolerance.But believe me, the beautiful feature of this database is we can tune and make this database to also meet Consistency. That surely hikes up the curiosity in IT folks. I will come to that soon.What is NoSQL?Having reasonable working experience in NoSQL databases, I can assure you that NoSQL is still a ‘buzzword’ in the database world.For easy understanding, I would like to list what people say about NoSQL and my opinion on it,NoSQL is vertically scalable – AgreedNoSQL violates ACID principle – Not all NoSQL databases, I would say it depends on the database since most of it partially supports ACID. (i.e) Mongo, HBase and sometimes Cassandra supports 100 % Durability, Mongo and HBase support row-level locking etc., But of course there is no concept of Transaction in the NoSQL databasesNoSQL is a key-value store Architecture – Perfect. That is the core concept of NoSQL. It supports faster write and readNoSQL is for Big Data – AgreedYes, all these denote NoSQL in the database world – it violates ACID, having key-value store structure definitely violates the core principle and concepts of relational databases and this is why it is also called as Not only SQL. I would say that NoSQL sacrifices these principles and concepts to provide the performance and data scalability.NoSQL says take care of ACID in your client code and as a compromise for it, I will provide the performance.Coming to CassandraCassandra is a NoSQL database and it is not a Master-Slave database. So which means all the nodes in the Cassandra are same. It is a peer-to-peer distributed database so it has the masterless architecture. (P.S. Throughout this blog, NODE denotes Cassandra node)Masterless ArchitectureIn other Master-Slave databases like MongoDB or HBase, there will be a downtime if the Master goes down and we need to wait for the next Master to come up. That’s not the case in Cassandra. It has no special nodes i.e. the cluster has no masters, no slaves or elected leaders. This enables Cassandra to be highly available while having no single point of failure. This is the reason it supports ‘A’ in CAP.As mentioned it is a distributed database system which means a single logical database is spread across a cluster of nodes and thus the need to spread data evenly amongst all participating nodes. Cassandra stores data by dividing data evenly across its cluster of nodes. Each node is responsible for part of the data. This is how it is able to support ‘P’ in CAP. So it is a database which supports AP.This answers the question why we need Cassandra in our application. Applications that demand zero downtime need a masterless architecture and that’s where Cassandra drives the value. In simple words, Write and Read can happen from any node in the cluster at any point in time. The below example shows the sample cluster formation in Cassandra for 5 node setup.Better ThroughputAnother highlight of Cassandra is that it can provide better workload performance with the increasing number of nodes. The below diagram demonstrates this in a better way.As per Cassandra, if two nodes can handle 100K transactions per second, then 4 nodes can handle 200K transactions per second and the strength multiplies so on.The buck does not stop here. There is definitely more to Cassandra and one can keep exploring to learn more.

Aziro Marketing

« Previous 1 2

EXPLORE ALL TAGS

Ansible Test Automation

application containerization

applications

Application Security

application testing

artificial intelligence

asynchronous replication

business intelligence

Cloud Cost Optimization

cloud devops

Cloud Infrastructure

Cloud Interoperability

Cloud Native Solution

Cloud Storage Security

Codeless Automation

Cognitive analytics

Configuration Management

container world conference

continuous-delivery

continuous deployment

continuous integration

data backup and recovery

Descriptive analytics

Descriptive analytics tools

Digital Transformation

dockercon 2019 san francisco

end-to-end-test-automation

High Performance Computing

icinga for monitoring

Image Recognition 2024

kubernetesday bangalore

Low-Code No-Code Platforms

mobile-application-testing

mobile-automation-testing

Predictive analytics tools

prescriptive analysis

Rapid Application Development

raspberry pi

RDMA

real time analytics

realtime analytics platforms

Real-time data analytics

Recovery

Recovery as a service

recovery as service

rsa

rsa 2019

rsa 2019 san francisco

Selenium Test Automation

selenium testng

serverless

Serverless Computing

Site Reliability Engineering

software defined storage

software-testing

software testing trends

software testing trends 2019

storage virtualization

support

Synchronous Replication

testing automation tools

thought leadership articles

trends

tutorials

ui automation testing

ui testing

ui testing automation

vCenter Operations Manager

vmworld 2019 san francisco

VMworld 2019 US

vROM

Web Automation Testing

web test automation

WFH

LET'S ENGINEER

Your Next Product Breakthrough

Book a Free 30-minute Meeting with our technology experts.

Aziro has been a true engineering partner in our digital transformation journey. Their AI-native approach and deep technical expertise helped us modernize our infrastructure and accelerate product delivery without compromising quality. The collaboration has been seamless, efficient, and outcome-driven.

CTO

Fortune 500 company