Tag Archive

Below you'll find a list of all posts that have been tagged as "big data"

10 Steps to Setup and Manage a Hadoop Cluster Using Ironfan

Recently, we faced a unique challenge – setup DevOps and management for a relatively complex Hadoop cluster on the Amazon EC2 Cloud. The obvious choice was to use a configuration management tool. Having extensively used Opscode’s Chef and given the flexibility and extensibility Chef provides; it was an obvious choice. While looking around for the best practices to manage a hadoop cluster using Chef, we stumbled upon: Ironfan What is Ironfan? In short Ironfan, open-souced by InfoChimps provides an abstraction on top of Chef, allowing users to easily provision, deploy and manage a cluster of servers – be it a simple web application or a complex Hadoop cluster. After a few experiments, we were convinced that Ironfan was the right thing to use as it simplifies a lot of configuration avoiding repetition while retaining the goodness of Chef. This blog shows how easy it is to setup and manage a Hadoop cluster using Ironfan. Pre-requisties: Chef Account (Hosted or Private) with knife.rb setup correctly on your client machine. Ruby setup (using RVM or otherwise) Installation: Now you can install IronFan on your machine using the steps mentioned here. Once you have all the packages setup correctly, perform these sanity checks: Ensure that the environment variable CHEF_USERNAME is your Chef Server username (unless your USER environment variable is the same as your Chef username) Ensure the the environment variable CHEF_HOMEBASE points to the location which contains the expanded out knife.rb ~/.chef should be a symbolic link to your knife directory in the CHEF_HOMEBASE Your knife/knife.rb file is not modified. Your Chef user PEM file should be in knife/credentials/{username}.pem Your organization’s Chef validator PEM file should be in knife/credentials/{organization}-validator.pem Your knife/credentials/knife-{organization}.rb file Should contain your Chef organization Should contain the chef_server_url Should contain the validation_client_name Should contain path to validation_key Should contain the aws_access_key_id/ aws_secret_access_key Should contain an AMI ID of an AMI you’d like to be able to boot in ec2_image_info Finally in the homebase rename the example_clusters directory to clusters. These are sample clusters than comes with Ironfan. Perform a knife cluster list command : $ knife cluster list Cluster Path: /.../homebase/clusters +----------------+-------------------------------------------------------+ | cluster | path | +----------------+-------------------------------------------------------+ | big_hadoop | /.../homebase/clusters/big_hadoop.rb | | burninator | /.../homebase/clusters/burninator.rb | ... Defining Cluster: Now lets define a cluster. A Cluster in IronFan is defined by a single file which describes all the configurations essential for a cluster. You can customize your cluster spec as follows: Define cloud provider settings Define base roles Define various facets Defining facet specific roles and recipes. Override properties of a particular facet server instance. Defining cloud provider settings: IronFan currently supports AWS and Rackspace Cloud providers. We will take an example of AWS cloud provider. For AWS you can provide config information like: Region, in which the servers will be deployed. Availibility zone to be used. EBS backed or Instance-Store backed servers Base Image(AMIs) to be used to spawn servers Security zone with the allowed port range. Defining Base Roles: You can define the global roles for a cluster. These roles will be applied to all servers unless explicitly overridden for any particular facet or server. All the available roles are defined in $CHEF_HOMEBASE/roles directory. You can create a custom role and use it in your cluster config. Defining Environment: Environments in Chef provide a mechanism for managing different environments such as production, staging, development, and testing, etc with one Chef setup (or one organization on Hosted Chef). With environments, you can specify per environment run lists in roles, per environment cookbook versions, and environment attributes. The available environments can be found in $CHEF_HOMEBASE/environments directory. Custom environments can be created and used. Ironfan.cluster 'my_first_cluster' do # Enviornment under which chef nodes will be placed environment :dev # Global roles for all servers role :systemwide role :ssh # Global ec2 cloud settings cloud(:ec2) do permanent true region 'us-east-1' availability_zones ['us-east-1c', 'us-east-1d'] flavor 't1.micro' backing 'ebs' image_name 'ironfan-natty' chef_client_script 'client.rb' security_group(:ssh).authorize_port_range(22..22) mount_ephemerals end Defining Facets: Facets are group of servers within a cluster. Facets share common attributes and roles. For example, in your cluster you have 2 app servers and 2 database servers then you can group the app servers under the app_server facet and the database servers under the database facet. Defining Facet specific roles and recipes: You can define roles and recipes particular to a facet. Even the global cloud settings can be overridden for a particular facet. facet :master do instances 1 recipe ‘nginx’ cloud(:ec2) do flavor ‘m1.small’ security_group(:web) do authorize_port_range(80..80) authorize_port_range(443..443) role :hadoop_namenode role :hadoop_secondarynn role :hadoop_jobtracker role :hadoop_datanode role :hadoop_tasktracker end facet :worker do instances 2 role :hadoop_datanode role :hadoop_tasktracker end In the above example we have defined a facet for Hadoop master node and a facet for worker node. The number of instances of master is set to 1 and that of worker is set to 2. Each master and worker facets have been assigned a set of roles. For master facet we have overridden the ec2 flavor settings as m1.medium. Also the security group for the master node is set to accept incoming traffic on port 80 and 443. Cluster Management: Now that we are ready with the cluster configuration lets get a hands on cluster management. All the cluster configuration files are placed under the $CHEF_HOMEBASE/clusters directory. We will place our new config file as hadoop_job001_cluster.rb. Now our new cluster should be listed in the cluster list. List Clusters: $ knife cluster list Cluster Path: /.../homebase/clusters +-------------+-------------------------+ | cluster | path | +-------------+-------------------------+ hadoop_job001 HOMEBASE/clusters/hadoop_job001_cluster.rb +-------------+-------------------------+ Show Cluster Configuration: $ knife cluster show hadoop_job001 Inventorying servers in hadoop_job001 cluster, all facets, all servers my_first_cluster: Loading chef my_first_cluster: Loading ec2 my_first_cluster: Reconciling DSL and provider information +-----------------------------+-------+-------------+----------+------------+-----+ | Name | Chef? | State | Flavor | AZ | Env | +-----------------------------+-------+-------------+----------+------------+-----+ | hadoop_job001-master-0 | no | not running | m1.small | us-east-1c | dev | | hadoop_job001-client-0 | no | not running | t1.micro | us-east-1c | dev | | hadoop_job001-client-1 | no | not running | t1.micro | us-east-1c | dev | +-----------------------------+-------+-------------+----------+------------+-----+ Launch Cluster: Launch Whole Cluster: $ knife cluster launch hadoop_job001 Loaded information for 3 computer(s) in cluster my_first_cluster +-----------------------------+-------+---------+----------+------------+-----+------------+--------- -------+----------------+------------+ | Name | Chef? | State | Flavor | AZ | Env | MachineID | Public IP | Private IP | Created On | +-----------------------------+-------+---------+----------+------------+-----+------------+----------------+----------------+------------+ | hadoop_job001-master-0 | yes | running | m1.small | us-east-1c | dev | i-c9e117b5 | 101.23.157.51 | 10.106.57.77 | 2012-12-10 | | hadoop_job001-client-0 | yes | running | t1.micro | us-east-1c | dev | i-cfe117b3 | 101.23.157.52 | 10.106.57.78 | 2012-12-10 | | hadoop_job001-client-1 | yes | running | t1.micro | us-east-1c | dev | i-cbe117b7 | 101.23.157.52 | 10.106.57.79 | 2012-12-10 | +-----------------------------+-------+---------+----------+------------+-----+------------+----------------+----------------+------------+ Launch a single instance of a facet: $ knife cluster launch hadoop_job001 master 0 Launch all instances of a facet: $ knife cluster launch hadoop_job001 worker Stop Whole Cluster: $ knife cluster stop hadoop_job001 Stop a single instance of a facet: $ knife cluster stop hadoop_job001 master 0 Stop all instances of a facet: $ knife cluster stop hadoop_job001 Setting up a Hadoop cluster and managing it cannot get easier than this! Just to re-cap, Ironfan, open-souced by InfoChimps, is a systems provisioning and deployment tool which automates entire systems configuration to enable the entire Big Data stack, including tools for data ingestion, scraping, storage, computation, and monitoring. There is another tool that we are exploring for Hadoop cluster management – Apache Ambari. We will post our findings and comparisons soon, stay tuned!

Aziro Marketing

Amazon EC2 big data cloud computing hadoop opscode-2 others

8 Steps To Foolproof Your Big Data Testing Cycle

Big Data refers to all the data that is generated across the globe at an unprecedented rate. This data could be either structured or unstructured. Comprehending this information and disentangling the different examples, uncovering the various patterns, and revealing unseen connections within the vast sea of data becomes critical and a massively compensating undertaking in reality. Better data leads to better decision making, and an improved way to strategize for organizations, irrespective of their size. The best ventures of tomorrow will be the ones that can make sense of all data at extremely high volumes and speeds to capture newer markets and client base.Why require Big Data Testing?With the presentation of Big Data, it turns out to be especially vital to test the enormous information framework with the utilization of suitable information accurately. If not tried appropriately, it would influence the business altogether; thus, automation becomes a key part of Big Data Testing. Enormous Data Testing whenever done inaccurately will make it extremely hard to comprehend the blunder, how it happened and the likely arrangement with alleviation steps could take quite a while along these lines bringing about mistaken/missing information, and adjusting it is again a colossal test so that present streaming information isn’t influenced. As information is critical, it is prescribed to have a relevant component with the goal that information isn’t lost/debased and proper mechanism should be used to handle failoversBig Data has certain characteristics and hence is defined using 4Vs, namely:Volume: is the measure of information that organizations can gather. It is huge and consequently, the volume of the information turns into a basic factor in Big Data Analytics.Velocity: the rate at which new information is being created, on account of our reliance on the web, sensors, machine-to-machine information is likewise imperative to parse Big Data conveniently.Variety: the information that is produced is heterogeneous; as in it could be in different types like video, content, database, numeric, sensor data and so on and consequently understanding the kind of Big Data is a key factor to unlocking its potential.Veracity: knowing whether the information that is accessible is originating from a believable source is of most extreme significance before unraveling and executing Big Data for business needs.Here is a concise clarification of how precisely organizations are using Big Data:When Big Data is transformed into pieces of data then it turns out to be quite direct for most business endeavors as it comprehends what their clients need, what items are quick moving, what are the desires for the clients from the client benefit, how to accelerate an opportunity to advertise, approaches to lessen expenses, and strategies to assemble economies of scale in an exceedingly productive way. Thus Big Data distinctively leads to big-time benefits for organizations and hence naturally there is such a huge amount of interest in it from all around the world.Testing Big Data:Source: Guru99.comLet us have a look at the scenarios for which Big Data Testing can be used in the Big Data components: –1. Data Ingestion: –This progression is considered as pre-Hadoop arrange where information is created from different sources and information streams into HDFS. In this progression, the analyzers check that information is removed legitimately and information is stacked into HDFS.Ensure appropriate information from different sources is ingested; for example, every required datum is ingested according to its characterized mapping; and information with non-coordinating pattern is not to be ingested. Information which has not coordinated with diagram ought to be put away for details stating the reason.Comparison of source data with data ingested to simply validate that correct data is pushed.Verify that correct data files are generated and loaded into HDFS correctly into desired location.2. Data Processing: –This progression is utilized for approving Map-Reduce employments. Map-Reduce is a concept used for condensing large amount of data into aggregated data. The information ingested is handled utilizing execution of Map-Reduce employments which gives wanted outcomes. In this progression, the analyzer confirms that ingested data is prepared utilizing Map-Reduce employments and approve whether business rationale is actualized accurately.Data Storage: –This progression is utilized for putting away yield information in HDFS or some other stockpiling framework, (for example, Data Warehouse). In this progression the analyzer checks that yield information is effectively produced and stacked into capacity framework.Validate information is amassed post Map-Reduce Jobs.Verify that right information is stacked into capacity framework and dispose of any middle of the road information which is available.Verify that there is no information defilement by contrasting yield information and HDFS (or any capacity framework) information.The other types of testing scenarios a Big Data Tester can do is: –4. Check whether legitimate ready instruments are actualized, for example, Mail on alarm, sending measurements on Cloud watch and so forth.5. Check whether exceptions or mistakes are shown legitimately with suitable special case message so tackling a blunder turns out to be simple.6. Performance testing to test the distinctive parameters to process an arbitrary lump of vast information and screen parameters, for example, time taken to finish Map-Reduce Jobs, memory use, circle use and different measurements as required.7. Integration testing for testing complete work process specifically from information ingestion to information stockpiling/representation.8. Architecture testing for testing that Hadoop is exceptionally accessible all the time and failover administrations are legitimately executed to guarantee information is handled even if there should arise an occurrence of disappointment of hubs.Data Storage – HDFS (Hadoop Distributed File System), Amazon S3, HBase.Note: – For testing it is very important to generate data that covers various test scenarios (positive and negative). Positive test scenarios cover scenarios which are directly related to the functionality. Negative test scenarios cover scenarios which do not have direct relation with the desired functionality.Tools used in Big Data TestingData Ingestion – Kafka, Zookeeper, Sqoop, Flume, Storm, Amazon Kinesis.Data Processing – Hadoop (Map-Reduce), Cascading, Oozie, Hive, Pig.

Aziro Marketing

big data Big Data QA Big Data Tester Big Data Testing

Big Data and Your Privacy: How Concerned Should You Really Be?

Today, every IT-related service online or offline is driven by data. In the last few years alone, explosion of social media has given rise to a humongous amount of data, which is sort of impossible to manipulate without specific high-end computing systems. In general, normal people like us are familiar with kilobytes, megabytes, and gigabytes of data, some even terabytes of data. But when it comes to the Internet, data is measured in entirely different scales. There are petabytes, exabytes, zettabytes, and yottabytes. A petabyte is a million gigabyte, an exabyte is a billion gigabyte, and so on.A Few Interesting StatisticsLet me pique your interest with a few statistics here from various sources: 90 percent of data in existence in the world was created in the last two years alone.90 percent of data in existence in the world was created in the last two years alone.The reason why Amazon sells five times Wal-Mart, Target, and Buy.com combined is because the company steadily grew to be of 74 billion dollar revenue from a miniature bookseller by incorporating all the statistical customer data it gathered since 1994. In a week, Amazon targets close to 130 million customers—imagine the enormous amount of big data it can gather from them.Google’s former CEO and current executive chairman, Eric Schmidt, once said: “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.” The significance of this statement is evident when you realize the magnitude of data that the search giant crunches every second. In its expansive index, Google has stored anywhere between 15 to 20 billion web pages, as in this statistic.On a daily basis, Google processes five billion queries. Beyond these, through numerous Google apps that you continuously use, such as Gmail, Maps, Android, Google+, Places, Blogger, News, YouTube, Play, Drive, Calendar, etc., Google is collecting data about you on a huge scale.All of this data is known in the industry circles as “big data.” Processing such huge chunks of data is not really possible with your existing hardware and software. That’s the reason why there are industry-standard algorithms for the purpose. Apache Hadoop, which Google also uses, is one such system. Various components of Hadoop–HDFS, MapReduce, YARN, etc.–are capable of intense data manipulation and processing capabilities. Similar to Hadoop, Apache Storm is a big data processing technology used by Twitter, Groupon, and Alibaba (the largest online retailer in the world).The effects and business benefits of big data can be quite significant. Imagine the growth of Amazon in the last few years. In that ginormous article, George Packer gives a “brief” account of Amazon’s growth in the past few years: from “the largest” bookseller to the multi-product online-retail behemoth it is today. What made that happen? In essence, the question is what makes the internet giants they are today? Companies such as Facebook, Google, Amazon, Microsoft, Apple, Twitter, etc., have reached the position they are today by systematically processing the big data generated by their users–including you.In essence, data processing is an essential tool for success in today’s Internet. How is the processing of your data affecting your privacy? Some of these internet giants gather and process more data than all governments combined. There really is a concern for your privacy, isn’t there?Look at National Security Agency of the US. It’s estimated that NSA has a tap on every smartphone communication that happens across the world, through any company that has been established in the United States. NSA is the new CIA, at least in the world of technology. Remember about PRISM program that the NSA contractor Edward Snowden blew the whistle on. For six years, PRISM remained under cover; now we know the extent of data collected by this program is several times in magnitude in comparison to the data collected by any technology company. Not only that, NSA, as reported by the Washington Post, has a surveillance system that can record hundred percent of telephone calls from any country, not only the United States. Also, NSA allegedly has the capability to remotely install a spy app (known as Dropoutjeep) in all iPhones. The spy app can then activate iPhone’s camera and microphone to gather real-time intelligence about the owner’s conversations. An independent security analyst and hacker Jacob Appelbaum reported this capability of the NSA.NSA gets a recording of every activity you do online: telephone and VoIP conversations, browsing history, messages, email, online purchases, etc. In essence, this big data collection is the biggest breach of personal privacy in human history. While the government assures that the entire process is for national security, there are definitely concerns from the general public.Privacy ConcernsWhile on one side companies are using your data to grow their profit, governments are using this big data to further surveillance. In a nutshell, this could all mean one thing: no privacy for the average individual. As far back as 2001, industry analyst Doug Laney signified big data with three v’s: volume, velocity, and variety. Volume for the vastness of the data that comes from the peoples of the world (which we saw earlier); velocity to mean the breathtaking speeds it takes for the data to arrive; and variety to mean the sizeable metadata used to categorize the raw data.What real danger is there in sharing your data with the world? For one thing, if you are strongly concerned about your own privacy, you shouldn’t be doing anything online or over your phone. While sharing your data can help companies like Google, Facebook, and Microsoft show you relevant ads (while increasing their advertising revenues), there virtually is no downside for you. The sizeable data generated by your activities goes into a processing phase wherein it is amalgamated to the big data generated by other users like you. It’s hence in many ways similar to disappearing in a crowd, something people like us do in the real world on a daily basis.However, online, there is always a trace that goes back to you, through your country’s internet gateway, your specific ISP, and your computer’s specific IP address (attached to a timestamp if you have dynamic IP). So, it’s entirely possible to create a log of all activities you do online. Facebook and Google already have a log, a thing you call your “timeline.” Now, the timeline is a simple representation of your activities online, attached to a social media profile, but with a trace on your computer’s web access, the data generated is pretty much your life’s log. Then it becomes sort of scary.You are under trace not only while you are in front of your computer but also when you move around with your smartphone. The phone can virtually be tapped to get every bit of your conversations, and its hardware components–camera, GPS, and microphone–can be used to trace your every movement.When it comes to online security, the choice is between your privacy and better services. If you divulge your information, companies will be able to provide you with some useful ads of the products that you may really like (and act God on your life!). On the other hand, there is always an inner fear that you are being watched–your every movement. To avoid it, you may have to do things you want to keep secret offline, not nearby any connected digital device–in essence, any device that has a power source attached.In an article that I happened to read some time back, it was mentioned that the only way to bypass security surveillance is removing a battery from your smartphone.The question remains, how you can trust any technology. I mean, there are a huge number of surveillance technologies and projects that people don’t know about even now. With PRISM, we came to know about NSA’s tactics, although most of them are an open secret. Which other countries engage in such tactics is still unknown.

Aziro Marketing

big data security

What are the Advantages of Predictive Storage Analytics?

Predictive Analytics can play a big role in taking your storage stack to the next level. Machine learning algorithms can pierce through the data and uncover critical insights. By analyzing the data patterns, you can improve capacity planning of your storage infrastructure and accelerate its efficiency as well as performance. Aziro (formerly MSys Technologies) as a revered big data analytics solutions providers and expertise in predictive storage analytics, present you the following infographic:Are you looking to adopt Predictive Analytics for your storage company? Our team can provide expert consulting and support services. Contact us today!

Aziro Marketing

artificial intelligence big data iot predictive analytics

Is there an Alternative for Hadoop ?

HadoopUsing big data technologies for your business is a really an attractive thing and Hadoop makes it even more appealing nowadays. Hadoop is a massively scalable data storage platform that is used as a foundation for many big data projects. Hadoop is powerful, however it has a steep learning curve in terms of time and other resources. It can be a game changer for companies if Hadoop is being applied the right way. Hadoop will be around for a longer time and for good reason as Hadoop can solve even fewer problems.For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, everyday data mugging.Apache Hadoop, the open-source framework for storing and analyzing big data, will be embraced by analytics vendors over the next two years as organizations seek out new ways to derive value from their unstructured data, according to a new research report from Gartner.Few alternatives of HadoopAs a matter of fact, there are many ways to store data in a structured way which stand as an Alternative for Hadoop namely BashReduce, Disco Project, Spark, GraphLab and the list goes on. Each one of them is unique in their own way. If GraphLab was developed and designed for use in machine learning which is focused to make the design and implementation of efficient and correct parallel machine learning algorithms easier, then Spark is one of the newest players in the MapReduce field which stands as a purpose to make data analytics fast to write and run.Conclusion:Despite all these Alternatives, Why Hadoop?One word: HDFS. For a moment, assume you could bring all of your files and data with you everywhere you go. No matter what system, or type of system, you log in to, your data is intact waiting for you. Suppose you find a cool picture on the Internet. You save it directly to your file store and it goes everywhere you go. HDFS gives users the ability to dump very large data sets (usually log files) to this distributed file system and easily access it with tools, namely Hadoop. Not only does HDFS store a large amount of data, it is fault tolerant. Losing a disk, or a machine, typically does not spell disaster for your data. HDFS has become a reliable way to store data and share it with other open-source data analysis tools. Spark can read data from HDFS, but if you would rather stick with Hadoop, you can try to spice it up.Below trend shows the percent of Hadoop adoption:Research firm projects that 65 percent of all “packaged analytic applications with advanced analytics” capabilities will come prepackaged with the Hadoop framework by 2015. The spike in Hadoop adoption largely will be spurred by an organizations’ need to analyze the massive amounts of unstructured data being produced from nontraditional data sources such as social media. Source: Gartner“It doesn’t take a clairvoyant — or in this case, a research analyst — to see that ‘big data’ is becoming (if it isn’t already, perhaps) a major buzzword in security circles. Much of the securing of big data will need to be handled by thoroughly understanding the data and its usage patterns. Having the ability to identify, control access to, and — where possible — mask sensitive data in big data environments based on policy is an important part of the overall approach.”Ramon KrikkenResearch VP, Security and Risk Management Strategies Analyst at Gartner“Hadoop is not a single entity, it’s a conglomeration of multiple projects, each addressing a certain niche within the Hadoop ecosystem such as data access, data integration, DBMS, system management, reporting, analytics, data exploration and much, much more,” – Forrester analyst Boris Evelson.Forrester Research, Inc. views Hadoop as “the open source heart of Big Data”, regarding it as “the nucleus of the next-generation EDW [enterprise data warehouse] in the cloud,” and has published its first ever The Forrester Wave: Initiative Hadoop Solutions report (February 2, 2012).Hadoop Streaming is an easy way to avoid the monolith of Vanilla Hadoop without leaving HDFS, and allows the user to write map and reduce functions in any language that supports writing to stdout, and reading from stdin. Choosing a simple language such as Python for Streaming allows the user to focus more on writing code that processes data rather than software engineering.The bottom line is that Hadoop is the future of the cloud EDW. Its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade. The roles that Hadoop is likely to assume in EDW strategy are the dominant applications.So? What is your experience with big data? Please share with us in the comments section.

Aziro Marketing

big data hadoop storage

How to classify Product Catalogue Using Ensemble

The Problem StatementThe Otto Product Classification Challenge was a competition hosted on Kaggle, a website dedicated to solving complex data science problems. The purpose of this challenge was to classify products into correct category, based on their recorded features.DataThe organizers had provided a training data set containing 61878 entries, and a test data set that had 144368 entries. The data contained 93 features, based on which the products had to be classified. The target column in the training data set indicated the category of the product. The training and test data sets are available for download here. A sample of training data set can be seen in the figure 1.Solution ApproachThe features in the training data set had a large variance. Anscombe transform on the features reduced the variance. In the process, it also transformed the features from an approximately Poisson distribution into an approximately normal distribution. Rest of the data was pretty clean, and thus we could use it directly as input to the classification algorithm. For classification, we tried two approaches { one was using the xgboost algorithm, and other using the deeplearning algorithm through h2o. xgboost is an implementation of extreme gradient boosting algorithm, which can be used effectively for classification.Figure 1: Sample DataAs discussed in the TFI blog, The gradient boosting algorithm is an ensemble method based on decision trees. For classification, at every branch in the tree, it tries to eliminate a category, to finally have only one category per leaf node. For this, it needs to build trees for each category separately. But since we had only 9 categories, we decided to use it. The deep learning algorithm provided by h2o is based on a multi-layer neural network. It is trained using a variation of gradient descent method. We used multi-class logarithmic loss as the error metric to find a good model, as this was also the model used for ranking by Kaggle.Building the ClassifierInitially, we created a classifier using xgboost. The xgboost configuration for our best submission is as below: param list(`objective' = `multi:softprob', `eval metric' = `mlogloss', `num class' = 9, `nthread' = 8, `eta'= 0.1, `max depth' = 27, `gamma' = 2, `min child weight' = 3, `subsample' = 0.75, `colsample bytree' = 0.85) nround = 5000 classier = xgboost(param=param, data = x, label = y, nrounds=nround) Here, we have specified our objective function to be multi:softprob’. This function returns the probabilities for a product being classified into a specific category. Evaluation metric, as specified earlier, is the multiclass logarithmic loss function. The eta’ parameter shrinks the priors for features, thus making the algorithm less prone to overfitting. The parameter usually takes values between 0:1 to 0:001. The max depth’ parameter will limit the height of the decision trees. Shallow trees are constructed faster. Not specifying this parameter lets the trees grow as deep as required. The min child weight’ controls the splitting of the tree. The value of the parameter puts a lower bound on the weight of each child node, before it can be split further. The subsample’ parameter makes the algorithm choose a subset of the training set. In our case, it randomly chooses 75% of the training data to build the classifier. The colsample bytree’ parameter makes the algorithm choose a subset of features while building the tree, in our case to 85%. Both subsample’ as well as colsample bytree’ help in preventing overt. The classier was built by performing 5000 iterations over the training data set. These parameters were tuned by experimentation, by trying to minimize the log-loss error. The log-loss error on public leader board for this configuration was 0.448.We also created another classifier using the deep-learning algorithm provided by h2o. The configuration for this algorithm was as follows:classication=T, activation=`RectierWithDropout', hidden=c(1024,512,256), hidden dropout ratio=c(0.5,0.5,0.5), input dropout ratio=0.05, epochs=50, l1=1e-5, l2=1e-5, rho=0.99, epsilon=1e-8, train samples per iteration=4000, max w2=10, seed=1 This configuration creates a neural network with 3 hidden layers, each with 1024, 512 and 256 neurons respectively. This is specified using the `hidden’ parameter. The activation function in this case is Rectifier with Dropout. The rectifier function filters negative inputs for each neuron. Dropout lets us randomly drop inputs to the hidden neuron layers. Dropout builds better generalizations. The hidden_dropout _ratio specifies the percentage of inputs to hidden layers to be dropped. The input dropout ratio specifies the percentage of inputs to the input layer to be dropped. Epochs define the number of training iterations to be carried out. Setting train samples per iteration makes the algorithm choose subset of training data. Setting l1 and l2 scales the weights assigned to each feature. l1 reduces model complexity, and l2 introduces bias in estimation. Rho and Epsilon together slow convergence. The max w2 function sets an upper limit on the sum of squared incoming weights into a neuron. This needs to be set for rectifier activation function. Seed is the random seed that controls sampling. Using these parameters, we performed 10 iterations of deep learning, and submitted the mean of the 10 results as the output. The log-loss error on public leaderboard for this configuration was 0.448.We then merged both the results by taking a mean, and that resulted in the top 10% submission, with a score of 0.428.ResultsSince the public leaderboard evaluation was based on 70% of the test data, the public leaderboard rank was fairly stable. We submitted the best two results, and our rank remained in the top 10% at the end of final evaluation. Overall, it was a good learning experience.What We LearntRelying on results of one model may not give the best possible result. Combining results of various approaches may reduce the error by significant margin. As can been seen here, the winning solution in this competition had a much more complex ensemble. Complex ensembles may be the way ahead for getting better at complex problems like classification.

Aziro Marketing

big data data-analytics machine-learning

How to Implement Big Data for Your Organization in the Right Way

Let’s start with an introduction to what IT across the globe calls “big data.” From a use case perspective, few terms are so overused and hackneyed as big data. Some people say it’s the entire data in your company, while some others say it’s anything above one terabyte; a third group argues it’s something you cannot easily tackle. Well, in essence, big data is a broad term applied to data sets so large and complex as to obsolesce traditional data processing systems. This means new systems have to be implemented in order to get a grasp of such large volumes of data.One might ask what benefits can be drawn by analyzing such deluge of data. The answer to that can be quite broad. Businesses can draw huge benefits from big data analytics, and the primary and the most sought-after of them all is key business intelligence that translates to high profits.Traditional business intelligence systems can not only be revamped but also made more beneficial by the implementation of big data systems. Look at Google’s Flu Trends, which gives near real-time trends on flu across the globe, with the help of the search data that Google captures from its data centers. This certainly qualifies as big data, and with its help, Google is able to provide an accurate worldwide analysis of flu trends. This is one of the major use cases of big data analysis. When it comes to your organization, a big data analytics implementation can make all the difference in profits. Many organizations have identified that just by implementing a recommendation engine, they are able to perceive huge difference in sales. Let’s look at some key best practices in implementing big data.1. Analyze Business RequirementsAs the first step, you need to know what you’ll be using big data tools for. The answer to that is your business requirements. You need to gather all business requirements and then analyze them in order to understand them. It’s important that all big data projects be aligned to specific business needs.2. Agile Implementation Is KeyAgile and iterative techniques are required for quick implementation of big data analytics in your organization. Business agility in its basic sense is the inherent capability to respond to changes rapidly. Throughout implementation, you may see that the requirements of the organization will evolve as it understands and implements big data. Agile and iterative techniques in implementation deliver quick solutions based on the extant needs.3. Business Decision MakingBig data analytics solutions reach their maximum potential if implemented from a business perspective rather than an engineering perspective. This essentially means the data analytics solutions have to be tailor-made for your business rather than being general. Tailor-made business-centric solutions will help you achieve the results that you are looking for.4. Make Use of the Data You HaveA key thing that many organizations need to do is understanding the potential of the data they already have. One of the reasons why this is essential is to keep up with the ever-increasing competition out there. Gartner survey says 73 percent of organizations will implement big data in the next two years. The other 27 percent will lose out in competition, for sure! According to IDC, visual data discovery tools will grow 2.5 times faster than the traditional business intelligence market. You already have an edge if you make use of the existing data in your organization. This is a key business decision that you need to make.5. Don’t Abandon Legacy Data SystemsAbandoning expensive legacy data systems that you may already have in place may be a big mistake. Relational database management solutions will not end their race soon, just because a new smarter kid is in the block. Although RDBMS may be erstwhile, the cost of complete abandonment may be large enough to render it unnecessary.6. Evaluate the Data Requirements CarefullyA complete evaluation of data that your organization gathers on a daily basis is essential, even if big data implementation is not in your immediate business plan. A stakeholder meet that clearly consolidates everyone’s opinion is necessary in this.7. Approach It From the Ground UpOne key thing that you should not forget is it’s nearly impossible to tackle an entire organization’s data in one shot. It’s better to go at it in a granular way, gradually incorporating data in sets and testing thoroughly the efficiency of implementation. Taking too much data in the first step itself will yield unreliable results or cause a complete collapse of your setup.8. Set Up Centers of ExcellenceIn order to optimize knowledge transfer across your organization, tackle any oversights, and correct mistakes, set up a center of excellence. This will also help share costs across the enterprise. One key benefit of setting up centers of excellence is that it will ensure the maturity of information architecture in a systematic way.ConclusionAssociate big data analytics platforms with the data gathered by your enterprise apps, such as HR management systems, CRMs, ERPs, etc. This will enable information workers to understand and unearth insights from different sources of data.Big data is associated with four key aspects: volume, variety, veracity, velocity, according to IBM. This means you have very little time to start analyzing the data, and the infrastructure needed to analyze it and provide insights from it will be well worth your time and investment. This is the reason why there is such immense interest in big data analytics.

Aziro Marketing

big data

How you can Hyperscale your Applications Using Mesos & Marathon

In a previous blog post we have seen what Apache Mesos is and how it helps to create dynamic partitioning of our available resources which results in increased utilization, efficiency, reduced latency, and better ROI. We also discussed how to install, configure and run Mesos and sample frameworks. There is much more to Mesos than above.In this post we will explore and experiment with a close to real-life Mesos cluster running multiple master-slave configurations along with Marathon, a meta-framework that acts as cluster-wide init and control system for long running services. We will set up a 3 Mesos master(s) and 3 Mesos slaves(s), cluster them along with Zookeeper and Marathon, and finally run a Ruby on Rails application on this Mesos cluster. The post will demo scaling up and down of the Rails application with the help of Marathon. We will use Vagrant to set up our nodes inside VirtualBox and will link the relevant Vagrantfile later in this post.To follow this guide you will need to obtain the binaries for Ubuntu 14.04 (64 bit arch) (Trusty)Apache MesosMarathonApache ZookeeperRuby / RailsVirtualBoxVagrantVagrant Pluginsvagrant-hostsvagrant-cachierLet me briefly explain what Marathon and Zookeeper are.Marathon is a meta-framework you can use to start other Mesos frameworks or applications (anything that you could launch from your standard shell). So if Mesos is your data center kernel, Marathon is your “init” or “upstart”. Marathon provides an excellent REST API to start, stop and scale your application.Apache Zookeeper is coordination server for distributed systems to maintain configuration information, naming, provide distributed synchronization, and group services. We will use Zookeeper to coordinate between the masters themselves and slaves.For Apache Mesos, Marathon and Zookeeper we will use the excellent packages from Mesosphere, the company behind Marathon. This will save us a lot of time from building the binaries ourselves. Also, we get to leverage bunch of helpers that these packages provide, such as creating required directories, configuration files and templates, startup/shutdown scripts, etc. Our cluster will look like this:The above cluster configuration ensures that the Mesos cluster is highly available because of multiple masters. Leader election, coordination and detection is Zookeeper’s responsibility. Later in this post we will show how all these are configured to work together as a team. Operational Guidelines and High Availability are goodreads to learn and understand more about this topic.InstallationIn each of the nodes we first add Mesosphere APT repositories to repository source lists and relevant keys and update the system.$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF $ echo "deb http://repos.mesosphere.io/ubuntu trusty main" | sudo tee /etc/apt/sources.list.d/mesosphere.list $ sudo apt-get -y update If you are using some version other than Ubuntu 14.04 then you will have to change the above line accordingly and if you are using some other distribution like CentOS then you will have to use relevant rpm and yum commands. This applies everywhere henceforth.On master nodes:In our configuration, we are running Marathon on the same box as Mesos masters. The folks at Mesosphere have created a meta-package called mesosphere which installs Mesos, marathon and also zookeeper.$ sudo apt-get install mesosphere On slave nodes:On slave nodes, we require only zookeeper and mesos installed. The following command should take care of it.$ sudo apt-get install mesos As mentioned above, installing the above packages will do more that just installing packages. Much of the plumbing work is taken care of for the better. You need not worry if the mandatory “work_dir” has been created in the absence of which Apache Mesos would not run and other such important things. If you want to understand more, extracting scripts from the package and studying them is highly recommended. That is what I did as well.You can save a lot of time if you clone this repository and then run the following command inside your copy.$ vagrant up This command will launch a cluster, set the IPs for all nodes, install all required packages to follow this post. You are now ready to configure your cluster.ConfigurationIn this section we will configure each tool/application one by one. We will start with Zookeeper, then Mesos servers, then Mesos slaves and finally Marathon.ZookeeperLet us stop Apache Zookeeper on all nodes (masters and slaves).$ sudo service zookeeper stop Let us configure Apache Zookeeper on all masters. Do the following steps on each master.Edit /etc/zookeeper/conf/myid on each of the master nodes. Replace the boilerplate text in this file with a unique number (per server) from 1 to 255. These numbers will be the IDs for the servers being controlled by Zookeeper. Lets chose 10, 30 and 50 as IDs for the 3 Mesos master nodes. Save the files after adding 10, 30 and 50 respectively in /etc/zookeeper/conf/myid for the nodes. Here’s what I had to do on the first master node. Same has to be repeated on other nodes with respective IDs.$ echo 10 | sudo tee /etc/zookeeper/conf/myid Next we configure the Zookeeper configuration file ( /etc/zookeeper/conf/zoo.cfg ) for each master node. For the purpose of this blog we are just adding the master node IPs and relevant server IDs that was selected in the previous step.Note the configuration template line below. server.id=host:port1:port2. port1 is used by peer ZooKeeper servers to communicate with each other, and port2 is used for leader election. The recommended values are 2888 and 3888 for port1 and port2 respectively but you can choose to use custom values for your cluster.Assuming that you have chosen the IP range 10.10.20.11-13 for your Mesos servers as mentioned above, edit /etc/zookeeper/conf/zoo.cfg to reflect the following:# /etc/zookeeper/conf/zoo.cfg server.10=10.10.20.11:2888:3888 server.30=10.10.20.12:2888:3888 server.50=10.10.20.13:2888:3888 This file will have many other Zookeeper related configurations which are beyond the scope of this post. If you are using the packages mentioned above, the configuration templates should be a lot of help. Definitely read the comments sections, a lot to learn there.This is a good tutorial on understand fundamentals of Zookeeper. And this document is perhaps the latest and best document to know more about administering Apache Zookeeper, specifically this section is of relevance of what we are doing.All NodesZookeeper Connection DetailsFor all nodes (masters and slaves), we have to set up Zookeeper connection details. This will be stored in /etc/mesos/zk, a configuration file that you will get thanks to the packages. Edit this file on each node and add the following url carefully.#/etc/mesos/zk zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/mesos Port 2181 is Zookeeper’s client port that it listens to for client connections. IP addresses will differ if you have chosen IPs for your servers differently.IP AddressesNext we set up IP address information for all nodes (masters and slaves).Masters$ echo | sudo tee /etc/mesos-master/ip $ sudo cp /etc/mesos-master/ip /etc/mesos-master/hostname Write the IP of the node in the file. Save and close the file.Slaves$ echo | sudo tee /etc/mesos-slave/ip $ sudo cp /etc/mesos-slave/ip /etc/mesos-slave/hostname Write the IP of the node in the file. Save and close the file. Keeping the hostname same as IP makes it easier to resolve DNS.If you are using the Mesosphere packages, then you get a bunch of intelligent defaults. One of the most important things you get is a convenient way to pass CLI options to Mesos. All you need to do is create a file with same name as that of the CLI option and put the correct value that you want to pass to Mesos (master or slave). The file needs to be copied to a correct directory. In case of Mesos masters, you need to copy the correct file to /etc/mesos-master and for slaves you should copy the file to /etc/mesos-slave For example: echo 5050 > sudo tee /etc/mesos-slave/port We will see some examples of similar configuration setup below. Here you can find all the CLI options that you can pass to Mesos master/slave.Mesos ServersWe need to set a quorum for the servers. This can be done by editing /etc/mesos-master/quorum and setting it to a correct value. For our case, the quorum value can be 2 or 3. We will use 2 in this post. Quorum is the strict majority. Since we chose 2 as quorum value it means that out of 3 masters, we will definitely need at least 2 master nodes running for our cluster to run properly.We need to stop the slave service on all masters if they are running. If they are not, the following command might give you a harmless warning.$ sudo service mesos-slave stop Then we disable the slave service by setting a manual override.$ echo manual | sudo tee /etc/init/mesos-slave.override Mesos SlavesSimilarly we need to stop the master service on all slaves if they are running. If they are not, the following command might give you a harmless warning. We also set the master and zookeeper service on each slave to manual override.$ sudo service mesos-master stop $ echo manual | sudo tee /etc/init/mesos-master.override $ echo manual | sudo tee /etc/init/zookeeper.override The above .override files are read by upstart on Ubuntu box to start/stop processes. If you are using a different distribution or even Ubuntu 15.04 then you might have to do this differently.MarathonWe can now configure Marathon, for which we need some work to be done. We will configure Marathon only on the server nodes.First create a directory for Marathon configuration.$ sudo mkdir -p /etc/marathon/conf Then like we did before, we will set configuration properties by creating files with same name as that of property to be set and adding the value of the property as the only content of the file (see box above).Marathon binary needs to know the values for –master and –hostname. We can reuse the files that we used for Mesos configuration.$ sudo cp /etc/mesos-master/ip /etc/marathon/conf/hostname $ sudo cp /etc/mesos/zk /etc/marathon/conf/master To make sure Marathon can use Zookeeper, do the following (note the endpoint is different in this case i.e. marathon):$ echo zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/marathon \ | sudo tee /etc/marathon/conf/zk Here you can find all the command line options that you can pass to Marathon.Starting ServicesNow that we have configured our cluster, we can resume all services.Master$ sudo service zookeeper start $ sudo service mesos-master start $ sudo service marathon start Slave$ sudo service mesos-slave start Running Your ApplicationMarathon provides nice Web UI to set up your application. It also provides an excellent REST API to create, launch, scale applications, check health status and more.Go to your Marathon Web UI, if you followed the above instructions then the URL should be one of the Mesos masters on port 8080 ( i.e. http://10.10.20.11:8080 ). Click on “New App” button to deploy a new application. Fill in the details. Application ID is mandatory. Select relevant values for CPU, Memory, Disk Space for your application. For now let number of instances be 1. We will increase them later when we scale up the application in our shiny new cluster.There are a few optional settings that you might have to take care depending on our your slaves are provisioned and configured. For this post, I made sure each slave had Ruby, Ruby related dependencies and Bundler gem were installed. I took care of this when I launched and provisioned the slaves nodes.One of the important optional settings is “Command” that Marathon can execute. Marathon monitors this command and reruns it if it stops for some reason. Thus Marathon claims to fame as “init” and runs long running applications. For this post, I have used the following command (without the quotes).“cd hello && bundle install && RAILS_ENV=production bundle exec unicorn -p 9999” This command reads the Gemfile in the Rails application, installs all the necessary gems required for the application, and then runs the application on port 9999.I am using a sample Ruby On Rails application. I have put the url of the tarred application in the URI field. Marathon understands a few archive/package formats and takes care of unpacking them so we needn’t worry about them. Applications need resources to run properly, URIs can be used for this purpose. Read more about applications and resourceshere.Once you click “Create”, you will see that Marathon starts deploying the Rails application. A slave is selected by Mesos, the application tarball is downloaded, untarred, requirements are installed and the application is run. You can monitor all the above steps by watching the “Sandbox” logs that you should find on Mesos main web UI page. When the state of task will change from “Staging” to “Running” we have a Rails application run via Marathon on a Mesos slave node. Hurrah!If you followed the steps from above, and you read the “Sandbox” logs you know the IP of the node where the application was deployed. Navigate to the SLAVE_NODE_IP:9999 to see your rails application running. Scaling Your ApplicationAll good but how do we scale? After all, the idea is for our application to reach web scale and become the next Twitter, and this post is all about scaling application with Mesos and Marathon. So this is going to be difficult! Scaling up/down is difficult but not when you have Mesos and Marathon for company. Navigate to the application page on Marathon UI. You should see a button that says “Scale”. Click on it and increase the number to 2 or 3 or whatever you prefer (assuming that you have that many slave nodes). In this post we have 3 slave nodes, so I can choose 2 or 3. I chose 3. And voila! The application is deployed seamlessly to the other two nodes just like it was deployed to the first node. You can see for yourself by navigating to SLAVE_NODE_IP:9999 where SLAVE_NODE_IP will be the IP of the slave where the application was deployed. And there you go, you have your application running on multiple nodes.It would be trivial to put these IPs behind a load-balancer and a reverse proxy so that access to your application is as simple as possible. Graceful Degradation (and vice versa)Sometimes nodes in your clusters go down for one reason or other. Very often we get an email from your IaaS provider that your node will be retired in few days time and at other times a node dies before you could figure out what happened. When such inevitable things happen and the node in question is part of the cluster running the application, the dynamic duo of Mesos and Marathon have your back. The system will detect the failure, will de-register the slave and deploy the application to a different slave available in the cluster. You could tie it up with your IaaS-provided scaling option and spawn required number of new slave nodes as a part of your cluster which once registered with the Mesos cluster can run your application.Marathon REST APIAlthough we have used the Web UI to add a new application and scale it. We could have done the same (and much more) using REST API and thus do Marathon operations via some program or scripts. Here’s an simple example that will scale the application to 2 instances. Use any REST client or just curl to make a PUT request to the application ID, in our case http://10.10.20.11:8080/v2/apps/rails-app-on-mesos-marathon with the following JSON data as payload. You will notice that Marathon deploys the application to another instance if there was only 1 instance before.{ "instances" : 2 } You can do much more than above, do health checks, add/suspend/kill/scale applications etc. This can become a complete blog post in itself and will be dealt at a later time.ConclusionScaling your application becomes as easy as pressing buttons with a combination of Mesos and Marathon. Setting up a cluster can become almost trivial once you get your requirements in place and ideally automate the configuration and provisioning of your nodes. For this post, I relied on simple Vagrantfile and a shell script that provision the system. Later I configured the system by hand as per above steps. Using Chef or alike would make the configuration step a single command work. In fact there are a few open-source projects that are already very successful and do just that. I have played witheverpeace/vagrant-mesos and it is an excellent starting point. Reading the code from these projects will help you understand a lot about building and configuring clusters with Mesos.There are other projects that do similar things like Marathon and sometimes more. I definitely would like to mention Apache Aurora and HubSpot’s Singularity.

Aziro Marketing

big data data-analytics datacenter machine-learning

EXPLORE ALL TAGS

Ansible Test Automation

application containerization

applications

Application Security

application testing

artificial intelligence

asynchronous replication

business intelligence

Cloud Cost Optimization

cloud devops

Cloud Infrastructure

Cloud Interoperability

Cloud Native Solution

Cloud Storage Security

Codeless Automation

Cognitive analytics

Configuration Management

container world conference

continuous-delivery

continuous deployment

continuous integration

data backup and recovery

Descriptive analytics

Descriptive analytics tools

Digital Transformation

dockercon 2019 san francisco

end-to-end-test-automation

High Performance Computing

icinga for monitoring

Image Recognition 2024

kubernetesday bangalore

Low-Code No-Code Platforms

mobile-application-testing

mobile-automation-testing

Predictive analytics tools

prescriptive analysis

Rapid Application Development

raspberry pi

RDMA

real time analytics

realtime analytics platforms

Real-time data analytics

Recovery

Recovery as a service

rsa 2019 san francisco

Selenium Test Automation

selenium testng

serverless

Serverless Computing

Site Reliability Engineering

software defined storage

software-testing

software testing trends

software testing trends 2019

storage virtualization

support

Synchronous Replication

testing automation tools

thought leadership articles

trends

tutorials

ui automation testing

ui testing

ui testing automation

vCenter Operations Manager

vmworld 2019 san francisco

VMworld 2019 US

vROM

Web Automation Testing

web test automation

WFH

LET'S ENGINEER

Your Next Product Breakthrough

Book a Free 30-minute Meeting with our technology experts.

Aziro has been a true engineering partner in our digital transformation journey. Their AI-native approach and deep technical expertise helped us modernize our infrastructure and accelerate product delivery without compromising quality. The collaboration has been seamless, efficient, and outcome-driven.

CTO

Fortune 500 company