Tag Archive

Below you'll find a list of all posts that have been tagged as "data-analytics"

Fundamentals of Forecasting and Linear Regression in R

In this article, let’s learn the basics of forecasting and linear regression analysis, a basic statistical technique for modeling relationships between dependent and explanatory variables. Also, we will look at how R programming language, a statistical programming language, implements linear regression through a couple of scenarios.Let’s start by considering the following scenarios.Scenario 1: Every year, as part of organizations annual planning process, a requirement is to come up with a revenue target upon which the budget of the rest of the organization is based. The revenue is a function of sales, and therefore the requirement is to approximately forecast the sales for the year. Depending on this forecast, the budget can be allocated within the organization. Looking at the organizations history, we can assume that the number of sales is based on the number of salespeople and the level of promotional activity. How can we use these factors to forecast sales?Scenario 2: An insurance company was facing heavy losses on vehicle insurance products. The company had data regarding the policy number, policy type, years of driving experience, age of the vehicle, usage of the vehicle, gender of the driver, marital status of the driver, type of fuel used in the vehicle and the capped losses for the policy. Could there be a relation between the driver’s profile, the vehicle’s profile, and the losses incurred on its insurance?The first scenario demands a prediction of sales based on the number of sales people and promotions. The second scenario demands a relationship between a vehicle, its driver, and losses accrued on the vehicle as a result of an insurance policy that covers it. These are classic questions that a linear regression can easily answer.What is linear regression?Forecasting and linear regression is a statistical technique for generating simple, interpretable relationships between a given factor of interest, and possible factors that influence this factor of interest. The factor of interest is called as a dependent variable, and the possible influencing factors are called explanatory variables. Linear regression builds a model of the dependent variable as a function of the given independent, explanatory variables. This model can further be used to forecast the values of the dependent variable, given new values of the explanatory variables.What are the use cases?Determining relationships: Linear regression is extensively used to determine relationship between the factor of interest and the corresponding possible factors of influence. Biology, behavioral and social sciences use linear regression extensively to find out relationships between various measured factors. In healthcare, it has been used to study the causes of health and disease conditions in defined populations.Forecasting: Linear regression can also be used to forecast trend lines, stock prices, GDP, income, expenditure, demands, risks, and many other factors.What is the output?A linear regression quantties the influence of each explanatory variable as a coeffcient. A positive coeffcient shows a positive influence, while a negative coeffcient shows a negative influence on the relationship. The actual value of the coeffcient decides the magnitude of influence. The greater the value of the coeffcient, the greater its influence.The linear regression also gives a measure of confidence in the relationships that it has determined. The higher the confidence, the better the model for relationship determination. A regression with high confidence values can be used for reliable forecasting.What are the limitations?Linear regression is the simplest form of relationship models, which assume that the relationship between the factor of interest and the factors aecting it is linear in nature. Therefore, this regression cannot be used to do very complex analytics, but provide a good starting point for analysis.How to use linear regression?Linear regression is natively supported in R, a statistical programming language. We’ll show how to run regression in R, and how to interpret its results. We’ll also show how to use it for forecasting.For generating relationships, and the model:Figure 1 shows the commands to execute in linear regression. Table 1 explains the contents in the numbered boxes. Figure 2 shows the summary of the results of regression, on executing the summary function on the output of lm, the linear regression function. Table 2 explains the various outputs seen in the summary.For forecasting using the generated model:The regression function returns a linear model, which is based on the input training data. This linear model can be used to perform prediction as shown in figure 3. As can be seen in the figure, the predict.lm function is used for predicting values of the factor of interest. The function takes two inputs, the model, as generated using the regression function lm, and the values for the influencing factors.Figure 1: Reading data and running regressionNumber Explanation 1 This box shows the sample input data. As we can see, there are two columns, Production and Cost. We have used the data for monthly production costs and output for a hosiery mill, which is available at http://www.stat.ufl.edu/~winner/data/millcost.dat. 2 This box shows the summary of the data. The summary gives the minimum, 1st quartile (25th percentile), median (50th percentile), mean, 3rd quartile (75th percentile) and maximum values for the given data. 3 This box shows the command to execute linear regression on data. The function, lm, takes in a formula as an input. The formula is of the form y x1+x2+: : :+xn, where y is the factor of interest, and x1; : : : ; xn are the possible influencing factors. In our case, Production is the factor of interest, and we have only one factor of in uence, that is Cost Table 1: Explanation of regression steps Figure 2: Interpreting the results of regression Figure 3: Forecasting using regressionNumber Explanation 4 This box shows the summary of residuals. Residual is the difference between the actual value and the value calculated by the regression, that is the error in calculation. The residuals section in summary shows the first quartile, median, third quartile, minimum, maximum and the mean values of residuals. Ideally, a plot of these residuals should follow a bell curve, that is, there should be a few residuals with value 0, a few residuals with high values, but many residuals with intermediate values. 5 The Estimate column coecient for each influencing factor shows the magnitude of influence, and the positivity or negativity of influence. The other columns give various error measures with given estimated coefficient. 6 The number of stars depict the goodness of the regression. The more the stars, the more accurate the regression. 7 The R-squared values give a confidence measure of how accurately the regression can predict. The values fall between the range zero and one, one being highest possible accuracy, and zero is no accuracy at all. Table 2: Explanation of regression outputI believe we have understood the power of linear regression and how it can be used for specific use cases. If you have any comments or questions, do share them below.

Aziro Marketing

data-analytics machine-learning

Future Outlook: Evolving Trends in Predictive Analytics 2024

Predictive analytics has become an indispensable tool for businesses across industries. By leveraging historical data, statistical modeling, and machine learning algorithms, organizations can gain valuable insights into future trends and customer behavior. This empowers them to make data-driven decisions, optimize operations, and gain a competitive edge. However, the field of predictive analytics is constantly evolving. New technologies and methodologies are emerging, reshaping how businesses utilize this powerful tool. Here, we delve into some of the key trends that will define the future of predictive analytics: 1. Democratization of Predictive Analytics Traditionally, predictive analytics required significant technical expertise and access to expensive software. This limited its use to large enterprises with dedicated data science teams. But the future is becoming more accessible. Cloud-based solutions, user-friendly interfaces, and pre-built analytics tools are making it easier for businesses of all sizes to leverage predictive power. This democratization will lead to wider adoption and unlock the potential of data for a broader range of organizations. 2. Integration of Artificial Intelligence (AI) and Machine Learning (ML) AI and machine learning are already playing a major role in predictive analytics. Advanced algorithms are capable of handling complex datasets, identifying hidden patterns, and making more accurate predictions. As these technologies continue to evolve, we can expect even more sophisticated models that can learn and adapt in real-time, leading to highly customized and dynamic predictive insights. 3. Rise of Explainable AI (XAI) While AI-powered predictive models can be incredibly powerful, a lack of transparency can be a concern. Businesses need to understand the “why” behind the predictions. Explainable AI (XAI) is addressing this by providing insights into how models arrive at their conclusions. This will build trust in AI-driven decision-making and allow businesses to leverage the power of AI while maintaining control and regulatory compliance. 4. Focus on Real-Time and Edge Computing Traditional predictive analytics often relies on historical data, which can limit its effectiveness in fast-paced environments. Real-time and edge analytics are addressing this by processing data at the source, closer to where it is generated. This enables businesses to make immediate decisions based on real-time insights, allowing them to react to changing situations and optimize performance more effectively. 5. Integration with the Internet of Things (IoT) The proliferation of IoT devices is generating vast amounts of data. Predictive analytics can be integrated with IoT systems to analyze this data in real-time. This can be used for predictive maintenance of equipment, optimizing supply chains, and personalizing customer experiences. As the number of connected devices continues to grow, the synergy between IoT and predictive analytics will be crucial for businesses looking to extract maximum value from their data. 6. Rise of Prescriptive Analytics Predictive analytics tells you what is likely to happen. However, the future lies in prescriptive analytics, which goes a step further by recommending specific actions to take based on predicted outcomes. This allows businesses to not just anticipate challenges but also proactively develop strategies to mitigate them or capitalize on opportunities. 7. Growing Focus on Data Security and Privacy As reliance on data grows, so do concerns about data security and privacy. Businesses need to ensure that they are collecting, storing, and utilizing data ethically and responsibly. This requires robust data security measures and adherence to data privacy regulations like GDPR and CCPA. Predictive analytics solutions that prioritize data privacy and security will be essential moving forward. 8. Emphasis on Human-in-the-Loop Analytics While AI and machine learning play a significant role, human expertise remains crucial. The future of predictive analytics lies in a collaborative approach – “human-in-the-loop” analytics. Here, human analysts work alongside AI models, leveraging their expertise to interpret results, identify potential biases, and ensure that AI-driven recommendations are aligned with business goals and ethical considerations. 9. Continuous Learning and Iteration Predictive models are not static. As new data is collected and analyzed, these models need to be continuously refined and updated. This ensures the accuracy and effectiveness of predictions over time. Businesses need to establish a culture of continuous learning and iteration to ensure their predictive analytics models remain valuable assets. 10. Focus on Ethical Considerations As predictive analytics becomes more powerful, ethical considerations become paramount. Businesses need to be aware of potential biases within their data sets and algorithms. They also need to ensure that their use of predictive analytics does not lead to discrimination or unfair treatment of individuals. The Future of Predictive Analytics with Aziro (formerly MSys Technologies) The future of predictive analytics is bright, with exciting trends shaping how businesses leverage data for success. Aziro (formerly MSys Technologies) is at the forefront of this evolution, offering a comprehensive suite of predictive analytics solutions powered by cutting-edge technology and a team of experienced data scientists. We help businesses: Develop robust predictive models: Our experts can help you design and implement custom predictive models tailored to your specific needs and industry. Leverage the power of AI and Machine Learning: We utilize advanced AI and ML algorithms to extract valuable insights from your data and deliver highly accurate predictions. Ensure Explainable AI (XAI): We prioritize transparency in our models, providing clear explanations for their outputs, fostering trust and informed decision-making. Implement real-time and edge computing solutions: Our expertise allows you to harness the power of real-time data for immediate insights and optimized performance. Integrate with IoT: We can help you seamlessly integrate predictive analytics with your IoT infrastructure to unlock the full potential of your connected devices. Develop prescriptive analytics strategies: Go beyond predictions with actionable insights that empower you to take proactive steps towards achieving your goals. Maintain data security and privacy: We prioritize robust data security practices and adhere to industry regulations to ensure responsible data utilization. Foster a human-in-the-loop approach: Our collaborative approach combines the power of AI with human expertise, leading to more accurate and reliable results. Promote continuous learning and model updates: We believe in continuous improvement, ensuring your models remain effective as your data landscape evolves. Navigate ethical considerations: We work closely with you to identify and mitigate potential biases, ensuring ethical and responsible use of predictive analytics. If you are ready to unlock the future of predictive analytics for your business. Contact Aziro (formerly MSys Technologies) today and schedule a consultation with our data science experts. We can help you leverage the power of predictive analytics to gain a competitive advantage, optimize your operations, and achieve your strategic goals.

Aziro Marketing

data-analytics predictive analytics

How to classify Product Catalogue Using Ensemble

The Problem StatementThe Otto Product Classification Challenge was a competition hosted on Kaggle, a website dedicated to solving complex data science problems. The purpose of this challenge was to classify products into correct category, based on their recorded features.DataThe organizers had provided a training data set containing 61878 entries, and a test data set that had 144368 entries. The data contained 93 features, based on which the products had to be classified. The target column in the training data set indicated the category of the product. The training and test data sets are available for download here. A sample of training data set can be seen in the figure 1.Solution ApproachThe features in the training data set had a large variance. Anscombe transform on the features reduced the variance. In the process, it also transformed the features from an approximately Poisson distribution into an approximately normal distribution. Rest of the data was pretty clean, and thus we could use it directly as input to the classification algorithm. For classification, we tried two approaches { one was using the xgboost algorithm, and other using the deeplearning algorithm through h2o. xgboost is an implementation of extreme gradient boosting algorithm, which can be used effectively for classification.Figure 1: Sample DataAs discussed in the TFI blog, The gradient boosting algorithm is an ensemble method based on decision trees. For classification, at every branch in the tree, it tries to eliminate a category, to finally have only one category per leaf node. For this, it needs to build trees for each category separately. But since we had only 9 categories, we decided to use it. The deep learning algorithm provided by h2o is based on a multi-layer neural network. It is trained using a variation of gradient descent method. We used multi-class logarithmic loss as the error metric to find a good model, as this was also the model used for ranking by Kaggle.Building the ClassifierInitially, we created a classifier using xgboost. The xgboost configuration for our best submission is as below: param list(`objective' = `multi:softprob', `eval metric' = `mlogloss', `num class' = 9, `nthread' = 8, `eta'= 0.1, `max depth' = 27, `gamma' = 2, `min child weight' = 3, `subsample' = 0.75, `colsample bytree' = 0.85) nround = 5000 classier = xgboost(param=param, data = x, label = y, nrounds=nround) Here, we have specified our objective function to be multi:softprob’. This function returns the probabilities for a product being classified into a specific category. Evaluation metric, as specified earlier, is the multiclass logarithmic loss function. The eta’ parameter shrinks the priors for features, thus making the algorithm less prone to overfitting. The parameter usually takes values between 0:1 to 0:001. The max depth’ parameter will limit the height of the decision trees. Shallow trees are constructed faster. Not specifying this parameter lets the trees grow as deep as required. The min child weight’ controls the splitting of the tree. The value of the parameter puts a lower bound on the weight of each child node, before it can be split further. The subsample’ parameter makes the algorithm choose a subset of the training set. In our case, it randomly chooses 75% of the training data to build the classifier. The colsample bytree’ parameter makes the algorithm choose a subset of features while building the tree, in our case to 85%. Both subsample’ as well as colsample bytree’ help in preventing overt. The classier was built by performing 5000 iterations over the training data set. These parameters were tuned by experimentation, by trying to minimize the log-loss error. The log-loss error on public leader board for this configuration was 0.448.We also created another classifier using the deep-learning algorithm provided by h2o. The configuration for this algorithm was as follows:classication=T, activation=`RectierWithDropout', hidden=c(1024,512,256), hidden dropout ratio=c(0.5,0.5,0.5), input dropout ratio=0.05, epochs=50, l1=1e-5, l2=1e-5, rho=0.99, epsilon=1e-8, train samples per iteration=4000, max w2=10, seed=1 This configuration creates a neural network with 3 hidden layers, each with 1024, 512 and 256 neurons respectively. This is specified using the `hidden’ parameter. The activation function in this case is Rectifier with Dropout. The rectifier function filters negative inputs for each neuron. Dropout lets us randomly drop inputs to the hidden neuron layers. Dropout builds better generalizations. The hidden_dropout _ratio specifies the percentage of inputs to hidden layers to be dropped. The input dropout ratio specifies the percentage of inputs to the input layer to be dropped. Epochs define the number of training iterations to be carried out. Setting train samples per iteration makes the algorithm choose subset of training data. Setting l1 and l2 scales the weights assigned to each feature. l1 reduces model complexity, and l2 introduces bias in estimation. Rho and Epsilon together slow convergence. The max w2 function sets an upper limit on the sum of squared incoming weights into a neuron. This needs to be set for rectifier activation function. Seed is the random seed that controls sampling. Using these parameters, we performed 10 iterations of deep learning, and submitted the mean of the 10 results as the output. The log-loss error on public leaderboard for this configuration was 0.448.We then merged both the results by taking a mean, and that resulted in the top 10% submission, with a score of 0.428.ResultsSince the public leaderboard evaluation was based on 70% of the test data, the public leaderboard rank was fairly stable. We submitted the best two results, and our rank remained in the top 10% at the end of final evaluation. Overall, it was a good learning experience.What We LearntRelying on results of one model may not give the best possible result. Combining results of various approaches may reduce the error by significant margin. As can been seen here, the winning solution in this competition had a much more complex ensemble. Complex ensembles may be the way ahead for getting better at complex problems like classification.

Aziro Marketing

big data data-analytics machine-learning

Descriptive Analytics: Understanding the Past to Inform the Future

In the ever-evolving landscape of data analytics, businesses increasingly rely on data to make informed decisions, drive strategies, and optimize operations. How descriptive analytics can be applied within various organizations and how it works in providing insights and conclusions from raw data for informed decision-making is crucial for understanding its value. Among the various branches of analytics, descriptive analytics holds a foundational place, providing critical insights into historical data to paint a comprehensive picture of past performance. This blog delves into the significance of descriptive analytics, its methodologies, tools, and its crucial role in shaping future strategies. Understanding Descriptive Analytics What is Descriptive Analytics? Descriptive analytics is the process of summarizing historical data to identify patterns, trends, and insights. It answers the question, “What happened?” by analyzing past data to understand the performance and behavior of various business aspects. Descriptive analytics can help in various business applications such as supply chain management, marketing campaign improvement, customer segmentation, operational efficiency analysis, and financial analysis. Unlike predictive analytics or prescriptive analytics, which focus on forecasting future trends and prescribing actions, descriptive analytics is retrospective, focusing solely on past data. Key Components of Descriptive Analytics Data Collection: Gathering relevant data from various sources such as transactional databases, logs, and external datasets is essential. This ensures the data is accurate, comprehensive, and representative of the subject being analyzed. Data Cleaning: Ensuring data accuracy by identifying and correcting errors, inconsistencies, and missing values. Data Aggregation: Combining data from different sources to create a comprehensive dataset. Data Analysis: Using statistical methods and tools to analyze the data and identify patterns and trends. Data Visualization: Presenting the analyzed data through charts, graphs, dashboards, and reports for easy interpretation. Importance of Descriptive Analytics Informing Decision Making Descriptive analytics provides a factual basis for decision-making by offering a clear view of what has transpired in the past. Analyzing various data points such as social media engagement, email open rates, and number of subscribers can optimize marketing campaigns and understand the company’s performance. Businesses can use these insights to understand their strengths and weaknesses, make informed strategic decisions, and set realistic goals. Performance Measurement Using Key Performance Indicators Organizations use descriptive analytics to measure performance against key performance indicators (KPIs). By tracking metrics over time, businesses can assess their progress, identify areas for improvement, and make necessary adjustments to achieve their objectives. Enhancing Customer Understanding with Historical Data By analyzing historical customer data, businesses can gain valuable insights into customer behavior, preferences, and buying patterns. By analyzing historical sales data, businesses can identify patterns, seasonality, and long-term trends, which helps in decision-making and forecasting future performance. This information helps in creating targeted marketing strategies, improving customer service, and enhancing customer satisfaction. Operational Efficiency Descriptive analytics helps businesses optimize their operations by identifying inefficiencies and areas of waste. By understanding past performance, organizations can streamline processes, reduce costs, and improve productivity. Methodologies in Descriptive Analytics Data Mining Data mining involves exploring large datasets to discover patterns, correlations, and anomalies. Exploratory data analysis involves techniques such as summary statistics and data visualization to understand data characteristics and identify initial patterns or trends. Techniques such as clustering, association rule mining, and anomaly detection are commonly used in descriptive analytics to uncover hidden insights. Descriptive Statistics and Analysis Statistical analysis uses mathematical techniques to analyze data and draw conclusions. Diagnostic analytics focuses on explaining why specific outcomes occurred and is used to make changes for the future. Descriptive statistics such as mean, median, mode, standard deviation, and variance provide a summary of the data’s central tendency and dispersion. Data Visualization Data visualization is a key aspect of descriptive analytics, enabling businesses to present complex data in an easily understandable format. Tools like bar charts, line graphs, pie charts, and histograms help in identifying trends and patterns visually. Reporting Reporting involves generating structured reports that summarize the analyzed data. These reports provide stakeholders with actionable insights and facilitate data-driven decision-making. Tools for Descriptive Analytics Microsoft Power BI Power BI is a powerful business analytics tool that enables organizations to visualize their data and share insights across the organization. It offers robust data modeling, visualization, and reporting capabilities, making it a popular choice for descriptive analytics. Tableau Tableau is a leading data visualization tool that helps businesses create interactive and shareable dashboards. Its drag-and-drop interface and extensive visualization options make it easy to explore and present data effectively. Google Data Studio Google Data Studio is a free tool that allows users to create customizable and interactive reports. It integrates seamlessly with other Google services, making it a convenient choice for organizations using Google Analytics, Google Ads, and other Google products. SAS Visual Analytics SAS Visual Analytics offers a comprehensive suite of analytics tools for data exploration, visualization, and reporting. It leverages data science to transform raw data into understandable patterns, trends, and insights, enabling organizations to make informed decisions. It is known for its advanced analytics capabilities and user-friendly interface, catering to both novice and experienced users. Qlik Sense Qlik Sense is a self-service data visualization and discovery tool that empowers users to create personalized reports and dashboards. Its associative data model allows for intuitive data exploration and analysis. Data Collection Methods Effective descriptive analytics relies on accurate data collection methods, including: Internal Databases: Leveraging data stored in company databases. Customer Surveys: Collecting feedback directly from customers. Website Analytics: Analyzing user behavior on company websites. Social Media Data: Gathering insights from social media interactions and engagements. Case Studies: Real-World Applications of Descriptive Analytics Sales & Marketing In sales and marketing, descriptive analytics can be used to analyze past sales data, identifying best-selling products, seasonal trends, and customer demographics. By transforming raw data into actionable insights, businesses can better understand their market and make informed decisions. This information helps tailor marketing campaigns for better targeting and improved ROI. For instance, a company might find that a certain product sells well among young adults during the summer, leading them to focus their marketing efforts on that demographic during that season. Retail Industry A leading retail chain used descriptive analytics to analyze sales data from its various stores. By identifying patterns in customer purchases, the company was able to optimize inventory levels, improve product placement, and increase sales. Descriptive analytics also helped the retailer segment its customer base and develop targeted marketing campaigns, resulting in higher customer engagement and loyalty. Healthcare Sector A healthcare provider utilized descriptive analytics to examine patient data and identify trends in disease outbreaks, treatment effectiveness, and patient outcomes. This analysis enabled the organization to improve patient care, streamline operations, and allocate resources more efficiently. By understanding historical data, the healthcare provider could also predict future healthcare needs and plan accordingly. Financial Services A financial institution leveraged descriptive analytics to analyze transaction data and detect fraudulent activities. By identifying unusual patterns and anomalies, the bank could prevent fraud and enhance its security measures. Additionally, descriptive analytics helped the bank understand customer behavior, enabling it to offer personalized financial products and services. Manufacturing Industry A manufacturing company used descriptive analytics to monitor production processes and identify inefficiencies. By analyzing machine performance data, the company could predict maintenance needs, reduce downtime, and improve overall productivity. Descriptive analytics also helped the manufacturer optimize supply chain operations and reduce operational costs. Human Resources In HR, descriptive analytics can identify top performers, track employee turnover rates, and improve talent acquisition strategies. For example, by analyzing employee data, a company might find that turnover is highest among new hires within the first six months. This insight can lead to improved onboarding processes and retention strategies. Best Practices for Implementing Descriptive Analytics Define Clear Objectives Before embarking on a descriptive analytics initiative, it is crucial to define clear objectives. Understanding what you want to achieve with your analysis will guide the data collection, analysis, and reporting processes. Ensure Data Quality High-quality data is the foundation of effective descriptive analytics. Invest in data cleaning and validation processes to ensure the accuracy, consistency, and completeness of your data. Choose the Right Tools Selecting the appropriate tools for data analysis and visualization is essential. Consider factors such as ease of use, scalability, integration capabilities, and cost when choosing analytics tools. Focus on Visualization Effective data visualization makes it easier to interpret and communicate insights. Invest in tools and techniques that allow you to create clear, interactive, and compelling visualizations. Foster a Data-Driven Culture Encourage a data-driven culture within your organization by promoting the use of data in decision-making. Provide training and resources to help employees develop their data literacy skills. Regularly Review and Update Your Analysis Descriptive analytics is an ongoing process. Regularly review and update your analysis to reflect new data and changing business conditions. Continuously seek feedback and make improvements to your analytics processes. The Future of Descriptive Analytics As technology advances and the volume of data continues to grow, the future of descriptive analytics looks promising. Here are some trends to watch: Integration with Predictive and Prescriptive Analytics Descriptive analytics will increasingly integrate with advanced analytics techniques such as predictive and prescriptive analytics. Predictive analytics makes predictions about future performance based on statistics and modeling, benefiting companies by identifying inefficiencies and forecasting future trends. This integration will provide a more comprehensive view of the data, enabling businesses to move from understanding the past to predicting and shaping the future. Real-Time Analytics The demand for real-time insights is growing. Future developments in descriptive analytics will focus on real-time data processing and analysis, allowing businesses to make timely and informed decisions. AI and Machine Learning Artificial intelligence (AI) and machine learning will play a significant role in enhancing descriptive analytics. These technologies will automate data analysis, uncover deeper insights, and provide more accurate and actionable recommendations. Enhanced Data Visualization Advancements in data visualization tools will enable more sophisticated and interactive visualizations. Businesses will be able to explore their data in new ways, uncover hidden patterns, and communicate insights more effectively. Increased Accessibility As analytics tools become more user-friendly and affordable, descriptive analytics will become accessible to a broader range of users. Small and medium-sized businesses will increasingly leverage descriptive analytics to gain a competitive edge. Conclusion Descriptive analytics is a vital component of any data-driven strategy. By providing a clear understanding of past performance, it empowers businesses to make informed decisions, optimize operations, and enhance customer experiences. As technology evolves, the capabilities of descriptive analytics will continue to expand, offering even greater insights and opportunities. By embracing descriptive analytics, organizations can build a solid foundation for future success, leveraging historical data to navigate the complexities of the modern business landscape. For more insights on Analytics and its applications, read our blogs: AI in Predictive Analytics Solutions: Unlocking Future Trends and Patters in the USA (2024 & Beyond) Predictive Analytics Solutions for Business Growth in Georgia Prescriptive Analytics: Definitions, Tools, and Techniques for Better Decision Making

Aziro Marketing

data-analytics Descriptive analytics machine-learning

How to use Naive Bayes for Text Classification

Classification is a process by which we can segregate different items to match their specific class or a category. This is a very commonly occurring problem across all activities that happen throughout the day, for all of us. Classifying whether an activity is dangerous, good, moral, ethical, criminal, etc., or not are all deep rooted and complex problems, which may or may not have a definite solution. But each of us, in a bounded rational world, try to classify actions, based on our prior knowledge and experience, into one or more of the classes that we may have defined over time. Let us take a look at some real-world examples of classification, as seen in business activities.Case 1: Doctors look at various symptoms and measure various parameters of a patient to ascertain what is wrong with the patient’s health. The doctors use their past experience about patients to make the right guess.Case 2: Emails need to be classified as spam or not spam, based on various parameters, such as the source IP address, domain name, sender name, content of the email, subject of the email etc. Users also feed information to the spam identifier by marking emails as spam.Case 3: IT enabled organizations face a constant threat for data theft from hackers. The only way to identify these hackers is to search for patterns in the incoming traffic, and classify traffic to be genuine or a threat.Case 4: Most of the organizations that do business in the B2C (business to consumer) segment keep getting feedbacks about their products or services from their customers in form of text, ratings, or answers to multiple choice questions. Surveys, too, provide such information regarding the services or products. Questions such as “What is the general public sentiment about the product or service?” or “Given a product, and its properties, will it be a good sell?” also needs classification.As we can imagine, classification is a very widely used technique for applying labels to the information that is received, thus assigning it some known, predefined class. Information may fall into one or more such classes, depending on the overlap between them. In all the above seen cases, and most of the other cases where classification is used, the incoming data is usually large. Going through such large data sets manually, to classify them can become a significantly time-consuming activity. Therefore, many classification algorithms have been developed in artificial intelligence to aid this intuitive process. Decision trees, boosting, Naive Bayes, random forests are a few commonly used ones. In this blog, we discuss the Naive Bayes classification algorithm.The classification using Naive Bayes is one of the simplest and widely used effective statistical classification technique, which works well on text as well as numeric data. It is a supervised machine learning algorithm, which means that it requires some already classified data, from which it learns and then applies what it has learnt to new, previously unseen information, and gives a classification for the new information.AdvantagesNaive Bayes classification assumes that all the features of the data are independent of each other. Therefore, the only computation required in the classification is counting. Hence, it is a very compute-efficient algorithm.It works equally well with numeric data as well as text data. Text data requires some pre-processing, like removal of stop words, before this algorithm can consume it.Learning time is very less as compared to a few other classification algorithms.LimitationsIt does not understand ranges; for example, if the data contains a column which gives age brackets, such as 18-25, 25-50, 50+, then the algorithm cannot use these ranges properly. It needs exact values for classification.It can classify only on the basis of the cases that it has seen. Therefore, if the data used in the learning phase is not a good representative sample of the complete data, then it may wrongly classify data.Classification Using Naive Bayes With PythonData In this blog, we used the customer review data for electronic goods from amazon.com. We downloaded this data set from the SNAP website. Then we extractedfeatures from the data set after removing stopwords and punctuation.Features Label (good, look, bad, phone) bad (worst, phone, world) bad (unreliable, phone, poor, customer, service) bad (basic, phone) bad (bad, cell, phone, batteries) bad (ok, phone, lots, problems) average (good, phone, great, pda, functions) average (phone, worth, buying, would, buy) average (beware, flaw, phone, design, might, want, reconsider) average (nice, phone, afford, features) average (chocolate, cheap, phone, functionally, suffers) average (great, phone, price) good (great, phone, cheap, wservice) good (great, entry, level, phone) good (sprint, phone, service) good (free, good, phone, dont, fooled) good Table 1: Sample DataWe used the stopwords list provided in nltk corpus for the identification and removal. Also, we applied labels to the extracted reviews, based on the ratings available in the data – 4 and 5 as good, 3 as average, and 1 and 2 as bad. A sample of this extracted data set is shown in table 1.Implementation : classification algorithm works in two steps – first is the training phase and second is the classification phase.Training Phase In the training phase, the algorithm takes two parameters as input. First is the set of features, and second is the classification labels for each feature. A feature is a part of the data, which contributes to the label or the class attached to the data. In the training phase, the classification algorithm builds the probabilities for each of the unique features given in a class. It also builds prior probabilities for each of the classes itself, that is, the probability that a given set of features will belong to that class. Algorithm 1 gives the algorithm for training. The implementation of this is shown using Python in figure 1.Classification Phase In the classification phase, the algorithm takes the features, and outputs the attached label or class with the maximum confidence. Algorithm 2 gives the algorithm for classification. Its implementation can be seen in figure 2.Concluding RemarksAlgorithm 1: Naive Bayes Training Data: C, D where C is a set of classes, and D is a set of documents 1 TrainNaiveBayes(C, D) begin 2 V ← ExtractVocabulary(D) 3 N ← CountDocs(C ) 4 for each c ∈ C do 5 Nc ←CountDocsInClass(D, c) 6 prior[c] ← NC ÷ N 7 textc ←ConcatenateTextOfAllDocumentsInClass(D, c) 8 for each t ∈ V do 9 Tct ← CountTokensOfTerm(textc , t) 10 for each t ∈ V do 11 condprob[t][c] ← (Tct + 1) ÷ Σt0 (Tct0 + 1) 12 return V, prior, condprob Algorithm 2: Naive Bayes Classification Data: C; V; prior; condprob; d where C is a set of classes, d is the new input document to be classied, and V; prior; condprob are the outputs of the training algorithm 1 ApplyNaiveBayes(C;D) begin 2 W ExtractTermsFromDoc(V; d) 3 Ndw CountTokensOfTermsInDoc(W; d) 4 for each c 2 C do 5 score[c] log(prior[c]) 6 if (t 2 W) then 7 score[c]+ = log(condprob[t][c] Ndt) 8 return argmaxc2C(score[c]) Figure 1: Training PhaseFigure 2: Classification Phase

Aziro Marketing

data-analytics machine-learning

How you can Hyperscale your Applications Using Mesos & Marathon

In a previous blog post we have seen what Apache Mesos is and how it helps to create dynamic partitioning of our available resources which results in increased utilization, efficiency, reduced latency, and better ROI. We also discussed how to install, configure and run Mesos and sample frameworks. There is much more to Mesos than above.In this post we will explore and experiment with a close to real-life Mesos cluster running multiple master-slave configurations along with Marathon, a meta-framework that acts as cluster-wide init and control system for long running services. We will set up a 3 Mesos master(s) and 3 Mesos slaves(s), cluster them along with Zookeeper and Marathon, and finally run a Ruby on Rails application on this Mesos cluster. The post will demo scaling up and down of the Rails application with the help of Marathon. We will use Vagrant to set up our nodes inside VirtualBox and will link the relevant Vagrantfile later in this post.To follow this guide you will need to obtain the binaries for Ubuntu 14.04 (64 bit arch) (Trusty)Apache MesosMarathonApache ZookeeperRuby / RailsVirtualBoxVagrantVagrant Pluginsvagrant-hostsvagrant-cachierLet me briefly explain what Marathon and Zookeeper are.Marathon is a meta-framework you can use to start other Mesos frameworks or applications (anything that you could launch from your standard shell). So if Mesos is your data center kernel, Marathon is your “init” or “upstart”. Marathon provides an excellent REST API to start, stop and scale your application.Apache Zookeeper is coordination server for distributed systems to maintain configuration information, naming, provide distributed synchronization, and group services. We will use Zookeeper to coordinate between the masters themselves and slaves.For Apache Mesos, Marathon and Zookeeper we will use the excellent packages from Mesosphere, the company behind Marathon. This will save us a lot of time from building the binaries ourselves. Also, we get to leverage bunch of helpers that these packages provide, such as creating required directories, configuration files and templates, startup/shutdown scripts, etc. Our cluster will look like this:The above cluster configuration ensures that the Mesos cluster is highly available because of multiple masters. Leader election, coordination and detection is Zookeeper’s responsibility. Later in this post we will show how all these are configured to work together as a team. Operational Guidelines and High Availability are goodreads to learn and understand more about this topic.InstallationIn each of the nodes we first add Mesosphere APT repositories to repository source lists and relevant keys and update the system.$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF $ echo "deb http://repos.mesosphere.io/ubuntu trusty main" | sudo tee /etc/apt/sources.list.d/mesosphere.list $ sudo apt-get -y update If you are using some version other than Ubuntu 14.04 then you will have to change the above line accordingly and if you are using some other distribution like CentOS then you will have to use relevant rpm and yum commands. This applies everywhere henceforth.On master nodes:In our configuration, we are running Marathon on the same box as Mesos masters. The folks at Mesosphere have created a meta-package called mesosphere which installs Mesos, marathon and also zookeeper.$ sudo apt-get install mesosphere On slave nodes:On slave nodes, we require only zookeeper and mesos installed. The following command should take care of it.$ sudo apt-get install mesos As mentioned above, installing the above packages will do more that just installing packages. Much of the plumbing work is taken care of for the better. You need not worry if the mandatory “work_dir” has been created in the absence of which Apache Mesos would not run and other such important things. If you want to understand more, extracting scripts from the package and studying them is highly recommended. That is what I did as well.You can save a lot of time if you clone this repository and then run the following command inside your copy.$ vagrant up This command will launch a cluster, set the IPs for all nodes, install all required packages to follow this post. You are now ready to configure your cluster.ConfigurationIn this section we will configure each tool/application one by one. We will start with Zookeeper, then Mesos servers, then Mesos slaves and finally Marathon.ZookeeperLet us stop Apache Zookeeper on all nodes (masters and slaves).$ sudo service zookeeper stop Let us configure Apache Zookeeper on all masters. Do the following steps on each master.Edit /etc/zookeeper/conf/myid on each of the master nodes. Replace the boilerplate text in this file with a unique number (per server) from 1 to 255. These numbers will be the IDs for the servers being controlled by Zookeeper. Lets chose 10, 30 and 50 as IDs for the 3 Mesos master nodes. Save the files after adding 10, 30 and 50 respectively in /etc/zookeeper/conf/myid for the nodes. Here’s what I had to do on the first master node. Same has to be repeated on other nodes with respective IDs.$ echo 10 | sudo tee /etc/zookeeper/conf/myid Next we configure the Zookeeper configuration file ( /etc/zookeeper/conf/zoo.cfg ) for each master node. For the purpose of this blog we are just adding the master node IPs and relevant server IDs that was selected in the previous step.Note the configuration template line below. server.id=host:port1:port2. port1 is used by peer ZooKeeper servers to communicate with each other, and port2 is used for leader election. The recommended values are 2888 and 3888 for port1 and port2 respectively but you can choose to use custom values for your cluster.Assuming that you have chosen the IP range 10.10.20.11-13 for your Mesos servers as mentioned above, edit /etc/zookeeper/conf/zoo.cfg to reflect the following:# /etc/zookeeper/conf/zoo.cfg server.10=10.10.20.11:2888:3888 server.30=10.10.20.12:2888:3888 server.50=10.10.20.13:2888:3888 This file will have many other Zookeeper related configurations which are beyond the scope of this post. If you are using the packages mentioned above, the configuration templates should be a lot of help. Definitely read the comments sections, a lot to learn there.This is a good tutorial on understand fundamentals of Zookeeper. And this document is perhaps the latest and best document to know more about administering Apache Zookeeper, specifically this section is of relevance of what we are doing.All NodesZookeeper Connection DetailsFor all nodes (masters and slaves), we have to set up Zookeeper connection details. This will be stored in /etc/mesos/zk, a configuration file that you will get thanks to the packages. Edit this file on each node and add the following url carefully.#/etc/mesos/zk zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/mesos Port 2181 is Zookeeper’s client port that it listens to for client connections. IP addresses will differ if you have chosen IPs for your servers differently.IP AddressesNext we set up IP address information for all nodes (masters and slaves).Masters$ echo | sudo tee /etc/mesos-master/ip $ sudo cp /etc/mesos-master/ip /etc/mesos-master/hostname Write the IP of the node in the file. Save and close the file.Slaves$ echo | sudo tee /etc/mesos-slave/ip $ sudo cp /etc/mesos-slave/ip /etc/mesos-slave/hostname Write the IP of the node in the file. Save and close the file. Keeping the hostname same as IP makes it easier to resolve DNS.If you are using the Mesosphere packages, then you get a bunch of intelligent defaults. One of the most important things you get is a convenient way to pass CLI options to Mesos. All you need to do is create a file with same name as that of the CLI option and put the correct value that you want to pass to Mesos (master or slave). The file needs to be copied to a correct directory. In case of Mesos masters, you need to copy the correct file to /etc/mesos-master and for slaves you should copy the file to /etc/mesos-slave For example: echo 5050 > sudo tee /etc/mesos-slave/port We will see some examples of similar configuration setup below. Here you can find all the CLI options that you can pass to Mesos master/slave.Mesos ServersWe need to set a quorum for the servers. This can be done by editing /etc/mesos-master/quorum and setting it to a correct value. For our case, the quorum value can be 2 or 3. We will use 2 in this post. Quorum is the strict majority. Since we chose 2 as quorum value it means that out of 3 masters, we will definitely need at least 2 master nodes running for our cluster to run properly.We need to stop the slave service on all masters if they are running. If they are not, the following command might give you a harmless warning.$ sudo service mesos-slave stop Then we disable the slave service by setting a manual override.$ echo manual | sudo tee /etc/init/mesos-slave.override Mesos SlavesSimilarly we need to stop the master service on all slaves if they are running. If they are not, the following command might give you a harmless warning. We also set the master and zookeeper service on each slave to manual override.$ sudo service mesos-master stop $ echo manual | sudo tee /etc/init/mesos-master.override $ echo manual | sudo tee /etc/init/zookeeper.override The above .override files are read by upstart on Ubuntu box to start/stop processes. If you are using a different distribution or even Ubuntu 15.04 then you might have to do this differently.MarathonWe can now configure Marathon, for which we need some work to be done. We will configure Marathon only on the server nodes.First create a directory for Marathon configuration.$ sudo mkdir -p /etc/marathon/conf Then like we did before, we will set configuration properties by creating files with same name as that of property to be set and adding the value of the property as the only content of the file (see box above).Marathon binary needs to know the values for –master and –hostname. We can reuse the files that we used for Mesos configuration.$ sudo cp /etc/mesos-master/ip /etc/marathon/conf/hostname $ sudo cp /etc/mesos/zk /etc/marathon/conf/master To make sure Marathon can use Zookeeper, do the following (note the endpoint is different in this case i.e. marathon):$ echo zk://10.10.20.11:2181,10.10.20.12:2181,10.10.20.13:2181/marathon \ | sudo tee /etc/marathon/conf/zk Here you can find all the command line options that you can pass to Marathon.Starting ServicesNow that we have configured our cluster, we can resume all services.Master$ sudo service zookeeper start $ sudo service mesos-master start $ sudo service marathon start Slave$ sudo service mesos-slave start Running Your ApplicationMarathon provides nice Web UI to set up your application. It also provides an excellent REST API to create, launch, scale applications, check health status and more.Go to your Marathon Web UI, if you followed the above instructions then the URL should be one of the Mesos masters on port 8080 ( i.e. http://10.10.20.11:8080 ). Click on “New App” button to deploy a new application. Fill in the details. Application ID is mandatory. Select relevant values for CPU, Memory, Disk Space for your application. For now let number of instances be 1. We will increase them later when we scale up the application in our shiny new cluster.There are a few optional settings that you might have to take care depending on our your slaves are provisioned and configured. For this post, I made sure each slave had Ruby, Ruby related dependencies and Bundler gem were installed. I took care of this when I launched and provisioned the slaves nodes.One of the important optional settings is “Command” that Marathon can execute. Marathon monitors this command and reruns it if it stops for some reason. Thus Marathon claims to fame as “init” and runs long running applications. For this post, I have used the following command (without the quotes).“cd hello && bundle install && RAILS_ENV=production bundle exec unicorn -p 9999” This command reads the Gemfile in the Rails application, installs all the necessary gems required for the application, and then runs the application on port 9999.I am using a sample Ruby On Rails application. I have put the url of the tarred application in the URI field. Marathon understands a few archive/package formats and takes care of unpacking them so we needn’t worry about them. Applications need resources to run properly, URIs can be used for this purpose. Read more about applications and resourceshere.Once you click “Create”, you will see that Marathon starts deploying the Rails application. A slave is selected by Mesos, the application tarball is downloaded, untarred, requirements are installed and the application is run. You can monitor all the above steps by watching the “Sandbox” logs that you should find on Mesos main web UI page. When the state of task will change from “Staging” to “Running” we have a Rails application run via Marathon on a Mesos slave node. Hurrah!If you followed the steps from above, and you read the “Sandbox” logs you know the IP of the node where the application was deployed. Navigate to the SLAVE_NODE_IP:9999 to see your rails application running. Scaling Your ApplicationAll good but how do we scale? After all, the idea is for our application to reach web scale and become the next Twitter, and this post is all about scaling application with Mesos and Marathon. So this is going to be difficult! Scaling up/down is difficult but not when you have Mesos and Marathon for company. Navigate to the application page on Marathon UI. You should see a button that says “Scale”. Click on it and increase the number to 2 or 3 or whatever you prefer (assuming that you have that many slave nodes). In this post we have 3 slave nodes, so I can choose 2 or 3. I chose 3. And voila! The application is deployed seamlessly to the other two nodes just like it was deployed to the first node. You can see for yourself by navigating to SLAVE_NODE_IP:9999 where SLAVE_NODE_IP will be the IP of the slave where the application was deployed. And there you go, you have your application running on multiple nodes.It would be trivial to put these IPs behind a load-balancer and a reverse proxy so that access to your application is as simple as possible. Graceful Degradation (and vice versa)Sometimes nodes in your clusters go down for one reason or other. Very often we get an email from your IaaS provider that your node will be retired in few days time and at other times a node dies before you could figure out what happened. When such inevitable things happen and the node in question is part of the cluster running the application, the dynamic duo of Mesos and Marathon have your back. The system will detect the failure, will de-register the slave and deploy the application to a different slave available in the cluster. You could tie it up with your IaaS-provided scaling option and spawn required number of new slave nodes as a part of your cluster which once registered with the Mesos cluster can run your application.Marathon REST APIAlthough we have used the Web UI to add a new application and scale it. We could have done the same (and much more) using REST API and thus do Marathon operations via some program or scripts. Here’s an simple example that will scale the application to 2 instances. Use any REST client or just curl to make a PUT request to the application ID, in our case http://10.10.20.11:8080/v2/apps/rails-app-on-mesos-marathon with the following JSON data as payload. You will notice that Marathon deploys the application to another instance if there was only 1 instance before.{ "instances" : 2 } You can do much more than above, do health checks, add/suspend/kill/scale applications etc. This can become a complete blog post in itself and will be dealt at a later time.ConclusionScaling your application becomes as easy as pressing buttons with a combination of Mesos and Marathon. Setting up a cluster can become almost trivial once you get your requirements in place and ideally automate the configuration and provisioning of your nodes. For this post, I relied on simple Vagrantfile and a shell script that provision the system. Later I configured the system by hand as per above steps. Using Chef or alike would make the configuration step a single command work. In fact there are a few open-source projects that are already very successful and do just that. I have played witheverpeace/vagrant-mesos and it is an excellent starting point. Reading the code from these projects will help you understand a lot about building and configuring clusters with Mesos.There are other projects that do similar things like Marathon and sometimes more. I definitely would like to mention Apache Aurora and HubSpot’s Singularity.

Aziro Marketing

big data data-analytics datacenter machine-learning

So You Want to Build Your Own Data Center?

In today’s internet-of-things world, companies run their applications 24×7, and this generally results in a lot of users or data. This data needs to be stored, analyzed, and post-processed; in essence, some action needs to be taken on it. We are looking at huge fluctuating workloads. The scale of operations is enormous, and to handle this kind of mammoth operations, clusters are built. In the age of commodity hardware, clusters are easy to build, but clusters with specific software stack that could do only one type of task (static partitioning of resources) lead to less optimal resource utilization, because it is possible that no task of that type is running at a given time.For example, Jenkins slaves in a CI cluster could be sitting idle at night or during a common vacation time when developers are not pushing code. But let’s say, when product release time is near, it might so happen that developers are pushing and hacking away at code so frequently that the build queue becomes longer due to the need for slaves to run the CI jobs. Both the situations are undesirable and reduce the efficiency and ROI of the company.Dynamic partitioning of resources is the solution to fix the above issue. Here, you pool your resources (CPU, memory, IO, etc.) such that nodes from your cluster act as one huge computer. Based on your current requirement, resources are allocated to the task that needs it. So the same pool of hardware runs your Hadoop, MySQL, Jenkins, and Storm jobs. You can call this “node abstraction.” Thus achieve diverse cluster computing on commodity hardware by fine-grained resource sharing. To put it simply, different distributed applications run on the same cluster.Google has mastered this kind of cluster computing for almost 2 decades now. The outside world does not know much about their project known as Borg or its successor, Omega. Ben Hindman, an American entrepreneur, and a group of researchers from UC Berkeley came up with an open-source solution inspired by Borg and Omega.Enter Mesos!What Is Mesos?Mesos is a scalable and distributed resource manager designed to manage resources for data centers.Mesos can be thought of as “distributed kernel” that achieves resource sharing via APIs in various languages (C++, Python, Go, etc.) Mesos relies on cgroups to do process isolation on top of distributed file systems (e.g., HDFS). Using Mesos you can create and run clusters running heterogeneous tasks. Let us see what it is all about and some fundamentals on getting Mesos up and running.Basic Terminology and ArchitectureMesos follows the master-slave architecture. It can also have multiple masters and slaves. Multi-master architecture makes Mesos fault-tolerant. The leader is elected through ZooKeeper.A Mesos application, or in Mesos parlance a “framework,” is a combination of a scheduler and an executor. Framework’s scheduler is responsible for registering with the Mesos master and also for accepting or rejecting resource offers from the Mesos master. An executor is a process running on Mesos slave that actually runs the task. Mesos has a distributed two-way scheduling called resource offer. A “resource offer” can be thought of as a two-step process, where initially a message from the Mesos master is sent to a particular framework on a Mesos slave about what resources (CPU, memory, IO etc) are available to it. The framework decides which offers it should accept or reject and which tasks to run on them.A task could be a Jenkins job inside a Jenkins slave, a Hadoop MapReduce job, or even a long-running service like a Rails application. Tasks run in isolated environment which can be achieved via cgroups, Linux containers, or even Zones on Solaris. Since Mesos v0.20.0, native Docker support has been added as well.Examples of useful existing frameworks include Storm, Spark, Hadoop, Jenkins, etc. Custom frameworks can be written on the API provided by Mesos kernel in various languages–C++, Python, Java, etc.Image credit: Mesos documentationA Mesos slave informs Mesos master about its available resources that the slave is ready to share. The Mesos master based on allocation policy makes “resource offers” to a framework. The framework’s scheduler decides whether or not to accept the offers. Once accepted, the framework sends a task description (and its resource requirements) to the Mesos master it needs to run. The Mesos master then sends these tasks to the Mesos slave to be executed on the slave by the framework’s executor. Finally the framework’s executor launches the task. Once the task is complete and the Mesos slave is idle, it reports back to the Mesos master about the freed resources.Mesos is being used by Twitter for many of their services including analytics, typeahead, etc. Many other companies who have large cluster and big data requirements like Airbnb, Atlassian, Ebay, Netflix, and others use Mesos.What Do You Get With Mesos?Arguably the most important feature that you can get out of Mesos would be “resource isolation.” This resource can be CPU, memory, etc. Mesos allows running multiple distributed applications on the same cluster, and this gives us increased utilization, efficiency, reduced latency, and better ROI.How to Build and Run Mesos on Your Local Machine?Enough with the theory! Now let us do fun bits of actually building the latest Mesos from Git and running Mesos and test frameworks. The below steps assume you are running Ubuntu 14.04 LTS.* Get Mesos git clone https://git-wip-us.apache.org/repos/asf/mesos.git * Install dependencies sudo apt-get update sudo apt-get install build-essential openjdk-6-jdk python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev autoconf libtool * Build Mesos cd mesos ./bootstrap mkdir build && cd build ../configure makeof things can be configured, enabled, or disabled before building Mesos. Most importantly, you can choose where you want to install Mesos by passing the directory to “–prefix” at the configure step. You can optionally use system-installed versions for ZooKeeper, gmock, protocol buffers, etc., instead of building them and thus save some time. You can also save some time by disabling language bindings that you might not need.As a general rule it would be nice to use a beefy machine with at least 8 GB RAM and fast enough processor if you are building Mesos locally on your test machine.* Run tests make checkNote that these tests take a lot of time to build (if they are not built by default) and run.* Install Mesos make installThis is an optional step; if you ignore it then you can run Mesos from the build directory you created earlier. But if you choose to install it, it will be installed in the $PREFIX that you chose during the configure step. If you do not provide custom $PREFIX, it will be installed to /usr/local/bin.* Prepare the system sudo mkdir /var/lib/mesos sudo chown /var/lib/mesosThe above two steps are mandatory. Mesos will throw an error if the directory is not there or permissions and ownership are not set correctly. You can chose some other directory but you have to provide the same as work_dir. Refer the next command. * Run Mesos Master ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesosIt is mandatory to pass –work_dir with correct directory as the value to the command line switch. Mesos master uses it for replicated log registry.* Run Mesos Slave ./bin/mesos-slave.sh --master=127.0.0.1:5050And voila! Now you have a Mesos master and Mesos slave running.Mesos by itself is incomplete. It uses frameworks to run distributed applications. Let’s run a sample framework. Mesos comes with a bunch of example frameworks located in “mesos/src/examples” folder of your mesos Git clone. For this article, I will run the Python framework that you should find in “mesos/src/examples/python”.You can play with the example code for more fun and profit. See what happens when you increase the value of TOTAL_TASKS in “mesos/src/examples/python/test_framework.py”. Or you could try to simulate different duration taken by tasks to execute by inserting a random amount of sleep in run_task() method inside “mesos/src/examples/python/test_executor.py”.* Run frameworks cd mesos/build ./src/examples/python/test-framework 127.0.0.1:5050Assuming that you have followed the above steps you can view the Mesos Dashboard at http://127.0.0.1:5050. Here is how it looked on our test box. ConclusionMarathon, a meta-framework on top of Mesos, is distributed init.d. It takes care of starting, stopping, restarting services, etc. Chronos, a scheduler, think of it as distributed and fault-tolerant cron (*nix scheduler), which takes care of scheduling tasks. Mesos even has a CLI tool (pip install mesos.cli), using which you can interact (tail, cat, find, ls, ps, etc.) with your Mesos cluster via command line and feel geeky about it. A lot can be achieved with Mesos, Marathon, and Chronos together. But more about these in a later post. I hope you have enjoyed reading about Mesos. Please share your questions through the comments.

Aziro Marketing

cloud computing data-analytics datacenter

EXPLORE ALL TAGS

Ansible Test Automation

application containerization

applications

Application Security

application testing

artificial intelligence

asynchronous replication

business intelligence

Cloud Cost Optimization

cloud devops

Cloud Infrastructure

Cloud Interoperability

Cloud Native Solution

Cloud Storage Security

Codeless Automation

Cognitive analytics

Configuration Management

container world conference

continuous-delivery

continuous deployment

continuous integration

data backup and recovery

Descriptive analytics

Descriptive analytics tools

Digital Transformation

dockercon 2019 san francisco

end-to-end-test-automation

High Performance Computing

icinga for monitoring

Image Recognition 2024

kubernetesday bangalore

Low-Code No-Code Platforms

mobile-application-testing

mobile-automation-testing

Predictive analytics tools

prescriptive analysis

Rapid Application Development

raspberry pi

RDMA

real time analytics

realtime analytics platforms

Real-time data analytics

Recovery

Recovery as a service

rsa 2019 san francisco

Selenium Test Automation

selenium testng

serverless

Serverless Computing

Site Reliability Engineering

software defined storage

software-testing

software testing trends

software testing trends 2019

storage virtualization

support

Synchronous Replication

testing automation tools

thought leadership articles

trends

tutorials

ui automation testing

ui testing

ui testing automation

vCenter Operations Manager

vmworld 2019 san francisco

VMworld 2019 US

vROM

Web Automation Testing

web test automation

WFH

LET'S ENGINEER

Your Next Product Breakthrough

Book a Free 30-minute Meeting with our technology experts.

Aziro has been a true engineering partner in our digital transformation journey. Their AI-native approach and deep technical expertise helped us modernize our infrastructure and accelerate product delivery without compromising quality. The collaboration has been seamless, efficient, and outcome-driven.

CTO

Fortune 500 company