Analytics, Best Practices, Bigdata, Exploratory Data Analysis, Machine Learning

Residual Plots for Regression Analysis…

As we discussed in my last article to show you parameters to understand the accuracy and prediction of a regression model but I guess before going into that we first need to understand the importance of residual plot. Without understanding residual plots the discussion on regression would be incomplete. Using residual analysis we can verify that our model is linear or nonlinear. Residual plots reveal unwanted residual patterns that indicate biased results. You just need to muster it by visualization. In residual analysis we check that the variables are randomly scattered around zero for the entire range of fitted values. […]

Best Practices, Bigdata, Data Science, Exploratory Data Analysis, Machine Learning

Ordinary least squares regression (OLSR)

Ordinary least squares regression (OLSR)  Invented in 1795 by Carl Friedrich Gauss, it is considered one of the earliest known general prediction methods. OLSR is a generalized linear modeling technique. It is used for estimating all unknown parameters involved in a linear regression model, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the explanatory variables. Ordinary least squares regression is also known as ordinary least squares or least squared errors regression. Lets start with a Linear regression model like below:- Here is few terminology we use when we […]

Analytics, Apache Spark, Best Practices, Bigdata, Framework

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Best Practices, Bigdata, Hadoop, Kafka

Better late then never : Time to replace your micro-service architecture with Kafka…

Kafka already spawns and facilitated many organizations on micro-services architecture world. If Kafka is still not part of your infrastructure, its high time for you to go with it. I am not promoting Kafka better then any other message queue systems as many articles are already floating on the internet about this subject. Kafka’s uniqueness is that it provides both simple file system and bridge functions. A Kafka broker’s most basic task is to write messages to and read messages from the log on disk as quickly as possible. Queue message will not be lost after the persistence, which is […]

Best Practices, Hadoop, Hive

Performance utilities in Hive

Before taking you in details of utilities provided by Hive, let me explain few components to get execution flow and where the related information stored in system. Hive is a data warehouse software best suited for OLAP (OnLine Analytical Processing) workloads to handle and query over vast volume of data residing in a distributed storage. The Hadoop Distributed File System (HDFS) is the ecosystem in which Hive maintains the data reliably and survives from hardware failures. Hive is the only SQL-like relational big data warehousing approach developed on top of Hadoop. HiveQL as described, is an SQL-like query language for […]

Best Practices, Database, Hive

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS. Below is the architecture of our environment. Below is the command to invoke presto… /presto –server XX.X.X.XX:9080 –catalog hive There is no presto user exists in […]

Best Practices, Hive

Hive Naming conventions and database naming…

Short Description: Naming conventions help to ease programmer and architect to understand whats inside going on in a business. Article I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be […]