Bigdata, Database

Encourage you to switch to Jupyter Lab…

Notebooks are great for prototyping, longer pipelines or processes. If you are a user of PyCharm or Jupyter Notebook and an exploratory data scientist, I would encourage you to switch you to Jupyter Lab. For Jupyter Lab installation steps go here Below are some of the advantages that I see using Jupyter Lab over Jupyter Notebook:- The new terminal is a tab view to use compared. The ability to set out multiple windows easily, much like an IDE This will make working on a remote server so much nicer, just start Jupyter Lab and an ssh tunnel and you have a […]

Analytics, Bigdata, Database

Why and when we need Machine Learning…

I’m into the data management/data quality from several years. When I ask some people what is data management processes they simply reply, “well, we have some of our data stored in a database and other data stored on file shares with proper permissions.” This isn’t data management…it’s data storage. If you and/or your organization don’t have good, clean data, you are most definitely not ready for machine learning. Data management should be your first step before diving into any other data project(s). Now I’d say if you have good data management and tagged for machine learning so give yourself a pause and […]

Bigdata, Database, Python

Python Lists and Lambda Learning…

1 There are many ways to use Python’s List and Lambda. Here I am going to show some of useful tips and tricks. So lets first start with Lists> Below we use most of the time with List. >>> a = [66.6, 333, 333, 1, 1234.5] >>> print a.count (333), a.count (66.6), a.count (‘x’) 2 1 0 >>> a.insert (2, -1) >>> a.append (333) >>> a [66.6, 333, -1, 333, 1, 1234.5, 333] >>> a.index (333) 1 >>> a.index (333,2) 3 >>> a.remove (333) >>> a [66.6, -1, 333, 1, 1234.5, 333] >>> a.reverse () >>> a [333, 1234.5, 1, […]

Apache Spark, Bigdata, Database

Apache Spark RDD API using Pyspark…

In my previous article, I am using scala to show usability of Spark RDD API. Many of us utilizing PySpark to work with RDD and Lambda functions. Though the function names and output is same what we have in Scala, syntax in Pyspark is different on RDD operations. I’ll explain here Pyspark RDD using a different approach and with a different perspective to solve the problem. Let us consider we are streaming data using Spark and we have created RDD using this streaming application want to perform RDD operations on this stream of data in particular time interval. Here I am […]

Analytics, Apache Spark, Bigdata, Database

Tips and Tricks for Apache Spark RDD API, Dataframe API- Part -1

I am planning to share my knowledge on Apache Spark RDD, Dataframes API and some tips and tricks. If I combine everything into one then it would be a very lengthy article. Therefore I am dividing the long article into three separate articles and this article is the first series in that continuation. Spark RDD API Dataframe API Tips and tricks on RDD API and Dataframe API. Let us start with basics of RDD API. Resilient Distributed Dataset(RDD) is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. RDD could […]

Bigdata, Database, Hadoop

Cloud Databases & Cloud Blob…

Cloud computing is the next stage in evolution of the Internet. The cloud in cloud computing provides the means through which everything — from computing power to computing infrastructure, applications, business processes to personal collaboration — can be delivered to you as a service wherever and whenever you need. Cloud databases are web-based services, designed for running queries on structured data stored on cloud data services. Most of the time, these services work in conjunction with cloud compute resources to provide users the capability to store, process, and query data sets within the cloud environment. These services are designed to […]

Database, GPU, PostgreSQL

PG-Storm: Let PostgreSQL run faster on the GPU

  PostgreSQL extension PG-Storm, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient. PG-Storm based on two basic ideas: On-the-fly native GPU code generation. […]

Best Practices, Database, Hive

Best Practices for Hive Authorization when using connector to HiveServer2

Recently we are in process of working with Presto and configuring Hive Connector to it. It got connected successfully with steps given at prestodb.io/docs/current/connector/hive.html. An overview of our architecture is Presto is running on a different machine (Presto Machine) use Hive connector to communicate with Hadoop cluster which is running on different machines. Presto Machine have hive.properties file which tells Presto to use thrift connection to hive client and hdfs-site core-site.xml files for HDFS. Below is the architecture of our environment. Below is the command to invoke presto… /presto –server XX.X.X.XX:9080 –catalog hive There is no presto user exists in […]

Database, Hbase, Tephra

Tephra is open-sourced projects that adds complete transaction support to Apache HBase…

Transaction support in Hbase? Yes, a wide range of use case require transaction support. Firstly, we want the client to have great insight and fine-grained control of what the transaction system can do. Having full control on the client side not only allows you to make the best decisions for optimizing for specific use cases, but it also makes integration with third-party systems simpler. Secondly, when different types of components in your application share the data and update the data in multiple data stores in many different ways(Hadoop applications), it is important for the transaction system to support you. Thirdly, […]

Database, HPL

HPL/SQL Make SQL-on-Hadoop More Dynamic

Think about the old days when we solved many business problems using Dynamic SQL, exception handling, flow-of-control, iterations. Now when I worked with couple of migration projects found few business rules that need to transform to Hive compatible (some of them are very complex and nearly impossible). Solution is HPL/SQL (formerly PL/HQL), is a language translation and execution layer developed by Dmitry Tolpeko (http://www.hplsql.org/) Why HPL/SQL The role of Hadoop in Data Warehousing is huge. But to implement comprehensive ETL, reporting, analytics and data mining processes you not only need distributed processing engines such as MapReduce, Spark or Tez, you […]