Hadoop

Reference architecture of bigdata solution in GCP and Azure…

This article is a showcase of a Reference architecture approach for the financial sector where stream and batch processing is a common part of its solution with other designs. Firstly the requirement analysis is the step to define the implementation of any use case. Therefore before moving to reference architecture we first need to understand Requirements Engineering. 

Requirements Engineering is regarded as one of the most important steps in software engineering and takes about 30% of project time. When done properly, it can provide a good foundation for the system design and development as the functionality and components needed for the system become clear during the Requirements Engineering process. it is important to understand what a requirement is. Generally spoken, it is the property of the system requested by a stakeholder for solving a problem, achieving an aim or fulfilling a contract. A stakeholder can be a user of the developed system or a customer who is buying a product that is supported by the system. Since requirements can be the basis of a contract between the software provider and customer, they have to be described precisely and unambiguously. On the other hand, it should be possible to adjust them later during the project in case the circumstances do change, so the requirements should also be flexible. Defining the requirements can be covered in two main activities:

a. Requirements Elicitation: Here the requirements for the system are identified using terms understood by customers and users. The elicitation takes place in close collaboration with the users and customers, often involving meetings and workshops in order to understand their needs. The requirements are then reconciled with the different stakeholders, analyzed for their feasibility and consolidated before being written down for documentation purposes. The final result of this process step is the Requirements Specification document that describes the purpose, functionality and environment of the system to be developed on a high-level using abstractions. Its main goal is to show what the system does and not how it does it.

b. Analysis (Modeling): Here the first steps of the system design take place. The system to be developed is described using a semi-formal modeling language, e.g. UML. The result is a so-called technical specification that is used as the foundation for the development by business analysts and programmers and already contains detailed descriptions. After it comes the system decomposition and the detailed design of the system.

Collecting requirements can be classified into two categories below:-

Functional: As the name already tells, a functional requirement describes the functionality of a system. The functionality is basically what the system does do if a user interacts with it through a user interface, for example the execution of a sequence of functions. It also covers the relationship between inputs and outputs in the system. A functional requirement can also state what the system shouldn’t do in specific/abnormal situations and is formulated independently from the actual implementation.

Non-Functional: In most cases these do refer to the system as a whole instead of one single feature in it and thus have a great impact for the system’s architecture. Nonfunctional requirements deal with the three following topics: performance, quality requirements and constraints. In performance requirements the system’s speed, recovery and response times and the availability are described. When it comes to a distributed Big Data system, they are very important hence they are relevant for the system’s stability. Issues like the number of simultaneous users or the amount of data processed are clarified here. Quality requirements include aspects like adaptability, maintainability and security. Constraints – or pseudo-requirements – are additional limitations to the system design brought up either through compliance or through the customer. This can be for instance the requirement to implement the system in a specific programming language or the need to comply with regulatory standards. Furthermore, organizational requirements have to be considered when eliciting and analyzing non-functional requirements.

Note : Before starting implementation and particularly before rolling out the system, it makes sense to validate them with the customer, as the cost of solving a requirements error after a roll-out are up to 100 times more than fixing an implementation error.

Now lets move to develop a set of cross-industrial Big Data artifacts, including definitions, security standards and finally a Big Data Reference Architecture. As the aim of this article is to design a Big Data Reference Architecture for the financial sector, the approach chosen here may varies with a very complex system(I guess that would be highly interesting). For developing a generic Reference Architecture is started out by collecting a number of cross-industrial Use Cases. The Use Case specific requirements were afterward verified for deriving generic, cross Use Case requirements from them and finally map these generic requirements to components here in the Reference Architecture. For use case evaluation I have used the template below.

1. Use Case Title

2. Use Case Description

3. Big Data Characteristics

4. Big Data Science

5. Security and Privacy

6. Organizational Requirements

7. Other Big Data Challenges

Detail view of this template document I’d share separately into another article. As of now we can move on to Reference Architecture. Before turning to the Reference Architecture it is needed to important to understand what IT-architecture actually is.

According to the IEEE14, architecture is “the fundamental organization of a system embodied in its components, their relationship to each other, and to the environment, and the principles guiding its design and evolution”.

TOGAF, an architecture framework, offers the following two definitions, depending on the usage context: “[Architecture is a] formal description of a system, or a detailed plan of the system at component level, to guide its implementation. [Architecture is] the structure of components, their inter-relationships, and the principles and guidelines governing their design and evolution over time”

So generally spoken architecture is a collection of artifacts that describe the set-up of a system by defining the relationships between them. An artifact can for example be an architectural component (e.g. a data lake in a Big Data context) or a guideline that governances the design of a system. Eventually IT-architecture creates the foundation for a structured system design and development.

TOGAF names the following three main architecture fields

a. Business Architecture: Here the business strategy, core business processes, business capabilities and the organizational structure are defined. The interrelation between all the artifacts named is described in this section – as are the high-level business requirements. Business Architecture is closely related to other business topics such as enterprise planning or business product development and has to be reconciled with them. It is also the precondition and baseline for designing the other architectures

b. Information System Architecture: In some cases also called Solution Architecture, it is made up of two parts:

  1. Application Architecture: Here the applications and (sub-) systems required for enabling the business processes are described thus giving an overview over the application landscape. Furthermore, it includes the description of the application’s relationships to each other. From a more detailed point of view, especially concerning (sub-) systems, the interfaces between them are covered by the Application Architecture, too. The Application Architecture relies upon the data objects identified in the Data Architecture for designing the entire software architecture.
  2. Data Architecture: All data objects, entities and their relationships needed for implementing the business processes and functions that were defined in the Business Architecture are listed here. A deeper view provides also the attributes for each data entity. Aspects like data integration, transformation, quality, migration (in case an existing application will be replaced with a new one), storage and governance are addressed in the Data Architecture as well.

c. Technology Architecture: Sometimes also called technical architecture, this part of architecture deals with logical and physical infrastructure issues, including the infrastructure landscape, its setup and operation. Above all the server nodes are mapped to the applications that run on them and the network protocols for the communication between systems and nodes are defined.

Finally the term Reference Architecture has also no unique definition. Seen from a more technical point of view, a Reference Architecture is a set of patterns designed for being used in a specific technical or business context and supported by a further set of artifacts. Other, more business-oriented definitions take also business goals and organizational aspects into account. Based on this it can be said that a Reference Architecture should present a standardized solution consisting of functional architectural components for a specific domain or industry problem. The components should cover both technological issues on an abstract level (e.g. data integration or data processing) and business requirements (e.g. for complying with regulatory standards). In a Big Data context a Reference Architecture should provide a conceptual standard or model that can be used as baseline or blueprint for developing Big Data systems. This standard should contain all components required for building a Big Data system in the respective domain whilst ensuring that the components are vendor, product and technology-neutral. This means in case of a real-time data processing component the component will not be named Apache Spark but e.g. “real-time processing engine”. Furthermore the Reference Architecture should introduce some common terms understandable for stakeholders both from business and technology. Since the Reference Architecture should cover business needs, the components should be functionally motivated. Finally security and data privacy issues have to be addressed by including the respective components into the Reference Architecture.

Big Data Reference Architectures where all components are grouped into three main categories: Data Platform, Analytics Platform and Management Platform.

Data Platform:

  • Information Gathering: Covers all mechanisms needed for ingesting data into the platform such as streaming, messaging and ETL. Furthermore basic validations and transformations are executed here. A task scheduler for managing the available resources is in place as well.
  • Information Store: This layer offers technologies for storing the available data. It ranges from conventional relational databases over document storage to technologies for storing massive amounts of data such as NoSQL. For enabling fast access to data In-memory databases exist.
  • Data Processing: Here are the technologies for processing large amounts of data and/or processing data at high speed, e.g. through distributed parallel processing. Conventional data processing, for instance by using Rules Engines, can be found here as well as data enrichment.

Analytics Platform:

  • Data Analytics: This layer covers the technologies required for getting insights from the data such as ML, Text and Data Mining, and Natural Language Processing. Furthermore mechanisms for simulating and optimizing the ML models are available here.
  • Decision Support/Utilization: Here various tools and mechanisms for presenting and visualizing the results from the data analysis are covered. Furthermore possible consumers of these results such as Business Process Management or search functions can be found here. Interestingly Online Analytical Processing (OLAP) is mentioned as a single element in this layer, although the complete OLAP process covers many other components such as Information Gathering, Data Processing and Data Analytics.

Management Platform:

  • Governance: All processes for data lifecycle management, ensuring high data quality and data audits can be found here. Furthermore the highly important aspect of IT Security is covered here.
  • Infrastructure: Any tools or technologies required for operating the components from the Big Data Reference Architecture are covered here. The goal of this layer is ensuring scalability, performance and availability.

Google: Google’s Big Data Reference Architecture (or Google Cloud Platform Architecture – GCP) consists of four main stages: Ingest, Store, Process & Analyze, and Explore & Visualize. The data passes these four stages from left to right. Although no infrastructure, governance or security components are shown in the provided architecture diagram, GCP ofcourse offers all these components. An overview over the components within GCP is given, although not each of them is described here in detail.

  • Ingest:It provides tools for collecting any incoming data and pipelining it to the Storage layer. Specifically, GCP provides tools for three possible sources: application data, streaming data and batch data. Application data can come from click-streams or transactions as well as application monitoring systems and will pass Stackdriver Logging (a component designed for log transfer). All this data can either be streamed or batch-loaded to the target storage or processing system, depending on the latency required for the respective Use Case. Streaming can be needed e.g. for pipelining sensor or clickstream data by using Cloud Pub/Sub, a real-time messaging service offered by GCP. Bulk data eventually can be loaded as a batch from the respective source, be it a relational or NoSQL database, or another existing cloud storage e.g. at Amazon Web Services (AWS). Here GCP also provides a number of tools, e.g. Cloud Transfer Service or Transfer Appliance.
  • Storage: In this layer GCP provides a number of storage technologies for keeping the data that came in during the Ingestion phase. With Cloud Storage GCP has got an object (file) storage for both structured and unstructured data. It can store data from ETL processes or media and is integrated with many other GCP components, like e.g. ML APIs. Cloud Storage guarantees fast access to data and can be configured for both frequent or seldom data access. When it comes to database storage, GCP covers both relational and NoSQL databases. Cloud SQL makes it possible to store any data intended for a conventional database such as Online Transactional Processing (OLTP) data or customer data. With Cloud Spanner there is now a relational database that guarantees not only ACID consistency but is also highly scalable for large loads of data. Cloud Datastore is a document database with a flexible, yet structured data schema ensuring high scalability. Cloud Bigtable is a column-oriented NoSQL database that was designed for guaranteeing high throughput and low latency for large datasets. It is comparable to other NoSQL databases HBase or Cassandra and can be used for storing IoT sensor data, real-time application and streaming data. Finally GCP offers also an analytical database – BigQuery. It can be seen as a Data Warehouse for Big Data that makes it possible to execute both real-time and conventional analytics on data stored there. If necessary – for instance in a velocity application – data can be pipelined directly to Big Query for being analyzed there. BigQuery is also a possible solution for OLAP tasks.
  • Process & Analyze: Simply storing data is not enough – it has to be analyzed for deriving insights from it so that the company’s business can benefit from it. Again GCP offers a number of tools that can be used for processing and analyzing data. When it comes to analyzing large datasets, distributed processing clusters are required. By using Cloud Dataproc, companies can move their existing Hadoop or Spark clusters to a service that administers and monitors these clusters and above all enables integration with other GCP components, e.g. from the Storage layer. Cloud Dataflow is a service for optimizing both stream and batch-oriented processing tasks by offering on-demand resources instead of having a predefined cluster size. One can use Apache Beam (a programming model) for writing ones own batch and streaming data processing pipelines and then deploy them to Cloud Dataflow. It is also integrated through connectors with various products from the Storage layer like Bigtable or Cloud Storage. Since BigQuery is an OLAP solution it has to ensure not only the respective storage functionalities but also the analytical and querying capabilities needed for Business Intelligence or real-time analytics. Finally ML is a very important aspect in GCP’s Process & Analyze layer. Here GCP provides two things: the first is a legion of ML APIs for proven models that have already been trained by Google. These API’s include text analytics (Cloud Natural Language API), image processing and recognition (Cloud Vision API), audio data analysis, video recordings analysis (Cloud Video Intelligence API) and language translation. The second thing is Google Cloud Machine Learning. It can be applied for executing and training models a company has developed on its own by using TensorFlow, Google’s ML framework. TensorFlow offers a wide range of ML algorithms including Deep Learning and is highly scalable as well. After a preprocessing stage, TensorFlow models are converted into models that can run on Cloud Machine Learning (also called graph building) where they are afterwards trained on a large dataset. Finally when a model has been trained it is used for making predictions [88]. Above all it is also possible to deploy ML models on GCP that were developed with other ML tools such as MLlib from Apache Spark.
  • Explore & Visualize: In this layer GCP provides several tools for visualizing the results of the calculations from the Process & Analyze layer. With Cloud Datalab data scientists can explore datasets by executing test models (that were developed for example with TensorFlow) and see their outcomes directly – this approach is also called sandboxing). Cloud Datalab is intended for visualizing the results and the data from data science calculations and analyzes. For visualizing data from any kinds of Business Intelligence analyses Cloud Data Studio is available, where it is possible to create drag-and-drop dashboards, reports or other visualizations. When integrated with BigQuery data has not to be imported to Data Studio but can be accessed directly thus enabling a visualization of real-time analytics. Above all BigQuery can be integrated with external providers of data visualization tools like e.g. Tableau.

Figure Above: Google’s Big Data Reference Architecture in action

Microsoft Azure: The architecture described here shows which technologies Microsoft Azure does offer for a Big Data architecture. Like the architecture diagram showing the components in GCP (In above figure) Microsoft’s architecture presents a number of technologies that can be used in the different phases of a data pipeline. Depending on the Use Case or project that has to be implemented, the required components can be selected and put into a solution architecture. Microsoft Azure is able to take in data from many different sources, such as sensors, transactional or CRM systems, social media and so on. A cross-layer component is Data Factory, an orchestration tool that takes care of coordinating task execution by single components (when they process data workloads) and distributes existing resources amongst them. It is also worth noting that the Streaming and Batch layers introduce technologies for both ingesting the data and also processing it so that these two phases of a data lifecycle are somewhat intermixed here. After data comes in from one of the sources it passes the following layers (not necessarily all or in this order).

  • Streaming: For any data that has to be loaded at (near) real-time speed into a Big Data application Microsoft’s own technologies and other open source products are available. The latter are for instance Spark Streaming or Apache Storm that can be used not only for pipelining but also for processing incoming data. With Azure Stream Analytics it is possible to process streaming data by applying data transformations and manipulations, and integrate it with both storage technologies and PowerBI a Presentation layer. Event Hub and IoT Hub are simple messaging (or event) queues for streaming data into an application in the conventional sense, i.e. without processing it. Both are able to persist data for a period of eight days in case the data pipeline should break down so that no data will be lost during the time of an outage. The IoT hub supports most common protocols in the IoT area and can be used for bidirectional communication, i.e. not only for passing data from an IoT device into the pipeline but also for sending event triggers from the Big Data application back to the device.
  • OLTP:Interestingly, Microsoft Azure lists OLTP with the respective storage technologies as an own layer in its architecture. Azure SQL Database is a storage technology that easily scales for large amounts of data and offers fast access to data – however it can only store data that has a strict schema and therefore is not suitable for many Big Data datasets. Azure Cosmos DB is a schema-less NoSQL database that is optimized for high throughput and can guarantee ACID for database transactions. Finally with Azure HDInsight there is a managed service available where one can set up clusters for HBase or other open source Big Data technologies (also for data processing such as Apache Spark or Apache Storm). HBase is a column-oriented NoSQL database that is highly scalable and can be used for fast OLTP operations.
  • Storage: For storing data Microsoft Azure offers Blob Storage that stores objects such as text data or media files and can be used as a data repository. When it comes to storing large amounts of data or accessing them quickly, Azure Data Lake Storage exists for solving these problems. It is based on Hadoop Distributed File System (HDFS); can be used for analytical workloads and can be easily integrated with other Big Data technologies from the Hadoop ecosystem. Any large data sets from batch or streaming, such as clickstreams or sensor data can be stored here. It is possible to use analytical applications like Apache Hive or Apache Impala (for real-time analytics) for accessing data directly from Azure Data Lake Storage.
  • Batch: When it comes to batch processing, Microsoft Azure supports a wide range of open source and Microsoft’s own products. In HDInsight, a managed service from Microsoft, batches can be processed using a Hadoop implementation provided by Cloudera, a software company. Furthermore it offers support for micro-batches with Apache Spark or conventional Hadoop batches executed using Hive (a data-warehouse solution) or Pig (a platform for analyzing and transforming large datasets). Additionally Azure Data Lake, a storage solution mentioned before, provides Azure Data Lake Analytics, a service that makes it possible to query large amounts of data that has no schema by using the language U-SQL (a combination of C# and SQL). Unlike HDInsight it does not need a pre-configuration of the amount of resources needed for executing a task, so it is much easier and flexible to use. Azure SQL Data Warehouse makes it possible to integrate data with a strong schema with data having only a weak schema.
  • Analytics: In order to analyze data, Microsoft Azure provides support for R and Apache Spark MLlib. Using R, a statistical programming language, ML models can be created and trained. Spark MLlib is a ML framework from Apache Spark that contains pre-trained models and mechanisms for tuning these models. Due to its integration with Spark SQL it can access large data sets required for training the models directly. Furthermore Microsoft offers Azure Machine Learning (or Machine Learning Studio), a managed service where own ML models can be easily developed, e.g. by using Python or R, and afterwards deployed. Above all it contains several pre-trained models for solving common problems such as text analytics or image recognition.
  • Presentation: Besides visualizing data with Excel, Microsoft offers PowerBI, a tool for visualizing data graphically with dashboards, scorecards and other presentation tools. PowerBI can be integrated into business applications thus also enabling direct access to data and ad-hoc analytics. RStudio and Azure Machine Learning also offer several mechanisms for visualizing results of analytical and predictive ML models.

Figure Above: Azure Big Data Reference Architecture in action.

Summary: This article has shown that Big Data can be applied for many different Use Cases by using some generic artifacts. It also showcase a technical basis for implementing these Use Cases by providing a Big Data Reference Architecture and a Big Data platform. Keeping the Big Data Reference Architecture and using a platform approach for reducing redundancies and assuring flexibility should always be part of the IT strategy. With the technical path being set, it is up to the users to decide which Use Cases their companies should implement by an exploratory PoC. Exploring new opportunities is also essential because techs and internet companies like MSFT, Google and Amazon will seek new business opportunities in the cloud market and offer new products with a new, compelling customer experience.

Leave a Reply

Your email address will not be published. Required fields are marked *