Tuesday, August 25, 2015

New BI Paradigm

We are in the midst of a fundamental shift in how data is stored, with rapid adoption of NoSQL and other storage technologies for storing unstructured and semi-structured data.

With an estimated 80% of enterprise data in unstructured/semi-structured form, enterprises are increasingly adopting “Polyglot Persistence” architecture, which is a fancy term for using different database technologies to handle unique persistence needs. In practice, this means traditional SQL based relational databases for structured data, in conjunction with newer NoSQL databases and file based sources for unstructured/semi-structured data.

So, lots of great options on how to store the data. But, when it comes to unlocking the value of this data across heterogeneous sources, there are some significant challenges given that traditional Business Intelligence solutions are built for structured data.

The standard approach to BI looks like this:

1. Extract, Transform & Load (ETL): Relevant data across heterogenous sources is typically stored into a data warehouse, but first step in that process is the dreaded ETL process. This usually entails:
a)  A team to write custom scripts to ETL data across various SQL based data sources into the predefined schema/warehouse

b)  Develop another set of ETL processes to pull relevant data from NoSQL databases, and yet another for files and shoehorn that data into a relational structure

2.  Data Warehouse: Generally a SQL based database, this requires schema definitions mapped upfront that determine what gets stored where and a dedicated database team to manage it. For any new metric or changes to existing metrics means changes to the ETL layer as well as the warehousing tier. 

3.  BI Tool: Once the data is in a structured form in the warehouse, a BI tool comes into play. With traditional BI tools, this means:
a)  Desktop licenses for use by report creators
b)  Server licenses with IT involvement for publishing and sharing the reports

Each stage of this process involves specific product suites along with specialized engineering and IT resources. For example, for the warehousing layer, typical choices include a custom SQL database or Amazon Redshift, Pentaho/Alteryx for ETL and data processing, Tableau/Qlik for data visualization, etc.

The problems with this approach are that the path from data to insights is far too cumbersome, brittle and expensive and takes a significant amount of time, resources and skills before any tangible results are in sight. Given that a database schema has to be designed upfront, every new metric or modifications to existing metrics requires BI and engineering touch points, thus increasing the time and effort to obtain insights from the data, impacting the organization’s ability to react to data changes quickly.

Cloud9 Charts takes a different approach. Specifically built for modern unstructured and structured data, we dramatically simplify every step to go from data to insights seamlessly. 
Let’s contrast the above with an example below using a screenshot gif:


The example sets up connectivity to MongoDB and MySQL, auto-generates queries against MongoDB and MySQL and performs an inner join on them (this example determines the sent message volume by customer and combines it with the address for each customer from a MySQL database) and lastly, creates a shareable, embeddable dashboard from the data.

 Key things to note: 

-  We’ve obtained insights from data within MongoDB, then combined it with data from MySQL in just a few clicks. No cumbersome data cleanup, complex ETL processing or data warehousing setup/management.

The query results are seamlessly stored into a flexible, scalable data lake/warehouse, without the need to pre-define schema or write complex ETL jobs. The warehouse provides fast access to the results, store multi-structured data, provides schema-on-read and SQL like capabilities and incrementally track the results. Reporting and dashboards can be driven off this warehouse, shielding your production systems against reporting workloads.

- While the warehousing scheme provides significant benefits, there may be cases where you need to execute queries & visualization directly against the source database, which is also supported in a single click.  

-  Dashboard creation, sharing, etc. are enabled right inside the portal, eliminating the need for multiple product suites with one for data discovery and report creation and yet another for sharing.

-  A range of sophisticated ways to query, aggregate and pipeline the data is built in (for example: run a query against 1000’s of databases and merge the results, query capabilities on logs/JSON/XML, plug-in custom machine learning algorithms, etc.)

By quickly connecting the dots from the raw source to insights in an integrated platform, we not only enable teams to make driven decisions across all their data sources, but also react to data changes quickly.