Thursday, February 16, 2017

7 Requirements to Deliver Business-Driven Analytics in the Modern Data Age

The journey to modern data has been fast and furious. Within a just a few decades, we have gone from retrospective analytics on operational data located within your four walls to prescriptive analytics using real-time multi-sourced data and everything in between. 

Evolution of Business Analytics
While data, data storage, and data processing technologies have advanced, I would argue that most analytics platforms are still playing to catch up, specifically when it comes to analytics using unstructured data sources.

Many organizations have figured out that moving, flattening and aggregating unstructured data so it can be loaded into relational tables is not a good long-term solution.  It adds unnecessary complexity to already complex architectures.  It forces the business to decide ahead of time what data they want to see.  It impacts the fidelity of the data.

To pile on, business teams are greedy.  They want the same self-service they get with SQL friendly desktop tool but avoid heavy processes like ETL because it adds weeks to their analytics projects. 

So how do you architect a modern analytics platform that supports polyglot persistence strategies and business-driven analytics?  It's a hard problem to solve, but done right can transform the business.

Having been involved in hundreds of such projects, we've come up with a list of 7 requirements to deliver business-driven analytics on modern data architectures.


#1 - Stop Moving Your Data Around
Modern analytics platform architecture
In the age of polyglot persistence, native integration with NoSQL data sources eliminates pre-defined data constraints and performance issues that are a by-product of moving, flattening and aggregating unstructured data so it can be loaded into relational tables structures.

#2 - Data Discovery Across Sources
Ideally, the complexity of the underlying data architecture is hidden from the data engineer and they just configure new data sources and start exploring, including joining multi-sourced and differently structured data together to created blended data sets and the ability to visualize the results immediately.

#3 - Shift to an Agile Analytics Approach
First, let data engineers take over the creation and management of datasets. Second, shift development to an agile iterative approach.

#4 - Enable Self-Service for All Data
In modern architectures, all data is equal and advanced analytics platforms should enable the same level of business self-service on NoSQL, SQL, RDMS or file-based data sources.

#5 -Embed Analytics in Data Applications
Embedding analytics in data applications not only makes it easier for the business to integrate Big Data into their decision-making processes, but it also makes their decisions data driven.

#6 - Drive Action Not Just Insights
Using advanced analytics is how many business leaders believe they will deliver business transformation and growth so even if you are not there yet plan for .

#7 - Ensure Enterprise Readiness
The business is relying on this data to make decisions.  Some processes need to be layered into the platform to ensure it is reliable, scalable, secure and fault-tolerant.  

We generally see high levels of success and higher user adoption rates when these requirements are delivered.  Our customers also report over a 10x reduction in analytics project implementation time and rework.

For more detail on these requirements and how we narrowed it down to these seven, download our free whitepaper Business-Driven Analytics in the Modern Data Age or click button below.



Wednesday, December 28, 2016

Data and Analytics Trends for 2017


Data and Analytics Trends for 2017

No, not in the sense of fleeting fads and fast concepts.

What we’re talking about are areas in tech that stood out this year.
Polyglot Persistence, Analytics 3.0, AI and machine learning all left a lasting impression.

Here’s a closer look at these trends and why they should be on your radar.

The rise of Polyglot Persistence
The concept of one-size-fits-all should be left to muumuus and ballcaps not datasources. Variety- as they say, it’s the spice of life. Not surprisingly, we’re hard-pressed to assume companies can manage all aspects of all data through a sole channel. That’s like saying you’d be able to store all your kitchen gadgets in a slim drawer- it only works if the design fits the concept. According to Technologies & Heller (2016), “Each database is built for its own unique type of workload. Its authors have made intentional trade-offs to make their database good at some things while sacrificing flexibility or performance in other categories.” So do as the Romans do...wait, what do the Romans do? I kid, but seriously, let’s talk about joining forces sources. Our friends at dummies.com tell us polyglot persistence ”...is used when it is necessary to solve a complex problem by breaking that problem into segments and applying different database models” (Hurwitz, Nugent, Halper, & Kaufman, n.d.). Once we have our sources- anything from RDBMSs, NoSQL, REST API’s, don’t you have to ETL it all over the place to get insights and access to that data? Enter Analytics 3.0.

Analytics 3.0
Reporting, analytics and self-service analytics has been around for awhile. What has been changing are the methods companies and people use to get to their final destinations. Let’s take a quick trip down memory lane and see where we’ve been as well as the illustrious yellow-brick road we are about to journey on…

BI 1.0 (A blast from the past)
Oy. This is is painful. Excruciating, debilitating, and time-consuming. What is it that has so many organizations screaming, “Uncle!”? Complex, back-end prep, that’s what. Once deployed, engineers and architects spend days, weeks, and sometimes months to painstakingly process raw data. Enter ETL. Extremely Tough Life? Well, yes and no. It’s a tough life for those charged to Extract, Transform, and Load heaps of information. I don’t wish this on my worst enemy...well, that’s not entirely true. I have seen this soul-crushing process and it is definitely working harder not smarter. Now, let’s be fair- at one point this business intelligence process was innovative, nay brilliant, nay revolutionary! But so was the cart and horse. For the first time, data could be recorded, aggregate, and analyzed (Davenport, 2013). Luckily, there was an proverbial throwing of the hands which led us to a new frontier...        

Big Data 2.0 (Volume, Velocity & Variety)
Imagine you’re in the middle of a labyrinthian building. People talk in hushed tones while the rhythmic keystrokes of well-worn computers drum on. You ask yourself, “What are they up to? And what is NoSQL?” You my friend, have just landed at a major crossroad- you can either continue with your legacy BI tool, or you could embrace this new wave of unstructured data. Unstructured? I like structure. I live for structure. My boss will can me if I don’t structure! Calm down, tiger. Data storage needs depend on many variables notably the 3Vs: volume, variety, and velocity. No doubt, all  aspects of the 3Vs and Big Data will cross-pollinate just to what degree. Volumetrically, data storage requires scalability- relational stores just can’t cope (Media, 2012). The velocity of incoming data (as well as the sheer variety) are better served with NoSQL databases. Relational stores need a large amount of work before insights can be of any real use  (Media, 2012). NoSQL data sources are agile and capture data that is in constant flux. NoSQL options complement traditional SQL based offerings with efficient options for unique data needs.

Current Generation (Dude, where’s my data?)
Data is everywhere. From the thermostat controlled via smartphone to the video camera that captures traffic patterns, data is all around.  The current version of analytics must address both data harnessing across multi-structured data sources  as well as actionable insights.

Analytics 3.0 moves further down the road from data that simply informs to data that predicts and prescribes actions. Leading edge companies are already using machine learning to understand how their customers actually use their products, services, etc. Then they use AI to personalize experiences so a person will actually to do something, buy something, sign up for something...you get my point. This will only accelerate in 2017. According to Ferreira (2015), our new era will be “the driving fact behind not only operational and strategic decision making, but also the creation of new services and products for companies.”

What is AI & Machine Learning and Why You Should Care
As humans and computers intersect, the need for advanced technology increases. While often used in the same sentence, these two concepts are not synonymous. Nor are they mutually exclusive. Heavy hitters like Google, Facebook, and Amazon are making AI and machine learning more widespread (Bell, 2016). In essence, machine learning uses previous experiences to influence future decisions. Artificial Intelligence is the process of making machines smarter or more intelligent. For example, when we make a typo, machines suggest a replacement and remember responses for the future. Alternatively, look at Computer Science, AI, and machine learning as a series of umbrellas. According to Intel’s Nidhi Chappell, “The way I think of it is: AI is the science and machine learning is the algorithms that make the machines smarter.” (Bell,2016). As reported by Fagella (2016), “only one percent of all medium-to-large companies across all industries are adopting AI”. AI adopters typically see increased revenue via faster image interpretation, documentation and data entry, and increased productivity by 25% (Wilson, Sachdev, & Alter, 2016). And that’s just the tip of the iceberg! Imagine all the time we free up when we use machines to work for us! My cereal-filled mornings with The Jetsons are looking less nebulous and more reachable everyday.

Did we miss your top trend for 2017? Leave your thoughts and comments below!

Cloud9 Charts is an enterprise ready business intelligence platform for modern data stacks. We are revolutionizing the path from data to insights by keeping engineers close to the data, no matter the source or type, and eliminating the need for traditional data warehouses and ETL. Cloud9 Charts is the de facto business intelligence platform for the modern enterprise, large to small.
References

Bell, L. (2016, December 1). Machine learning versus AI: What’s the difference? Retrieved December 21, 2016, from http://www.wired.co.uk/article/machine-learning-ai-explained

Davenport, T. H. (2013, December 1). Analytics 3.0. Retrieved December 20, 2016, from Analytics, https://hbr.org/2013/12/analytics-30

Faggella, D. (2016, September 30). Valuing the artificial intelligence market, graphs and predictions for 2016 and beyond -. Retrieved December 21, 2016, from Tech Emergence, http://techemergence.com/valuing-the-artificial-intelligence-market-2016-and-beyond/

Ferreira, T. (2015, December 09). What is Analytics 3.0? Retrieved December 20, 2016, from https://www.quora.com/What-is-Analytics-3-0

Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. Big data and Polyglot persistence. Retrieved December 19, 2016, from Engineering, http://www.dummies.com/programming/big-data/engineering/big-data-and-polyglot-persistence/

Media, Or. (2012, January 19). Volume, velocity, variety: What you need to know about big data. Forbes. Retrieved from http://www.forbes.com/sites/oreillymedia/2012/01/19/volume-velocity-variety-what-you-need-to-know-about-big-data/2/#7890f71f7c1d

Technologies, R., & Heller, B. (2016, March 25). Analytics 101: Choosing the right database. Retrieved December 19, 2016, from https://reflect.io/blog/analytics-101-choosing-the-right-database/

Wilson, H. J., Sachdev, S., & Alter, A. (2016, May 3). How companies are using machine learning to get faster and more efficient. Retrieved December 21, 2016, from Harvard Business Review, https://hbr.org/2016/05/how-companies-are-using-machine-learning-to-get-faster-and-more-efficient

Wednesday, December 14, 2016

Product Update - Dataset Lineage and Data Anomaly Alerting and more...

We are excited to announce a number of new features that we've added over the past quarter.   If you would like to check them out using your data, please click here to login.


VISUALIZE DATASET RELATIONSHIPS

It is now easy to see the original query, dataset(s) and all associated visualizations. This allows you easily understand how visualizations are built, which datasets are being used and where the data is coming from.

Additionally, you can make modifications to any part of the flow directly from here, making it simpler to add new widgets or building derived datasets.
Learn More





DYNAMIC DATA ALERTS

Our new powerful alerting capabilities will help you rapidly detect anomalies. You can create your own custom alerting logic on any of your data and a notification is immediately sent.





OTHER COOL STUFF



NEW SSH TUNNEL CONNECTIVITY OPTION

SSH tunnel is one of the options available to connect to your database inside a private network, complementing other modes of connectivity.


NEW VISUALIZATIONS

Additions include Sankey Diagrams, Chord, Custom widgets, web pages and others.

You can now also customize the colors used in any of the visualizations to match internal branding or to exercise your creative side.

With over 30 visualizations to choose from, finding the right visualization to tell your story is as easy as 1-2-3.


ENHANCED EMAIL REPORTS

Export dashboard reports as PDF or CSV files with 1-click for elegant distribution via email.


MORE POWERFUL QUERY GENERATION

For MarkLogic, we've added XQuery code hints to dynamically build XQuery easily.

If you are using ElasticSearch, you should definitely try our new ElasticSearch query generator.

OTHER

Triggered queries enable automatic update of derived queries when the source data has been updated.

Integrations galore: With over 30 data sources, we now support all major NoSQL databases natively, along with SQL databases, REST API's and files.

That's it for now. If you have any questions, need more information, or would like to schedule a demo, please contact us.



Try Cloud9 Charts for free:  

Monday, September 26, 2016

Analyzing 1.2 Billion NYC Taxi Rides

Recently, we held a webinar with our friends at Ocean9 focused on-demand, self-serve analytics on large scale datasets. We needed a large dataset for our demo, for which we turned to New York City Taxi data, nicely put together on github by toddwschneider. Plus, it was opportune time for me as I’m headed to New York City this week with lots of cab rides in between meetings.
The dataset consists of trip details of 1.237 billion rides (237GB on disk of RAW CSV data).

STACK

We used the following:
SAP HANA: In-memory, relational database. Given its in-memory architecture, it provides fast lookups on large scale datasets (but by no means cheap).
Ocean9.io: Database-as-a-Service for SAP HANA that enables one click deployment in the cloud.
Cloud9 Charts: Our analytics platform.
This setup enables to do the analysis all in the cloud, with nothing to install (and tearing it down post-analysis).
DATA IMPORT
Frank Stienhans, CTO at Ocean9, has put together a nice blog post on the data import process into HANA. It’s definitely worth a read if you are thinking of using HANA in the cloud.
The import process took approximately 60 minutes into a R3.8xlarge instance on AWS using Ocean9.

ANALYSIS

After the data was loaded, we connected to it directly using Cloud9 Charts, which features a full HANA integration, including point-and-click HANA specific SQL query generation.
Example query:
SELECT avg(total_amt), avg(tip_amt) FROM nyc.yellow
Took 1 sec to return. Not too shabby for a query touching 1.2 billion records.
So let’s turn to some analysis. You can find the full interactive dashboard here: https://cloud9charts.com/d/1.1-Billion-NYC-Taxi-Dataset-Analysis

Trip Geo Clusters

Pickups are heavily concentrated around Manhattan (midtown in particular), as well as JFK and La Guardia.

Day/Hour Trends

Heatmap of trips by day by the hour of day:
A few observations:
Peak times are 6–10PM and 8–10 AM weekdays, but notice the dip between 4–5PM. This was a but puzzling to me at first as you’d expect more trips during rush hour. Turns out that around 5PM is when the shift change occurs where cab drivers are heading back to the garage. It might be a bit harder to hail a yellow cab during that time.

Monthly Trends & Predictions

The chart below plots the following:
  • Total Monthly Rides since 2009 (Blue)
  • 3 Month Moving Average (Green)
  • Predicted Values (Yellow). This uses Cloud9 Charts’s prediction modelsout of the box that’ll automatically backtest the data to select the best model.
According to the data, trips have gone from a peak of almost 16 million trips in May 2012 to a low of 11m in Feb 2016, which also coincides with the rise of Uber, Lyft and Green Cabs in NYC.
However, the prediction model values (in yellow) indicates that the downtrend appears to have stabilized, with a slight uptick expected over the next year.

SUMMARY

This particular taxi analysis just scratches the surface: ride data is not just about going from point A to B, but in some ways provides a pulse of the city itself. Time permitting, I’ll put together more detailed analysis of the data in future posts.
The convergence of cloud, optionality in database types for the right workload, infrastructure provisioning using a Database-as-a-Service, with an Analytics-as-a-Service drastically accelerates the time to insights to make self-service analytics a reality.

RESOURCES


Guest Post: NYC Cab Rides using SAP HANA and Ocean9


This is a guest blog post from Frank Stienhas, CTO and co-founder of Ocean9 that provides a one-click SAP HANA as-a-service. This post discusses the setup of a HANA cluster for analysis of NYC Yellow Cab Taxi Rides from 2009-2016.

We recently held a BrightTALK webinar around self-service analytics on SAP HANA with our friends from Cloud9 Charts.We selected a public dataset with NYC Yellow Taxi Rides from 2009 to 2016. Nicely stored in Amazon S3 like so many public datasets.

Properties: 217 GB RAW CSV data with 1.231 billion rows.

A word on data oceans

I think it is no longer a discussion that the data persistence layer for data oceans is cloud object storage such as Amazon S3, Azure Blob Storage or Google Cloud Storage.
Standard price points are 2.4 to 3.0 cents / GB / month across the 3 providers.
Also each of the 3 providers have the equivalent of infrequently accessed object storage, which comes in at 1.0 to 1.25 cents / GB / month.
AWS and Google both promise a durability of 99.999999999 %. Azure has no published durability statement at this point.
Don't think though that object storage is a commodity. Between the 3 providers there are massive differences in the domains of performance, security and best practices.

A word on data lakes

There might also be no better storage for your corporate datasets, including your most confidential ones. That is because
  • Strong Security, Data Protection and Compliance capabilities
  • Durability, Availability design
  • unlimited immediate scaling
And the price points above.
Security Options include that you can lock down access in a number of ways and the specify strong encryption methods. However as mentioned earlier the 3 providers are not equals on Security and Data Protection.

SAP HANA and data lakes and oceans

In the following I will describe a low tech and high tech approach to load datasets from the data ocean into SAP HANA.
Let me describe some general rules that you should consider.

General Rules

1) Co-Location

You will want to put your SAP HANA System next to the data set. This will provide you with maximum performance of the data load and minimum cost (outbound dataset traffic to your HANA system).
In the NYC Yellow Taxi case, we can deduce from the Dataset URL that it is stored in Amazon S3 - us-east-1. So you will want to place your HANA System there.

2) Private Network Optimization

If your HANA System runs in a private subnet then you should configure an AWS VPC Endpoint for S3, to provide you with High Performance access at minimum cost. Otherwise all S3 traffic will go through your NAT layer which will certainly not bring you higher performance but it will lead to higher cost.

3) Use an instance type with 10 Gbit Networking for High Performance Data Loading

We will see later what Amazon S3 can provide in terms of throughput. If you care about data load speed then you should select
  • R3.8xlarge (244 GB RAM)
  • M4.10xlarge (160 GB RAM)
  • C3 / C4.8xlarge (60 GB RAM)
X1
Why am I not listing X1 with 2 TB RAM? We are still waiting for the High Performance Network Drivers for SUSE Linux to activate the 20 Gbit mode. Until then X1 will not reach the throughput performance of the instances above.

Manual Approach

After taking care of the above you can continue with

4) Copy the Dataset to the HANA Machine.

You should use AWS Command Line Interface for achieving a decent storage throughput of 200 MB / second. I would also suggest to store the dataset on a different storage device than what is underneath /hana/data and /hana/log. Otherwise SAP HANA will compete for Storage Bandwidth during data load.

5) Load Data

Now you can load the data into SAP HANA, either using HANA Studio or the HANA Command Line. This blog post describes nicely how to do this.

Cloud Native Approach

Well a cloud native approach is to do the above in two clicks. One for Provisioning the system and one for loading the data.
Ocean9 provides this to you and more.
At Ocean9 we are permanently seeking to get the maximum out of the cloud. We have realized a direct path for data loading from S3 to SAP HANA without persistence in between.
CPU was permanently in the corridor of 80-90%. Disk I/O was not an issue with less than 170 MB / second.
The data load took exactly 60 minutes. (and yes there is further room for improvement)
On X1.32xlarge data loading took 125 minutes (because of the Driver status described above)

HANA Performance

I just ran one statement to get an impression.
select sum(total_amt) from nyc.yellow_taxi
The SQL statement needs to touch all "rows" and can theoretically use all vCPUs in parallel.
The command completes in
  • r3.8xlarge :    1.0 second  (32 vCPUs)
  • r3.4xlarge :    1.8 seconds (16 vCPUs)
  • x1.32xlarge : 0.4 seconds (128 vCPUs)

Backup and Restore

Using our Advanced Backup and Restore Implementation for SAP HANA, it is always a good idea to perform a data backup to S3 after data load.
For this dataset backup takes 10 minutes and restore takes 8 minutes.
Now we can create a new brand new system including this dataset in 20 minutes, with data loaded in Memory !
For the brighttalk webinar we will launch the system on Wednesday morning and terminate it before noon.
This is how the cloud should be used.

What's next ?

See it yourself ! Watch the BrightTALK webinar to see this in action combined with an Analytics Service from Cloud9 Charts !
Try it yourself !  Use my HANA SQL Schema for the NYC Taxi Dataset.