So, you have just started on your first project on Hadoop. After hearing about big data Hadoop for months, you can figure out your first use case implemented using Hadoop. Data gets extracted in a small cluster, and you are now ready to ‘leverage data.’ This is when you will face the question of selecting your Big Data visualization tool.
In the past, you might have faced the same question when you started with your BI initiative. However, the dynamics of selecting a data visualization tool for Hadoop differ. Yes, learnings from selecting a BI tool can be reused for selecting a visualization tool for Hadoop, but there is so much extra to consider.
After having helped customers implement visualization tools for various types of structured, unstructured, and streaming data on Hadoop, here are a few criteria that I can summarize:
- Budget Available
Hadoop data visualization tools come in 4 categories: Firstly, there are Enterprise BI tools like SAS, Cognos, Microstrategy, QlikView, etc., which are enterprise BI tools and have good Hadoop compatibility. Then, there are Hadoop-specific visualization tools like HUNK, Datameer, Platfora, etc., that are meant specifically for the visualization of Hadoop data. Thirdly, there are open-source tools like Pentaho, BIRT, and Jaspersoft that have been early adopters of Hadoop and probably have invested more in Hadoop compatibility than some biggies.Finally, there are charting libraries like RShiny, D3.js, Highcharts, etc., that are mostly open source / low cost and have good visualization but require scripting and coding. As you move from the first category to the fourth category, the software licenses’ costs, ease of development, and self-service capabilities also reduce. There are some exceptions to this general trend, though. - Your existing BI tool
Your company is probably already using some BI tool or the other. You may have SAS, Microstrategy, IBM Cognos, and OBIEE in your company. Most of these tools have made tremendous investments in enhancing their tools for compatibility with the Hadoop ecosystem. They have connectors for Hadoop and NoSQL databases; graphical tools are available. It may be easy for the end users to use something they already use. Consider using your existing BI tool for Hadoop data visualization if it has obvious drawbacks. - Hadoop distribution used
If you are using Hadoop distribution from, say, Cloudera or Hortonworks, you can safely select tools certified by these distributors of Hadoop. For example, Tableau, Microstrategy, Pentaho, and QlikView are all certified by Cloudera and have proven connectors to the distribution of Hadoop. Similarly, most of these tools are partners of Hortonworks. If your Big Data platform is IBM BigInsights, then going for Cognos makes sense since compatibility will not be an issue as an IBM product. It is always advisable to check if the tool you select for visualization is certified by the Hadoop distribution. - Nature of data
If the data you want to analyze is tabular, columnar data, then most of the tools are capable of providing visualization facilities. However, special-purpose charting libraries like ‘timeslot’ may be a good option if the data is log data. Similarly, tools like Zoomdata provide better visualization capabilities for social media data. - End user profile
Who are your end users? Are they data scientists? Then, a visualization tool with very high-end visualization patterns will be required. If operational business users (such as sales managers and finance managers) are end users, then more than advanced visualization, speed of delivery, and cost of tool (since the number of users may be very high) is important. - Programming skills available
Going for scripting-based tools makes sense if you have good Java and JavaScript skills in-house. Also, if you are an R shop with good R programming capabilities, RShiny can be a good alternative. ON THE OTHER HAND, standard BI tools such as Microstrategy Pentaho allow writing SQL on top of Hadoop data. Tools like Datameer are schema-free and drag and drag-and-drop tools. So, in short, each tool comes with its own set of programming skill requirements, and you need to make sure these requirements are compatible with programming skills available in-house. - Operating system
It is a basic checkbox while selecting tools for visualization. We come across customers who use Linux platforms only, and Windows-based tools like QlikView, Tableau, and Microsoft BI are not possible in this case. Also, if you are planning an implementation on the cloud, make sure your cloud provider can provide the OS required by the visualization tool. - Visualization features required
Traditional BI tools that added Hadoop capabilities are more mature than new entrants in providing commonly required visualization patterns. For example, multiple Y axes, support for HTML5 and animation, and user-friendly drill-down are some features that are very mature in traditional BI tools but still evolving in new entrants, open-source BI tools, and some charting libraries. It is advisable to compare your visualization needs to the capabilities offered by the tools. - Data Volume
Data volume and streaming nature of data is an important consideration, especially if you are thinking of an in-memory architecture visualization tool. If your Hadoop data store has Terabytes of data, data is being added in real-time. If you plan to use an in-memory visualization tool, you need to think of a mechanism to continuously reduce the volume and feed data from Hadoop to the visualization tool. This is possible but not very simple. Be aware of the impact of real-time high-volume data on in-memory architecture. - Industry experience
It is always advisable to depend on dominant players in your industry vertical. SAS, for example, has been used by banks in analyzing big data for customer intelligence and risk management. In cases like this, the availability of underlying algorithms and visualization patterns makes the big data project implementation much easier.
All of these factors need to be considered carefully. Some of them, like operating systems, seem to be no-brainers, but I have seen companies make oversights and select visualization tools that later needed to be changed. After shortlisting visualization alternatives and considering these factors, you are ready for the next step in the journey of building a visualization platform for big data, and that step is initiating a Proof of Concept. More about learnings from data visualization Proof of Concept in the next blog.