In Part 1 of this post, I briefly described data virtualization. I also pointed out that each vendor’s solution is their secret sauce - their approach to data integrity, performance, security, connectivity, and accessibility.
In this second part, I share advice from several corporate customers that have first-hand experience with selecting, purchasing, implementing, managing, and being accountable for corporate wide data virtualization projects.
Hopefully, you will get the benefit from lessons learned.
Note: The advice below came from interviews. The respondents referred to specific vendor products, which of course will not be named here (there were three). The solutions are generically referred to as “the solution selected”.
Data Virtualization or Data Warehouse?
Data warehouses are not replaced by data virtualization. They can complement each other as needed, especially if you need time-series analysis that requires a persistent view of historical data for comparison. Data virtualization should be considered as one of many strategies that can stand alone or be part of a hybrid. It depends on what you need to do with the data. The answer is not always data virtualization.
Scale Up to Handle the Crazy Hard Stuff
The data virtualization solution must handle the needs of a large organization that requires a tremendous amount of data transformations across multiple domains and businesses. It is the "crazy hard stuff" that needs to be evaluated. It was through those evaluations that we were able to prove that the solution selected was better scaled to meet such challenges.
Extensive Integration Capabilities
At a base level, it is important to be able to easily access and discover traditional data sources like SQL Server, Sybase, and Oracle. However, it is also important to integrate with external sources on the web and with less traditional sources found at vendor or customer sites (proprietary database implementations). The solution selected has a very extensive integration library with broad and easy access methods to that library. This was key to the decision making process because there was a need to access data from a large variety of implementations.
Optimize Queries Close to the Source
When designing a data virtualization solution, do not pull loads of data and then run all of the queries inside the data virtualization layer. It is better to "push down" partial queries (like counting up all customers in a particular set of zip codes) to the source system where the needed detail data resides. The solution selected is very good at optimizing the queries closer to the source so that only the data that is absolutely needed is used.
Involve Both IT and the Business in the Evaluation
A variety of users need to be involved in the evaluation process, including IT, security, business and data analysts. The solution selected was able to provide the tools necessary to empower non- technology users in data discovery and modeling, thus moving more responsibility to the business side.
The Heavy Lifting: Many Models Over Many Views:
The ability to easily write business rules and provide different access methods to the virtualized data model in order to handle complex situations is mandatory. The solution selected was able to easily load and apply logical data models across multiple views of one or more data stores and to tweak those models until the right balance was met. This is where the heavy lifting is. The ergonomics of facilitating that and how well the tools sit on structured and unstructured data is critical.
Conclusion
Data virtualization will not be the answer to every data integration challenge. It depends on many factors, most importantly, how the business wants to leverage the data they have now and integrate new sources of data in the future.
As advised by customers that have successfully gone through this process, the evaluation process needs to have a cross-functional perspective that involves both the business and IT.