Preparing for storage for big data and analytics
Andri Selamat explains how organisations have the tools to analyse their business operations as never before.
This level of analysis is made possible by advances in database, analytics and storage technology to create a class of solutions that come under the banner of ‘Big Data.
‘Big Data’ is a set of processes and technologies that give organisations the capability to analyse large unstructured data sets in near real-time. This represents significant progress over data warehousing technology which is designed to analyse only structured data with batch processed responses.
All organisations operate in a world of big data
We live in a world where an estimated 5.2 quintillion bytes of data are created every day.
However, it is only in recent years that companies have had the technology to manage and analyse the ever increasing amounts of data available to them. An example of big data’s impact on an organisation is described below:
Macy’s, the largest department store chain in the USA, can adjust prices across its entire range of products several times a day – possible because it implemented a big data solution that reviews 2 terabytes of data daily. This process used to take seven days but now allows Macy’s to adjust prices to those of local competitors within hours. This price tracking system will require over 730 terabytes of data storage each year.
The ability to analyse terabyte and petabyte data sets has become critical in the ongoing search for competitive advantage. It is sobering to think that the exabyte (1,000 petabytes) will soon be a common measure of storage in enterprise computing.
This exponential growth in big data has a corresponding effect on storage planning.
Learn how Ricoh IT Services can help you > See customer stories
Big data storage planning considerations – the three Vs
Gartner’s 3 Vs model of big data is widely used to describe the main features of big data applications. There are three core characteristics a big data storage solution needs to consider: volume, velocity and variety; let’s consider each in more detail:
Terabytes and petabytes of data need to be stored in a scalable solution that can accommodate planned data growth. So storage is evolving from discrete devices and networks to hyper-converged systems where compute, storage and virtualisation co-exist on a single platform. Both Nutanix and Nimble are examples of hyper-converged solutions that enable you to better manage and scale both your data and compute requirements.
The network the data is transferred over must not become a bottleneck. The distributed nature of the Hadoop Distributed File System (HDFS) is a response to this challenge, where compute is moved to the data location rather than processing the data centrally.
The velocity of data refers to the speed at which data arrives and the speed at which it must be presented to users.
The delay arising from disk I/O on such large volumes of data has led to the evolution of RAM-based database solutions in recent years. For example, SAP HANA uses ‘data held in memory’ to process large data sets faster than on physical storage, while Nimble uses Cache Accelerated Sequential Layout (CASL) to optimise flash memory.
The security of data must not be ignored in the quest for speed; encryption, user access, privacy and data retention are all considerations. Remember, if only 3% of a petabyte of big data requires encryption, that equates to 30 terabytes. Planning data security before, not after, application implementation takes on a whole new meaning when big data’s involved.
Big data storage must be capable of holding and presenting a wide variety of data. For example, an enterprise might capture terabytes of video security information every day, collect point-of-sale information from across a national retail chain, interpret computer weather modelling for planning purposes, or analyse machine generated data from a factory floor, mining facility or transport system.
Further big data storage planning considerations
It is likely a big data application will be the single biggest volume of data to be stored and managed within an organisation. So, while its implementation has the potential to make a big impact on business users and customers, it will also impact IT management. Just some of the issues to be addressed in your storage planning process include:
- Governance and security of data need to be defined early in the planning.
- Physical storage planning in data centres mustscale for future growth.
- Networking capacity must be capable of handling the data, and may even call for a new dedicated network.
- Backups and Disaster Recovery planning must also be considered; how can a big data system with over a petabyte of data be restored within a timeframe acceptable to the business?
You may consider moving some or all of a big data application to a cloud storage vendor – both for reasons of scalability and to reduce the need to develop big data storage expertise and infrastructure within your organisation. For example, Xero, the cloud- based accounting service, predicts 60% growth on the 760 terabytes of data it holds. Data was managed on in-house SANs but, after multiple complex SAN migrations, Xero has benefited from moving some of its data to Azure and AWS managed cloud storage services.
Big data is complex
While the word ‘big’ encourages us to focus on the volume of data in a big data application, as we’ve seen, it’s only one piece of the puzzle.
The application’s complexity, the speed at which data is delivered and the variety of data handled make it vital to be confident in your upfront and ongoing storage planning.
No matter their proven enterprise pedigree, Traditional Network Attached Storage (NAS) and Storage Attached Network (SAN) solutions are unlikely to meet the demands of a big data application.
Organisations must now look to a new breed of storage technology solutions from companies which are ready to support for big data virtualisation and their integration partners to meet their future big data storage requirements.
About the Author
Andri Selamat is Practice Principal at Ricoh IT Services. With almost 20 years’ experience in the IT sector, he’s currently responsible for the delivery of engineering and architecture services across cloud computing, Microsoft, Citrix and VMware.