News

How big is Big?

9 Jun 2016

Applying Big Data concepts at any scale

Current estimates say that by 2020 there will be more than 40 zetabytes of data worldwide (which is about 5,200 gigabytes for each person). While most companies will work with just a small fraction of this available data, the challenges they face in aggregating and organizing their data can benefit from approaches similar to those used in big data analysis. Generally, big data projects follow the same process: acquire the data, align the data, store it for future use, and then shape it for analysis. Therefore, while each data aggregation project is as unique as the customer problem it addresses, there are universal questions to ask that can set you up for the best overall strategy to fit your needs.

Big Data Basics Graphic

“Big Data” system architects commonly consider “the four V’s” when designing a data centric solution.Throughout this article, keep in mind how the four V’s of big data apply to your situation (even if you may not have big data) as you consider each decision you make:

For more info, check out this infographic: http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Data Acquisition and Transportation – how do you get your data and move it around?

The first step in data aggregation for analysis and visualization is to find all the relevant data and collect it in the same place. Typically, this involves setting up a way to acquire the data you need on a regular schedule and placing the data within the correct storage location once you have it. The best acquisition method will vary by the type of data and local storage methods that each company employs. The considerations below will help you and your software provider narrow down which methods are right for you:

How is the data stored?

Your company probably has data in multiple locations and in many formats, so finding your data and determining its local layout is the first step in aggregation. Do you have a centralized database or multiple data sources within your company? Is some of your data kept in an Excel or .csv file that can be collected periodically? Is it streaming data from an internet connected device? Or is it stored in a local system where a web service or API could be used to collect the data?

How often do you want to acquire new data?

After you have your data collection strategy decided upon, you will need to determine how often it needs to be integrated into your analytics processes. Is it sufficient to view historic data current to the last quarter, month, week, or day, or do you need by minute to minute updates? If you are accessing your data more infrequently, then sending all the data as a batch can be done during off hours, increasing the performance of queries against the data during the day. If you need more frequent updates, then you will likely want to pursue a method of sending data in frequent small packages, or even streaming. These methods assure that you have close to real-time data, but can take a toll on performance.

How much data is being collected at the intervals you’ve selected?

Now you can begin to design your data workflows, but deciding what technology to use will depend on the structural characteristics of the data payloads you are moving around. Is it in large chunks (such as a batch of long distance driver logs) or in discrete pieces (like individual laboratory test results)? Most integration and data aggregation approaches leverage some type of messaging infrastructure to implement workflows. The size of your data chunks will help you decide whether to send messages that incorporate the data or to migrate the data as a chunk to temporary storage and then send messages with a location reference so that the data can be retrieved later in the process. As above, there are considerations for each choice. Messages that incorporate data tend to streamline development, but messages with location identifiers can improve performance and increase modularity of the system, making it more traceable and easier to adapt to change in the future. These considerations can also influence the tool selection process as many systems are only designed to accepted data as part of the message payload.

Can you retrieve your data again?

In many cases you need to plan for the possibility that information may be lost or corrupted during the collection and processing steps. Generally, when more traceability and modularity built into a system, it is easier to pinpoint where something went wrong and correct the issue. If your source data is stored in a file or database, then it is likely you can go back and reload the data if something unexpected happens. However, if you are collecting a live stream or a rolling log you might not be able to retrieve it again. The impact on data loss or corruption will be different depending on the methods you’ve chosen to acquire and move the data.

How will you monitor data flow and clean up any errors?

Like the old adage goes, “Hope for the best, but plan for the worst”. Once you’ve made your choices on how the system should be set up, you need to know what to do when something goes wrong. Plans for downtime and the inevitable case where data is unavailable need to be put into place so that operations don’t grind to a halt if your access to the data goes down. In addition, you will need to think about a data clean up strategy in the case that inaccurate or misplaced data gets into the system. In the case of inaccurate data, it is useful to know what actions were performed against the data before it got to storage. This includes what source it came from, what manipulations where done on it, and when it entered the system? It is also important to consider the possibility that decisions could be made based on the inaccurate data and determine a plan of action to prevent and correct any harm done by these events.

Data Alignment Graphic

Data alignment – How do you make sure your data is apples to apples?

The two big questions you must ask in terms of data alignment are: how to match up the data and how to monitor the process. When pulling data from disparate sources it is common to find it difficult to isolate key identifiers because the data sources aren’t identical. Think of a patient who goes to see their doctor, goes to a lab for a test, and then pays for it via insurance. The patient is the same in each case, but the doctor, the lab, and the insurance might use different identifiers in their database to store the information....

Looking for more great information like this? Check out our BI Dashboards article to learn how to get started on a BI dashboard project with the your newly corralled data.

Not sure where to start? Still have questions? Yahara has been helping clients with their data needs for over 20 years, so don’t hesitate to reach out!

Words by Abbey Vangeloff

Images by Sandi Schwert

Collaborators: Adam Steinert, Patrick Cullen, Chad McKee, Kevin Meech, and Garrett Peterson

Sources: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf