The concept of big data has been around for well over a decade now, and today, data sets are bigger than ever before.
Content production of all types continues to explode. Various sorts of telemetry-producing devices from IoT sensors to robots to cloud-based microservices churn out massive quantities of data – all of them potentially valuable.
On the consumption side, AI has given us good reason to generate, collect, and process vast sums of data – the more, the merrier. No AI use case from autonomous vehicles to preventing application outages can’t be improved with more data, all the time.
Where to put all these data continues to be an ongoing concern. The explosion of data threatens to swamp any reduction in storage costs we might eke out. Data gravity continues to weigh us down. And no matter how many data we have, we want any insights we may be able to extract from them right now.
The challenge of how to manage all these data, in fact, is more complicated than people realize. There are multiple considerations that impact any decision we might make about collecting, processing, storing, and extracting value from increasingly large, dynamic data sets.
Here are some of the basics.
The Three Dimensions of Big Data Management
The first dimension we must deal with is data gravity. Data gravity refers to the relative cost and time constraints of moving large data sets as compared to the corresponding cost and time impacts of moving compute capabilities closer to the data.
If we’re moving data in or out of a cloud, there are typically ingress and egress fees we must take into account. It’s also important to consider the storage costs for those data, given how hot (rapidly accessible) those data must be.
Bandwidth or network costs can also be a factor, especially if moving multiple data sets in parallel is the best bet.
Every bit as important as cost is the time consideration. How long will it take to move these data from here to there? If we’re moving data through a narrow pipe, such time constraints can be prohibitive.
The second dimension is data residency. Data residency refers to the regulatory constraints that limit the physical locations of our data.
Some jurisdictions require data sovereignty – keeping data on EU citizens within Europe, for example. Other regulations constrain the movement of certain data across borders.
In some cases, data residency limitations apply to entire data sets, but more often than not, they apply to specific fields within those data sets. A large file with personally identifiable information (PII) will have numerous regulatory constraints as to its movement and use, while an anonymized version of the same file might not.
The third dimension we must consider is data latency. Given the network constraints that apply to a particular situation, just how fast can we move any quantity of data? Will those limitations impact any real-time behavior that business stakeholders require from the data?
The reason data latency is a problem in the first place is because of the pesky speed of light. It doesn’t matter how good your tech is, it is physically impossible for any message to exceed this cosmic speed limit.
Once your network is as optimized as possible, the only way to reduce latency is to move the endpoints closer together.
Given that the greatest latency any message on earth is likely to experience is about a quarter of a second (representing the round-trip time to a geosynchronous satellite), most business applications don’t care much about latency.
In some situations, however, latency is critically important. Real-time multiplayer gaming, real-time stock trading, and other real-time applications like telesurgery all seek to reduce latency well below the quarter-second threshold.
Low-latency applications may still be the exception in the enterprise, but the situations where latency is an important consideration, however, are exploding. An autonomous vehicle traveling 60 miles per hour will travel 22 feet in a quarter second – so it better not be waiting for instructions from the cloud or that pedestrian is toast!
Calculating the Impact of the Three Dimensions
Most organizations struggling with big data have tackled one or more of these considerations – but typically, they do so separately. It’s important, however, to factor in all three dimensions when planning any big data strategy.
Regulatory constraints provide guardrails, but depending upon specific compliance restrictions, organizations can tackle data residency in different ways. Such calculations should always take into account data gravity considerations as well.
For example, if one strategy for complying with a data residency regulation requires moving large data sets, it’s important to factor in both the cost and time constraints of such movements when deciding whether that particular strategy is the right one.
The real wild card in these calculations, however, is the impact of edge computing. Today, organizations are already balancing data gravity and latency considerations when choosing a content delivery network (CDN). CDNs operate servers at the cloud edge – locations within clouds that are geographically close to end-users.
Edge computing adds the near and far edges to this consideration. The near edge includes telco points of presence, factory data centers, and even retail phone closets – anywhere an organization might locate server equipment to better serve edge-based resources.
While the near edge generally assumes stable electricity and plenty of processor power, we can’t make those assumptions on the far edge. The far edge includes smartphones and other smart devices as well as IoT sensors and actuators and any other technology endpoint that might interact (even intermittently) with the near edge.
The edge is such a wildcard because disruptive innovation is proceeding at a rapid pace, so planning ahead involves strategic guesswork. That being said, there are plenty of examples today where edge computing has impacted the calculations of the three dimensions of big data management.
Take video surveillance, for example. AI has improved to the point that it can detect suspicious behavior in video feeds in real-time. Moving video files from cameras over a network to the near edge and from there to the cloud, however, faces serious data gravity challenges.
As a result, most AI inferencing for video surveillance either takes place at the near edge (say, a server closet near the cameras) or on the cameras themselves.
Moving inferencing to the devices, in turn, reduces latency – enabling the detection of suspicious behavior long before the burglars get away with the jewels.
Localizing such inferencing may also help with regulatory compliance, as such video feeds might contain confidential information like license plates or even people’s faces.
Any data management strategy for video surveillance, therefore, must balance gravity, latency, and residency priorities within the same deployment.
The Intellyx Take
Whenever you find yourself facing multi-dimensional big data management challenges, ask yourself these three questions:
- What is the business value of the data?
- What are the costs associated with the data?
- How real-time do the data need to be?
The answers to those questions, in turn, will feed into calculations that balance data gravity, data residency, and data latency priorities.
There will never be one right answer, as the business considerations will vary dramatically from one situation to the next. What’s important to remember, therefore, is that regardless of the situation, you’ll need to crunch the numbers across all the dimensions to figure out your optimal big data management strategy.
© Intellyx LLC. Intellyx publishes the Intellyx Cloud-Native Computing Poster and advises business leaders and technology vendors on their digital transformation strategies. Intellyx retains editorial control over the content of this article. Image credit: autogenerated at Craiyon.