What is Data Gravity? How it Can Influence Your Cloud Strategy
What is Data Gravity?
When working with larger and larger datasets, moving the data around to various applications becomes cumbersome and expensive. This effect is known as “data gravity.”
Because data gravity can lock you into an on-premises data center or a single cloud provider, data gravity hinders your business’s ability to be nimble or innovative. Overcoming data gravity is as simple as adopting a cloud-attached storage solution that connects to multiple clouds simultaneously.
How Does Data Gravity Influence Your Cloud Strategy?
As providers like AWS, Azure, and Google Cloud compete to be the primary cloud provider, it seems like they all have a pitch to convince you to migrate into their cloud. Adopting one or more clouds might make sense for your business needs, but does it make sense for your data?
Data is massive, both in terms of the scope of datasets, and in the “pull” of that data to multiply requirements for additional capacity and services to utilize it. The term for this effect is “Data Gravity” and for many, the associated costs are crushing. Increasing data storage costs, fees to access your data, and doubling the costs to host, replicate, and sync duplicate data sets can all impede your budget and business success.
There are two challenges to solving Data Gravity: latency and scale. The speed of light is a hard limit on how quickly data can be transferred between sites, so placing data as close to your cloud applications and services as possible will reduce latency. As your data increases in size, it becomes more difficult to move it around. Let’s look at a couple of cloud strategies organizations use to address the challenge of Data Gravity:
All of Your Data in One Cloud
One approach to reducing latency is putting all of your data in a single cloud. Like the proverbial warning about putting all of your eggs in one basket, this approach has some drawbacks. These drawbacks include:
- Synchronization — Maintaining disparate data sets can be complicated and time-consuming. As data sets grow, maintaining recovery time objectives (RTO) and recovery point objectives (RPO) become more difficult
- Compatibility — Replicating your solution design using cloud provider resources usually takes the form of a menagerie of functionality and services overlaid on top of your storage to provide similar, or even just compatible, storage workflows. Like others, you may find that cloud provider storage solutions do not fit your use case as well as you might want or need, and requires additional services, that may not be expected or budgeted, for functionality.
- Fees — Not only are you paying for the base data storage costs with cloud provider storage, but also for performance, transaction, and egress fees.
- Operationalization — Being able to use established methodologies or procedures like array-based replication, snapshots, or multiprotocol access, can significantly improve workflows and alleviate budget constraints. Providers may not directly support this functionality.
Each cloud provider promises agility, flexibility, lower costs, and superior services and toolsets, but the reality can be unforgiving. Instead of increased agility and flexibility, your developers may become hamstrung by the single cloud implementation or by an underperforming hybrid cloud configuration. Instead of superior services and toolsets, you’re trapped with a provider (“vendor lock-in”) or struggling with compatibility and service integration. Instead of lowered costs, you’re sitting on a mountain of egress fees or paying for a mismatch in performance levels.
Store Your Data On-Premises Plus Cloud-Native Storage
Duplicated data, outside of backups or DR strategies, is wasteful, so maintaining a single data repository or data lake is the best method to avoid siloed and disparate datasets.
A data lake with appropriate scalability seems easy enough, and it can be — depending on your data needs. Many organizations have a suitable on-premises data lake, but accessing that data lake from the cloud has several challenges:
- Latency – The further you are from your cloud, the more latent your experience will be. For every doubling in Round trip time(RTT), per flow throughput is halved.
- Connectivity – Ordering and managing network links can be costly. Balancing redundancy, performance and operational costs is difficult
- Support – Operating and maintaining storage systems is expensive and complicated enough to generally require dedicated personnel.
- Capacity – A plan and budget for growth are required.
Migrating your data into a cloud provider or utilizing on-premises storage in the cloud are both susceptible to some or all of these challenges. Performance will always be hamstrung by high latency with cloud accessible on-premises solutions and cloud provider offerings only solve problems for workloads in that cloud. While both of these options are valid, neither will move your performant hybrid cloud or multi-cloud agenda forward.
Adopting Multiple Clouds to Overcome Vendor Lock-In
According to Gartner, by 2024, two-thirds of organizations will use a multi-cloud strategy to reduce vendor dependency. Cloud-native storage tiers on AWS, Google Cloud, and Azure can be matched to the performance and access frequency of different types of data, but can only be accessed from its own cloud. If your developers and user teams use multiple services from different clouds that all need access to the same data, these cloud provider storage solutions may become a trap. External, remote, or cross-cloud access may be closed off. Cross-availability zone access within the same cloud or replication can become more difficult. Even a simple method for seeding data can become a pain point. Sidestepping these issues may require duplicating your datasets – adding cost and management overhead. While you’ve solved the problem of vendor dependency, this approach still has data access and cost implications.
If your organization has sunk costs in equipment that may not be fully depreciated, legacy applications that are unsuitable for cloud-native deployment designs, or data compliance requirements, overcoming inertia to access the benefits of a multi-cloud strategy may seem impossible. It’s helpful to change the goal from “how do I get my app in the cloud” to “how do I use my data from the cloud?” This perspective change that places your data at the center of your strategy will help your organization chart a path that future-proofs your data and enables you to leverage competitive services from each of the clouds.
When solving for multi-cloud data access, ask these questions:
- How do I minimize latency?
- How do I keep my data secure?
- What is the most efficient way to access my data from anywhere?
- How do I minimize fees?
Some challenges, like cross-cloud access, can add such complexity or cost that the design becomes untenable. The real dream killer, though, is latency. No matter how awesome your storage array is, no matter how fat your network pipes are, storage performance is a function of latency, and distance is the enemy.
Where’s the Right Location?
Where can you put your data that allows for multi-cloud access at low latency? Adjacency is the proper solution to latency and the cloud edge is the logical answer, but what does that really mean?
Colocated Data Lake
Colocation data centers that are adjacent to cloud locations can enable data access from multiple clouds, a significant improvement over the data duplication that comes with copies of the same data in each cloud’s native storage. Because organizations often manage their own equipment in a colocation agreement, the responsibility of cross-referencing possible colo data centers with desired public clouds to validate low latency requirements falls on the customer’s organization. If that organization needs cross-region access, additional colocation sites and higher costs are often required. Finally, cross-connects and private circuit options, along with hyper-scaler onramps, introduce additional unknowns and will certainly increase the cost. Leveraging them safely and effectively may be more effort than you are prepared to shoulder.
Managed Data Services
Managed Data Services providers can offer the best of both worlds. They have already done the work of ensuring their data centers are located in close proximity to major hyperscale cloud providers, which means they can offer cloud-adjacent data lakes with low-latency, secure connections as well as SLOs suitable for your unique workloads and use cases. For additional efficiencies, providers can offer a familiar storage platform that you can easily consume without the burden of managing and supporting yourself, bundled with access and service offerings that connect to and augment resources and services of your clouds of choice.
Leveraging your preferred storage platform, directly from multiple cloud edges, is critical to crafting a more performant and reliable multi-cloud environment. The shortcomings of existing solutions are laid bare when high performance and multi-cloud access are needed.
Make the cloud edge the central pivot point for your data workflows to enable simultaneous access from multiple clouds and unlock the innovation and flexibility of multi-cloud. This resolves latency and performance bottlenecks from on-premise or un-optimized datacenter locations while greatly improving access and availability. Seeding data, configuring DR, and migrating data out become near painless. Best-of-breed storage services and toolsets are available from any cloud provider. Data security is more digestible, and compliance is easier to understand and manage. Finally, cloud arbitrage is possible, allowing you to deploy or shift workloads depending on cloud provider pricing or resource availability, enabling application-level high-availability (HA) across clouds. With data at the center of your multi-cloud world, the options are endless.
About Dan: Dan is a Senior Storage Engineer and Infrastructure Architect who has been with Faction 8 years focusing on hybrid and multi cloud storage architectures.