Data-intensive applications require large volumes of data and devote most of their processing time to I/O and manipulation of data. These applications include an array of use cases like IoT or Genomic Analysis. Most of these applications rely on MapReduce services like Hadoop. MapReduce services can process Big Data sets with a parallel, distributed algorithm on a cluster of compute nodes. Most cloud providers furnish MapReduce programs such as Hadoop, AWS Elastic Map Reduce (EMR), or Azure HDInsight.
Each public cloud provides different analytics capabilities, and some capabilities are unique to a provider. For example, AWS offers a rich set of applications and capabilities for voice recognition and anomaly detection. Azure offers Microsoft Power BI as-a-service to enable self-service business intelligence and meaningful insights with reports and dashboards.
Big Data management and analytics pipelines frequently leverage object storage services such as Azure Blob Storage, AWS S3, and S3-compatible storage solutions such as Google storage, OpenStack, RiakCS, Cassandra, and AliYun as the source or destination for analytics pipelines.
The massive amounts of data required by these applications exert tremendous data-gravity. As datasets grow larger and larger, they become increasingly difficult to move, which effectively locks an enterprise into their current cloud provider.
There are significant downsides to being locked into a single cloud provider or on-premise solution. It may be difficult or costly to leverage tools at another cloud provider or your on-premise deployment. Techniques to accomplish this include synchronizing or copying datasets between providers. Copying is seldom practical due to the size and maintaining synchronized data sets is costly and error-prone.