Cloudera Can’t Wait! Apache Iceberg Now Available on Cloudera Data Platform

In Cloud by Ron WestfallLeave a Comment


The News: Cloudera announced the general availability (GA) of Apache Iceberg in Cloudera Data Platform (CDP). Iceberg is developed as a 100% open table format, developed through the Apache Software Foundation, and seeks to help users avoid vendor lock-in. The general availability announcement covers Iceberg running within key data services in the Cloudera Data Platform, including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), and Cloudera Machine Learning (CML). These tools seek to empower analysts and data scientists to collaborate on the same data, with their choice of tools and analytic engines. Read the Cloudera blog here.

Cloudera Can’t Wait! Apache Iceberg Now Available on Cloudera Data Platform

Analyst Take: Cloudera Data Platform’s GA support of Apache Iceberg advances ecosystem support for openness across burgeoning data lakehouse environments. Organizations are increasing their demand for hybrid data flexibility, adopting open data lakehouses to attain application interoperability and portability between on premises implementations and public clouds coupled with data scalability assurances.

What I see as key to expanding CDP’s presence across the enterprise space and Apache Iceberg’s influence throughout the ecosystem is Iceberg’s native integration benefits from the enterprise-level capabilities of the Shared Data Experience (SDX), which includes built-in data lineage, audit, and security. Specifically, Apache Iceberg tables in CDP are integrated with the SDX Metastore for table structure and access validation, which means organizations can attain auditing and generate fine grained policies as well as unified metadata, security, and governance capabilities out of the box.

From my view, Cloudera’s credentials in supporting openness and interoperability across the data lakehouse community are impressive, including contributions toward Apache Hive, Apache Spark, Apache Nifi, Apache Impala, and Apache YuniKorn. Now, through Apache Iceberg support, Cloudera can augment modern data architectures. The capabilities that enable that include:

  • Multi-function analytics delivered concurrently to fulfill comprehensive data lifecycle requirement, including edge and AI.
  • Time travel with point-in-time queries that support regulatory compliance as well as forensic visibility demands.
  • In-place table evolution that covers schema and partition changes though streamlined single command, avoiding reliance on intricate, lengthy processes.
  • Performance improvements gained through supercharged partitioning to administer massive-scale data sets.

I believe only though openness can data lakehouses meet the evolving requirements of customers. Cloudera touts how IQVIA uses the Cloudera portfolio to synthesize more that two petabytes of data from 250 data warehouses globally including Oracle, IBM Netezza, and Teradata systems into a multi-tenant data lake to run the company’s analytics. Cloudera’s hybrid data platform provides the open data lakehouse architecture key to supporting the full data lifecycle that can deliver multiple advanced analytics use cases with full data in motion and operational database solutions.

These benefits present an opportunity for Cloudera from a sales and marketing standpoint. The company should dig in and amplify the benefits of CDP for customers and how it can go a long way toward helping them avoid spiraling administration complexity and the greater expense of building out additional support structure to redesign existing table structures or taking on third-party tool integration. By delivering on such benefits, we expect that Cloudera can become integral to driving improved data lakehouse and business outcomes across its customer footprint.

Overall, I view the CDP as fully aligned with the Apache Iceberg mission, taking advantage of its open table, high performance, and cloud native format that scales to petabytes while remaining agnostic to the existing underlying storage layer and access engine layer. This includes enabling smooth integration between processing and streaming engines, while maintaining data integrity between them. As such, CDP can prove well-suited to play an essential role delivering cloud-driven Apache Iceberg benefits and advantages in supporting and augmenting organization transitions to open, converged data lakehouse environments.

Disclosure: Futurum Research is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of Futurum Research as a whole.

Other insights from Futurum Research:

Cloudera Data Platform: Cloudera is All in on Hybrid Data

Kyndryl and Cloudera Announce a New Global Partnership to Improve Customer Data Transformation Projects

NVIDIA GTC 2021: Cloudera and NVIDIA Expand Partnership, Look to RAPIDly Advance Data Scientist Adoption of GPUs

Image Credit: Cloudera

The original version of this article was first published on Futurum Research.

Ron is an experienced research expert and analyst, with over 20 years of experience in the digital and IT transformation markets. He is a recognized authority at tracking the evolution of and identifying the key disruptive trends within the service enablement ecosystem, including software and services, infrastructure, 5G/IoT, AI/analytics, security, cloud computing, revenue management, and regulatory issues.

Leave a Comment