To bundle or not to bundle: Fabric or Databricks for completeness of Data Analytics platform?
Published:
June 28, 2024
Eugene Reis
Bundled product offerings in the retail and enterprise are standard these days. While the simplicity of such offerings is enticing for customers, once the usage matures and increases in scale, customers often find themselves saddled with technical and nontechnical debt and limited options to fight slow and steady price rises and to decouple the systems. In the long run, it’s rare to find customer satisfaction utilizing bundled services. A simple analogy in the retail space would be to look at customer behavior when they buy bundles with internet, Cable TV, and Phone services. Often, once the initial promotional period ends, the cost of such services immediately doubles or even triples, with limited options to decouple and move on to better service providers.
Similar analogies exist in the enterprise data analytics space, and we will address two major contenders in this discussion: Databricks and Microsoft Fabric.
Introduction: Bundling of Services in Data Analytics
This article is intended to explore the constant pursuit of the idea of having a single, fully integrated data platform that can address most, if not all, data governance and management needs and how Databricks is not only a pioneer in this field but also the most mature and comprehensive platform to fulfill the requirements.
History
Over the last few decades, data has increased in volume, speed, and complexity. Several different terminologies have been coined to describe platforms trying to be all-encompassing, such as Data Warehouse, Big Data, Data Lake, Fabric, etc. Cloud computing emerged to make things even more challenging, and globally distributed data is now an everyday reality. Artificial intelligence has also become popular, and everybody is pursuing at least some type of large language model. How can any company manage such challenging scenarios?
The only way to embrace the challenge is to embrace it fully. In 2018, Databricks introduced its Unified Platform, which was later followed by the official implementation of the "Lakehouse" paradigm. The article was published in collaboration with UC Berkeley and Stanford University. It describes an environment encompassing existing data technologies and extending them into Artificial Intelligence (AI) and machine learning (ML). Moreover, the paradigm was designed to be multicloud and scalable.
In our industry, completeness of vision and ability to execute are critical for companies to achieve results with reduced friction/risk and obtain results more quickly and efficiently. That's the vision guiding Databricks’ roadmap for a long time and its commitment to deliver a seamlessly integrated environment, which is by no means an easy task. Unlike competitors who repackage and rebrand existing products, Databricks realized it had to innovate and create new technologies from the ground up to address the constant growth of data management complexities.
Comparing Analytics Platforms in a Data-Driven World
In today’s data-driven world, choosing the right analytics platform can significantly affect your organization’s efficiency due to the velocity of changes in platforms and technologies to address scalability, security, and simplicity of cohesive programming tools to support and address constantly changing business needs. The advent and maturity of big data frameworks were critical to processing large amounts of raw structured and unstructured data and storing them in a unified data lake storage for consumption by both BI and ML/AI workloads. The usage volumes have leapfrogged traditional relational data warehouse platforms for their agility, but not without data security challenges and governance issues. The biggest challenge is retraining existing talent in the development communities to address new programming paradigms.
Artificial intelligence has become popular, and everybody is pursuing at least some significant language or complex predictive model. The need for large implementations of newer Machine Learning models to handle large-scale training creates a different set of process and governance problems. How can any company manage such challenging scenarios?
Other companies have tried combining openness, integration, and governance into a single ecosystem. However, as mentioned before, previous efforts were based on forcing different products together, which resulted in a lack of consistency and clarity. Because Databricks was built from the ground up and continues to evolve with all those elements in its roadmap, it went further than any previous similar effort.
Two popular platforms today are Databricks and Microsoft Fabric. Both offer powerful data analytics, machine learning, and collaboration features, but they have distinct differences that may make one more suitable for your specific needs.
Some of the key questions and dilemmas around any technologies or industry standards are not so much about the marvelous engineering feats behind them but rather:
- Can it scale, and is it reliable to be used in mission-critical production applications?
- Can it satisfy the needs and complexities of large enterprises?
- Does it have the proper safeguards to protect information?
- How much will it cost vs. how much will it help save money/time and generate revenue?
- How much vendor lock-in vs. open standards?
- Can it address Developer productivity?
This leads us to direct the conversation toward what's currently available. Microsoft Fabric is the most recent announcement of a platform meant to address those questions. Let's explore that for a few moments and contrast it with Databricks. Let's look at the Microsoft stack first.
Fabric
Fabric is a data platform for real-time analytics, integration, and governance. It offers a comprehensive set of tools for data management, ETL processes, and data quality monitoring.
You're correct if you notice several well-known products above; Fabric is an attempt to bundle and unify existing household names under one roof. And the first thought that might come to many people's minds is that this is a rebranding and repackaging of those products. And that would not be unfair. Microsoft has also introduced Synapse in the past as "simply unmatched and truly limitless," but that didn't go as planned. The most likely reason: Databricks
Databricks
Databricks is a pure play, organically designed unified analytics platform based on Apache Spark™. It provides an integrated environment for data science, data engineering, and collaborative data analysis. With Databricks, users can run Spark workloads, build machine learning models, and visualize data, all within a single platform. It seamlessly integrates with various data sources, cloud providers, and third-party tools.
Notice that Databricks is not naming products but rather naming types of workloads. That might sound like semantics, but the apparent reason is that products are developed to address needs. That leads Databricks to work on two fronts:
- Develop new products when what's out there is not good enough or doesn't exist
- Integrate with other products when needed or preferable, giving clients choices
Databricks was born in the cloud and was designed to be scalable. Moreover, it was designed to process data in its many forms: relational, hierarchical, graph, and unstructured text, as in natural language. That's on the ingestion side. Once data is in, the needs can be just as demanding: Business Intelligence, Machine Learning, Exploratory Analysis of large datasets, and Compliance with regulatory requirements. To name a few. How can all of that be done? Once again, Databricks was built as an ecosystem.
Architecture
Skepticism has always been a big part of our industry. Buzzwords come and go all the time. Data managers have been hearing about the promised land of such a complete platform for decades, and some of them may now think it might be as accurate as a unicorn. The managers who kept searching and found Databricks now want to understand how it can deliver on its promises. If the answers to their questions are not solid enough, they'll quickly move on to another product.
A few principles have guided Databricks in developing a truly unified platform.
Openness. There is no such thing as a walled garden. Data can be in many formats, places, and ecosystems. One of the significant issues in the past was vendor lock-in, and that's still going on with products that hold your data hostage. Databricks created Delta Lake, an open source data management system that's scalable and based on another open standard, Parquet. Spark, widely regarded as the most successful distributed engine ever developed, is also open source.
Integration from the get-go. Other companies are trying to repackage old products and give the bundle fancy names. Still, existing products can only be shoehorned into being easily integrated and managed by designing them from scratch. That makes Databricks different; the platform has always had integration in its DNA. It's present in multiple clouds and seamlessly integrates with most of their other services; it integrates well with third party vendors, and it handles structured and unstructured data, different schedulers, and a variety of mechanisms of authentication.
Governance. What good is a platform that can store tons of data but can't manage it? This is a crucial point. Unity Catalog is meant to address that. It can manage different types of data assets such as workspaces, cloud storage, language models, queries with automatic filters based on who's searching for data, etc. That can only be accomplished when a platform has those safety mechanisms at the core.
Data Processing and Analytics
Initially built on Apache Spark™, Databricks excels in processing large-scale data analytics and machine learning workloads. It offers a notebook-based interface for interactive data exploration. It supports multiple programming languages like Scala, Python, and R. It supports batch and streaming workloads with Structured streaming support utilizing declarative Delta Live Tables. Databricks runtimes are always ahead with the latest Spark releases and incorporate the next-generation Photon engine (C++ implementation) to deliver superior performance. Photon is integrated with SQL Warehouse at no additional cost.
Fabric focuses on real-time analytics and offers a variety of disparate data processing engines, including Spark and Flink (which may always lag a version or two behind the latest Spark run time). It provides a visual interface for designing and monitoring data pipelines, making it easier to manage complex data workflows. However, since some of these are older traditional tools integrated, they would require programming knowledge on many fronts.
Cloud Support
Databricks is cloud-agnostic and supports all three major providers: Azure, AWS, and Google Cloud. Storage and Compute are decoupled, so mixing AWS and Azure for different workloads is seamless. Databricks workspaces can be easily migrated from one cloud to the other. Fabric is Azure-centric and focused on Microsoft deployments to address all the data analytics needs. Fabric’s simplified billing ends up coupling computing and storage into one pricing structure, making pricing a bit unclear and unpredictable for customers.
Programmability
Databricks, with its unique positioning of all workloads Data Engineering, Machine Learning, and Business Intelligence, is supported primarily through simple declarative standard SQL and Python scripting, with advanced support to integrate with Java and Scala libraries. Fabric, on the other hand, with disparate tools cobbled together, will require developers to have skills in SQL, Python, MS-SQL stored procedures, DAX for Power BI reporting, and dashboard development. Simplification of development and deployment using standard PySpark-based notebooks provides the best of both worlds in terms of simplicity and integration with third party libraries.
Security and Governance
Security is critical when choosing a data analytics platform, especially for organizations dealing with sensitive or regulated data. Both Databricks and Fabric prioritize security, but each approach it differently. Let's delve into the security features offered by each platform to understand their strengths and limitations.
- Data Encryption: Both platforms provide end-to-end data encryption at rest and in transit. However, Fabric does not offer as extensive security options as Databricks.
- Identity and Access Management (IAM): Databricks integrates with popular identity providers like Azure AD, Okta, and LDAP for centralized user authentication and authorization. It offers fine-grained access control policies to restrict data access based on roles and permissions within its Unity Catalog feature. With its Azure-centric architecture, Fabric may require additional integration with third-party identity providers for centralized management.
- Audit and Compliance: Databricks provides comprehensive auditing capabilities with detailed logs and monitoring tools. Tracking user activities, data access, and system changes helps organizations comply with regulatory requirements like GDPR, HIPAA, and SOC 2. Fabric provides basic auditing and monitoring tools to track data access and system activities. While it helps organizations maintain visibility into data usage, it may need more advanced compliance features than Databricks.
- Governance and Secure Collaboration: Databricks, through its Unity Catalog governance, offers secure collaboration features like shared notebooks with access controls, version history, and collaborative workspace. It ensures that data and insights are shared securely among team members without compromising confidentiality. Fabric supports collaborative data analysis and sharing but may not offer as advanced collaboration features as Databricks. It may lack features like shared notebooks with version control and fine-grained access controls. It is possible in Fabric to bypass security controls to the underlying data store as integration with different tools is loosely coupled.
Delta Sharing
Databricks Delta Sharing features as part of Unity Catalog governance and allows for the secure distribution of datasets within and outside the organization's boundaries. This also extends to data sharing across cloud providers, where users could seamlessly share data from an AWS-based workspace to an Azure workspace. This is a crucial differentiator for Databricks that is currently impossible in Fabric. Unity Catalog can be leveraged to unify workloads requiring Data Engineering, Machine Learning, and MLOps orchestration and securely distribute content. That's one more thing that sets Databricks apart from Fabric.
Open Standards
Databricks has always been an open source company, with products and new features constantly released to the open source community. Apache Spark™ is the flag bearer supporting Delta Lake, MLflow, Delta Sharing, and now DBRX, the next-generation standard in LLM models. Their focus has always been decoupling Compute and Storage, thus enabling customers to build cost-efficient applications that leverage and optimize different pricing models based on their workload structure.
Pricing and Costs
Microsoft Fabric’s pricing structure is supposed to be simple. The con of that approach is that it goes for a one-size-fits-all approach, in which compute capacity is reserved and consumed by all other products in the suite. Microsoft itself recognizes that there's no easy way to have an upfront estimate of the capacity size you'll need. Plus, its allocation model follows an exponential pattern, in which you must allocate 256, 512, 1024 cores, and so on. That easily opens the door to overallocation and waste.
Databricks, on the other hand, offer granular pricing, in which different types of computation (SQL, Model Serving, Job clusters, all-purpose clusters, etc.) have different price points. The sum of those granular estimates determines the final resource allocation. Clients are not forced to use the resources for the initial purpose; they can freely reallocate the capacity.
The simple exercise in the table below can be easily understood. Databricks can have a more precise reserved capacity. At the same time, for Fabric, there's the need to either allocate the highest discounted volume price closest to the actual need ($33.638) or another lower one ($16.819) and then pay the total list price for the extra. It only worsens when the consumption increases, as the next reserved allocation would be $67,276. The simplicity turns into costly complexity.
The bundling of services often benefits customers and vendors because the marginal cost for Microsoft to offer more services is lower, as marketing costs are lowered when products are channeled through their fabric ecosystem. They may also be able to predict the demand better since customers may not look for other options. The upsell opportunities for add-on services are generally higher, with opaque pricing in fine print. Databricks follows a different model where clients can pay for their needs without hidden costs. Each workload type can control costs. Volume discounts are available with mechanisms to keep costs under control and its management granular. For instance, different departments will pay their fair share while the company can negotiate a global volume.
Building Fences
Microsoft has generally successfully built fences around its primary product lines by locking customers into its Azure Ecosystem. They have evolved from desktop-based office products as their main profit engine to being a data center company offering bundled products and services for enterprises large and small to utilize their computing infrastructure within Azure optimally. This is a sound strategy for Microsoft and its hunger to sell its products while establishing a vendor lock-in. This has historically worked for them but at the cost of slower innovation in offering nimble and agile solutions for the next-generation data-driven world.
Conclusion
Choosing Databricks over Fabric as a bundled service offers several advantages in terms of comprehensive data analytics and machine learning capabilities, advanced security features, optimized performance, and flexible integration options. While Fabric has its strengths in real-time analytics and data integration, Databricks provides a more cohesive, comprehensive, and scalable platform to meet the diverse needs of organizations looking for bundled service offerings.
Ultimately, the decision should be based on thoroughly evaluating your organization's requirements, priorities, and long-term strategic goals. Consider conducting a detailed comparison, evaluating demos or trial versions, and seeking stakeholder input to make an informed decision that aligns with your business objectives and customer needs.