Organisations have always accumulated data. Over the years the need to organise, locate and retrieve this data has driven a steady evolution of data management from filing cabinets through to relational databases and data lakes. In our modern data rich world, we have a vast array of different data, stored in different formats and in different places. The challenges of managing all this data and making real use of it grow with the increasing diversity and scale. What’s needed is a single framework that can encompass all of an organisation’s data and present it to the right people in the right way. Step in Microsoft Fabric’s OneLake.
OneLake is the backbone of the Microsoft Fabric platform. It links everything together by providing a single, unified data lake that manages all your data. This data is then made easily accessible to each Fabric application and experience. Here’s why OneLake could be a game changer for many organisations.
OneLake utilises Lakehouses combining the structure of data warehouses and versatility of data lakes to store structured and unstructured data. Data is organised into Tables and Files. Files are raw unstructured data in any format whereas tables represent managed data. Tables can be created by simply importing from a file or through a complex ETL process. Tables stored within OneLake are held in Delta Parquet format, an open-source data format optimised for analytics and accessible to multiple applications.
One of the goals of OneLake is to remove duplication of data. Once data has been ingested into OneLake it is, permissions permitting, available to all users within your Fabric tenant to access using their preferred tools. This reduces the need to create multiple copies of the same data across different silos.
Most organisations will already have data stored in different locations, although there maybe aspirations to consolidate these disparate data stores, in practice the time and cost needed are often prohibitive. OneLake helps to ease this challenge by providing shortcuts. Shortcuts allow you to access data stored outside of Fabric by creating a direct link, once you have defined a shortcut the linked data can be referenced and managed as if it were in OneLake. Aside from the convenience, this means that you don’t have to create a copy of the data to use it, and when you do access it, you are reading the up to date version. Currently, in Fabric Public Preview, shortcuts can help you access AWS S3 and Azure Data Lake Storage Gen2 (ADLS Gen2) files.
OneLake tables are stored in Delta Parquet format, making them directly accessible to multiple programming languages and avoiding the complexity of file format conversion. The different experiences and applications within Fabric are geared towards different ways of working and as such make use of different programming languages. This allows different users to work with the tools and languages they know. Whether that is T-SQL, R, Python, etc, users’ can select the tools that fit best.
Just like OneDrive, you can install OneLake onto your local machine allowing you to access OneLake files through file explorer. Uploading and browsing your OneLake files is made super simple providing a seamless workflow. In File Explorer, the OneLake files are presented within folders which map to your Fabric Workspaces. You can even interact with your files stored elsewhere, such as AWS S3, utilising the connection through a shortcut.
This wide-ranging access to data demands a comprehensive security layer. User identity is derived from Azure Active Directory which is used for authentication. The primary means of segregating data in Fabric is through Workspaces, these are used to contain resources a specific areas or projects. Access to resources within a Workspace is granted via role-based permissions assigned at the individual or group level. Finer grained access controls can then be applied at the individual item level, giving additional sharing options (allowing you to grant access to users who have no roles in the Workspace) and promoting collaboration across the organisation. Microsoft Fabric is still in preview, so it seems reasonable that further refinement will be made to the security model.
Organisations need to know where their data is going and where it’s come from. Having all your data stored in OneLake, and tables being stored in Delta Parquet format, allows the lineage of the data in your Workspace to be tracked. The lineage view comes built into Fabric and allows you to see the flow of data through your processes, giving a clear view of how your data artefacts are connected.
OneLake is different from the already available Azure Data Lake Storage. The key difference being that OneLake comes already provisioned with Fabric. This means there is no requirement to do any set up or configuring of your storage, OneLake is there and ready to go when you start using Fabric. OneLake supports the same ADLS Gen2 APIs and SDKs so, if you rely on these, moving to Fabric and OneLake does not create challenges. And if you have data stored in ADLS Gen2 or AWS S3 then you can leave your data there and access using a shortcut.
While OneLake is still in preview at this point, it looks like a real contender for the unified storage solution we’ve all been dreaming of.
For more on Microsoft Fabric and OneLake, check out our other blog posts or contact us at Katalyze Data to discuss more.