In our first post of data governance for modern organizations, we explained why traditional data governance strategies are failing modern organizations.
With the rise of decentralized data ownership, data democratization, and new embedded data roles, data is getting more attention — and more out of hand. The world is coming to understand that data can improve any part of business, but while technology has advanced to support this, processes have lagged behind. Data is now easy to make, but hard to find and control.
In this article, we want to provide a framework of recommendations that combat this problem. We believe that in order to create better data governance in the modern data stack, companies need to get their data discovery. By creating a central data catalog that everyone in the organization contributes to and building automation layers on top, data discovery and data governance become sustainable and scalable processes.
How to incorporate a decentralized data governance model
A data catalog is key to incorporating a decentralized data governance model. Using a data catalog as a single source of truth allows different teams to agree on the company KPIs and metrics definitions. A data catalog not only ensures consistency, but when the contributors are the subject matter experts, it provides a direct source of accurate data. The process of creating and maintaining a data catalog can make data governance much smoother when the following steps are taken:
1. Get a better understanding of how data is being consumed today
In order to build a data catalog, you need to understand how data is being used in the organization today. This can be derived from usage statistics on each data asset, user, and team. When you know what data is being accessed (and when), which dashboards are being used (or not being used), and who is actively engaging with existing data (and how), you can determine the following:
- Which datasets need to be classified or documented first?
- Which data pipelines or models are most critical to monitor their quality?
- Which datasets or dashboards can be archived?
Knowing the popularity of data tables, columns, and dashboards makes it easier to identify what needs to be deprecated, documented, and organized. When you have large amounts of data, you can prioritize what to focus on first by looking at what is most or least used.
Exploring trends in how different people in the organization use data makes it easier to delegate and distribute whatever data management work is needed. Moreover, having a good understanding of data operations can cut through a lot of noise and endless manual work of data stewardship when the semantic context is missing or outdated.
2. Build a high-level framework to start organizing the data
In order to distribute the work or documenting and classifying data amongst multiple teams, it is important to define an easy framework that everyone can follow. There are three things we recommend including in that framework–tags, ownership, and standardized documentation.
Tags are a simple way to create an organization system for your data. By applying tags to your datasets, you attach a semantic meaning that makes it easier to use datasets more frequently and correctly.
You can get flexibility and coverage by creating two main types of tags — category tags and status tags. Category tags define business units like Sales, Marketing, Operations, or product lines within the organization. They can be thought of as separate workspaces, where the same datasets can be shared between workspaces.
Status tags define how a dataset or field should be classified. Examples of status tags include To be deprecated, Certified, Sensitive, L0 / L1 / L2, Gold / Silver / Bronze, or PII. Tag names should be clear enough that anyone can understand what applying the tag means for how the data should be used or accessed.
We recommend keeping things simple by having fewer tags. This can also make things easier from a governance perspective. With a simple tag framework like category vs. status tags, regardless of which team a user belongs to, it is very easy for them to get a high-level understanding of a dataset.
b. Data ownership
By assigning owners to each dataset, you can distribute the work of documenting and maintaining datasets. The concept of data ownership or stewardship may seem like an extra responsibility for the data team, but encouraging top users of the data to participate in tagging and documentation can create a more collaborative environment.
We find it most useful to have business and technical owners for each dataset. Business owners are typical data stewards in charge of maintaining the correctness of data and its definition. They are usually the technical product manager or data analyst who designed the table. Technical owners are responsible for maintaining the data pipeline and quality. Some companies adopt legal owners for data ownership as well.
c. Standardized documentation
Creating a documentation template for your data dictionary or metrics definition is another way to ensure the data is well-maintained. If you surface operational metadata such as top users, popular queries, or how many times a dataset has been used in the last 30 days, you can immediately provide a wealth of information to anyone looking to understand that dataset and how it is used.
Having a simple, easy-to-apply framework based on tags, ownership, and standardized documentation allows different teams to collaborate more easily and ensure data is properly governed.
3. Automate data governance workflows
We’ve covered a number of ways you can organize and describe your data that can make data easier to find, understand, and use. In order to make this data governance strategy we’ve outlined scalable and maintainable, it is necessary to find ways to automate the data governance workflow. Here are a few areas to target for automation:
a. Notification system
A notification system can help owners and top users (i.e., other analysts or engineers utilizing the table) stay on top of questions or changes to their datasets by letting them know automatically if their data needs attention. Notifications can also let someone know when they have been assigned as an owner and that they are responsible for documenting the dataset, ensuring its correctness, or being sure it is functional.
b. Tracking metadata changes
Create a system that can automatically identify metadata changes such as when new datasets are created, raw datasets are added, or when data descriptions or loading status has changed. Identifying datasets associated with a team or individual or showing data which might contain PII (especially with something explicit and simple, like tags) can convey information about how the data is used in a simple and efficient manner. When implemented properly, this can significantly reduce the burden of data governance and compliance.
Tracking metadata changes and then notifying users of those changes automatically can help you stay organized. For example, you may want to notify the owners of a table when the description changes so they can approve those changes.
c. Bulk updates
When building out your data catalog or updating your data, you’ll want to be able to apply changes in bulk to save time. If your data already has naming conventions or commonalities and you can find big chunks associated with a project or team, you can easily apply tags and owners to that data. Making it easy to update the owners of a dataset in bulk is not only useful when establishing ownership for the first time, but when projects change status or business groups are reorganized.
Embracing a federated, decentralized model of data consumption
One of the core difficulties of modern data governance is who should have access to what data. We’ve mentioned that some companies cope with this issue by granting access to all data to everyone in the organization, but that presents significant security concerns. The inverse is also a problem, if data access is too restricted to allow users to work.
Data can be organized based on shared knowledge and contributions to the catalog, resulting in a more manageable data model. Creating a system where the entire organization collectively contributes descriptions, tags, ownership, and other metadata and shares the task of maintaining it makes a data catalog easier to achieve. By automating parts of the process, it is even easier to get individuals and teams to agree to play a small part in creating a scalable, sustainable data model.
Allowing all users to search through that metadata means they can understand what data exists and how it is being used without having access to the metadata itself. Users can have greater clarity into what data they actually need access to, reducing the tension between data governance and data-driven decision making. If organizations embrace a federated, decentralized model of data access by building an effective data catalog, they can actually excel at data governance.