A well functioning Data Catalog or Metadata Store with accurate and up-to date technical/business metadata is a dream of everybody in the data team. Even if we starts with one, it is not easy to keep metadata up-to date. Things are changing so fast and nobody cares keeping metadata documented.
Keeping accurate metadata is only possible (like many other things) with organizational level collaboration of people including data producers— that is exactly the main reason I think the OpenMetadata will work — this tool place the people at the core of everything.
OpenMetadata is an all-in-one platform for data and team collaboration, data discovery, data lineage, data quality, observability, governance. It is one of the fastest growing open-source projects. It acts as a single place for users to discover, and collaborate on all data.
How to start with
I would recommend to install it locally using docker-compose. Go to releases , download docker-compose file and up the containers. They also have a Python based CLI tool to install , but that is just a wrapper around docker-compose
Once docker stack is up check containers and if elastic-search container exits with code 137 — that means a memory error. Increase memory for docker daemon or shutdown other not used containers
If you can’t have docker based installation try either sandbox environment or sign-up for SaaS version’s free trail
Once the UI is up, click on the icon shown below in red and it will take you through important features of the tool.
Next steps
Add Tables And More
As next step you can try adding some of the tables from Lake or Warehouse. It supports all popular Lake and Warehouse solutions. Select yours from the list here and there is step by step guide with UI screenshots.
- You can specify tables/schemas filters to be included/excluded
- You can schedule ingestion pipeline to refresh the metadata.
- A new version is created whenever metadata is got updated. You can see the history of changes.
- The tool is using an internal Airflow instance for all type of scheduled runs. Airflow UI is exposed vi port 8080
Tables are just starting. Tool support many more like:
- Dashboard Services — Looker, Metabase, Mode, PowerBI, Redash, Superset, Tableau
- Pipeline Services — Airbyte, Airflow, Glue Pipeline, Fivetran, Dagster, Domo Pipeline
- ML Model — ML flow
- Metadata Services — Amundsen, Atlas
Create user hierarchy and add users
The tool support a very rich organizational hierarchy team structure with teamType
that can be Organization
, Business Unit
, Division
, Department
, and Group
(default team type).
Organization
is the root team in the hierarchy.BusinessUnit
is the next level of the team.Division
is the next level of the team in the hierarchy below Business Unit
. Department
is the team in the hierarchy under Division.
Group
is the last level of the team in the hierarchy. It can have only Users
.
Create different teams and users as per your organization’s requirements.It should have all teams ideally , not only data team. Read more about it in the official documentation. Roles and Policies can be used manage access for the all the teams and users easily.
Adding Metadata
Getting required technical metadata is quite easy — Once network connectivity and credentials for connections are ready OpenMetadata will take care fetching those details as per scheduled frequency.
Now to add actual business meaning of all entities (Tables, columns, Pipelines etc..)
- Assign owner for all the entities (start with some very key tables or pipelines)
- Ask the owners for updating the metadata from the tool itself. Create tasks, have conversation and collaborate within the tool itself— read more here
Data Profiling and quality checks
We can add Data Profiling for the table and Data Quality checks at a column level.
Both can be done through UI — see here for data profile workflow configuration and for Adding test suits and test cases see here. Running of profiler and test suits can be scheduled.
Notifications for test failures and more
We can configure web-hooks for Slack or MS teams to get notification for any events like a test failure. This can be done via UI and see here the steps
Data Lineage
Data lineage can be added manually via UI or API or the tool can derive it from source if it supported. Lineage can have any entity like Tables, Pipelines, Dashboards etc. Tables can have column level Lineage. Read more here
Conclusion
OpenMetadata is not just a tool, but it provides an open standard for all metadata detentions. I think it has potential to become de-facto standard for metadata management. I will write more advanced concepts like Automation of lineage using APIs in coming articles.