Select Star came out of stealth in late February of 2021 to solve the data discovery problem. We promised easy data discovery via automated data lineage, popularity, top users, and downstream dashboards. When we show prospective customers these features, we find they are immediately excited about the value our easy-to-use, intuitive platform can deliver with so little effort on their part. They also always ask us how it works, so we decided to unpack how Select Star works underneath:
- Data Source Integration
- Metadata Ingestion
- Query Parsing & Analysis — Data Lineage & Popularity
1. Data Source Integration
Select Star supports integrations with the modern data stack tools such as dbt, Snowflake, Google BigQuery, Amazon Redshift, and BI platforms (Tableau, Looker, Mode). You can find a complete list here; we’ll be adding more in the coming months.
Connecting any of these data sources to Select Star is easy to do through the UI — it’s a managed service, so there’s no setup or hosting required on your part. You’ll need admin access to create service accounts for Select Star, and once the right permissions are granted, it’s simple to just plug in authentication information and Select Star will run through its metadata analysis.
We use the official APIs and plugin libraries provided by each tool to authenticate a service user you create. The service users are only required to have the minimum credentials necessary to read your metadata and query history logs. In addition, when we connect to your data warehouses, we don’t require access to your data itself (we only need access to the system metadata and logs). By default, your data isn’t visible to us. Any of the system metadata we do touch is encrypted in transit and at rest. We’re SOC 2 Type II certified.
2. Metadata Ingestion
Once data is ingested into Select Star, we run a metadata sync every 24 hours allowing us to get updates as your data changes. We pick up the following changes in the metadata sync:
- Columns, tables, and schemas added or removed from databases
- SQL queries that have been run since the last sync
- BI dashboards, reports, or workbooks that have been created, modified, or removed
- Operational information about data changes, such as query execution times and last updated timestamps
3. Query Parsing & Metadata Analysis
All the metadata we ingest is processed in our metadata store and represented by our unified metadata model. Then our query parser will parse different SQL dialects, data model files (.lookml, .yaml files), and BI activity logs in order to generate a column-level data lineage and popularity score for each data object.
a. Data Lineage
Data lineage is available in Select Star for any dataset, so you can trace dependencies all the way from raw sources to data dashboards and vice versa. You can see data lineage by table level or column level. Rather than spending hours to find out where a value comes from, you’ll be able to easily see its entire downstream impact through your data model in seconds.
To create a data lineage model, we parse through all DDL and DML queries in your data warehouse or dbt models to see where the data connects. DDL & DML queries include CREATE_TABLE_AS_SELECT, CREATE_VIEW, or MERGE statements. For BI tools, we trace through queries and .lookml files to map the upstream database tables, BI models, and downstream dashboards.
b. Data Popularity
While data lineage shows the context of a specific table or a dashboard, popularity can put different pieces of datasets in perspective.
In order to calculate the popularity of your dataset, we look at SELECT queries and consider how many times a table was used and by how many people. Based on this calculation, our popularity model will assign a score for each table and column relative to all the other tables and columns of the data source. Similarly for BI dashboards, we consider the number of views by number of users in the BI tool.
If you’ve ever found yourself staring at documentation, posting in Slack channels wondering which specific table to use for your query, and trying to guess the right suspiciously-similar-named columns to use, you can understand why popularity is a useful measure. It lets you see what your team has actually done, in addition to the written word.
Select Star uses this popularity score to rank the tables, fields, and any search results, so you can always see the most important data first. Search results that span multiple tools have normalized popularity scores, so even if you’re looking for data with a commonly used name, you’ll see the most used dataset first.
This post is a high-level overview of some of Select Star’s capabilities and the mechanisms behind them. There’s a lot more to explore in the application, and plenty more to come as we continue to develop Select Star.
If this sounds interesting to you, please reach out to us to see a demo and learn more!