Databricks vs dbt testing – Generic test
In the context of dbt, generic tests are reusable tests that can be applied across different models and columns without needing to be rewritten each time. These tests often cover common data quality checks, such as ensuring values are not null, unique, or fall within a certain range.
In Databricks Delta Live Tables (DLT), similar functionality can be achieved using the @dlt.expect
decorator, which lets you write generic data quality checks that can be applied to multiple datasets or tables. While you won’t have a built-in “test” system as in dbt, you can build these reusable tests manually in DLT and apply them across different tables.
Let’s explore how to create generic tests in Databricks DLT.
1. What Are Generic Tests?
In dbt, a generic test allows you to apply the same test logic across different models or tables without having to redefine it each time. For example, dbt provides built-in generic tests like:
- not_null: Ensures a column does not have null values.
- unique: Ensures a column has unique values (no duplicates).
- accepted_values: Ensures values in a column are from a predefined set of allowed values.
These tests are often reusable across many models, so you can define them once and reference them wherever needed.
In Databricks DLT, you can create similar functionality manually by defining generic functions or decorators that check for these common data quality issues.
2. Creating Generic Tests in Databricks DLT
In DLT, you will create reusable test functions using the @dlt.expect
decorator for different types of checks (e.g., not null
, unique
, accepted values
, etc.). These checks can be used in multiple tables, just like dbt generic tests.
a) Generic Test for Non-Null Values
A generic non-null test can be created by defining a function that checks if any given column in a table contains null values. You can then apply this function to multiple tables.
import dlt
from pyspark.sql.functions import col
# Generic test to check if a column has no null values
def test_not_null(df, column_name):
return df.filter(col(column_name).isNull()).count() == 0
# Reusable DLT transformation with non-null test
@dlt.table
@dlt.expect(“id_not_null”, test_not_null(dlt.read(“raw_data”), “id”))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
In this example:
test_not_null
is a generic function that checks for null values in any column.- This function is applied to the
id
column in thecleaned_data
table.
You can use the same test_not_null
function for other tables as well by just changing the column name.
b) Generic Test for Uniqueness
You can create a generic uniqueness test to ensure no duplicates exist in a column (or combination of columns).
# Generic test for uniqueness
def test_unique(df, column_name):
return df.groupBy(column_name).count().filter(“count > 1”).count() == 0
# Reusable DLT transformation with uniqueness test
@dlt.table
@dlt.expect(“id_unique”, test_unique(dlt.read(“raw_data”), “id”))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
Here, test_unique
is a generic function that checks whether a column has any duplicates by counting rows with a count greater than one. It can be reused across tables by just passing the column name.
c) Generic Test for Accepted Values
If you want to ensure a column only contains a predefined set of values, you can write a generic test for accepted values.
# Generic test for accepted values
def test_accepted_values(df, column_name, accepted_values):
return df.filter(~col(column_name).isin(accepted_values)).count() == 0
# Reusable DLT transformation with accepted values test
@dlt.table
@dlt.expect(“status_valid”, test_accepted_values(dlt.read(“raw_data”), “status”, [“active”, “inactive”]))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“status”).isNotNull())
)
In this case:
test_accepted_values
checks if the values in a column belong to a set of accepted values.- Here, it checks that the
status
column only contains either"active"
or"inactive"
.
3. Using These Generic Tests Across Multiple Tables
You can easily reuse these generic tests across different tables in your DLT pipeline by calling the appropriate test function and passing different column names or parameters.
Example: Reusing the not_null
test across multiple tables
@dlt.table
@dlt.expect(“id_not_null”, test_not_null(dlt.read(“raw_data”), “id”))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
@dlt.table
@dlt.expect(“email_not_null”, test_not_null(dlt.read(“raw_data”), “email”))
def user_data():
return (
dlt.read(“raw_data”)
.filter(col(“email”).isNotNull())
)
In this example, we apply the test_not_null
function to two different columns (id
and email
) in different tables (cleaned_data
and user_data
).
4. Combining Multiple Generic Tests
You can apply multiple generic tests in a single DLT pipeline. This is similar to how dbt allows you to combine multiple tests (e.g., not_null
and unique
) on a model.
@dlt.table
@dlt.expect(“id_not_null”, test_not_null(dlt.read(“raw_data”), “id”))
@dlt.expect(“id_unique”, test_unique(dlt.read(“raw_data”), “id”))
@dlt.expect(“status_valid”, test_accepted_values(dlt.read(“raw_data”), “status”, [“active”, “inactive”]))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
Here, we:
- Apply the non-null test on
id
. - Apply the uniqueness test on
id
. - Apply the accepted values test on
status
.
If any of these tests fail, the pipeline will halt, ensuring that only clean and valid data moves forward.
5. How to Handle Errors and Failures
- If any of the expectations fail, the pipeline will stop processing further, and you’ll get a detailed error message indicating which check failed.
- You can use the Databricks UI or API to monitor the execution of your DLT pipeline, and you will be notified if an expectation fails during pipeline runs.
- For example, if the
id
column has null values or duplicates, the pipeline will fail, preventing bad data from being processed.
6. Summary: Generic Tests in Databricks DLT
While Databricks DLT doesn’t have a built-in system of reusable tests like dbt, you can easily implement similar functionality using the @dlt.expect
decorator and Python functions. This allows you to create generic tests for common data quality checks (like not null, unique, accepted values, etc.) and apply them across multiple tables in your pipeline.
Key Points:
- Generic tests in DLT can be implemented using functions like
test_not_null
,test_unique
, andtest_accepted_values
. - You can reapply these tests across multiple tables or columns, similar to how dbt applies tests across models.
- You can combine multiple tests in a single table, and the pipeline will fail if any of the tests are not passed.
This approach gives you flexible, reusable, and declarative data quality checks within Databricks DLT, which is analogous to the functionality provided by dbt’s generic tests.