Databricks vs dbt testing – Singular test
In Databricks Delta Live Tables (DLT), singular tests can be implemented using the @dlt.expect
decorator. A singular test in dbt typically refers to a test on a single column or a small set of rows, and it focuses on validating specific data quality issues such as checking for null values, uniqueness, or referential integrity in a particular field.
In DLT, you can define similar checks as part of your transformation logic, using @dlt.expect
for specific columns or conditions on your tables. These tests ensure that data adheres to certain rules, and if the conditions are not met, the pipeline will fail, similar to how dbt works.
Singular Tests in dbt vs. Databricks DLT
In dbt, you might write a singular test like this:
- Not Null Test (dbt):
— dbt: Check for nulls in a column
select *
from {{ ref(‘my_model’) }}
where my_column is null
- Unique Test (dbt):
— dbt: Check for duplicates
select my_column, count(*)
from {{ ref(‘my_model’) }}
group by my_column
having count(*) > 1
In Databricks DLT, similar tests can be written using the @dlt.expect
decorator inside your pipeline code.
How to Implement Singular Tests in Databricks DLT
Let’s walk through how you can replicate singular tests for common checks such as non-null, unique, and other conditions in DLT.
1. Not Null Test (Singular Test)
To check if a column contains any null values, you can use @dlt.expect
with a condition that ensures the column doesn’t contain any nulls. This will be a singular test on a single column.
Example: Not Null Test in DLT
import dlt
from pyspark.sql.functions import col
# Singular Test to ensure ‘id’ column is not null
@dlt.table
@dlt.expect(“id_not_null”, col(“id”).isNotNull())
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
- Explanation:
@dlt.expect("id_not_null", col("id").isNotNull())
: This is a singular test that checks whether theid
column contains anynull
values.- If any row contains a
null
value in theid
column, the pipeline will fail at this point.
2. Unique Test (Singular Test)
You can also write a singular test to check if a column contains unique values, similar to dbt’s uniqueness test.
Example: Unique Test in DLT
import dlt
from pyspark.sql.functions import col
# Singular Test to ensure ‘id’ column is unique
@dlt.table
@dlt.expect(“id_unique”, “COUNT(DISTINCT id) = COUNT(id)”)
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“id”).isNotNull())
.dropDuplicates([“id”])
)
- Explanation:
@dlt.expect("id_unique", "COUNT(DISTINCT id) = COUNT(id)")
: This is a singular test that ensures there are no duplicates in theid
column.- If the count of distinct
id
values is not equal to the total count, the pipeline will fail, indicating that there are duplicates.
3. Range Validity Test (Singular Test)
If you want to check if a column contains values within a specific range, this can be implemented as a singular test.
Example: Range Test (e.g., Age)
import dlt
from pyspark.sql.functions import col
# Singular Test to ensure ‘age’ column is within the valid range (18 to 100)
@dlt.table
@dlt.expect(“valid_age_range”, (col(“age”) >= 18) & (col(“age”) <= 100))
def cleaned_data():
return (
dlt.read("raw_data")
.filter(col("age").isNotNull())
)
- Explanation:
@dlt.expect("valid_age_range", (col("age") >= 18) & (col("age") <= 100))
: This singular test checks whether theage
column contains values between 18 and 100 (inclusive).- If any value falls outside this range, the pipeline will fail.
4. Check for Accepted Values (Singular Test)
For some columns, you might want to ensure that the values belong to an allowed set. For example, a column with a status could only contain specific values like "active"
, "inactive"
, or "pending"
.
Example: Accepted Values Test (Status)
import dlt
from pyspark.sql.functions import col
# Singular Test to ensure ‘status’ column contains only accepted values
@dlt.table
@dlt.expect(“status_valid”, col(“status”).isin(“active”, “inactive”, “pending”))
def cleaned_data():
return (
dlt.read(“raw_data”)
.filter(col(“status”).isNotNull())
)
- Explanation:
@dlt.expect("status_valid", col("status").isin("active", "inactive", "pending"))
: This singular test ensures that thestatus
column contains only the values"active"
,"inactive"
, or"pending"
.- If there is a value outside this list, the pipeline will fail.
5. Combining Multiple Singular Tests
You can also combine multiple singular tests into a single pipeline to enforce multiple data quality rules. This is similar to running several tests in dbt within a single model.
Example: Combining Singular Tests
import dlt
from pyspark.sql.functions import col
@dlt.table
@dlt.expect(“id_not_null”, col(“id”).isNotNull()) # Check for nulls in ‘id’
@dlt.expect(“id_unique”, “COUNT(DISTINCT id) = COUNT(id)”) # Check for uniqueness of ‘id’
@dlt.expect(“valid_age_range”, (col(“age”) >= 18) & (col(“age”) <= 100)) # Check if 'age' is within range
def cleaned_data():
return (
dlt.read("raw_data")
.filter(col("id").isNotNull())
.dropDuplicates(["id"])
.filter((col("age") >= 18) & (col(“age”) <= 100))
)
- Explanation:
- Here, we have three singular tests:
- Check if
id
is not null. - Ensure
id
is unique. - Ensure
age
falls within the valid range of 18 to 100.
- Check if
- Here, we have three singular tests:
If any of these tests fail, the pipeline will stop processing and show an error indicating which test failed.
Handling Test Failures in DLT
- When using
@dlt.expect
, if a test fails, the pipeline will fail at that point and display a clear error message. - You can use Databricks’ UI or API to monitor pipeline failures and see which tests have failed during execution.
If you want to capture errors or take action after a failure, you can also implement custom error handling inside your pipeline logic, though by default, DLT will stop processing on test failure.
Summary: Singular Tests in Databricks DLT
In Databricks Delta Live Tables (DLT), you can implement singular tests using the @dlt.expect
decorator to enforce data quality checks on specific columns, similar to dbt’s singular tests.
- Not Null: Check if a column contains null values.
- Unique: Ensure there are no duplicate values in a column.
- Range Validity: Check if values in a column fall within a defined range.
- Accepted Values: Ensure that column values belong to a predefined set of valid values.
These singular tests can be combined to apply multiple checks on a single table or column, and the pipeline will fail if any of the tests fail.
By using @dlt.expect
, you can enforce data quality rules within your DLT pipeline, ensuring that your data is validated and cleaned before further processing.