Dataframe Comparison
Create 2 dataframe with some differences¶
In [129]:
data1={'custno':[100,101,102,102,103],
'name':['Tim','Alex','Max','Max','Corey'],
'team':['a','b',None,None,'c'],
'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023','1/1/2023']
}
In [130]:
data2={'custno':[100,101,102,103,104],
'name':['Tim','Alex','Mac','Corey','Jack'],
'team':['a','b','z','c','d']
}
In [131]:
import pandas as pd
df1=pd.DataFrame(data1)
df1
Out[131]:
custno | name | team | started | |
---|---|---|---|---|
0 | 100 | Tim | a | 1/1/2023 |
1 | 101 | Alex | b | 1/1/2023 |
2 | 102 | Max | None | 1/1/2023 |
3 | 102 | Max | None | 1/1/2023 |
4 | 103 | Corey | c | 1/1/2023 |
In [132]:
df2=pd.DataFrame(data2)
df2
Out[132]:
custno | name | team | |
---|---|---|---|
0 | 100 | Tim | a |
1 | 101 | Alex | b |
2 | 102 | Mac | z |
3 | 103 | Corey | c |
4 | 104 | Jack | d |
compare 2 dataframes – Basic Comparision¶
In [133]:
df1.equals(df2)
Out[133]:
False
compare the same 2 dataframes using datacompy Compare¶
In [134]:
import datacompy
In [135]:
compare = datacompy.Compare(
df1,
df2,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level for numeric columns
rel_tol=0, #tolerance level for numeric columns
df1_name='original',
df2_name='new')
#You can pass in on_index=True instead of join_columns to join on the index instead.
#By default the comparison needs to match values exactly,
#but you can pass in abs_tol and/or rel_tol to apply absolute and/or relative tolerances for numeric columns.
In [136]:
print(compare.matches())
False
In [139]:
print(compare.df1_unq_rows)
#extra rows from df1 but not in df2, here duplicate row shows up along wihh temporary index
custno name team started 3 102 Max None 1/1/2023
In [140]:
print(compare.df2_unq_rows)
#extra rows from df2 but not in df1
custno name team 5 104 Jack d
In [141]:
print(compare.intersect_rows)
#column by colum compare after ignorning the mismatch rows from 2 df's.
#This shows up exact column mismatch
custno name_df1 team_df1 started name_df2 team_df2 _merge name_match \ 0 100 Tim a 1/1/2023 Tim a both True 1 101 Alex b 1/1/2023 Alex b both True 2 102 Max None 1/1/2023 Mac z both False 4 103 Corey c 1/1/2023 Corey c both True team_match 0 True 1 True 2 False 4 True
In [78]:
print(compare.df1_unq_columns() ) #any extra fields that are not in df2
OrderedSet(['started'])
In [79]:
print(compare.df2_unq_columns() ) #any extra fields that are not in df1
OrderedSet()
In [88]:
print(compare.intersect_columns()) # to get the overlapping columns
OrderedSet(['custno', 'name', 'team'])
In [90]:
#Generate the output (in the form of report )
print(compare.report())
DataComPy Comparison -------------------- DataFrame Summary ----------------- DataFrame Columns Rows 0 original 4 5 1 new 3 5 Column Summary -------------- Number of columns in common: 3 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0 Row Summary ----------- Matched on: custno Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 4 Number of rows in original but not in new: 1 Number of rows in new but not in original: 1 Number of rows with some compared columns unequal: 1 Number of rows with all compared columns equal: 3 Column Comparison ----------------- Number of columns compared with some values unequal: 2 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 2 Columns with Unequal Values or Types ------------------------------------ Column original dtype new dtype # Unequal Max Diff # Null Diff 0 name object object 1 0 0 1 team object object 1 0 1 Sample Rows with Unequal Values ------------------------------- custno name (original) name (new) 2 102 Max Mac custno team (original) team (new) 2 102 None z Sample Rows Only in original (First 10 Columns) ----------------------------------------------- custno name team started 3 102 Max None 1/1/2023 Sample Rows Only in new (First 10 Columns) ------------------------------------------ custno name team 5 104 Jack d
we can pass in ignore_extra_columns=True to ignore non matching column and not return False(It will still check on overlapping columns)¶
Matching scenario¶
In [95]:
data3={'custno':[100,101,102,103],
'name':['Tim','Alex','Max','Corey'],
'team':['a','b',None,'c'],
'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023']
}
In [96]:
data4={'custno':[100,101,102,103],
'name':['Tim','Alex','Mac','Corey'],
'team':['a','b',None,'c']
}
#3rd record anme Mac and Max is different
In [97]:
df3=pd.DataFrame(data3)
df3
Out[97]:
custno | name | team | started | |
---|---|---|---|---|
0 | 100 | Tim | a | 1/1/2023 |
1 | 101 | Alex | b | 1/1/2023 |
2 | 102 | Max | None | 1/1/2023 |
3 | 103 | Corey | c | 1/1/2023 |
In [98]:
df4=pd.DataFrame(data4)
df4
Out[98]:
custno | name | team | |
---|---|---|---|
0 | 100 | Tim | a |
1 | 101 | Alex | b |
2 | 102 | Mac | None |
3 | 103 | Corey | c |
In [103]:
import datacompy
compare = datacompy.Compare(
df3,
df4,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level
rel_tol=0,
df1_name='original',
df2_name='new')
In [104]:
print(compare.matches(ignore_extra_columns=True))
False
All are same except an additional field – Use ignore_extra_columns option¶
In [105]:
data5={'custno':[100,101,102,103],
'name':['Tim','Alex','Max','Corey'],
'team':['a','b',None,'c'],
'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023']
}
data6={'custno':[100,101,102,103],
'name':['Tim','Alex','Max','Corey'],
'team':['a','b',None,'c']
}
In [106]:
df5=pd.DataFrame(data5)
df6=pd.DataFrame(data6)
In [107]:
import datacompy
compare = datacompy.Compare(
df5,
df6,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level
rel_tol=0,
df1_name='original',
df2_name='new')
In [110]:
print(compare.matches()) # 1 field is extra hence it has a difference
False
In [111]:
print(compare.matches(ignore_extra_columns=True)) # Ignore the extra field and compare the overlappoing fields
True
Summary¶
DataComPy will try to join two dataframes either on a list of join columns, or on indexes.
If the two dataframes have duplicates based on join values,
the match process sorts by the remaining fields and joins based on that row number.
Column-wise comparisons attempt to match values even when dtypes don’t match.
So if, for example, you have a column with decimal.
Decimal values in one dataframe and an identically-named column with float64 dtype in another,
it will tell you that the dtypes are different but will still try to compare the values.
In [124]:
pd.testing.assert_frame_equal(df5, df6)
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-124-36c86a5d766d> in <module> ----> 1 pd.testing.assert_frame_equal(df5, df6) [... skipping hidden 2 frame] AssertionError: DataFrame are different DataFrame shape mismatch [left]: (4, 4) [right]: (4, 3)
In [125]:
data7={'custno':[100,101,102,103],
'name':['Tim','Alex','Max','Corey'],
'team':['a','b',None,'c']
}
data8={'custno':[100,101,102,103],
'name':['Tim','Alex','Max','Corey'],
'team':['a','b',None,'c']
}
df7=pd.DataFrame(data7)
df8=pd.DataFrame(data8)
pd.testing.assert_frame_equal(df7, df8)
Reference¶
A similar comparison can also be done in spark
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)
compare = datacompy.SparkCompare(spark, df1, df2, join_columns=[‘custno’])