Dataframe Comparison

dataset comparison

Create 2 dataframe with some differences¶

In [129]:

data1={'custno':[100,101,102,102,103],
      'name':['Tim','Alex','Max','Max','Corey'],
      'team':['a','b',None,None,'c'],
      'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023','1/1/2023']
    }

In [130]:

data2={'custno':[100,101,102,103,104],
      'name':['Tim','Alex','Mac','Corey','Jack'],
      'team':['a','b','z','c','d']
    }

In [131]:

import pandas as pd
df1=pd.DataFrame(data1)
df1

Out[131]:

	custno	name	team	started
0	100	Tim	a	1/1/2023
1	101	Alex	b	1/1/2023
2	102	Max	None	1/1/2023
3	102	Max	None	1/1/2023
4	103	Corey	c	1/1/2023

In [132]:

df2=pd.DataFrame(data2)
df2

Out[132]:

	custno	name	team
0	100	Tim	a
1	101	Alex	b
2	102	Mac	z
3	103	Corey	c
4	104	Jack	d

compare 2 dataframes – Basic Comparision¶

In [133]:

df1.equals(df2)

Out[133]:

False

compare the same 2 dataframes using datacompy Compare¶

In [134]:

import datacompy

In [135]:

compare = datacompy.Compare(
df1,
df2,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level for numeric columns
rel_tol=0, #tolerance level for numeric columns
df1_name='original',
df2_name='new')

#You can pass in on_index=True instead of join_columns to join on the index instead.

#By default the comparison needs to match values exactly, 
#but you can pass in abs_tol and/or rel_tol to apply absolute and/or relative tolerances for numeric columns.

In [136]:

print(compare.matches())

False

In [139]:

print(compare.df1_unq_rows) 
#extra rows from df1 but not in df2, here duplicate row shows up along wihh temporary index

   custno name  team   started
3     102  Max  None  1/1/2023

In [140]:

print(compare.df2_unq_rows) 
#extra rows from df2 but not in df1

   custno  name team
5     104  Jack    d

In [141]:

print(compare.intersect_rows) 
#column by colum compare after ignorning the mismatch rows from 2 df's.
#This shows up exact column mismatch

   custno name_df1 team_df1   started name_df2 team_df2 _merge  name_match  \
0     100      Tim        a  1/1/2023      Tim        a   both        True   
1     101     Alex        b  1/1/2023     Alex        b   both        True   
2     102      Max     None  1/1/2023      Mac        z   both       False   
4     103    Corey        c  1/1/2023    Corey        c   both        True   

   team_match  
0        True  
1        True  
2       False  
4        True

In [78]:

print(compare.df1_unq_columns() ) #any extra fields that are not in df2

OrderedSet(['started'])

In [79]:

print(compare.df2_unq_columns() ) #any extra fields that are not in df1

OrderedSet()

In [88]:

print(compare.intersect_columns()) # to get the overlapping columns

OrderedSet(['custno', 'name', 'team'])

In [90]:

#Generate the output (in the form of report )
print(compare.report())

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        4     5
1       new        3     5

Column Summary
--------------

Number of columns in common: 3
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: custno
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 4
Number of rows in original but not in new: 1
Number of rows in new but not in original: 1

Number of rows with some compared columns unequal: 1
Number of rows with all compared columns equal: 3

Column Comparison
-----------------

Number of columns compared with some values unequal: 2
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 2

Columns with Unequal Values or Types
------------------------------------

  Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0   name         object    object          1         0            0
1   team         object    object          1         0            1

Sample Rows with Unequal Values
-------------------------------

   custno name (original) name (new)
2     102             Max        Mac

   custno team (original) team (new)
2     102            None          z

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

   custno name  team   started
3     102  Max  None  1/1/2023

Sample Rows Only in new (First 10 Columns)
------------------------------------------

   custno  name team
5     104  Jack    d

we can pass in ignore_extra_columns=True to ignore non matching column and not return False(It will still check on overlapping columns)¶

Matching scenario¶

In [95]:

data3={'custno':[100,101,102,103],
      'name':['Tim','Alex','Max','Corey'],
      'team':['a','b',None,'c'],
      'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023']
    }

In [96]:

data4={'custno':[100,101,102,103],
      'name':['Tim','Alex','Mac','Corey'],
      'team':['a','b',None,'c']
      
    }

#3rd record anme Mac and Max is different

In [97]:

df3=pd.DataFrame(data3)
df3

Out[97]:

	custno	name	team	started
0	100	Tim	a	1/1/2023
1	101	Alex	b	1/1/2023
2	102	Max	None	1/1/2023
3	103	Corey	c	1/1/2023

In [98]:

df4=pd.DataFrame(data4)
df4

Out[98]:

	custno	name	team
0	100	Tim	a
1	101	Alex	b
2	102	Mac	None
3	103	Corey	c

In [103]:

import datacompy
compare = datacompy.Compare(
df3,
df4,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level
rel_tol=0,
df1_name='original',
df2_name='new')

In [104]:

print(compare.matches(ignore_extra_columns=True))

False

All are same except an additional field – Use ignore_extra_columns option¶

In [105]:

data5={'custno':[100,101,102,103],
      'name':['Tim','Alex','Max','Corey'],
      'team':['a','b',None,'c'],
      'started':['1/1/2023','1/1/2023','1/1/2023','1/1/2023']
    }
data6={'custno':[100,101,102,103],
      'name':['Tim','Alex','Max','Corey'],
      'team':['a','b',None,'c']
    }

In [106]:

df5=pd.DataFrame(data5)
df6=pd.DataFrame(data6)

In [107]:

import datacompy
compare = datacompy.Compare(
df5,
df6,
join_columns='custno', #You can also specify a list of columns
abs_tol=0.0001, #tolerance level
rel_tol=0,
df1_name='original',
df2_name='new')

In [110]:

print(compare.matches()) # 1 field is extra hence it has a difference

False

In [111]:

print(compare.matches(ignore_extra_columns=True)) # Ignore the extra field and compare the overlappoing fields

True

Summary¶

DataComPy will try to join two dataframes either on a list of join columns, or on indexes.
If the two dataframes have duplicates based on join values,
the match process sorts by the remaining fields and joins based on that row number.

Column-wise comparisons attempt to match values even when dtypes don’t match.
So if, for example, you have a column with decimal.
Decimal values in one dataframe and an identically-named column with float64 dtype in another,
it will tell you that the dtypes are different but will still try to compare the values.

In [124]:

pd.testing.assert_frame_equal(df5, df6)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-124-36c86a5d766d> in <module>
----> 1 pd.testing.assert_frame_equal(df5, df6)

    [... skipping hidden 2 frame]

AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (4, 4)
[right]: (4, 3)

In [125]:

data7={'custno':[100,101,102,103],
      'name':['Tim','Alex','Max','Corey'],
      'team':['a','b',None,'c']
    }
data8={'custno':[100,101,102,103],
      'name':['Tim','Alex','Max','Corey'],
      'team':['a','b',None,'c']
    }
df7=pd.DataFrame(data7)
df8=pd.DataFrame(data8)
pd.testing.assert_frame_equal(df7, df8)

Reference¶

A similar comparison can also be done in spark

df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

compare = datacompy.SparkCompare(spark, df1, df2, join_columns=[‘custno’])

Top Guides – Data School