Python Solution

Performing the data cleaning / ETL using Pandas

Solution

Pandas :)

Private

Does someone want to be private

Source Table:
- Contacts Table

Solution:
1. Create procedure to add new column Private

Looking for notes

contacts.Remarks[~contacts.Remarks.isin([''])]
0                                 Is anonymous
6    Met in person on 5/9/2018 at Annual Event
8                      Electronic receipt only
Name: Remarks, dtype: object
contacts['Private'] = contacts.Remarks.apply(lambda x: True if x == 'Is anonymous' else False)

ContactType

is required and can only be Household or Organization

Source Table:
- Contacts Table

Solution:
1. Create procedure to add new column ContactType

contacts[['Number', 'CompanyName']]
Number CompanyName
0 653377813-7
1 390551098-7
2 093004505-X
3 729707142-0 A Company Co.
4 488464926-5
5 315297729-8
6 848348568-0
7 029456846-8
8 687119652-8
9 739131380-7
10 809975531-Y


contacts['ContactType'] = contacts.apply(lambda x: 'Household' if x['CompanyName'] == '' else 'Organization', axis=1)
contacts[['Number', 'CompanyName', 'ContactType']].head(3)
Number CompanyName ContactType
0 653377813-7 Household
1 390551098-7 Household
2 093004505-X Household

Postal Code

if address is present and is US, must be a valid zip code, either 12345 or 12345-1234

Source Table:
- Contacts

Solution:
1. Create procedure to remove any postal codees that doesn’t match the approved format from the usps

contacts['Postal'].head(3)
0    20535-871
1        89130
2             
Name: Postal, dtype: object
postal_code_pattern = '[0-9]{5}(?:-[0-9]{4})?$'

Examples

bool(re.match(postal_code_pattern, '32828'))
True
bool(re.match(postal_code_pattern, '328289'))
False
bool(re.match(postal_code_pattern, '3282'))
False
bool(re.match(postal_code_pattern, '32828-237'))
False
bool(re.match(postal_code_pattern, '32828-2379'))
True

Applying Transformation

contacts['Postal'] = contacts.Postal.apply(lambda x: x if re.match(postal_code_pattern, x) else '')
contacts['Postal'].head(3)
0         
1    89130
2         
Name: Postal, dtype: object

IsDeceased

can only be TRUE or FALSE

Source Table: - Contacts

Solution:
1. Create procedure to update Deceased to TRUE/FALSE

contacts.Deceased.head(3)
0    
1    
2    
Name: Deceased, dtype: object
contacts['Deceased'] = contacts.Deceased.apply(lambda x: True if x == 'Yes' else False)
contacts.Deceased.head(3)
0    False
1    False
2    False
Name: Deceased, dtype: bool

GiftType

Can only be Cash, Check, Credit, Other, or Reversing Transaction

Source Table: - Gifts

Solution:
1. Identify Incorrect Gift Types
2. Create procedure to replace invalid gift types

gifts[['AmountReceived', 'PaymentMethod']].tail(10)
AmountReceived PaymentMethod
21 4.21 credit card
22 9.28 cash
23 2.74 PayPal
24 9.00 money order
25 1.88 cash
26 -6.76 check
27 7.58 credit card
28 5.49 cash
29 8.93 money order
30 2.62 credit card
def clean_payment_type(row):
    payment_method = ''
    orginal_payment_method = str(row['PaymentMethod']).lower()
    
    if row['AmountReceived'] < 0:
        payment_method = 'Reversing Transaction'
    elif re.match('credit', orginal_payment_method):
        payment_method = 'Credit'
    elif orginal_payment_method in ['check', 'cash', 'reversing transaction']:
        payment_method = orginal_payment_method.title()
    else:
        payment_method = 'Other'

    return payment_method

source

clean_payment_type

 clean_payment_type (row)

Applying GiftType Transformation

gifts['PaymentMethod'] = gifts.apply(clean_payment_type, axis=1)
gifts.PaymentMethod.tail(10)
21                   Credit
22                     Cash
23                    Other
24                    Other
25                     Cash
26    Reversing Transaction
27                   Credit
28                     Cash
29                    Other
30                   Credit
Name: PaymentMethod, dtype: object

CreditCardType

Can only be Visa, Mastercard, AMEX, Discover

Solution:
1. Identify Incorrect Credit Types
2. Create procedure to replace invalid credit types

gifts.CreditCardType.unique()
array(['', 'American Ex', 'AMEX', 'Visa', 'Master card', 'Mastercard',
       'Discover'], dtype=object)

Creating functions to validate each of the credit card types


source

validate_mastercard

 validate_mastercard (string)

source

validate_amex

 validate_amex (string)

source

validate_visa

 validate_visa (string)

source

validate_discover

 validate_discover (string)

source

validate_credit_card

 validate_credit_card (string)
validate_credit_card('Amex')
'AMEX'
validate_credit_card('Master Card')
'Mastercard'
validate_credit_card('visa')
'Visa'
validate_credit_card('test')
''
validate_credit_card('Gold card')
''

Applying CreditCardType Transformation

gifts['CreditCardType'] = gifts.CreditCardType.apply(validate_credit_card)

Creating Final Datasets

Contact Methods

Table of constituent contact information

Merging Contact and Contact Methods Tables To get ALL the contact information

temp_contacts = contacts[['Number', 'Phone', 'EMail']].rename(columns={
    'Number': 'LegacyContactId',
    'Phone': 'HomePhone',
    'EMail': 'HomeEmail'
}).copy()
temp_contact_methods = contact_methods.rename(columns={'DonorNumber':'LegacyContactId'})
contacts_wide = temp_contacts.merge(temp_contact_methods,
                                how='left',
                                on='LegacyContactId',
                                ).sort_values('LegacyContactId')

Pivoting the Data, converting the HomePhone, HomeEmail, And Fax values to individual Type : Value pairs

contacts_v = contacts_wide.melt(id_vars=['LegacyContactId'], value_vars=['HomePhone', 'HomeEmail', 'Fax', 'Phone', 'EMail'], var_name='Type', value_name='Value')

Creating a function to safley check for nan values

def is_nan(x):
    try:
        return np.isnan(x)
    except TypeError:
        return False
final_contact_methods = contacts_v[~((contacts_v.Value == '') | (contacts_v.Value.apply(is_nan)))].reset_index(drop=True).copy()
def clean_contact_type(s):
    if s == 'Phone':
        return 'HomePhone'
    elif s == 'EMail':
        return 'HomeEmail'
    else:
        return s
final_contact_methods['Type'] = final_contact_methods.Type.apply(clean_contact_type)
final_contact_methods.drop_duplicates(inplace=True)
final_contact_methods.sort_values('LegacyContactId').reset_index(drop=True)
LegacyContactId Type Value
0 029456846-8 HomeEmail jdoley6@telegraph.co.uk
1 093004505-X Fax 818-156-7985
2 093004505-X HomePhone 818-323-9865
3 315297729-8 HomeEmail dmouncey9@cnn.com
4 390551098-7 HomeEmail ebenech1@goodreads.com
5 488464926-5 HomeEmail fcastille4@timesonline.co.uk
6 653377813-7 HomeEmail kklumbers@yahoo.com
7 653377813-7 HomePhone 832-442-4988
8 687119652-8 HomeEmail cmakepeace7@1688.com
9 739131380-7 Fax 626-981-3874
10 848348568-0 HomePhone 702-844-9524


final_contact_methods.to_csv('data/final_contact_methods.csv', index=False)

Contacts

Table of constituent contact information

Checking for any missing contacts

def is_empty(x):
    if str(x) == '' or str(x) == 'nan':
        return True
    else:
        return False
def format_string(x):
    if is_empty(x):
        return ''
    else: 
        return x
contacts_wide['HomePhone'] = np.where(contacts_wide['HomePhone'].apply(is_empty), format_string(contacts_wide['Phone']), format_string(contacts_wide['HomePhone']))
contacts_wide['HomeEmail'] = np.where(contacts_wide['HomeEmail'].apply(is_empty), format_string(contacts_wide['EMail']), format_string(contacts_wide['HomeEmail']))
contacts_wide = contacts_wide[['LegacyContactId', 'HomePhone', 'HomeEmail']].fillna('').copy()
contacts_wide = contacts_wide[~((contacts_wide.HomePhone == '') & (contacts_wide.HomeEmail == ''))].copy()
contacts = (contacts
            .drop(['Phone', 'EMail'], axis=1)
            .merge(contacts_wide,
                   left_on='Number',
                   right_on='LegacyContactId',
                   how='left',
            )
            .drop('LegacyContactId', axis=1)
            )
columns = [
    'LegacyContactId', 'LegacyIndividualId', 'ContactType', 'ContactName',
    'FirstName', 
    'LastName', 'SecondaryLegacyIndividualId', 'SecondaryFirstName',
    'SecondaryLastName', 'HomePhone', 'HomeEmail', 'Address1', 
    'City', 'State', 'PostalCode', 'IsPrivate', 'IsDeceased',
    ]
contacts.rename(columns={
    'Number':'LegacyContactId',
    'Street': 'Address1',
    'Postal': 'PostalCode',
    'Private': 'IsPrivate',
    'Deceased': 'IsDeceased',
}, inplace=True)
final_contacts = contacts[columns].fillna('')
final_contacts
LegacyContactId LegacyIndividualId ContactType ContactName FirstName LastName SecondaryLegacyIndividualId SecondaryFirstName SecondaryLastName HomePhone HomeEmail Address1 City State PostalCode IsPrivate IsDeceased
0 653377813-7 0 Household Karita & Kelvin Lumbers Karita Lumbers 1 Kelvin Lumbers 832-442-4988 kklumbers@yahoo.com 4 Bunting Parkway Washington DC True False
1 390551098-7 2 Household Helga Benech Helga Benech ebenech1@goodreads.com 48684 Jenifer Way Las Vegas NV 89130 False False
2 093004505-X 3 Household Masha Butt Gow Masha Butt Gow 818-323-9865 353 Schmedeman Park Indianapolis IN False False
3 729707142-0 4 Organization Cymbre Cross Cymbre Cross 2055 Lakewood Parkway Camden NJ False False
4 488464926-5 5 Household Hoyt Castille Hoyt Castille fcastille4@timesonline.co.uk 37 8th Trail Grand Rapids MI 49560 False False
5 315297729-8 6 Household Benedict Oscar & Idell Mouncey Benedict Oscar 7 Idell Mouncey dmouncey9@cnn.com 4225 Madison Ave Boise ID False False
6 848348568-0 8 Household Mannie Turpin Mannie Turpin 702-844-9524 NV False True
7 029456846-8 9 Household Romy Doley Romy Doley jdoley6@telegraph.co.uk 608 Old Shore Alley Marietta GA 30066 False False
8 687119652-8 10 Household Ruggiero Makepeace Ruggiero Makepeace cmakepeace7@1688.com 15 Sunbrook Center Omaha NE 68164 False False
9 739131380-7 11 Household Rosemaria & Rogelio Dimond Rosemaria Dimond 12 Rogelio Dimond Juneau AK False False
10 809975531-Y 13 Household Adeline Shakespeare Adeline Shakespeare False False


final_contacts.to_csv('data/final_contacts.csv', index=False)

Gifts

Table with gift history

columns = ['LegacyContactId', 'LegacyGiftId', 'GiftType', 'GiftDate',
           'GiftAmount', 'Notes', 'CreditCardType', 'Project1Code',
           'Project2Code', 'LegacyPledgeID']
gifts.rename(columns={
    'DonorNumber': 'LegacyContactId',
    'PaymentMethod': 'GiftType',
    'AmountReceived': 'GiftAmount',
}, inplace=True)
gifts[columns]
LegacyContactId LegacyGiftId GiftType GiftDate GiftAmount Notes CreditCardType Project1Code Project2Code LegacyPledgeID
0 848348568-0 95196378 Other 2019-03-04 4.1500 None 0
1 729707142-0 95196889 Check 2019-03-05 2.3648 ChildSponsorship None 1
2 687119652-8 95197689 Cash 2019-03-07 1.3100 None 2
3 653377813-7 95198998 Credit 2019-03-10 2.0400 In honor of Mannie Turpin AMEX None 3
4 390551098-7 95198999 Cash 2019-01-10 5.8000 None 89752384
5 848348568-0 95296677 Other 2019-03-20 9.2800 General ReliefFund 5
6 029456846-8 95298831 Check 2019-03-24 5.0000 ACH check #7687 None 6
7 093004505-X 95298845 Check 2019-04-09 4.8300 None 7
8 315297729-8 95298997 Check 2019-04-12 7.0000 SchoolSupplies2019 None 8
9 809975531-Y 9 Credit 2019-08-14 8.4800 AMEX None 9
10 739131380-7 95329966 Credit 2019-04-13 5.8400 Visa None 10
11 739131380-7 95330011 Other 2019-04-13 7.4500 None 11
12 029456846-8 95330012 Check 2019-04-17 8.1300 Mentorship2023 None 12
13 315297729-8 95330110 Reversing Transaction 2019-04-19 -3.0100 None 13
14 739131380-7 95330662 Other 2019-05-10 3.4000 None 14
15 687119652-8 95419562 Other 2019-06-04 5.0700 None 15
16 488464926-5 95422266 Cash 2019-06-05 5.4200 GeneralFund None 16
17 848348568-0 95485564 Other 2019-06-10 6.8000 None 57398862
18 390551098-7 95496635 Other 2019-06-11 6.7800 None 18
19 729707142-0 95497782 Other 2019-06-20 5.2700 None 19
20 315297729-8 20 Check 2019-06-20 5.5900 None 65139856
21 390551098-7 95763575 Credit 2019-07-01 4.2100 Mastercard None 21
22 488464926-5 95798342 Cash 2019-07-18 9.2800 None 22
23 848348568-0 95798343 Other 2019-07-01 2.7400 None 23
24 029456846-8 95801563 Other 2019-08-01 9.0000 None 24
25 729707142-0 95801564 Cash 2019-08-03 1.8800 None 25
26 687119652-8 95835492 Reversing Transaction 2019-08-12 -6.7600 None 26
27 809975531-Y 27 Credit 2019-08-14 7.5800 Mastercard Color run ChildSponsorship 27
28 653377813-7 28 Cash 2019-08-26 5.4900 None 28
29 739131380-7 96638462 Other 2019-09-01 8.9300 ReliefFund None 29
30 093004505-X 96638468 Credit 2019-09-06 2.6200 Discover None 30


final_gifts = gifts[columns].copy()
final_gifts.to_csv('data/final_gifts.csv', index=False)

Export

import nbdev
nbdev.nbdev_export('02_Pandas_Solution.ipynb')