Setup

Friday August 18th, 2023

Previewing The Initial Datasets

Loading Datasets using Pandas

contacts.head()
Number Company Name First Name Last Name Street City State Postal Phone E-mail Remarks Deceased?
0 653377813-7 Karita & Kelvin Lumbers 4 Bunting Parkway Washington DC 20535-871 kklumbers@ yahoo.co Is anonymous
1 390551098-7 Helga Benech 48684 Jenifer Way Las Vegas NV 89130 ebenech1@goodreads.com
2 093004505-X Masha 353 Schmedeman Park Indianapolis IN 577-374-96523
3 729707142-0 A Company Co. 2055 Lakewood Parkway Camden NJ 8104 No
4 488464926-5 Hoyt Castille 37 8th Trail Grand Rapids MI 49560 fcastille4@timesonline.co.uk No
gifts.head()
donor_number gift_id first_name last_name amount received date fund_id credit card type payment method pledge_number notes
0 848348568-0 95196378.0 Mannie Turpin $4.15 3/4/2019 PayPal
1 729707142-0 95196889.0 Cymbre Cross 2.3648 3/5/2019 ChildSponsorship check
2 687119652-8 95197689.0 Ruggiero Makepeace $1.31 3/7/2019 cash
3 653377813-7 95198998.0 Karita Lumbers $2.04 3/10/2019 American Ex credit card In honor of Mannie Turpin
4 390551098-7 95198999.0 Helga Benech $5.80 2019/1/10 cash 89752384.0
contact_methods.head()
donor_number Phone E-mail Fax
0 653377813-7 832-442-4988
1 390551098-7 ebenech1@goodreads.com
2 093004505-X 818-323-9865 818-156-7985
3 729707142-0
4 488464926-5 fcastille4@timesonline.co.uk

Inital Transformations

Transforming certain task on load in order to reduce redundency … and some work for me :)

Column Name Transformation


  • Nothing worse than malformed column names, amiright?


source

to_camel_case

 to_camel_case (s)

Converts a string to camel case.

Examples

stripping spaces

to_camel_case('this isa test')
'ThisIsaTest'

Stripping Special Characters

to_camel_case('this_is_a_test')
'ThisIsATest'
to_camel_case('hello!world_what_is_up?')
'HelloWorldWhatIsUp'

source

transform_cnames

 transform_cnames (df, func=<function to_camel_case>)

Apply a function to all column names of a dataframe. It may not always be the case that I want to apply to_camel_case as the column transformation function. Because of this func is an optional argument with a default value of to_camel_case

Apply a function to all column names of a dataframe

for df in [contact_methods, contacts, gifts]:
    transform_cnames(df)

Validating that the contacts column names are now CamelCase

contacts.columns
Index(['Number', 'CompanyName', 'FirstName', 'LastName', 'Street', 'City',
       'State', 'Postal', 'Phone', 'EMail', 'Remarks', 'Deceased'],
      dtype='object')

Column Type Transformation

I identified 2 columns on the gift table that should be ints

I’m going to convert them both to type int

int_cols = ['GiftId', 'PledgeNumber']
gifts[int_cols] = gifts[int_cols].replace({'':0}).astype(int)

AmountRecieved should be a float

Removing any special characters (besides dashes and periods) and converting to a float :)

gifts['AmountReceived'] = gifts.AmountReceived.apply(lambda x: float(re.sub(r'[^a-zA-Z0-9\.-]', '', x)))

Previewing change

gifts.AmountReceived.head(5)
0    4.1500
1    2.3648
2    1.3100
3    2.0400
4    5.8000
Name: AmountReceived, dtype: float64

Replacing ‘0’ values in PledgeNumber & GiftId to thier index so that each row is unique

gifts.loc[ gifts.PledgeNumber == 0, 'PledgeNumber'] = gifts[gifts.PledgeNumber == 0].index
gifts.loc[ gifts.GiftId == 0, 'GiftId'] = gifts[gifts.GiftId == 0].index
gifts[['PledgeNumber', 'GiftId']].head(5)
PledgeNumber GiftId
0 0 95196378
1 1 95196889
2 2 95197689
3 3 95198998
4 89752384 95198999

Renaming the columns to match virtuous requirements

gifts = gifts.rename(columns={'PledgeNumber': 'LegacyPledgeID', 'GiftId': 'LegacyGiftId'})

Misplaced Data Transformation

Identify and clean any pieces of data in the wrong column

contacts[['Phone', 'EMail']].head(3)
Phone EMail
0 kklumbers@ yahoo.co
1 ebenech1@goodreads.com
2 577-374-96523

Creating a function that uses regular expressions to identify a string as a phone or a email


source

classify_phone_email

 classify_phone_email (value)

Classify a value as a phone number, email, or None.

Examples

classify_phone_email('321-295-2502')
'phone'
classify_phone_email('warren.hyson5@gmail.com')
'email'
classify_phone_email('test321-29@gmail.com')
'email'
classify_phone_email('3212952502@gmail.com')
'email'
classify_phone_email('321-295-2502@gmail.com')
'email'

Applying Phone/Email Transformation

Looping over phone & email records.

If a phone and/or email are identified in the wrong column, I’m swapping thier values

contacts[['Phone', 'EMail']].head(3)
Phone EMail
0 kklumbers@ yahoo.co
1 ebenech1@goodreads.com
2 577-374-96523

Consolidate Contacts Table

  • Searching for missing users
  • Splitting / joining households

Checking for Missing records

Checking for any DonorNumbers that are NOT IN contacts Number…

(~gifts.DonorNumber.isin(contacts.Number.unique())).any()
True

Found 1!

Doing the same for contact_methods

(~contact_methods.DonorNumber.isin(contacts.Number.unique())).any()
False

Didn’t find any there
Going to extract the missing records from the gifts table

donors_not_in_contacts = gifts.loc[~gifts.DonorNumber.isin(contacts.Number.unique()), :]
donors_not_in_contacts
DonorNumber LegacyGiftId FirstName LastName AmountReceived Date FundId CreditCardType PaymentMethod LegacyPledgeID Notes
9 809975531-Y 9 Adeline Shakespeare 8.48 8/14/2019 AMEX credit card 9
27 809975531-Y 27 Adeline Shakespeare 7.58 8/14/2019 Color run, ChildSponsorship Mastercard credit card 27

Adding Shakespeare to our contacts!

contacts = pd.concat([
    contacts,
    # Dataframe of donors not in contacts
    pd.DataFrame(donors_not_in_contacts[['DonorNumber', 'FirstName', 'LastName']]
                 .drop_duplicates()
                 .rename(columns={'DonorNumber': 'Number'})
                 .drop_duplicates()
                 .to_dict('records'))
])

Spliting rows with Multiple People

contacts.FirstName.head(1)
0    Karita & Kelvin
Name: FirstName, dtype: object


Split the names on ’ & ’ or ’ and ’, then expand the resulting lists into new rows

contacts[['FirstName', 'SecondaryFirstName']] = contacts['FirstName'].str.split(' & | and ', expand=True).fillna('')

Aditionally, going to look for any records where there’s a duplicated Number

The following Python code will return True for rows where the Number was deemed to be a duplicate:

contacts.Number.duplicated()

For all the rows that are duplicates, I’m going to:

1. Place them in a seperate variable
2. Remove them from the original dataset
3. Add them as the 2nd Person on the non-duplicated row
4. Ensure they have proper first & last names

records_to_join = contacts.loc[contacts.Number.duplicated(), :].to_dict(orient='records')

Removing duplicates from contacts before joining

contacts = contacts.loc[~contacts.Number.duplicated(), :]

Adding dupliates as secondary contacts

for record in records_to_join:
    contacts.loc[contacts.Number.isin([record['Number']]), ['SecondaryFirstName', 'SecondaryLastName']] = [record['FirstName'], record['LastName']]

Adding Secondary Last Name where it was previously missing

contacts['SecondaryLastName'] = contacts.apply(lambda x: x['LastName'] if x['SecondaryLastName'] == '' and x['SecondaryFirstName'] != '' else x['SecondaryLastName'], axis=1)

Updating unique identifiers

Ensuring LegacyIndividualId is present for all contacts & is unique

Initalizing empty collumns

Adding id since none was provided

id = 0
for index, row in contacts.iterrows():
    contacts.loc[index, 'LegacyIndividualId'] = id
    id += 1
    if row['SecondaryFirstName'] != '':
        contacts.loc[index, 'SecondaryLegacyIndividualId'] = id
        id += 1

Cleaning Contact First & Last Names

Cleaning Records with blank first or last names

Checking for records where FirstName and/or LastName is blank

blank_name_records = ((contacts.FirstName == '') | (contacts.LastName == ''))

Previewing the blank name records

contacts.loc[((contacts.FirstName == '') | (contacts.LastName == '')), :]
Number CompanyName FirstName LastName Street City State Postal Phone EMail Remarks Deceased SecondaryFirstName SecondaryLastName LegacyIndividualId SecondaryLegacyIndividualId
2 093004505-X Masha 353 Schmedeman Park Indianapolis IN 577-374-96523 3
3 729707142-0 A Company Co. 2055 Lakewood Parkway Camden NJ 8104 No 4
7 029456846-8 608 Old Shore Alley Marietta GA 30066 jdoley6@telegraph.co.uk 9

Before I delete the records I’m going to check if the names are present on the gift table

I’m going to start by getting the unique Numbers that the records belong too

blank_name_numbers= contacts.loc[blank_name_records, 'Number']
gifts.loc[gifts.DonorNumber.isin(blank_name_numbers), ['DonorNumber', 'FirstName', 'LastName']].drop_duplicates().head(5)
DonorNumber FirstName LastName
1 729707142-0 Cymbre Cross
6 029456846-8 Romy Doley
7 093004505-X Masha Butt Gow
19 729707142-0 Cymbre

The names are present on the gifts table!

gift_name_records = gifts.loc[gifts.DonorNumber.isin(blank_name_numbers), ['DonorNumber', 'FirstName', 'LastName']].drop_duplicates()

Removing the invalid record

gift_name_records = gift_name_records.loc[((gift_name_records.FirstName != '') & (gift_name_records.LastName != '')), :]
gift_name_records
DonorNumber FirstName LastName
1 729707142-0 Cymbre Cross
6 029456846-8 Romy Doley
7 093004505-X Masha Butt Gow

Updating the records that previousuly had a blank first or last name

for _, row in gift_name_records.iterrows():
    contacts.loc[contacts['Number'] == row['DonorNumber'], ['FirstName', 'LastName']] = [row['FirstName'], row['LastName']]

All the records valid names now!

contacts.loc[blank_name_records, :]
Number CompanyName FirstName LastName Street City State Postal Phone EMail Remarks Deceased SecondaryFirstName SecondaryLastName LegacyIndividualId SecondaryLegacyIndividualId
2 093004505-X Masha Butt Gow 353 Schmedeman Park Indianapolis IN 577-374-96523 3
3 729707142-0 A Company Co. Cymbre Cross 2055 Lakewood Parkway Camden NJ 8104 No 4
7 029456846-8 Romy Doley 608 Old Shore Alley Marietta GA 30066 jdoley6@telegraph.co.uk 9

Add ContactName Column

contacts.head(3)
Number CompanyName FirstName LastName Street City State Postal Phone EMail Remarks Deceased SecondaryFirstName SecondaryLastName LegacyIndividualId SecondaryLegacyIndividualId
0 653377813-7 Karita Lumbers 4 Bunting Parkway Washington DC 20535-871 kklumbers@ yahoo.co Is anonymous Kelvin Lumbers 0 1
1 390551098-7 Helga Benech 48684 Jenifer Way Las Vegas NV 89130 ebenech1@goodreads.com 2
2 093004505-X Masha Butt Gow 353 Schmedeman Park Indianapolis IN 577-374-96523 3

Creating a function to update the contact name depending on the individuals on the household record

def set_contact_name(row):
    """Sets the contact name for a given row in the dataframe."""
    if row['LastName'] == row['SecondaryLastName']:
        return row['FirstName'] + ' & ' + row['SecondaryFirstName'] + ' ' + row['LastName']
    elif row['SecondaryFirstName'] != '':
        return row['FirstName'] +  ' ' + row['LastName'] + ' & ' + row['SecondaryFirstName'] + ' ' + row['SecondaryLastName']
    else:
        return row['FirstName'] + ' ' + row['LastName']

source

set_contact_name

 set_contact_name (row)

Sets the contact name for a given row in the dataframe.

contacts['ContactName'] = contacts.apply(set_contact_name, axis=1)

Previewing result

contacts[['ContactName', 'FirstName', 'LastName', 'SecondaryFirstName', 'SecondaryLastName']].head(5)
ContactName FirstName LastName SecondaryFirstName SecondaryLastName
0 Karita & Kelvin Lumbers Karita Lumbers Kelvin Lumbers
1 Helga Benech Helga Benech
2 Masha Butt Gow Masha Butt Gow
3 Cymbre Cross Cymbre Cross
4 Hoyt Castille Hoyt Castille

ProjectCode Transformation

Splitting FundId

Splitting the FundId column by ','

project_codes = gifts.FundId.str.split(', ', expand=True)
project_codes.head(6)
0 1
0 None
1 ChildSponsorship None
2 None
3 None
4 None
5 General ReliefFund

Assigning the values to the Project1 & 2 columns

gifts[['Project1Code', 'Project2Code']] = project_codes

Removing FundId from the dataset

gifts = gifts.loc[:, gifts.columns.drop('FundId')].copy()

Date Transformation

Formating date

df['Date'].head(5)
0     3/4/2019
1     3/5/2019
2     3/7/2019
3    3/10/2019
4    2019/1/10
Name: Date, dtype: object


Since Dates are in different formats I’m going to create a custom parser


source

custom_parser

 custom_parser (date_str)

Parses a date string into a datetime object.

Examples

custom_parser('4/14/1999')
datetime.datetime(1999, 4, 14, 0, 0)
custom_parser('1999/04/14')
datetime.datetime(1999, 4, 14, 0, 0)

Transforming Dates

gifts['GiftDate'] = gifts['Date'].apply(custom_parser)

Dropping old date columns

gifts = gifts.loc[:, gifts.columns.drop('Date')].copy()

Email Transformation

Validate emails

Previewing the Email column

contacts.EMail.head(1)
0    kklumbers@ yahoo.co
Name: EMail, dtype: object


It looks like the first email ‘kklumbers@ yahoo.co’ is invalid and should be fixed

I’m going to write a function to validate emails and fix common mistakes


source

valid_email

 valid_email (s)

Validates an email address.

Examples

valid_email('warren.hyson5@gmail.com')
True
valid_email('warren.hyson5@gmail')
False
valid_email('warren.hyson5@.com')
False
valid_email('warren.hyson5@gmailcom')
False
valid_email('warren.hyson5@@gmail.com')
False


Creating a second function that attempts to fix common email mistakes like mispelled domains


source

fix_email

 fix_email (email)

Fixes common mistakes in email addresses.

Examples

correct

fix_email('warren.hyson5@gmail.com')
'warren.hyson5@gmail.com'

gmail spelled wrong

fix_email('warren.hyson5@gmial.com')
'warren.hyson5@gmail.com'

com spelled wrong

fix_email('warren.hyson5@gmial.con')
'warren.hyson5@gmail.com'

Applying Email Transformation

Previewing

contacts.EMail.head(1).apply(fix_email)
0    kklumbers@yahoo.com
Name: EMail, dtype: object
contacts['EMail'] = contacts.EMail.apply(fix_email)

Phone # Transformation

Fix phone numbers

There appears to be a invalid phone number

contacts.loc[2, 'Phone']
'577-374-96523'

Writing a function to validate US phone numbers, assuming the phone numbers provided are from the US


source

validate_us_phone_number

 validate_us_phone_number (phone_number)

Validates a US phone number and returns it if valid, otherwise returns an empty string.

Examples

validate_us_phone_number('321-295-2502')
'321-295-2502'

no-dashes 10 digit

validate_us_phone_number('3212952502')
'3212952502'

no-dashes 7 digit

validate_us_phone_number('2952502')
'2952502'

no-dashes 11 digit

validate_us_phone_number('13212952502')
''

With Parenthesese

validate_us_phone_number("(123) 456-7890")
'(123) 456-7890'

Random Text

validate_us_phone_number('407-555-1234 ext. 1234')
''

Applying Phone # Transformation

contacts['Phone'] = contacts.Phone.apply(validate_us_phone_number)
contacts.fillna('', inplace=True)

Results

contacts.head(5)
Number CompanyName FirstName LastName Street City State Postal Phone EMail Remarks Deceased SecondaryFirstName SecondaryLastName LegacyIndividualId SecondaryLegacyIndividualId ContactName
0 653377813-7 Karita Lumbers 4 Bunting Parkway Washington DC 20535-871 kklumbers@yahoo.com Is anonymous Kelvin Lumbers 0 1 Karita & Kelvin Lumbers
1 390551098-7 Helga Benech 48684 Jenifer Way Las Vegas NV 89130 ebenech1@goodreads.com 2 Helga Benech
2 093004505-X Masha Butt Gow 353 Schmedeman Park Indianapolis IN 3 Masha Butt Gow
3 729707142-0 A Company Co. Cymbre Cross 2055 Lakewood Parkway Camden NJ 8104 No 4 Cymbre Cross
4 488464926-5 Hoyt Castille 37 8th Trail Grand Rapids MI 49560 fcastille4@timesonline.co.uk No 5 Hoyt Castille
gifts.head(5)
DonorNumber LegacyGiftId FirstName LastName AmountReceived CreditCardType PaymentMethod LegacyPledgeID Notes Project1Code Project2Code GiftDate
0 848348568-0 95196378 Mannie Turpin 4.1500 PayPal 0 None 2019-03-04
1 729707142-0 95196889 Cymbre Cross 2.3648 check 1 ChildSponsorship None 2019-03-05
2 687119652-8 95197689 Ruggiero Makepeace 1.3100 cash 2 None 2019-03-07
3 653377813-7 95198998 Karita Lumbers 2.0400 American Ex credit card 3 In honor of Mannie Turpin None 2019-03-10
4 390551098-7 95198999 Helga Benech 5.8000 cash 89752384 None 2019-01-10
contacts.head(5)
Number CompanyName FirstName LastName Street City State Postal Phone EMail Remarks Deceased SecondaryFirstName SecondaryLastName LegacyIndividualId SecondaryLegacyIndividualId ContactName
0 653377813-7 Karita Lumbers 4 Bunting Parkway Washington DC 20535-871 kklumbers@yahoo.com Is anonymous Kelvin Lumbers 0 1 Karita & Kelvin Lumbers
1 390551098-7 Helga Benech 48684 Jenifer Way Las Vegas NV 89130 ebenech1@goodreads.com 2 Helga Benech
2 093004505-X Masha Butt Gow 353 Schmedeman Park Indianapolis IN 3 Masha Butt Gow
3 729707142-0 A Company Co. Cymbre Cross 2055 Lakewood Parkway Camden NJ 8104 No 4 Cymbre Cross
4 488464926-5 Hoyt Castille 37 8th Trail Grand Rapids MI 49560 fcastille4@timesonline.co.uk No 5 Hoyt Castille

Export

import nbdev
nbdev.nbdev_export('00_Setup.ipynb')