Setup

Friday August 18th, 2023

Previewing The Initial Datasets

Loading Datasets using Pandas

contacts.head()

	Number	Company Name	First Name	Last Name	Street	City	State	Postal	Phone	E-mail	Remarks	Deceased?
0	653377813-7		Karita & Kelvin	Lumbers	4 Bunting Parkway	Washington	DC	20535-871	kklumbers@ yahoo.co		Is anonymous
1	390551098-7		Helga	Benech	48684 Jenifer Way	Las Vegas	NV	89130		ebenech1@goodreads.com
2	093004505-X		Masha		353 Schmedeman Park	Indianapolis	IN			577-374-96523
3	729707142-0	A Company Co.			2055 Lakewood Parkway	Camden	NJ	8104				No
4	488464926-5		Hoyt	Castille	37 8th Trail	Grand Rapids	MI	49560		fcastille4@timesonline.co.uk		No

gifts.head()

	donor_number	gift_id	first_name	last_name	amount received	date	fund_id	credit card type	payment method	pledge_number	notes
0	848348568-0	95196378.0	Mannie	Turpin	$4.15	3/4/2019			PayPal
1	729707142-0	95196889.0	Cymbre	Cross	2.3648	3/5/2019	ChildSponsorship		check
2	687119652-8	95197689.0	Ruggiero	Makepeace	$1.31	3/7/2019			cash
3	653377813-7	95198998.0	Karita	Lumbers	$2.04	3/10/2019		American Ex	credit card		In honor of Mannie Turpin
4	390551098-7	95198999.0	Helga	Benech	$5.80	2019/1/10			cash	89752384.0

contact_methods.head()

	donor_number	Phone	E-mail	Fax
0	653377813-7	832-442-4988
1	390551098-7		ebenech1@goodreads.com
2	093004505-X	818-323-9865		818-156-7985
3	729707142-0
4	488464926-5		fcastille4@timesonline.co.uk

Inital Transformations

Transforming certain task on load in order to reduce redundency … and some work for me :)

Column Name Transformation

Nothing worse than malformed column names, amiright?

source

to_camel_case

 to_camel_case (s)

Converts a string to camel case.

Examples

stripping spaces

to_camel_case('this isa test')

'ThisIsaTest'

Stripping Special Characters

to_camel_case('this_is_a_test')

'ThisIsATest'

to_camel_case('hello!world_what_is_up?')

'HelloWorldWhatIsUp'

source

transform_cnames

 transform_cnames (df, func=<function to_camel_case>)

Apply a function to all column names of a dataframe. It may not always be the case that I want to apply to_camel_case as the column transformation function. Because of this func is an optional argument with a default value of to_camel_case

Apply a function to all column names of a dataframe

for df in [contact_methods, contacts, gifts]:
    transform_cnames(df)

Validating that the contacts column names are now CamelCase

contacts.columns

Index(['Number', 'CompanyName', 'FirstName', 'LastName', 'Street', 'City',
       'State', 'Postal', 'Phone', 'EMail', 'Remarks', 'Deceased'],
      dtype='object')

Column Type Transformation

I identified 2 columns on the gift table that should be ints

I’m going to convert them both to type int

int_cols = ['GiftId', 'PledgeNumber']

gifts[int_cols] = gifts[int_cols].replace({'':0}).astype(int)

AmountRecieved should be a float

Removing any special characters (besides dashes and periods) and converting to a float :)

gifts['AmountReceived'] = gifts.AmountReceived.apply(lambda x: float(re.sub(r'[^a-zA-Z0-9\.-]', '', x)))

Previewing change

gifts.AmountReceived.head(5)

0    4.1500
1    2.3648
2    1.3100
3    2.0400
4    5.8000
Name: AmountReceived, dtype: float64

Replacing ‘0’ values in PledgeNumber & GiftId to thier index so that each row is unique

gifts.loc[ gifts.PledgeNumber == 0, 'PledgeNumber'] = gifts[gifts.PledgeNumber == 0].index
gifts.loc[ gifts.GiftId == 0, 'GiftId'] = gifts[gifts.GiftId == 0].index

gifts[['PledgeNumber', 'GiftId']].head(5)

	PledgeNumber	GiftId
0	0	95196378
1	1	95196889
2	2	95197689
3	3	95198998
4	89752384	95198999

Renaming the columns to match virtuous requirements

gifts = gifts.rename(columns={'PledgeNumber': 'LegacyPledgeID', 'GiftId': 'LegacyGiftId'})

Misplaced Data Transformation

Identify and clean any pieces of data in the wrong column

contacts[['Phone', 'EMail']].head(3)

	Phone	EMail
0	kklumbers@ yahoo.co
1		ebenech1@goodreads.com
2		577-374-96523

Creating a function that uses regular expressions to identify a string as a phone or a email

source

classify_phone_email

 classify_phone_email (value)

Classify a value as a phone number, email, or None.

Examples

classify_phone_email('321-295-2502')

'phone'

classify_phone_email('warren.hyson5@gmail.com')

'email'

classify_phone_email('test321-29@gmail.com')

'email'

classify_phone_email('3212952502@gmail.com')

'email'

classify_phone_email('321-295-2502@gmail.com')

'email'

Applying Phone/Email Transformation

Looping over phone & email records.

If a phone and/or email are identified in the wrong column, I’m swapping thier values

contacts[['Phone', 'EMail']].head(3)

	Phone	EMail
0		kklumbers@ yahoo.co
1		ebenech1@goodreads.com
2	577-374-96523

Consolidate Contacts Table

Searching for missing users
Splitting / joining households

Checking for Missing records

Checking for any DonorNumbers that are NOT IN contacts Number…

(~gifts.DonorNumber.isin(contacts.Number.unique())).any()

True

Found 1!

Doing the same for contact_methods

(~contact_methods.DonorNumber.isin(contacts.Number.unique())).any()

False

Didn’t find any there
Going to extract the missing records from the gifts table

donors_not_in_contacts = gifts.loc[~gifts.DonorNumber.isin(contacts.Number.unique()), :]

donors_not_in_contacts

	DonorNumber	LegacyGiftId	FirstName	LastName	AmountReceived	Date	FundId	CreditCardType	PaymentMethod	LegacyPledgeID	Notes
9	809975531-Y	9	Adeline	Shakespeare	8.48	8/14/2019		AMEX	credit card	9
27	809975531-Y	27	Adeline	Shakespeare	7.58	8/14/2019	Color run, ChildSponsorship	Mastercard	credit card	27

Adding Shakespeare to our contacts!

contacts = pd.concat([
    contacts,
    # Dataframe of donors not in contacts
    pd.DataFrame(donors_not_in_contacts[['DonorNumber', 'FirstName', 'LastName']]
                 .drop_duplicates()
                 .rename(columns={'DonorNumber': 'Number'})
                 .drop_duplicates()
                 .to_dict('records'))
])

Spliting rows with Multiple People

contacts.FirstName.head(1)

0    Karita & Kelvin
Name: FirstName, dtype: object

Split the names on ’ & ’ or ’ and ’, then expand the resulting lists into new rows

contacts[['FirstName', 'SecondaryFirstName']] = contacts['FirstName'].str.split(' & | and ', expand=True).fillna('')

Aditionally, going to look for any records where there’s a duplicated Number

The following Python code will return True for rows where the Number was deemed to be a duplicate:

contacts.Number.duplicated()

For all the rows that are duplicates, I’m going to:

1. Place them in a seperate variable
2. Remove them from the original dataset
3. Add them as the 2nd Person on the non-duplicated row
4. Ensure they have proper first & last names

records_to_join = contacts.loc[contacts.Number.duplicated(), :].to_dict(orient='records')

Removing duplicates from contacts before joining

contacts = contacts.loc[~contacts.Number.duplicated(), :]

Adding dupliates as secondary contacts

for record in records_to_join:
    contacts.loc[contacts.Number.isin([record['Number']]), ['SecondaryFirstName', 'SecondaryLastName']] = [record['FirstName'], record['LastName']]

Adding Secondary Last Name where it was previously missing

contacts['SecondaryLastName'] = contacts.apply(lambda x: x['LastName'] if x['SecondaryLastName'] == '' and x['SecondaryFirstName'] != '' else x['SecondaryLastName'], axis=1)

Updating unique identifiers

Ensuring LegacyIndividualId is present for all contacts & is unique

Initalizing empty collumns

Adding id since none was provided

id = 0
for index, row in contacts.iterrows():
    contacts.loc[index, 'LegacyIndividualId'] = id
    id += 1
    if row['SecondaryFirstName'] != '':
        contacts.loc[index, 'SecondaryLegacyIndividualId'] = id
        id += 1

Cleaning Contact First & Last Names

Cleaning Records with blank first or last names

Checking for records where FirstName and/or LastName is blank

blank_name_records = ((contacts.FirstName == '') | (contacts.LastName == ''))

Previewing the blank name records

contacts.loc[((contacts.FirstName == '') | (contacts.LastName == '')), :]

	Number	CompanyName	FirstName	Street	City	State	Postal	Phone	EMail	Deceased	LegacyIndividualId
2	093004505-X		Masha	353 Schmedeman Park	Indianapolis	IN		577-374-96523			3
3	729707142-0	A Company Co.		2055 Lakewood Parkway	Camden	NJ	8104			No	4
7	029456846-8			608 Old Shore Alley	Marietta	GA	30066		jdoley6@telegraph.co.uk		9

Before I delete the records I’m going to check if the names are present on the gift table

I’m going to start by getting the unique Numbers that the records belong too

blank_name_numbers= contacts.loc[blank_name_records, 'Number']

gifts.loc[gifts.DonorNumber.isin(blank_name_numbers), ['DonorNumber', 'FirstName', 'LastName']].drop_duplicates().head(5)

	DonorNumber	FirstName	LastName
1	729707142-0	Cymbre	Cross
6	029456846-8	Romy	Doley
7	093004505-X	Masha	Butt Gow
19	729707142-0	Cymbre

The names are present on the gifts table!

gift_name_records = gifts.loc[gifts.DonorNumber.isin(blank_name_numbers), ['DonorNumber', 'FirstName', 'LastName']].drop_duplicates()

Removing the invalid record

gift_name_records = gift_name_records.loc[((gift_name_records.FirstName != '') & (gift_name_records.LastName != '')), :]
gift_name_records

	DonorNumber	FirstName	LastName
1	729707142-0	Cymbre	Cross
6	029456846-8	Romy	Doley
7	093004505-X	Masha	Butt Gow

Updating the records that previousuly had a blank first or last name

for _, row in gift_name_records.iterrows():
    contacts.loc[contacts['Number'] == row['DonorNumber'], ['FirstName', 'LastName']] = [row['FirstName'], row['LastName']]

All the records valid names now!

contacts.loc[blank_name_records, :]

	Number	CompanyName	FirstName	LastName	Street	City	State	Postal	Phone	EMail	Deceased	LegacyIndividualId
2	093004505-X		Masha	Butt Gow	353 Schmedeman Park	Indianapolis	IN		577-374-96523			3
3	729707142-0	A Company Co.	Cymbre	Cross	2055 Lakewood Parkway	Camden	NJ	8104			No	4
7	029456846-8		Romy	Doley	608 Old Shore Alley	Marietta	GA	30066		jdoley6@telegraph.co.uk		9

Add ContactName Column

contacts.head(3)

	Number	FirstName	LastName	Street	City	State	Postal	Phone	EMail	Remarks	SecondaryFirstName	SecondaryLastName	LegacyIndividualId	SecondaryLegacyIndividualId
0	653377813-7	Karita	Lumbers	4 Bunting Parkway	Washington	DC	20535-871		kklumbers@ yahoo.co	Is anonymous	Kelvin	Lumbers	0	1
1	390551098-7	Helga	Benech	48684 Jenifer Way	Las Vegas	NV	89130		ebenech1@goodreads.com				2
2	093004505-X	Masha	Butt Gow	353 Schmedeman Park	Indianapolis	IN		577-374-96523					3

Creating a function to update the contact name depending on the individuals on the household record

def set_contact_name(row):
    """Sets the contact name for a given row in the dataframe."""
    if row['LastName'] == row['SecondaryLastName']:
        return row['FirstName'] + ' & ' + row['SecondaryFirstName'] + ' ' + row['LastName']
    elif row['SecondaryFirstName'] != '':
        return row['FirstName'] +  ' ' + row['LastName'] + ' & ' + row['SecondaryFirstName'] + ' ' + row['SecondaryLastName']
    else:
        return row['FirstName'] + ' ' + row['LastName']

source

set_contact_name

 set_contact_name (row)

Sets the contact name for a given row in the dataframe.

contacts['ContactName'] = contacts.apply(set_contact_name, axis=1)

Previewing result

contacts[['ContactName', 'FirstName', 'LastName', 'SecondaryFirstName', 'SecondaryLastName']].head(5)

	ContactName	FirstName	LastName	SecondaryFirstName	SecondaryLastName
0	Karita & Kelvin Lumbers	Karita	Lumbers	Kelvin	Lumbers
1	Helga Benech	Helga	Benech
2	Masha Butt Gow	Masha	Butt Gow
3	Cymbre Cross	Cymbre	Cross
4	Hoyt Castille	Hoyt	Castille

ProjectCode Transformation

Splitting FundId

Splitting the FundId column by ','

project_codes = gifts.FundId.str.split(', ', expand=True)

project_codes.head(6)

	0	1
0		None
1	ChildSponsorship	None
2		None
3		None
4		None
5	General	ReliefFund

Assigning the values to the Project1 & 2 columns

gifts[['Project1Code', 'Project2Code']] = project_codes

Removing FundId from the dataset

gifts = gifts.loc[:, gifts.columns.drop('FundId')].copy()

Date Transformation

Formating date

df['Date'].head(5)

0     3/4/2019
1     3/5/2019
2     3/7/2019
3    3/10/2019
4    2019/1/10
Name: Date, dtype: object

Since Dates are in different formats I’m going to create a custom parser

source

custom_parser

 custom_parser (date_str)

Parses a date string into a datetime object.

Examples

custom_parser('4/14/1999')

datetime.datetime(1999, 4, 14, 0, 0)

custom_parser('1999/04/14')

datetime.datetime(1999, 4, 14, 0, 0)

Transforming Dates

gifts['GiftDate'] = gifts['Date'].apply(custom_parser)

Dropping old date columns

gifts = gifts.loc[:, gifts.columns.drop('Date')].copy()

Email Transformation

Validate emails

Previewing the Email column

contacts.EMail.head(1)

0    kklumbers@ yahoo.co
Name: EMail, dtype: object

It looks like the first email ‘kklumbers@ yahoo.co’ is invalid and should be fixed

I’m going to write a function to validate emails and fix common mistakes

source

valid_email

 valid_email (s)

Validates an email address.

Examples

valid_email('warren.hyson5@gmail.com')

True

valid_email('warren.hyson5@gmail')

False

valid_email('warren.hyson5@.com')

False

valid_email('warren.hyson5@gmailcom')

False

valid_email('warren.hyson5@@gmail.com')

False

Creating a second function that attempts to fix common email mistakes like mispelled domains

source

fix_email

 fix_email (email)

Fixes common mistakes in email addresses.

Examples

correct

fix_email('warren.hyson5@gmail.com')

'warren.hyson5@gmail.com'

gmail spelled wrong

fix_email('warren.hyson5@gmial.com')

'warren.hyson5@gmail.com'

com spelled wrong

fix_email('warren.hyson5@gmial.con')

'warren.hyson5@gmail.com'

Applying Email Transformation

Previewing

contacts.EMail.head(1).apply(fix_email)

0    kklumbers@yahoo.com
Name: EMail, dtype: object

contacts['EMail'] = contacts.EMail.apply(fix_email)

Phone # Transformation

Fix phone numbers

There appears to be a invalid phone number

contacts.loc[2, 'Phone']

'577-374-96523'

Writing a function to validate US phone numbers, assuming the phone numbers provided are from the US

source

validate_us_phone_number

 validate_us_phone_number (phone_number)

Validates a US phone number and returns it if valid, otherwise returns an empty string.

Examples

validate_us_phone_number('321-295-2502')

'321-295-2502'

no-dashes 10 digit

validate_us_phone_number('3212952502')

'3212952502'

no-dashes 7 digit

validate_us_phone_number('2952502')

'2952502'

no-dashes 11 digit

validate_us_phone_number('13212952502')

''

With Parenthesese

validate_us_phone_number("(123) 456-7890")

'(123) 456-7890'

Random Text

validate_us_phone_number('407-555-1234 ext. 1234')

''

Applying Phone # Transformation

contacts['Phone'] = contacts.Phone.apply(validate_us_phone_number)

contacts.fillna('', inplace=True)

Results

contacts.head(5)

	Number	CompanyName	FirstName	LastName	Street	City	State	Postal	EMail	Remarks	Deceased	SecondaryFirstName	SecondaryLastName	LegacyIndividualId	SecondaryLegacyIndividualId	ContactName
0	653377813-7		Karita	Lumbers	4 Bunting Parkway	Washington	DC	20535-871	kklumbers@yahoo.com	Is anonymous		Kelvin	Lumbers	0	1	Karita & Kelvin Lumbers
1	390551098-7		Helga	Benech	48684 Jenifer Way	Las Vegas	NV	89130	ebenech1@goodreads.com					2		Helga Benech
2	093004505-X		Masha	Butt Gow	353 Schmedeman Park	Indianapolis	IN							3		Masha Butt Gow
3	729707142-0	A Company Co.	Cymbre	Cross	2055 Lakewood Parkway	Camden	NJ	8104			No			4		Cymbre Cross
4	488464926-5		Hoyt	Castille	37 8th Trail	Grand Rapids	MI	49560	fcastille4@timesonline.co.uk		No			5		Hoyt Castille

gifts.head(5)

	DonorNumber	LegacyGiftId	FirstName	LastName	AmountReceived	CreditCardType	PaymentMethod	LegacyPledgeID	Notes	Project1Code	Project2Code	GiftDate
0	848348568-0	95196378	Mannie	Turpin	4.1500		PayPal	0			None	2019-03-04
1	729707142-0	95196889	Cymbre	Cross	2.3648		check	1		ChildSponsorship	None	2019-03-05
2	687119652-8	95197689	Ruggiero	Makepeace	1.3100		cash	2			None	2019-03-07
3	653377813-7	95198998	Karita	Lumbers	2.0400	American Ex	credit card	3	In honor of Mannie Turpin		None	2019-03-10
4	390551098-7	95198999	Helga	Benech	5.8000		cash	89752384			None	2019-01-10

contacts.head(5)

	Number	CompanyName	FirstName	LastName	Street	City	State	Postal	EMail	Remarks	Deceased	SecondaryFirstName	SecondaryLastName	LegacyIndividualId	SecondaryLegacyIndividualId	ContactName
0	653377813-7		Karita	Lumbers	4 Bunting Parkway	Washington	DC	20535-871	kklumbers@yahoo.com	Is anonymous		Kelvin	Lumbers	0	1	Karita & Kelvin Lumbers
1	390551098-7		Helga	Benech	48684 Jenifer Way	Las Vegas	NV	89130	ebenech1@goodreads.com					2		Helga Benech
2	093004505-X		Masha	Butt Gow	353 Schmedeman Park	Indianapolis	IN							3		Masha Butt Gow
3	729707142-0	A Company Co.	Cymbre	Cross	2055 Lakewood Parkway	Camden	NJ	8104			No			4		Cymbre Cross
4	488464926-5		Hoyt	Castille	37 8th Trail	Grand Rapids	MI	49560	fcastille4@timesonline.co.uk		No			5		Hoyt Castille

Export

import nbdev

nbdev.nbdev_export('00_Setup.ipynb')