Will a customer start a dispute regarding financial services?

Predicting a dispute using real complaints data and XGBoost.

Financial services are in the top 10 of Better Business Bureau's list for consumer complaints and inquiries every year. I tought it would be great to predict what makes a consumer go from a complaint to a legal dispute, since these could save thousands of dollars to both sides.

I'm using real data extracted from: Consumer Financial Protection Financial Bureau.This Bureau defines itself as: "Every complaint provides insight into problems that people are experiencing, helping us identify inappropriate practices and allowing us to stop them before they become major issues." https://www.consumerfinance.gov/data-research/consumer-complaints/

This Bureau, each week receives thousands of consumers’ complaints about financial products and services, and send them to companies for response. Those complaints are published after the company responds or after 15 days, whichever comes first.

I will use the information of the complaints that ended in a dispute for my model. First, I will start with a descriptive analysis and data cleaning before building my model.

I use different functions for the descriptive analysis but my goal in this notebook is focus more in the model, however you can find all th functions in the file dispute_functions.py. My first step was calling dispute_functions.py where I included a call for the different packages like nltk for sentiment analysis. I also called Pandas, Seaborn and XGBoost.

Content

In [1]:
from dispute_functions import * 
[nltk_data] Downloading package words to
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]   Package stopwords is already up-to-date!

Basic data processing.

First, I have to do some data preparation: download the file in a pandas a dataframe, give format to all the columns name (remove spaces, symbols, and put everything in lower characters), eliminate duplicates, and select just the rows where the information about disputes is available (most of the recent cases are still in the first phase of the complaint and we don't have information about disputes yet).

Also I have to convert the dates to format datetime since in the file they were saved like strings.

In [2]:
file_path = 'complaints.csv'
consumer_data = pd.read_csv(file_path, error_bad_lines=False, index_col=False, dtype='unicode')

consumer_data = consumer_data
consumer_data.columns = consumer_data.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
consumer_data = consumer_data.drop_duplicates(consumer_data.columns, keep='last')
consumer_data['date_received'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_received'] ]
consumer_data['date_sent_to_company'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_sent_to_company'] ]
dispute_data=consumer_data[(consumer_data['consumer_disputed']=='Yes')|(consumer_data['consumer_disputed']=='No')]
#print(dispute_data.columns, len(dispute_data.columns))
#dispute_data.head(5)

Reviewing the data columns and shape I can see that the data consists in 18 columns that contain information about:

  • Complaint: complaintid, date_received, date_sent_to_company, submited_via, state, zip_code.
  • Product: product, sub_product, issue, sub_issue,
  • Company: company, company_public_response, tags, company_response_to_consumer, timely_response.
  • Consumer: consumer_complain_narrative,consumer_consent_provided, consumer_disputed.

The data is unbalanced: 80% of the complaints didnt ended in a dispute.

In [3]:
dispute_data.groupby('consumer_disputed').size()/len(dispute_data)
Out[3]:
consumer_disputed
No     0.806919
Yes    0.193081
dtype: float64

Dates

We have two dates values in the dataset:

  • date_received: This is when the consumer submitted the complaint to the Consumer Financial Protection Financial Bureau.
  • data_sent_to_company: this is when the Consumer Financial Protection Financial Bureau send the complaint to the company asking for response.

The number of complaints that ended in disputes doesn't seem relate to any of this dates. The distributios of disputes or no disputes is almost uniform, only when a peak of no disputes at the beginning of 2017.

In [4]:
received = (dispute_data.groupby(['consumer_disputed','date_received']).size()).rename('Complaints received by bureau').reset_index() 
sent = (dispute_data.groupby(['consumer_disputed','date_sent_to_company']).size()).rename('Complaints sent to company').reset_index()  
fig, axes = plt.subplots(2,figsize=(16,7))
sns.lineplot(x="date_received", y="Complaints received by bureau", hue='consumer_disputed', linewidth=0.5, data=received, ax=axes[0], marker='o')
sns.lineplot(x="date_sent_to_company", y="Complaints sent to company",hue='consumer_disputed', linewidth=0.5, data=sent, ax=axes[1], marker='o')
plt.show()
In [5]:
(received[received['consumer_disputed']=='Yes'].set_index('date_received').groupby(pd.Grouper(freq='A')).size()/received.set_index('date_received').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index() 
Out[5]:
date_received % Complaints ended in Disputes per Year
0 2011-12-31 50.000000
1 2012-12-31 50.000000
2 2013-12-31 50.000000
3 2014-12-31 50.000000
4 2015-12-31 50.000000
5 2016-12-31 50.000000
6 2017-12-31 49.775785
In [6]:
(sent[sent['consumer_disputed']=='Yes'].set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()/sent.set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index() 
Out[6]:
date_sent_to_company % Complaints ended in Disputes per Year
0 2011-12-31 50.000000
1 2012-12-31 48.713826
2 2013-12-31 49.652295
3 2014-12-31 50.000000
4 2015-12-31 50.000000
5 2016-12-31 50.000000
6 2017-12-31 33.623188
7 2018-12-31 NaN

But if I plot the time difference (date_sent_to_company - date_received in days) against the percentage of complaints that ended in disputes, I can see as time difference increase the probability of disputes increase.

In [7]:
dispute_data['time_difference'] = (dispute_data['date_sent_to_company'] - dispute_data['date_received']).astype('timedelta64[D]')
timediff = ((dispute_data[dispute_data['consumer_disputed']=='Yes'].groupby(['time_difference']).size()/dispute_data.groupby(['time_difference']).size())*100).rename('% Complaints ended Disputes').reset_index() 
plt.subplots(figsize=(10,6))
s = sns.scatterplot(x="time_difference", y="% Complaints ended Disputes", linewidth=0.5, data=timediff)
s.set(xlim=(0.7,1000),ylim=(0,103))
s.set(xscale="log")
s.set_title('% Complaints ended Disputes')
Out[7]:
Text(0.5, 1.0, '% Complaints ended Disputes')

Location and zip code

XGBoost only handles numerical values, so for all the categorical variables I have to create dummy variables but I want to be selective and put only relevant features.

Zip code and states are correlated, so I decided to work with states. The createcolumn function calculates the rate of disputes for the values of a column and returns the top "N" of values with the highest rate of disputes and the top "N" values with the lowest ratio of disputes.

For example applying the createcolumn to the state column the function will return the 4 states with the highest rate of disputes and 4 states with lowest rate of disputes. And it will return a dataframe with the dummy variables for these states.

In [8]:
states_df = createcolumn(dispute_data,'consumer_disputed','yes','state',4)

Product

There are only 12 variables for the product column, So I've decided to work with all for the model. But for subproduct I have 49 different values so I selected just 20, and for issue and subissue I also selected the values with the highest and lowest rates of disputes to feed my model.

In [9]:
dispute_data.groupby('product').size()
Out[9]:
product
Bank account or service         86206
Checking or savings account         3
Consumer Loan                   31604
Credit card                     89190
Credit reporting               140432
Debt collection                145815
Money transfers                  5354
Mortgage                       226897
Other financial service          1059
Payday loan                      5543
Prepaid card                     3819
Student loan                    32537
Virtual currency                   18
dtype: int64
In [10]:
print(len(dispute_data['sub_product'].unique()))
print(len(dispute_data['issue'].unique()))
print(len(dispute_data['sub_issue'].unique()))
51
99
62
In [11]:
subproduct_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_product',10)
subproduct_df.head()
Out[11]:
(CD) Certificate of deposit_sub_product Auto_sub_product Cashing a check without an account_sub_product Check cashing_sub_product Checking account_sub_product Conventional adjustable mortgage (ARM)_sub_product Conventional fixed mortgage_sub_product Conventional home mortgage_sub_product Credit card_sub_product Credit repair_sub_product Refund anticipation check_sub_product Reverse mortgage_sub_product Savings account_sub_product Second mortgage_sub_product Title loan_sub_product Transit card_sub_product Traveler’s/Cashier’s checks_sub_product VA mortgage_sub_product Vehicle lease_sub_product Vehicle loan_sub_product
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [12]:
(dispute_data.groupby('issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
issue_df = createcolumn(dispute_data,'consumer_disputed','yes','issue',10)
In [13]:
(dispute_data.groupby('sub_issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
subissue_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_issue',10)

Company

I have 2231 different companies in the data set, so I decided to have a different approach for them. I create ordinal values: If the company has less than 50 complaints I consider it a Unique, if has between 50 and 1000 it is a Small size company if has between 1000 and 3000 complaints it is a Medium size company, and for more than 3000 complaints I consider the Company a new category itself.

Doing this I ended with 9 categories, 6 corresponding to the largest financial institutions. And I got the dummy variables for all these.

In [14]:
df = dispute_data.groupby('company').size().rename('companysize').reset_index()
df
Out[14]:
company companysize
0 (Former)Shapiro, Swertfeger & Hasty, LLP 4
1 1 STOP MONEY CENTERS, LLC 1
2 1ST 2ND MORTGAGE CO. OF NJ INC 1
3 1ST ALLIANCE LENDING, LLC 18
4 1ST PREFERENCE MORTGAGE CORP 2
... ... ...
4284 eMoneyUSA Holdings, LLC 3
4285 i3 Lending, Inc 2
4286 iFreedom Direct Corporation 16
4287 iQuantified Management Services, LLC 5
4288 Lippman Recupero, LLC 5

4289 rows × 2 columns

In [15]:
df['companycode'] = ['Unique' if x < 50 else 'Small' if 1000 > x >49 else 'Medium' if 3000 > x >999 else 'company' for x in df['companysize']]
df.loc[df.companycode == 'company', 'companycode'] = df['company']

companycode_dum = pd.get_dummies(pd.DataFrame(df['companycode']))
companycode_dum.head(10)
Out[15]:
companycode_AES/PHEAA companycode_ALLY FINANCIAL INC. companycode_AMERICAN EXPRESS COMPANY companycode_BANK OF AMERICA, NATIONAL ASSOCIATION companycode_BARCLAYS BANK DELAWARE companycode_BB&T CORPORATION companycode_CAPITAL ONE FINANCIAL CORPORATION companycode_CITIBANK, N.A. companycode_CITIZENS FINANCIAL GROUP, INC. companycode_DISCOVER BANK ... companycode_SYNCHRONY FINANCIAL companycode_Santander Consumer USA Holdings Inc. companycode_Seterus, Inc. companycode_Small companycode_TD BANK US HOLDING COMPANY companycode_TRANSUNION INTERMEDIATE HOLDINGS, INC. companycode_U.S. BANCORP companycode_UNITED SERVICES AUTOMOBILE ASSOCIATION companycode_Unique companycode_WELLS FARGO & COMPANY
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

10 rows × 38 columns

Other variables

I have other columns but these have few variables so I decided to include all in the model:

  • Response to consumer has only 7 variables
  • Consumer consent provided has 5 variables.
  • Submitted via has only 6 variables.

I created a dataframe with allthe dummy variables for these categories.

In [16]:
print('Products... ', dispute_data['product'].unique())
print('Response to consumers  ', dispute_data['company_response_to_consumer'].unique())
print('Consumer consent provided  ', dispute_data['consumer_consent_provided'].unique())
print('submited via:  ', dispute_data['submitted_via'].unique())
dum = pd.get_dummies(dispute_data[['product','consumer_consent_provided','company_response_to_consumer','submitted_via']])
dum.head()
Products...  ['Debt collection' 'Credit card' 'Bank account or service' 'Consumer Loan'
 'Mortgage' 'Payday loan' 'Credit reporting' 'Other financial service'
 'Student loan' 'Prepaid card' 'Money transfers' 'Virtual currency'
 'Checking or savings account']
Response to consumers   ['Closed with explanation' 'Closed with monetary relief'
 'Closed with non-monetary relief' 'Closed' 'Untimely response'
 'Closed with relief' 'Closed without relief']
Consumer consent provided   [nan 'Consent provided' 'Consent not provided' 'Other' 'Consent withdrawn']
submited via:   ['Web' 'Phone' 'Postal mail' 'Referral' 'Fax' 'Email']
Out[16]:
product_Bank account or service product_Checking or savings account product_Consumer Loan product_Credit card product_Credit reporting product_Debt collection product_Money transfers product_Mortgage product_Other financial service product_Payday loan ... company_response_to_consumer_Closed with non-monetary relief company_response_to_consumer_Closed with relief company_response_to_consumer_Closed without relief company_response_to_consumer_Untimely response submitted_via_Email submitted_via_Fax submitted_via_Phone submitted_via_Postal mail submitted_via_Referral submitted_via_Web
16 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
178 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
227 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
265 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
438 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 30 columns

In [17]:
dispute_data['disputed'] = [1 if x =='Yes' else 0 for x in dispute_data['consumer_disputed']] 
dispute_data['timely_response2'] = [1 if x =='Yes' else 0 for x in dispute_data['timely_response']] 

Text analysis

Consumer complaint narrative and company public response

I have two columns with text: Consumer complaint narrative (a consumer description of the complaint) and company public response.

First thing I did it's cleaning the text, the cleaning function removes non alphabethic characters,stop words and numerical characters and return text in lowers. Then I use the wordfrequencyplot function to visualize the most frequen words when the complaints ends (or not) in a dispute.

For both (consumer narrative and company response) the most frequent words where the same with and without disputes. So I decided to take a different approach and use sentiment analysis.

In [18]:
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'company_public_response'),'Frequency for public response when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'company_public_response'),'Frequency for public response when consumer didnt disputed')
In [19]:
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer didnt disputed')

Sentiment analysis

I use sentiment analysis with TextBlob. TextBlob is an open source python library used for textual analysis. It is very much useful in Natural Language Processing and Understanding.

There are two things that we can measure:

  • Polarity
  • Subjectivity

POLARITY

Polarity helps us in finding the expression and emotion of the author in the text. The value ranges from -1.0 to +1.0 and they contain float values.

Less than 0 denotes Negative Equal to 0 denotes Neutral Greater than 0 denotes Positive Values near to +1 are more likely to be positive than a value near to 0. Same is in the case of negativity.

SUBJECTIVITY

It tell us if a sentence is subjective or objective. The value ranges from 0.0 to +1.0

Subjective sentences are based on personal opinions, responses, beliefs whereas objective sentences are based on factual information.

I created a dataframe with all these values and added it to the variables for my model.

In [20]:
cleaned_response = cleaning(dispute_data,'company_public_response')
sentiment_response_df = analysis(cleaned_response, 'polarity_response_company', 'subjectivity_response_company')
sentiment_response_df  = pd.concat([dispute_data['disputed'],sentiment_response_df], axis=1, sort=False)
sentiment_response_df
Out[20]:
disputed Cleantext polarity_response_company subjectivity_response_company
16 1 no_comment NaN NaN
178 1 no_comment NaN NaN
227 0 no_comment NaN NaN
265 0 company responded consumer cfpb chooses provid... 0.0 0.066667
438 0 company believes complaint represents opportun... 0.1 0.350000
... ... ... ... ...
1624049 0 no_comment NaN NaN
1624050 0 no_comment NaN NaN
1624051 0 company responded consumer cfpb chooses provid... 0.0 0.066667
1624052 1 company responded consumer cfpb chooses provid... 0.0 0.066667
1624053 0 company responded consumer cfpb chooses provid... 0.0 0.066667

768477 rows × 4 columns

In [21]:
sentiment_response_d = sentiment_response_df.dropna()
ax = sentiment_response_d[sentiment_response_d['disputed']==1][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_response_d[sentiment_response_d['disputed']==0][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')
Out[21]:
Text(0.5, 1.0, 'When consumer didnt dispute')
In [22]:
cleaned_complaints = cleaning(dispute_data,'consumer_complaint_narrative')
sentiment_complaint_df = analysis(cleaned_complaints, 'polarity_complaint', 'subjectivity_complaint')
sentiment_complaint_df  = pd.concat([dispute_data['disputed'],sentiment_complaint_df], axis=1, sort=False)
sentiment_complaint_df
Out[22]:
disputed Cleantext polarity_complaint subjectivity_complaint
16 1 no_comment NaN NaN
178 1 monitor credit report frequently attempting ho... 0.192857 0.417857
227 0 xxxx xxxx xxxx received letter stating owed de... 0.093855 0.405556
265 0 stupid charge items macy macy credit card xxxx... 0.003133 0.559818
438 0 vehicle repoed xxxx paychecks gotten loans hun... 0.000000 0.138095
... ... ... ... ...
1624049 0 xxxx xxxx contacted xxxx xxxx branch manager x... 0.082857 0.489286
1624050 0 chase services mortgage owns original loan doc... 0.375000 0.750000
1624051 0 payment citi xxxx credit card xxxx xxxx using ... 0.166667 0.548148
1624052 1 cfbp like file complaint experian reporting ag... -0.100000 0.375000
1624053 0 husband middle short sale property located xxx... 0.092857 0.435714

768477 rows × 4 columns

In [23]:
sentiment_complaint_d = sentiment_complaint_df.dropna()
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==1][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==0][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')
Out[23]:
Text(0.5, 1.0, 'When consumer didnt dispute')
In [24]:
narrative = pd.concat([sentiment_response_df,sentiment_complaint_df], axis=1, sort=False)
narrative.head()
narrativedf = narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]

XGBoost model

XGBoost is an implementation of Gradient Boosting Machine, with major improvements.

GBM is an algorithm used for supervised learning: An ensemble of weak learners is built, where the misclassified records are given greater weight (‘boosted’) to correctly predict them in later models. These weak learners are later combined to produce a single strong learner.

GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster.

Starting predictions

Now I will put together all the features for the model:

- states_df
- subproduct_df
- issue_df
- subissue_df
- companycode
- dum
- narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]
- dispute_data[['time_difference', 'timely_response2']]
In [25]:
all_df = [states_df, companycode_dum, subproduct_df, issue_df, subissue_df, narrativedf.reset_index(drop=True)]
X = pd.DataFrame()
X =  pd.concat(all_df, axis=1, sort=False)
X = pd.concat([X, dum.reset_index(drop=True)], axis=1, sort=False)
fromdf = dispute_data[['time_difference', 'timely_response2']].reset_index(drop=True)
X = pd.concat([X, fromdf], axis=1, sort=False)

Y = dispute_data['disputed'] 
X.head()
Out[25]:
AA_state AE_state AK_state AL_state WA_state WI_state WV_state WY_state companycode_AMERICAN EXPRESS COMPANY companycode_BANK OF AMERICA, NATIONAL ASSOCIATION ... company_response_to_consumer_Closed without relief company_response_to_consumer_Untimely response submitted_via_Email submitted_via_Fax submitted_via_Phone submitted_via_Postal mail submitted_via_Referral submitted_via_Web time_difference timely_response2
0 0 0 0 0 0 0 0 0 0.0 0.0 ... 0 0 0 0 0 0 0 1 1.0 1
1 0 0 0 0 0 0 0 0 0.0 0.0 ... 0 0 0 0 0 0 0 1 1.0 1
2 0 0 0 0 0 0 0 0 0.0 0.0 ... 0 0 0 0 0 0 0 1 0.0 1
3 0 0 0 0 0 0 0 0 0.0 0.0 ... 1 0 0 0 0 0 0 1 3.0 1
4 0 0 0 0 0 0 0 0 0.0 0.0 ... 0 0 0 0 0 0 0 1 0.0 1

5 rows × 128 columns

XGBoost parameters

XGBoost algorithm uses multiple parameters. To improve the model, parameter tuning is necessary. I use GridSearchCV from SCikitLearn to tune the model. I am considering two scoring values accuracy and recall.

  • Accuracy is the number of correct predictions
  • Recall or Sensitivity is the number of items correctly identified as positive out of total true positives.

My goal is maximize the recall but trying to get the best accuracy.

scale_pos_weight parameter

The first parameter to tune is the scale_pos_weight, it controls the balance of positive and negative weights, useful for unbalanced classes.

The formula to calculate it is: sum(negative instances) / sum(positive instances). I know that the optimum value should be aroung 4, but I'm tunning this value to get the one that satisfies better my two metrics.

Grid search says that the best parameter is 5, but this is becausa the refit function is Recall and 5 gives the best Recall. When working with multiple metrics gridsearch demands to select one for the refit and prioritize this.

In my case, if I see the validation plot I see that the value that maximizes recall and minimizes loss of accuracy is around 3.85, so this is my optimum value for the weight parameter.

In [26]:
from sklearn.metrics import  make_scorer, recall_score, accuracy_score

scoring_evals = {'Recall': make_scorer(recall_score), 'Accuracy': make_scorer(accuracy_score)}
param_test = { 'scale_pos_weight':[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5]}

gsearch = GridSearchCV(estimator =XGBClassifier(n_estimators=200, learning_rate= 0.15, gamma=0, subsample=0.8,
                     max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_ )

plot_grid_search_validation_curve(gsearch,[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5], 'weight', title='Validation Curve', ylim=None,
                                      xlim=None, log=None)
{'scale_pos_weight': 5}

Number of estimators

N_estimators is the number of gradient boosted trees. It is equivalent to number of boosting rounds.

To tune this I will fix all the other parameters and find number of estimators. Again the gridsearch says that the optimum value is 10, since it maximizes recall. But the validation curve shows that the best value is for n_estimators 70.

In [27]:
param_test = { 'n_estimators':[10,50,100,500,1000]}

gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
                     max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall',n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_ )
#gsearch.cv_results_ 
plot_grid_search_validation_curve(gsearch,[10,50,100,500,1000], 'n_estimators', title='Validation Curve', ylim=(0.5,0.85),
                                      xlim=None, log=True)
{'n_estimators': 10}

Tune max_depth and min_child_weight

Max_depth is the maximum depth of a boosting tree. Increasing this value makes the model more complex and more likely to overfit.

Min_child_weight defines the minimum sum of weights of all observations required in a child. It controls over-fitting: Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

In [31]:
param_test = { 'max_depth':range(0,11,3), 'min_child_weight':[0.2,1,2.5,5]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
                     n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
gsearch.best_params_ 
Out[31]:
{'max_depth': 3, 'min_child_weight': 5}
In [32]:
plot_grid_search_3d_validation(gsearch, 'max_depth', 'min_child_weight', log1=None, log2=None)

Tune gamma

Gamma specifies the minimum loss reduction required to make a split.

In [33]:
param_test = {'gamma':[i/10.0 for i in range(0,7,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, subsample=0.8,
                     n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[i/10.0 for i in range(0,7,2)], 'gamma', title='Validation Curve', ylim=(0.5,0.8),
                                      xlim=None, log=True)
{'gamma': 0.0}

Tune subsample and colsample_bytree

Subsample is the fraction of observations to be randomly samples for each tree, for example setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. The lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.

Colsample_bytree denotes the fraction of columns to be randomly samples for each tree.

In [34]:
param_test = { 'subsample':[i/10.0 for i in range(3,11,2)], 'colsample_bytree':[i/10.0 for i in range(3,11,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0,
                     n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_3d_validation(gsearch, 'subsample', 'colsample_bytree', log1=None, log2=None)
{'colsample_bytree': 0.7, 'subsample': 0.5}

Regularization

It's the Lasso or L1 regularization term on weights. Increasing this value will make model more conservative (more regularization on the model or simpler the model).

In [35]:
param_test = {'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                     n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight =5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[0, 0.001, 0.005, 0.01, 0.05], 'reg_alpha', title='Validation Curve', ylim=(0.0,0.8),
                                      xlim=None, log=True)
{'reg_alpha': 0.05}

Reducing learning rate

A smaller learning rate allows the model to learn a more optimal set of weights, but it's necessary tu increase the n_estimators to make sure it reaches the optimum value.

In [36]:
param_test = {'learning_rate':[0.0005,0.001, 0.005, 0.01, 0.05, 0.1, 0.3], 'n_estimators':[10,50,100,500,1000]}
gsearch = GridSearchCV(estimator =XGBClassifier(scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                      objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
{'learning_rate': 0.0005, 'n_estimators': 1000}
In [39]:
plot_grid_search_3d_validation(gsearch, 'learning_rate', 'n_estimators', log1=None, log2=None)
In [40]:
xgbfinal = XGBClassifier(learning_rate=0.01, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                     n_estimators=100, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
                     nthread=4, seed=27)
modelfit(xgbfinal, X, Y)
Model Report
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.9, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=3.8, seed=27,
              silent=None, subsample=0.5, verbosity=1)
Accuracy : 0.5357
AUC Score (Train): 0.609178
submitted_via_Web                                               111
company_response_to_consumer_Closed with explanation             90
product_Mortgage                                                 67
company_response_to_consumer_Closed without relief               66
company_response_to_consumer_Closed                              53
submitted_via_Referral                                           50
product_Credit reporting                                         47
consumer_consent_provided_Consent not provided                   41
time_difference                                                  37
company_response_to_consumer_Untimely response                   34
product_Debt collection                                          29
company_response_to_consumer_Closed with non-monetary relief     10
timely_response2                                                  9
product_Credit card                                               9
Conventional fixed mortgage_sub_product                           8
polarity_response_company                                         8
company_response_to_consumer_Closed with monetary relief          7
Conventional adjustable mortgage (ARM)_sub_product                5
polarity_complaint                                                4
subjectivity_response_company                                     4
dtype: int64
tn 0.411734843968437 fp 0.39499012030063446 fn 0.06926202941260697 tp 0.12401300631832157