Financial services are in the top 10 of Better Business Bureau's list for consumer complaints and inquiries every year. I tought it would be great to predict what makes a consumer go from a complaint to a legal dispute, since these could save thousands of dollars to both sides.
I'm using real data extracted from: Consumer Financial Protection Financial Bureau.This Bureau defines itself as: "Every complaint provides insight into problems that people are experiencing, helping us identify inappropriate practices and allowing us to stop them before they become major issues." https://www.consumerfinance.gov/data-research/consumer-complaints/
This Bureau, each week receives thousands of consumers’ complaints about financial products and services, and send them to companies for response. Those complaints are published after the company responds or after 15 days, whichever comes first.
I will use the information of the complaints that ended in a dispute for my model. First, I will start with a descriptive analysis and data cleaning before building my model.
I use different functions for the descriptive analysis but my goal in this notebook is focus more in the model, however you can find all th functions in the file dispute_functions.py. My first step was calling dispute_functions.py where I included a call for the different packages like nltk for sentiment analysis. I also called Pandas, Seaborn and XGBoost.
from dispute_functions import *
First, I have to do some data preparation: download the file in a pandas a dataframe, give format to all the columns name (remove spaces, symbols, and put everything in lower characters), eliminate duplicates, and select just the rows where the information about disputes is available (most of the recent cases are still in the first phase of the complaint and we don't have information about disputes yet).
Also I have to convert the dates to format datetime since in the file they were saved like strings.
file_path = 'complaints.csv'
consumer_data = pd.read_csv(file_path, error_bad_lines=False, index_col=False, dtype='unicode')
consumer_data = consumer_data
consumer_data.columns = consumer_data.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
consumer_data = consumer_data.drop_duplicates(consumer_data.columns, keep='last')
consumer_data['date_received'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_received'] ]
consumer_data['date_sent_to_company'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_sent_to_company'] ]
dispute_data=consumer_data[(consumer_data['consumer_disputed']=='Yes')|(consumer_data['consumer_disputed']=='No')]
#print(dispute_data.columns, len(dispute_data.columns))
#dispute_data.head(5)
Reviewing the data columns and shape I can see that the data consists in 18 columns that contain information about:
The data is unbalanced: 80% of the complaints didnt ended in a dispute.
dispute_data.groupby('consumer_disputed').size()/len(dispute_data)
We have two dates values in the dataset:
The number of complaints that ended in disputes doesn't seem relate to any of this dates. The distributios of disputes or no disputes is almost uniform, only when a peak of no disputes at the beginning of 2017.
received = (dispute_data.groupby(['consumer_disputed','date_received']).size()).rename('Complaints received by bureau').reset_index()
sent = (dispute_data.groupby(['consumer_disputed','date_sent_to_company']).size()).rename('Complaints sent to company').reset_index()
fig, axes = plt.subplots(2,figsize=(16,7))
sns.lineplot(x="date_received", y="Complaints received by bureau", hue='consumer_disputed', linewidth=0.5, data=received, ax=axes[0], marker='o')
sns.lineplot(x="date_sent_to_company", y="Complaints sent to company",hue='consumer_disputed', linewidth=0.5, data=sent, ax=axes[1], marker='o')
plt.show()
(received[received['consumer_disputed']=='Yes'].set_index('date_received').groupby(pd.Grouper(freq='A')).size()/received.set_index('date_received').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index()
(sent[sent['consumer_disputed']=='Yes'].set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()/sent.set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index()
But if I plot the time difference (date_sent_to_company - date_received in days) against the percentage of complaints that ended in disputes, I can see as time difference increase the probability of disputes increase.
dispute_data['time_difference'] = (dispute_data['date_sent_to_company'] - dispute_data['date_received']).astype('timedelta64[D]')
timediff = ((dispute_data[dispute_data['consumer_disputed']=='Yes'].groupby(['time_difference']).size()/dispute_data.groupby(['time_difference']).size())*100).rename('% Complaints ended Disputes').reset_index()
plt.subplots(figsize=(10,6))
s = sns.scatterplot(x="time_difference", y="% Complaints ended Disputes", linewidth=0.5, data=timediff)
s.set(xlim=(0.7,1000),ylim=(0,103))
s.set(xscale="log")
s.set_title('% Complaints ended Disputes')
XGBoost only handles numerical values, so for all the categorical variables I have to create dummy variables but I want to be selective and put only relevant features.
Zip code and states are correlated, so I decided to work with states. The createcolumn function calculates the rate of disputes for the values of a column and returns the top "N" of values with the highest rate of disputes and the top "N" values with the lowest ratio of disputes.
For example applying the createcolumn to the state column the function will return the 4 states with the highest rate of disputes and 4 states with lowest rate of disputes. And it will return a dataframe with the dummy variables for these states.
states_df = createcolumn(dispute_data,'consumer_disputed','yes','state',4)
There are only 12 variables for the product column, So I've decided to work with all for the model. But for subproduct I have 49 different values so I selected just 20, and for issue and subissue I also selected the values with the highest and lowest rates of disputes to feed my model.
dispute_data.groupby('product').size()
print(len(dispute_data['sub_product'].unique()))
print(len(dispute_data['issue'].unique()))
print(len(dispute_data['sub_issue'].unique()))
subproduct_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_product',10)
subproduct_df.head()
(dispute_data.groupby('issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
issue_df = createcolumn(dispute_data,'consumer_disputed','yes','issue',10)
(dispute_data.groupby('sub_issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
subissue_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_issue',10)
I have 2231 different companies in the data set, so I decided to have a different approach for them. I create ordinal values: If the company has less than 50 complaints I consider it a Unique, if has between 50 and 1000 it is a Small size company if has between 1000 and 3000 complaints it is a Medium size company, and for more than 3000 complaints I consider the Company a new category itself.
Doing this I ended with 9 categories, 6 corresponding to the largest financial institutions. And I got the dummy variables for all these.
df = dispute_data.groupby('company').size().rename('companysize').reset_index()
df
df['companycode'] = ['Unique' if x < 50 else 'Small' if 1000 > x >49 else 'Medium' if 3000 > x >999 else 'company' for x in df['companysize']]
df.loc[df.companycode == 'company', 'companycode'] = df['company']
companycode_dum = pd.get_dummies(pd.DataFrame(df['companycode']))
companycode_dum.head(10)
I have other columns but these have few variables so I decided to include all in the model:
I created a dataframe with allthe dummy variables for these categories.
print('Products... ', dispute_data['product'].unique())
print('Response to consumers ', dispute_data['company_response_to_consumer'].unique())
print('Consumer consent provided ', dispute_data['consumer_consent_provided'].unique())
print('submited via: ', dispute_data['submitted_via'].unique())
dum = pd.get_dummies(dispute_data[['product','consumer_consent_provided','company_response_to_consumer','submitted_via']])
dum.head()
dispute_data['disputed'] = [1 if x =='Yes' else 0 for x in dispute_data['consumer_disputed']]
dispute_data['timely_response2'] = [1 if x =='Yes' else 0 for x in dispute_data['timely_response']]
I have two columns with text: Consumer complaint narrative (a consumer description of the complaint) and company public response.
First thing I did it's cleaning the text, the cleaning function removes non alphabethic characters,stop words and numerical characters and return text in lowers. Then I use the wordfrequencyplot function to visualize the most frequen words when the complaints ends (or not) in a dispute.
For both (consumer narrative and company response) the most frequent words where the same with and without disputes. So I decided to take a different approach and use sentiment analysis.
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'company_public_response'),'Frequency for public response when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'company_public_response'),'Frequency for public response when consumer didnt disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer didnt disputed')
I use sentiment analysis with TextBlob. TextBlob is an open source python library used for textual analysis. It is very much useful in Natural Language Processing and Understanding.
There are two things that we can measure:
Polarity helps us in finding the expression and emotion of the author in the text. The value ranges from -1.0 to +1.0 and they contain float values.
Less than 0 denotes Negative Equal to 0 denotes Neutral Greater than 0 denotes Positive Values near to +1 are more likely to be positive than a value near to 0. Same is in the case of negativity.
It tell us if a sentence is subjective or objective. The value ranges from 0.0 to +1.0
Subjective sentences are based on personal opinions, responses, beliefs whereas objective sentences are based on factual information.
I created a dataframe with all these values and added it to the variables for my model.
cleaned_response = cleaning(dispute_data,'company_public_response')
sentiment_response_df = analysis(cleaned_response, 'polarity_response_company', 'subjectivity_response_company')
sentiment_response_df = pd.concat([dispute_data['disputed'],sentiment_response_df], axis=1, sort=False)
sentiment_response_df
sentiment_response_d = sentiment_response_df.dropna()
ax = sentiment_response_d[sentiment_response_d['disputed']==1][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_response_d[sentiment_response_d['disputed']==0][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')
cleaned_complaints = cleaning(dispute_data,'consumer_complaint_narrative')
sentiment_complaint_df = analysis(cleaned_complaints, 'polarity_complaint', 'subjectivity_complaint')
sentiment_complaint_df = pd.concat([dispute_data['disputed'],sentiment_complaint_df], axis=1, sort=False)
sentiment_complaint_df
sentiment_complaint_d = sentiment_complaint_df.dropna()
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==1][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==0][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')
narrative = pd.concat([sentiment_response_df,sentiment_complaint_df], axis=1, sort=False)
narrative.head()
narrativedf = narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]
XGBoost is an implementation of Gradient Boosting Machine, with major improvements.
GBM is an algorithm used for supervised learning: An ensemble of weak learners is built, where the misclassified records are given greater weight (‘boosted’) to correctly predict them in later models. These weak learners are later combined to produce a single strong learner.
GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster.
Now I will put together all the features for the model:
- states_df
- subproduct_df
- issue_df
- subissue_df
- companycode
- dum
- narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]
- dispute_data[['time_difference', 'timely_response2']]
all_df = [states_df, companycode_dum, subproduct_df, issue_df, subissue_df, narrativedf.reset_index(drop=True)]
X = pd.DataFrame()
X = pd.concat(all_df, axis=1, sort=False)
X = pd.concat([X, dum.reset_index(drop=True)], axis=1, sort=False)
fromdf = dispute_data[['time_difference', 'timely_response2']].reset_index(drop=True)
X = pd.concat([X, fromdf], axis=1, sort=False)
Y = dispute_data['disputed']
X.head()
XGBoost algorithm uses multiple parameters. To improve the model, parameter tuning is necessary. I use GridSearchCV from SCikitLearn to tune the model. I am considering two scoring values accuracy and recall.
My goal is maximize the recall but trying to get the best accuracy.
The first parameter to tune is the scale_pos_weight, it controls the balance of positive and negative weights, useful for unbalanced classes.
The formula to calculate it is: sum(negative instances) / sum(positive instances). I know that the optimum value should be aroung 4, but I'm tunning this value to get the one that satisfies better my two metrics.
Grid search says that the best parameter is 5, but this is becausa the refit function is Recall and 5 gives the best Recall. When working with multiple metrics gridsearch demands to select one for the refit and prioritize this.
In my case, if I see the validation plot I see that the value that maximizes recall and minimizes loss of accuracy is around 3.85, so this is my optimum value for the weight parameter.
from sklearn.metrics import make_scorer, recall_score, accuracy_score
scoring_evals = {'Recall': make_scorer(recall_score), 'Accuracy': make_scorer(accuracy_score)}
param_test = { 'scale_pos_weight':[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5]}
gsearch = GridSearchCV(estimator =XGBClassifier(n_estimators=200, learning_rate= 0.15, gamma=0, subsample=0.8,
max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_ )
plot_grid_search_validation_curve(gsearch,[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5], 'weight', title='Validation Curve', ylim=None,
xlim=None, log=None)
N_estimators is the number of gradient boosted trees. It is equivalent to number of boosting rounds.
To tune this I will fix all the other parameters and find number of estimators. Again the gridsearch says that the optimum value is 10, since it maximizes recall. But the validation curve shows that the best value is for n_estimators 70.
param_test = { 'n_estimators':[10,50,100,500,1000]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall',n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_ )
#gsearch.cv_results_
plot_grid_search_validation_curve(gsearch,[10,50,100,500,1000], 'n_estimators', title='Validation Curve', ylim=(0.5,0.85),
xlim=None, log=True)
Max_depth is the maximum depth of a boosting tree. Increasing this value makes the model more complex and more likely to overfit.
Min_child_weight defines the minimum sum of weights of all observations required in a child. It controls over-fitting: Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
param_test = { 'max_depth':range(0,11,3), 'min_child_weight':[0.2,1,2.5,5]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic',
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
gsearch.best_params_
plot_grid_search_3d_validation(gsearch, 'max_depth', 'min_child_weight', log1=None, log2=None)
Gamma specifies the minimum loss reduction required to make a split.
param_test = {'gamma':[i/10.0 for i in range(0,7,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, subsample=0.8,
n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic',max_depth=3, min_child_weight = 5,
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[i/10.0 for i in range(0,7,2)], 'gamma', title='Validation Curve', ylim=(0.5,0.8),
xlim=None, log=True)
Subsample is the fraction of observations to be randomly samples for each tree, for example setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. The lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
Colsample_bytree denotes the fraction of columns to be randomly samples for each tree.
param_test = { 'subsample':[i/10.0 for i in range(3,11,2)], 'colsample_bytree':[i/10.0 for i in range(3,11,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0,
n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight = 5,
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_)
plot_grid_search_3d_validation(gsearch, 'subsample', 'colsample_bytree', log1=None, log2=None)
It's the Lasso or L1 regularization term on weights. Increasing this value will make model more conservative (more regularization on the model or simpler the model).
param_test = {'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight =5,
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[0, 0.001, 0.005, 0.01, 0.05], 'reg_alpha', title='Validation Curve', ylim=(0.0,0.8),
xlim=None, log=True)
A smaller learning rate allows the model to learn a more optimal set of weights, but it's necessary tu increase the n_estimators to make sure it reaches the optimum value.
param_test = {'learning_rate':[0.0005,0.001, 0.005, 0.01, 0.05, 0.1, 0.3], 'n_estimators':[10,50,100,500,1000]}
gsearch = GridSearchCV(estimator =XGBClassifier(scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y)
print(gsearch.best_params_)
plot_grid_search_3d_validation(gsearch, 'learning_rate', 'n_estimators', log1=None, log2=None)
xgbfinal = XGBClassifier(learning_rate=0.01, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
n_estimators=100, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
nthread=4, seed=27)
modelfit(xgbfinal, X, Y)