Will a customer start a dispute regarding financial services?

Predicting a dispute using real complaints data and XGBoost.¶

Financial services are in the top 10 of Better Business Bureau's list for consumer complaints and inquiries every year. I tought it would be great to predict what makes a consumer go from a complaint to a legal dispute, since these could save thousands of dollars to both sides.

I'm using real data extracted from: Consumer Financial Protection Financial Bureau.This Bureau defines itself as: "Every complaint provides insight into problems that people are experiencing, helping us identify inappropriate practices and allowing us to stop them before they become major issues." https://www.consumerfinance.gov/data-research/consumer-complaints/

This Bureau, each week receives thousands of consumers’ complaints about financial products and services, and send them to companies for response. Those complaints are published after the company responds or after 15 days, whichever comes first.

I will use the information of the complaints that ended in a dispute for my model. First, I will start with a descriptive analysis and data cleaning before building my model.

I use different functions for the descriptive analysis but my goal in this notebook is focus more in the model, however you can find all th functions in the file dispute_functions.py. My first step was calling dispute_functions.py where I included a call for the different packages like nltk for sentiment analysis. I also called Pandas, Seaborn and XGBoost.

Content

Descriptive analysis and feature enginering
Sentiment analysis
XGBoost

from dispute_functions import *

[nltk_data] Downloading package words to
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]   Package stopwords is already up-to-date!

Basic data processing.

First, I have to do some data preparation: download the file in a pandas a dataframe, give format to all the columns name (remove spaces, symbols, and put everything in lower characters), eliminate duplicates, and select just the rows where the information about disputes is available (most of the recent cases are still in the first phase of the complaint and we don't have information about disputes yet).

Also I have to convert the dates to format datetime since in the file they were saved like strings.

file_path = 'complaints.csv'
consumer_data = pd.read_csv(file_path, error_bad_lines=False, index_col=False, dtype='unicode')

consumer_data = consumer_data
consumer_data.columns = consumer_data.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
consumer_data = consumer_data.drop_duplicates(consumer_data.columns, keep='last')
consumer_data['date_received'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_received'] ]
consumer_data['date_sent_to_company'] = [datetime.strptime(x, '%Y-%m-%d') for x in consumer_data['date_sent_to_company'] ]
dispute_data=consumer_data[(consumer_data['consumer_disputed']=='Yes')|(consumer_data['consumer_disputed']=='No')]
#print(dispute_data.columns, len(dispute_data.columns))
#dispute_data.head(5)

Reviewing the data columns and shape I can see that the data consists in 18 columns that contain information about:

Complaint: complaintid, date_received, date_sent_to_company, submited_via, state, zip_code.
Product: product, sub_product, issue, sub_issue,
Company: company, company_public_response, tags, company_response_to_consumer, timely_response.
Consumer: consumer_complain_narrative,consumer_consent_provided, consumer_disputed.

The data is unbalanced: 80% of the complaints didnt ended in a dispute.

dispute_data.groupby('consumer_disputed').size()/len(dispute_data)

consumer_disputed
No     0.806919
Yes    0.193081
dtype: float64

Dates¶

We have two dates values in the dataset:

date_received: This is when the consumer submitted the complaint to the Consumer Financial Protection Financial Bureau.
data_sent_to_company: this is when the Consumer Financial Protection Financial Bureau send the complaint to the company asking for response.

The number of complaints that ended in disputes doesn't seem relate to any of this dates. The distributios of disputes or no disputes is almost uniform, only when a peak of no disputes at the beginning of 2017.

received = (dispute_data.groupby(['consumer_disputed','date_received']).size()).rename('Complaints received by bureau').reset_index() 
sent = (dispute_data.groupby(['consumer_disputed','date_sent_to_company']).size()).rename('Complaints sent to company').reset_index()  
fig, axes = plt.subplots(2,figsize=(16,7))
sns.lineplot(x="date_received", y="Complaints received by bureau", hue='consumer_disputed', linewidth=0.5, data=received, ax=axes[0], marker='o')
sns.lineplot(x="date_sent_to_company", y="Complaints sent to company",hue='consumer_disputed', linewidth=0.5, data=sent, ax=axes[1], marker='o')
plt.show()

(received[received['consumer_disputed']=='Yes'].set_index('date_received').groupby(pd.Grouper(freq='A')).size()/received.set_index('date_received').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index()

(sent[sent['consumer_disputed']=='Yes'].set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()/sent.set_index('date_sent_to_company').groupby(pd.Grouper(freq='A')).size()*100).rename('% Complaints ended in Disputes per Year').reset_index()

But if I plot the time difference (date_sent_to_company - date_received in days) against the percentage of complaints that ended in disputes, I can see as time difference increase the probability of disputes increase.

dispute_data['time_difference'] = (dispute_data['date_sent_to_company'] - dispute_data['date_received']).astype('timedelta64[D]')
timediff = ((dispute_data[dispute_data['consumer_disputed']=='Yes'].groupby(['time_difference']).size()/dispute_data.groupby(['time_difference']).size())*100).rename('% Complaints ended Disputes').reset_index() 
plt.subplots(figsize=(10,6))
s = sns.scatterplot(x="time_difference", y="% Complaints ended Disputes", linewidth=0.5, data=timediff)
s.set(xlim=(0.7,1000),ylim=(0,103))
s.set(xscale="log")
s.set_title('% Complaints ended Disputes')

Text(0.5, 1.0, '% Complaints ended Disputes')

Location and zip code¶

XGBoost only handles numerical values, so for all the categorical variables I have to create dummy variables but I want to be selective and put only relevant features.

Zip code and states are correlated, so I decided to work with states. The createcolumn function calculates the rate of disputes for the values of a column and returns the top "N" of values with the highest rate of disputes and the top "N" values with the lowest ratio of disputes.

For example applying the createcolumn to the state column the function will return the 4 states with the highest rate of disputes and 4 states with lowest rate of disputes. And it will return a dataframe with the dummy variables for these states.

states_df = createcolumn(dispute_data,'consumer_disputed','yes','state',4)

Product¶

There are only 12 variables for the product column, So I've decided to work with all for the model. But for subproduct I have 49 different values so I selected just 20, and for issue and subissue I also selected the values with the highest and lowest rates of disputes to feed my model.

dispute_data.groupby('product').size()

product
Bank account or service         86206
Checking or savings account         3
Consumer Loan                   31604
Credit card                     89190
Credit reporting               140432
Debt collection                145815
Money transfers                  5354
Mortgage                       226897
Other financial service          1059
Payday loan                      5543
Prepaid card                     3819
Student loan                    32537
Virtual currency                   18
dtype: int64

print(len(dispute_data['sub_product'].unique()))
print(len(dispute_data['issue'].unique()))
print(len(dispute_data['sub_issue'].unique()))

51
99
62

subproduct_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_product',10)
subproduct_df.head()

(dispute_data.groupby('issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
issue_df = createcolumn(dispute_data,'consumer_disputed','yes','issue',10)

(dispute_data.groupby('sub_issue').size()/len(dispute_data)*100).sort_values(ascending=False).head(12)
subissue_df = createcolumn(dispute_data,'consumer_disputed','yes','sub_issue',10)

Company¶

I have 2231 different companies in the data set, so I decided to have a different approach for them. I create ordinal values: If the company has less than 50 complaints I consider it a Unique, if has between 50 and 1000 it is a Small size company if has between 1000 and 3000 complaints it is a Medium size company, and for more than 3000 complaints I consider the Company a new category itself.

Doing this I ended with 9 categories, 6 corresponding to the largest financial institutions. And I got the dummy variables for all these.

df = dispute_data.groupby('company').size().rename('companysize').reset_index()
df

df['companycode'] = ['Unique' if x < 50 else 'Small' if 1000 > x >49 else 'Medium' if 3000 > x >999 else 'company' for x in df['companysize']]
df.loc[df.companycode == 'company', 'companycode'] = df['company']

companycode_dum = pd.get_dummies(pd.DataFrame(df['companycode']))
companycode_dum.head(10)

Other variables¶

I have other columns but these have few variables so I decided to include all in the model:

Response to consumer has only 7 variables
Consumer consent provided has 5 variables.
Submitted via has only 6 variables.

I created a dataframe with allthe dummy variables for these categories.

print('Products... ', dispute_data['product'].unique())
print('Response to consumers  ', dispute_data['company_response_to_consumer'].unique())
print('Consumer consent provided  ', dispute_data['consumer_consent_provided'].unique())
print('submited via:  ', dispute_data['submitted_via'].unique())
dum = pd.get_dummies(dispute_data[['product','consumer_consent_provided','company_response_to_consumer','submitted_via']])
dum.head()

Products...  ['Debt collection' 'Credit card' 'Bank account or service' 'Consumer Loan'
 'Mortgage' 'Payday loan' 'Credit reporting' 'Other financial service'
 'Student loan' 'Prepaid card' 'Money transfers' 'Virtual currency'
 'Checking or savings account']
Response to consumers   ['Closed with explanation' 'Closed with monetary relief'
 'Closed with non-monetary relief' 'Closed' 'Untimely response'
 'Closed with relief' 'Closed without relief']
Consumer consent provided   [nan 'Consent provided' 'Consent not provided' 'Other' 'Consent withdrawn']
submited via:   ['Web' 'Phone' 'Postal mail' 'Referral' 'Fax' 'Email']

dispute_data['disputed'] = [1 if x =='Yes' else 0 for x in dispute_data['consumer_disputed']] 
dispute_data['timely_response2'] = [1 if x =='Yes' else 0 for x in dispute_data['timely_response']]

Text analysis

Consumer complaint narrative and company public response¶

I have two columns with text: Consumer complaint narrative (a consumer description of the complaint) and company public response.

First thing I did it's cleaning the text, the cleaning function removes non alphabethic characters,stop words and numerical characters and return text in lowers. Then I use the wordfrequencyplot function to visualize the most frequen words when the complaints ends (or not) in a dispute.

For both (consumer narrative and company response) the most frequent words where the same with and without disputes. So I decided to take a different approach and use sentiment analysis.

wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'company_public_response'),'Frequency for public response when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'company_public_response'),'Frequency for public response when consumer didnt disputed')

wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='Yes'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer disputed')
wordfrequecyplot(cleaning(dispute_data[dispute_data['consumer_disputed']=='No'],'consumer_complaint_narrative'),'Frequency for complaint narrative when consumer didnt disputed')

Sentiment analysis¶

I use sentiment analysis with TextBlob. TextBlob is an open source python library used for textual analysis. It is very much useful in Natural Language Processing and Understanding.

There are two things that we can measure:

Polarity
Subjectivity

POLARITY¶

Polarity helps us in finding the expression and emotion of the author in the text. The value ranges from -1.0 to +1.0 and they contain float values.

Less than 0 denotes Negative Equal to 0 denotes Neutral Greater than 0 denotes Positive Values near to +1 are more likely to be positive than a value near to 0. Same is in the case of negativity.

SUBJECTIVITY¶

It tell us if a sentence is subjective or objective. The value ranges from 0.0 to +1.0

Subjective sentences are based on personal opinions, responses, beliefs whereas objective sentences are based on factual information.

I created a dataframe with all these values and added it to the variables for my model.

cleaned_response = cleaning(dispute_data,'company_public_response')
sentiment_response_df = analysis(cleaned_response, 'polarity_response_company', 'subjectivity_response_company')
sentiment_response_df  = pd.concat([dispute_data['disputed'],sentiment_response_df], axis=1, sort=False)
sentiment_response_df

sentiment_response_d = sentiment_response_df.dropna()
ax = sentiment_response_d[sentiment_response_d['disputed']==1][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_response_d[sentiment_response_d['disputed']==0][['polarity_response_company','subjectivity_response_company']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')

Text(0.5, 1.0, 'When consumer didnt dispute')

cleaned_complaints = cleaning(dispute_data,'consumer_complaint_narrative')
sentiment_complaint_df = analysis(cleaned_complaints, 'polarity_complaint', 'subjectivity_complaint')
sentiment_complaint_df  = pd.concat([dispute_data['disputed'],sentiment_complaint_df], axis=1, sort=False)
sentiment_complaint_df

sentiment_complaint_d = sentiment_complaint_df.dropna()
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==1][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set_title('When consumer disputed')
ax.set(xlim=(-1,1),ylim=(0,10))
ax = sentiment_complaint_d[sentiment_complaint_d['disputed']==0][['polarity_complaint','subjectivity_complaint']].plot.kde()
ax.set(xlim=(-1,1),ylim=(0,10))
ax.set_title('When consumer didnt dispute')

Text(0.5, 1.0, 'When consumer didnt dispute')

narrative = pd.concat([sentiment_response_df,sentiment_complaint_df], axis=1, sort=False)
narrative.head()
narrativedf = narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]

XGBoost model

XGBoost is an implementation of Gradient Boosting Machine, with major improvements.

GBM is an algorithm used for supervised learning: An ensemble of weak learners is built, where the misclassified records are given greater weight (‘boosted’) to correctly predict them in later models. These weak learners are later combined to produce a single strong learner.

GBM’s build trees sequentially, but XGBoost is parallelized. This makes XGBoost faster.

Starting predictions¶

Now I will put together all the features for the model:

- states_df
- subproduct_df
- issue_df
- subissue_df
- companycode
- dum
- narrative[['polarity_response_company','subjectivity_response_company','polarity_complaint','subjectivity_complaint']]
- dispute_data[['time_difference', 'timely_response2']]

all_df = [states_df, companycode_dum, subproduct_df, issue_df, subissue_df, narrativedf.reset_index(drop=True)]
X = pd.DataFrame()
X =  pd.concat(all_df, axis=1, sort=False)
X = pd.concat([X, dum.reset_index(drop=True)], axis=1, sort=False)
fromdf = dispute_data[['time_difference', 'timely_response2']].reset_index(drop=True)
X = pd.concat([X, fromdf], axis=1, sort=False)

Y = dispute_data['disputed'] 
X.head()

XGBoost parameters¶

XGBoost algorithm uses multiple parameters. To improve the model, parameter tuning is necessary. I use GridSearchCV from SCikitLearn to tune the model. I am considering two scoring values accuracy and recall.

Accuracy is the number of correct predictions
Recall or Sensitivity is the number of items correctly identified as positive out of total true positives.

My goal is maximize the recall but trying to get the best accuracy.

scale_pos_weight parameter¶

The first parameter to tune is the scale_pos_weight, it controls the balance of positive and negative weights, useful for unbalanced classes.

The formula to calculate it is: sum(negative instances) / sum(positive instances). I know that the optimum value should be aroung 4, but I'm tunning this value to get the one that satisfies better my two metrics.

Grid search says that the best parameter is 5, but this is becausa the refit function is Recall and 5 gives the best Recall. When working with multiple metrics gridsearch demands to select one for the refit and prioritize this.

In my case, if I see the validation plot I see that the value that maximizes recall and minimizes loss of accuracy is around 3.85, so this is my optimum value for the weight parameter.

from sklearn.metrics import  make_scorer, recall_score, accuracy_score

scoring_evals = {'Recall': make_scorer(recall_score), 'Accuracy': make_scorer(accuracy_score)}
param_test = { 'scale_pos_weight':[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5]}

gsearch = GridSearchCV(estimator =XGBClassifier(n_estimators=200, learning_rate= 0.15, gamma=0, subsample=0.8,
                     max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_ )

plot_grid_search_validation_curve(gsearch,[3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5], 'weight', title='Validation Curve', ylim=None,
                                      xlim=None, log=None)

{'scale_pos_weight': 5}

Number of estimators¶

N_estimators is the number of gradient boosted trees. It is equivalent to number of boosting rounds.

To tune this I will fix all the other parameters and find number of estimators. Again the gridsearch says that the optimum value is 10, since it maximizes recall. But the validation curve shows that the best value is for n_estimators 70.

param_test = { 'n_estimators':[10,50,100,500,1000]}

gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
                     max_depth=3, min_child_weight = 1, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall',n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_ )
#gsearch.cv_results_ 
plot_grid_search_validation_curve(gsearch,[10,50,100,500,1000], 'n_estimators', title='Validation Curve', ylim=(0.5,0.85),
                                      xlim=None, log=True)

{'n_estimators': 10}

Tune max_depth and min_child_weight¶

Max_depth is the maximum depth of a boosting tree. Increasing this value makes the model more complex and more likely to overfit.

Min_child_weight defines the minimum sum of weights of all observations required in a child. It controls over-fitting: Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

param_test = { 'max_depth':range(0,11,3), 'min_child_weight':[0.2,1,2.5,5]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, subsample=0.8,
                     n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic', 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
gsearch.best_params_

{'max_depth': 3, 'min_child_weight': 5}

plot_grid_search_3d_validation(gsearch, 'max_depth', 'min_child_weight', log1=None, log2=None)

Tune gamma¶

Gamma specifies the minimum loss reduction required to make a split.

param_test = {'gamma':[i/10.0 for i in range(0,7,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, subsample=0.8,
                     n_estimators=50, colsample_bytree=0.8, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[i/10.0 for i in range(0,7,2)], 'gamma', title='Validation Curve', ylim=(0.5,0.8),
                                      xlim=None, log=True)

{'gamma': 0.0}

Tune subsample and colsample_bytree¶

Subsample is the fraction of observations to be randomly samples for each tree, for example setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. The lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.

Colsample_bytree denotes the fraction of columns to be randomly samples for each tree.

param_test = { 'subsample':[i/10.0 for i in range(3,11,2)], 'colsample_bytree':[i/10.0 for i in range(3,11,2)]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0,
                     n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_3d_validation(gsearch, 'subsample', 'colsample_bytree', log1=None, log2=None)

{'colsample_bytree': 0.7, 'subsample': 0.5}

Regularization¶

It's the Lasso or L1 regularization term on weights. Increasing this value will make model more conservative (more regularization on the model or simpler the model).

param_test = {'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]}
gsearch = GridSearchCV(estimator =XGBClassifier(learning_rate= 0.15, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                     n_estimators=50, objective= 'binary:logistic',max_depth=3, min_child_weight =5, 
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)
plot_grid_search_validation_curve(gsearch,[0, 0.001, 0.005, 0.01, 0.05], 'reg_alpha', title='Validation Curve', ylim=(0.0,0.8),
                                      xlim=None, log=True)

{'reg_alpha': 0.05}

Reducing learning rate¶

A smaller learning rate allows the model to learn a more optimal set of weights, but it's necessary tu increase the n_estimators to make sure it reaches the optimum value.

param_test = {'learning_rate':[0.0005,0.001, 0.005, 0.01, 0.05, 0.1, 0.3], 'n_estimators':[10,50,100,500,1000]}
gsearch = GridSearchCV(estimator =XGBClassifier(scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                      objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
                     nthread=4, seed=27), param_grid = param_test, scoring=scoring_evals, refit='Recall', n_jobs=4, iid=False, cv=5)
gsearch.fit(X,Y) 
print(gsearch.best_params_)

{'learning_rate': 0.0005, 'n_estimators': 1000}

plot_grid_search_3d_validation(gsearch, 'learning_rate', 'n_estimators', log1=None, log2=None)

xgbfinal = XGBClassifier(learning_rate=0.01, scale_pos_weight=3.8, gamma=0, colsample_bytree=0.9, subsample=0.5,
                     n_estimators=100, objective= 'binary:logistic',max_depth=3, min_child_weight = 5, reg_alpha=0,
                     nthread=4, seed=27)
modelfit(xgbfinal, X, Y)

Model Report
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.9, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=5, missing=None, n_estimators=100, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=3.8, seed=27,
              silent=None, subsample=0.5, verbosity=1)
Accuracy : 0.5357
AUC Score (Train): 0.609178
submitted_via_Web                                               111
company_response_to_consumer_Closed with explanation             90
product_Mortgage                                                 67
company_response_to_consumer_Closed without relief               66
company_response_to_consumer_Closed                              53
submitted_via_Referral                                           50
product_Credit reporting                                         47
consumer_consent_provided_Consent not provided                   41
time_difference                                                  37
company_response_to_consumer_Untimely response                   34
product_Debt collection                                          29
company_response_to_consumer_Closed with non-monetary relief     10
timely_response2                                                  9
product_Credit card                                               9
Conventional fixed mortgage_sub_product                           8
polarity_response_company                                         8
company_response_to_consumer_Closed with monetary relief          7
Conventional adjustable mortgage (ARM)_sub_product                5
polarity_complaint                                                4
subjectivity_response_company                                     4
dtype: int64
tn 0.411734843968437 fp 0.39499012030063446 fn 0.06926202941260697 tp 0.12401300631832157

	date_received	% Complaints ended in Disputes per Year
0	2011-12-31	50.000000
1	2012-12-31	50.000000
2	2013-12-31	50.000000
3	2014-12-31	50.000000
4	2015-12-31	50.000000
5	2016-12-31	50.000000
6	2017-12-31	49.775785

	date_sent_to_company	% Complaints ended in Disputes per Year
0	2011-12-31	50.000000
1	2012-12-31	48.713826
2	2013-12-31	49.652295
3	2014-12-31	50.000000
4	2015-12-31	50.000000
5	2016-12-31	50.000000
6	2017-12-31	33.623188
7	2018-12-31	NaN

	(CD) Certificate of deposit_sub_product	Auto_sub_product	Cashing a check without an account_sub_product	Check cashing_sub_product	Checking account_sub_product	Conventional adjustable mortgage (ARM)_sub_product	Conventional fixed mortgage_sub_product	Conventional home mortgage_sub_product	Credit card_sub_product	Credit repair_sub_product	Refund anticipation check_sub_product	Reverse mortgage_sub_product	Savings account_sub_product	Second mortgage_sub_product	Title loan_sub_product	Transit card_sub_product	Traveler’s/Cashier’s checks_sub_product	VA mortgage_sub_product	Vehicle lease_sub_product	Vehicle loan_sub_product
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	company	companysize
0	(Former)Shapiro, Swertfeger & Hasty, LLP	4
1	1 STOP MONEY CENTERS, LLC	1
2	1ST 2ND MORTGAGE CO. OF NJ INC	1
3	1ST ALLIANCE LENDING, LLC	18
4	1ST PREFERENCE MORTGAGE CORP	2
...	...	...
4284	eMoneyUSA Holdings, LLC	3
4285	i3 Lending, Inc	2
4286	iFreedom Direct Corporation	16
4287	iQuantified Management Services, LLC	5
4288	Lippman Recupero, LLC	5

	...	companycode_Small	companycode_Unique
0	...	0	1
1	...	0	1
2	...	0	1
3	...	0	1
4	...	0	1
5	...	0	1
6	...	0	1
7	...	0	1
8	...	1	0
9	...	0	1

	(CD) Certificate of deposit_sub_product	Auto_sub_product	Cashing a check without an account_sub_product	Check cashing_sub_product	Checking account_sub_product	Conventional adjustable mortgage (ARM)_sub_product	Conventional fixed mortgage_sub_product	Conventional home mortgage_sub_product	Credit card_sub_product	Credit repair_sub_product	Refund anticipation check_sub_product	Reverse mortgage_sub_product	Savings account_sub_product	Second mortgage_sub_product	Title loan_sub_product	Transit card_sub_product	Traveler’s/Cashier’s checks_sub_product	VA mortgage_sub_product	Vehicle lease_sub_product	Vehicle loan_sub_product
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	...	companycode_Small	companycode_Unique
0	...	0	1
1	...	0	1
2	...	0	1
3	...	0	1
4	...	0	1
5	...	0	1
6	...	0	1
7	...	0	1
8	...	1	0
9	...	0	1

	product_Credit card	product_Debt collection	...	submitted_via_Web
16	0	1	...	1
178	0	1	...	1
227	0	1	...	1
265	1	0	...	1
438	0	1	...	1

	disputed	Cleantext	polarity_response_company	subjectivity_response_company
16	1	no_comment	NaN	NaN
178	1	no_comment	NaN	NaN
227	0	no_comment	NaN	NaN
265	0	company responded consumer cfpb chooses provid...	0.0	0.066667
438	0	company believes complaint represents opportun...	0.1	0.350000
...	...	...	...	...
1624049	0	no_comment	NaN	NaN
1624050	0	no_comment	NaN	NaN
1624051	0	company responded consumer cfpb chooses provid...	0.0	0.066667
1624052	1	company responded consumer cfpb chooses provid...	0.0	0.066667
1624053	0	company responded consumer cfpb chooses provid...	0.0	0.066667

	disputed	Cleantext	polarity_complaint	subjectivity_complaint
16	1	no_comment	NaN	NaN
178	1	monitor credit report frequently attempting ho...	0.192857	0.417857
227	0	xxxx xxxx xxxx received letter stating owed de...	0.093855	0.405556
265	0	stupid charge items macy macy credit card xxxx...	0.003133	0.559818
438	0	vehicle repoed xxxx paychecks gotten loans hun...	0.000000	0.138095
...	...	...	...	...
1624049	0	xxxx xxxx contacted xxxx xxxx branch manager x...	0.082857	0.489286
1624050	0	chase services mortgage owns original loan doc...	0.375000	0.750000
1624051	0	payment citi xxxx credit card xxxx xxxx using ...	0.166667	0.548148
1624052	1	cfbp like file complaint experian reporting ag...	-0.100000	0.375000
1624053	0	husband middle short sale property located xxx...	0.092857	0.435714

	...	company_response_to_consumer_Closed without relief	submitted_via_Web	time_difference	timely_response2
0	...	0	1	1.0	1
1	...	0	1	1.0	1
2	...	0	1	0.0	1
3	...	1	1	3.0	1
4	...	0	1	0.0	1

	(CD) Certificate of deposit_sub_product	Auto_sub_product	Cashing a check without an account_sub_product	Check cashing_sub_product	Checking account_sub_product	Conventional adjustable mortgage (ARM)_sub_product	Conventional fixed mortgage_sub_product	Conventional home mortgage_sub_product	Credit card_sub_product	Credit repair_sub_product	Refund anticipation check_sub_product	Reverse mortgage_sub_product	Savings account_sub_product	Second mortgage_sub_product	Title loan_sub_product	Transit card_sub_product	Traveler’s/Cashier’s checks_sub_product	VA mortgage_sub_product	Vehicle lease_sub_product	Vehicle loan_sub_product
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	...	companycode_Small	companycode_Unique
0	...	0	1
1	...	0	1
2	...	0	1
3	...	0	1
4	...	0	1
5	...	0	1
6	...	0	1
7	...	0	1
8	...	1	0
9	...	0	1

	product_Credit card	product_Debt collection	...	submitted_via_Web
16	0	1	...	1
178	0	1	...	1
227	0	1	...	1
265	1	0	...	1
438	0	1	...	1

	...	company_response_to_consumer_Closed without relief	submitted_via_Web	time_difference	timely_response2
0	...	0	1	1.0	1
1	...	0	1	1.0	1
2	...	0	1	0.0	1
3	...	1	1	3.0	1
4	...	0	1	0.0	1

	(CD) Certificate of deposit_sub_product	Auto_sub_product	Cashing a check without an account_sub_product	Check cashing_sub_product	Checking account_sub_product	Conventional adjustable mortgage (ARM)_sub_product	Conventional fixed mortgage_sub_product	Conventional home mortgage_sub_product	Credit card_sub_product	Credit repair_sub_product	Refund anticipation check_sub_product	Reverse mortgage_sub_product	Savings account_sub_product	Second mortgage_sub_product	Title loan_sub_product	Transit card_sub_product	Traveler’s/Cashier’s checks_sub_product	VA mortgage_sub_product	Vehicle lease_sub_product	Vehicle loan_sub_product
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	...	companycode_Small	companycode_Unique
0	...	0	1
1	...	0	1
2	...	0	1
3	...	0	1
4	...	0	1
5	...	0	1
6	...	0	1
7	...	0	1
8	...	1	0
9	...	0	1

	product_Credit card	product_Debt collection	...	submitted_via_Web
16	0	1	...	1
178	0	1	...	1
227	0	1	...	1
265	1	0	...	1
438	0	1	...	1

	...	company_response_to_consumer_Closed without relief	submitted_via_Web	time_difference	timely_response2
0	...	0	1	1.0	1
1	...	0	1	1.0	1
2	...	0	1	0.0	1
3	...	1	1	3.0	1
4	...	0	1	0.0	1