from IPython.display import Video
Video("test.mp4",width=1024, height=576)
In the last years, influencer marketing has become a strategy for brands. Brands partner with prominent social media influencers to create sponsored content that resonates with a brand’s target audience. But brands are only interested in working with influences who have active and responsive audiences: the higher your engagement metrics, the better chances of getting a business to sponsor you.
For Instagram, the main engagement metrics are likes and comments; they show how interested are the followers in the influencer content. So I thought it would be good to have a tool that predicts these metrics based on the Instagram post content.
My goal is to use a regression model to predict and understand how Instagram data in a post affect likes and comments. The data includes followers, following, media count (number of posts in the feed), caption, kind of the picture (selfie, outdoor, body snap, editorial image), content on the photo (smiles, faces, products, logos).
With these models and the insights obtained, I built SMARTGRAM using Streamlit. You can see the demo above or try it by yourself here and the code is here.
My analysis consist of the next sections:
I obtained the data from the project https://arxiv.org/abs/1704.04137. In their data project they obtained 24,752 Instagram posts by 13,350 people on Instagram through the Instagram’s API. The data collection was done over a month period in January 2015. And in all the posts renowned fashion brands are named in the hashtags.
The data includes:
I'm using python for this project.The nltk (the natural language toolkit) is a package of tools for working with text data. And scikit-learn for the machine learning models.
I donwloaded the data in a pandas dataframe, I will be working with the hashtags and captions but I'm keeping the columns of followers, likes and comments because I want to compare the different engagement metrics for the brand, brandcategories and clusters.
My first step was a simple formating (removing non-alphabetical characters and lowercasing) in the name of the columns and in the columns that includes the brandnames.
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
import re
import nltk
from nltk.corpus import stopwords
import en_core_web_sm
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
import joblib
from joblib import dump
import warnings
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, make_scorer,mean_squared_error
from math import sqrt
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
#print('Loading words, spacy, punktd, stopwords')
nltk.download('words')
nlp = en_core_web_sm.load()
words = set(nltk.corpus.words.words())
nltk.download('stopwords')
#nltk.download('punkt')
print('done, now loading text and basic formating of columns name')
# Read dataset and format texts
df = pd.read_csv(r'fashion data on instagram.csv', index_col=0)#.sample(frac=0.25)
df = df.loc[~df.index.duplicated(keep='first')]
df = df[~df['Caption'].isnull()]
#Formating column names, brand categories names and brand names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
df.brandname = df.brandname.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
df.brandcategory = df.brandcategory.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('?', '')
The local outlier detector identify outliers by finding samples with a lower density of neighbors. The density is calculated by measuring the distance of its k-nearest neighbors.t measures the local deviation of density of a given sample with respect to its neighbors. The variables that I decided to use to identify outliers are likes, comments, followings and followers. This is because apart from having low amount of likes and comments, accounts with fake followers have large number of followers, or some times just follow other accounts to gain followers.
def filterbyoutlier(df,vector):
clfhash = LocalOutlierFactor(n_neighbors=50)
clfhash.fit_predict(vector)
df['outlier_factor_hash'] = clfhash.negative_outlier_factor_
return df[df['outlier_factor_hash']>-5]
print(df.shape)
ndf = filterbyoutlier(df,df[['likes','comments','followings', 'followers']])
print(ndf.shape)
targets = ndf[['likes','comments']].copy(deep=True)
text_df = ndf[['caption']].copy(deep=True)
numerical = ndf[['followings', 'followers','mediacount','selfie','bodysnap','marketing','productonly','nonfashion','face','logo','brandlogo','smile','outdoor','numberofpeople','numberoffashionproduct']].copy(deep=True).round(0)
In the data set they included in their results learned variables of emotions.
corrmat = ndf.round(0).corr()
sns.heatmap(corrmat, xticklabels=corrmat.columns, yticklabels=corrmat.columns)
numerical.info()
sns.pairplot(ndf[['likes','comments','followings','followers','mediacount']])
ndf[['likes','comments','followings','followers','mediacount']].describe()
I'm taking the captions and creating a text vector that I will later use with my numerical data for my model predictions. First step is the text processing for captions including hashtags. The cleaning fuction takes all the entries in a column, removes the non-alphabetical characters, stopwords and encrypted words and returns the remaining text in lower cases.
Then from the cleaned text I will create the text vectors with the tfidvectorizer. Tokenize is called inside of vectorizer, it takes the text entries, extracts the words and reduce them to their roots.
I got 188 words for the vectors, I'm creating a data frame to save the vector for each caption and I later will concat it with the numerical variables for the predictors dataframe.
def cleaning(frame,col):
"""
Function to clean text from a column in a data frame
This funtion removes non alphabethic characters,stop words and numerical characters and return text in lowers.
Parameters:
Data frame, text column
Returns:
values clean text from column
"""
newframe=frame.copy()
punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','#',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)
stop_words = list(stop_words)
newframe[col]=newframe[col].str.replace('\d+', '').str.replace('\W', ' ').str.lower().str.replace(r'\b(\w{1,3})\b', '')
newframe[col] = [' '.join([w for w in x.lower().split() if w not in stop_words]) for x in newframe[col].tolist()]
newframe['Cleantext'] =[' '.join(word for word in x.split() if not word.startswith('uf')if not word.startswith('ue')if not word.startswith('u0')) for x in newframe[col].tolist()]
content = newframe['Cleantext']#.values
return content
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')
def tokenize(text):
return [stemmer.stem(word) for word in tokenizer.tokenize(text)]
vectorizing = TfidfVectorizer(sublinear_tf=True, min_df=300, norm='l2',
ngram_range=(1, 1), tokenizer = tokenize)
clean_captions = cleaning(text_df,'caption')
%time clean_captions_vector = vectorizing.fit_transform(clean_captions.values)
names= vectorizing.get_feature_names()
textdf = pd.DataFrame(columns=["{}{}".format('txt_',i) for i in names])
joblib.dump(textdf, 'textdf.pkl')
text = pd.DataFrame(clean_captions_vector.toarray(), columns=["{}{}".format('txt_',i) for i in names])
text.head()
X = pd.concat([numerical,text.set_index(numerical.index)], axis=1, sort=False)
Y = targets['likes']
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
scoring_evals = {'MSE': make_scorer(mean_squared_error)}
param_test = { 'alpha':[0.5, 0.1, 0.01, 0.001, 0.0001]}
gsearch = GridSearchCV(estimator = Ridge(), param_grid = param_test, scoring=scoring_evals, refit='MSE', n_jobs=4, iid=False, cv=5)
gsearch.fit(X_train,y_train)
#print(gsearch.cv_results_)
print(gsearch.best_params_ )
likesmodel=Ridge(alpha=0.0001)
likesmodel.fit(X_train,y_train)
pred_test_rr = likesmodel.predict(X_test)
pred_test_rr = pred_test_rr.clip(min=0)
print(np.sqrt(mean_squared_error(y_test,pred_test_rr)))
print(r2_score(y_test, pred_test_rr))
Y1 = targets['comments']
X_train, X_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)
scoring_evals = {'MSE': make_scorer(mean_squared_error)}
param_test = { 'alpha':[0.5, 0.1, 0.01, 0.001, 0.0001]}
gsearch = GridSearchCV(estimator = Ridge(), param_grid = param_test, scoring=scoring_evals, refit='MSE', n_jobs=4, iid=False, cv=5)
gsearch.fit(X_train,y1_train)
#print(gsearch.cv_results_)
print(gsearch.best_params_ )
commentmodel=Ridge(alpha=0.0001)
commentmodel.fit(X_train,y1_train)
pred_test_rr = commentmodel.predict(X_test)
pred_test_rr = pred_test_rr.clip(min=0)
print(np.sqrt(mean_squared_error(y_test,pred_test_rr)))
print(r2_score(y1_test, pred_test_rr))
Saving the models for the app
joblib.dump(vectorizing, 'vectorizer.pkl')
joblib.dump(likesmodel, 'likesmodel.pkl')
joblib.dump(commentmodel, 'commentmodel.pkl')
featureslist = X.columns.tolist()
importance = commentmodel.coef_
comments_features = {featureslist[i]: importance[i] for i in range(len(importance))}
sorted_comments_features = sorted(comments_features, key=comments_features.get, reverse=True)
print(sorted_comments_features[0:20])
notext = [feature for feature in sorted_comments_features if not 'txt'in feature]
for key in notext:
print(key, comments_features[key])
importance = likesmodel.coef_
likes_features = {featureslist[i]: importance[i] for i in range(len(importance))}
sorted_likes_features= sorted(likes_features, key=likes_features.get, reverse=True)
print(sorted_likes_features[0:20])
notext = [feature for feature in sorted_likes_features if not 'txt'in feature]
for key in notext:
print(key, likes_features[key])