What Does Your Hero Preferences in DotA 2 Tell about your Personality?

9 min readDec 7, 2020

As an old-school DotA player, I have always been thinking that whether is there any psychological motivation for picking heroes or not: but not only for tactical picking, especially among players who don’t play competitively. In addition, I would have preferred to collect data on players who started to play recently in Dota 2. Because after a threshold, players picking heroes for the win; so mostly the most favorite hero changes through gameplay updates. So, don’t take this analysis seriously; it’s just for a fun.

After a random search in Kaggle; I’ve found a dataset collected in 2017. https://www.kaggle.com/definitelyliliput/rawscores. Although data includes the big five personality scale and game motivation scale, I used only the big five for the independent variable. For the “tactical picking” issue as I mentioned above, I picked variables that have been tagged with “favorite”. (For instance, I analyzed “most_preferred_hero” but not “most_played_hero”.)

Now let’s start to analyze and find out whether personality traits can predict hero-picking behaviors in DotA 2 or not?

Before the start, note that our data is raw data. That means we have to calculate to get personality points. The Big Five Inventory(BFI), which is our personality measurement, is a self-reporting scale and has widely using among psychology researchers. The BFI yields five primary scales: Extraversion, Agreeableness, Conscientiousness Neuroticism, and Openness subscales. You can access the full questionnaire here: https://dionysus.psych.wisc.edu:5001/sharing/JTrZWyXGw

Now let’s switch to R and begin to analyze. First, let’s recode the reverse points.

#our data set:
df_dota2<-read.csv("Dota2.csv",header=T,sep=",")
#recoding
library(car)df_dota2$B6<-Recode(df_dota2$B6,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B21<-Recode(df_dota2$B21,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B31<-Recode(df_dota2$B31,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B2<-Recode(df_dota2$B2,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B12<-Recode(df_dota2$B12,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B27<-Recode(df_dota2$B27,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B37<-Recode(df_dota2$B37,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B8<-Recode(df_dota2$B8,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B18<-Recode(df_dota2$B18,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B23<-Recode(df_dota2$B23,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B43<-Recode(df_dota2$B43,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B9<-Recode(df_dota2$B9,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B24<-Recode(df_dota2$B24,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B34<-Recode(df_dota2$B34,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B35<-Recode(df_dota2$B35,"1=5;2=4;3=3;4=2;5=1")
df_dota2$B41<-Recode(df_dota2$B41,"1=5;2=4;3=3;4=2;5=1")

The recode function, which came with the car package, is handy for basic recoding processes, although it seems too long in the script, you only need to do paste variables that you wanted to recode. However, you can create a loop for the re-coding without the Recode function just for fun!

After the recoding process, now we can calculate the subscales. I did this with the Tidyverse package easily!

library(tidyverse)#Extraversion Subscale 
df_dota2<-df_dota2 %>% 
  rowwise() %>% 
  mutate(Extraver = sum(B1,B6,B11,B16,B21,B26,B31,B36))#Aggreeableness Subscale 
df_dota2<-df_dota2 %>% 
  rowwise() %>% 
  mutate(Aggre = sum(B2,B7,B12,B17,B22,B27,B32,B37,B42))#Conscientiousness Subscale
df_dota2<-df_dota2 %>% 
  rowwise() %>% 
  mutate(Conscient = sum(B3,B8,B13,B18,B23,B28,B33,B38,B43))#Neuroticism Subscale
df_dota2<-df_dota2 %>% 
  rowwise() %>% 
  mutate(Neurotic = sum(B4,B9,B14,B19,B24,B29,B34,B39))#Openness Subscale
df_dota2<-df_dota2 %>% 
  rowwise() %>% 
  mutate(Open = sum(B5,B10,B15,B20,B25,B30,B35,B40,B41,B44))
#I deleted the ID column
df_dota2$ï..ID<-NULL

Now let’s see what happened to our data:

names(df_dota2)

As you can see above, we have so many variables that we don’t need anymore. Hence, we can conduct a selection operation.

df_dota3 <- df_dota2 %>%
select(age,gender,most_preferred,
most_played,
most_preferred_hero1,
most_preferred_hero2,
most_preferred_hero3,
Extraver,Aggre,Conscient,Open,Neurotic)

Dealing With Missing Variables

The mice package is useful for overcoming missing values. Mice provide us a function called “md.pattern” to getting missing variables. You can either visualize and control your missing variables. However, I much more love to VIM package for the visualization.

library(mice)md.pattern(df_dota3)#visualazition:library(VIF)missings <- aggr(df_dota3, col=c('green','orange'), numbers=TRUE, sortVars=TRUE, labels=names(df_dota3), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

Output md.pattern(df_dota3)

The output tells us that 916 samples are completed, however, 90 samples miss in each of the subscales. In sum, 540 are missing. And our graph is telling us that %91 of our data has no missings. But if you look closely, you can see that all of the missing variables are equally distributed. So mice, will not do any imputation for this data since all 90 raws are completely missing.

#imputation
imputed_dota<-mice(df_dota3,m=1)
imputed_dota<-complete(imputed_dota,1)
#now let's see whether mice did any imputation or not:
md.pattern(imputed_dota)

So we can call freely the “na.omit” function to remove all NA variables in our data.

imputed_dota<-na.omit(imputed_dota)

Finally, we got rid of those missing values.

Dealing Outliers With Mahalanobis Distance

Normally, I deal with outliers with the boxplot method and rstatix package very easily fixes them. But Mahalanobis distance is a much more comfortable method when the degrees of freedom are>3 for me. And Mahalanobis distance is useful to avoid multicollinearity issues when we conduct a model with 3 or more independent variables. Now let’s get started:

There are various methods to get outliers with Mahalanobis Distance. I always get indexes of outliers and remove them from the main data frame.

#let's select all independent variables.
mahl<-na.omit(imputed_dota[c("age","Extraver","Aggre","Conscient","Open","Neurotic")])
#we can find the center points of an ellipse with col means
mahl.center<-colMeans(mahl)#covariance:
mahl.cov<-cov(mahl)
#now let's get mahlanobis scores.
distance<-mahalanobis(mahl,center =mahl.center,cov = mahl.cov)

cutoff<-qchisq(p=0.95,df=6)index<-which(distance>cutoff)
#remove these indexes into the our main data.imputed_dota<-imputed_dota[-index,]

Now we can focus on the variables that we are interested in:
most preferred heroes and most preferred game style.

First, let’s see what’s inside of these variables. The distinct function will show us all unique observers.

imputed_dota%>%distinct(most_preferred)
imputed_dota%>%distinct(most_preferred_hero1)

Ok, we have 2 major roles(support and core) eventually. The bad news is our data has all heroes that played in DotA 2. It’s bad because our sample size for each hero probably will be very low. So we have to select the most preferred heroes to conduct a robust model.

Let’s start with the heroes. I’ll select the most preferred heroes in this dataset: Invoker, Pudge, Anti-Mage, Dazzle, Crystal Maiden(cm), Juggernaut, and Ogre-Magi.

#selecting the heroes that we wanted to analyze:
invoker<-imputed_dota%>%filter(most_preferred_hero1=="Invoker")
pudge<-imputed_dota%>%filter(most_preferred_hero1=="Pudge")
dazzle<-imputed_dota%>%filter(most_preferred_hero1=="Dazzle")
cm<-imputed_dota%>%filter(most_preferred_hero1=="Crystal Maiden")
anti<-imputed_dota%>%filter(most_preferred_hero1=="Anti-Mage")
jugg<-imputed_dota%>%filter(most_preferred_hero1=="Juggernaut")
ogre<-imputed_dota%>%filter(most_preferred_hero1=="Ogre Magi")
#or select these heroes with: #filter(most_preferred_hero1==c("Invoker","Pudge","dazzle"...) and #so on...

Now let’s bind those variables:

anovadota<-rbind(invoker,pudge,dazzle,cm,anti,jugg,ogre)#convert to factor
anovadota$most_preferred_hero1<-
as.factor(anovadota$most_preferred_hero1)#droplevels for 0 factorsanovadota$most_preferred_hero1<-droplevels(anovadota$most_preferred_hero1)table(anovadota$most_preferred_hero1)

Just as we thought before, since there are 100+ heroes at this variable; it’s not a surprise to face such low frequencies.

Before the regression model, I would like to conduct different ANOVAs for these heroes to each personality trait: I will use this code for each boxplot:

ggplot(anovadota,aes(most_preferred_hero1,Extraver,fill=most_preferred_hero1))+geom_boxplot()+stat_compare_means(method = "anova")

Extraversion

Agreeableness

Conscientiousness

The main effect of most_preferred_hero1 is not significant (F(6, 214) = 1.33, p = 0.245) and can be considered as very small (partial omega squared = 8.85e-03).

Openness

The ANOVA suggests that:

— The main effect of most_preferred_hero1 is not significant (F(6, 214) = 1.42, p = 0.209) and — Anova suggest that, the main effect of most_preferred_hero1 is not significant (F(6, 214) = 1.42, p = 0.209) and can be considered as small (partial omega squared = 0.01).

Neuroticism

- The main effect of most_preferred_hero1 is not significant (F(6, 214) = 1.92, p = 0.079) and can be considered as small (partial omega squared = 0.02).

These results showed us there is no significant interaction between hero preferences and personality traits with the F test. Hereby, we don’t need to conduct a post-hoc test.

Now let's test the most preferred roles: support and carry.

We are backed to imputed_dota data frame.

#I used this code for each boxplot:(a,b,c,d,e) respectively...
a<-ggplot(imputed_dota,aes(most_preferred,Neurotic,fill=most_preferred))+geom_boxplot()+stat_compare_means(method = "t.test")

library(gridExtra)
# 
grid.arrange(a,b,c,d,e,ncol=3)

The Welch Two Sample t-test suggests that the difference of Agreeableness by most_preferred is significant **(difference = -1.41, 95% CI [-2.08, -0.74], t(782.98) = -4.11, p < .001)**

Although there is a significant effect in agreeableness, the 95% confidence interval shows us it does not seem very reliable. However, from my Dota 2 experience, I think it may be true. Because, if you play the support role in Dota; you have to be more communicative and talkative. Whereas, carry(or core) players are much more focused on their playthrough. In professional games, we can see every team’s captain is playing the support role in order to organize teams’ general behavior and reactions. Support players mostly play for their teams. They don’t buy items for themselves: they buy items for teams and sacrifice themselves for the team’s carry/core players. These are also “compatible” with our results and the agreeableness trait in Big Five.

Multinomial Logistic Regression For The Heroes

Before the start, I would like to say that it’s not a good idea to conduct a multinominal logistic regression model for the small sample sizes and for our data. But since I enjoyed analyzing Dota 2 data(and I don’t have a better job to do right now), I will try to make some predictions with a small number of observations.

Now let’s start with the sampling process. Note that, if we want to make some predictions with our multinominal logistic regression model; the counts of categorical variables at dependent value must be at the same level(or very close to each other). That’s why we need to do sampling processing

indexanti<-sample(1:nrow(anti),size = 0.80*nrow(pudge))
indexpud<-sample(1:nrow(pudge),size = 0.80*nrow(pudge))
indexdazz<-sample(1:nrow(dazzle),size = 0.80*nrow(pudge))
indexcm<-sample(1:nrow(cm),size = 0.80*nrow(pudge))
indexjug<-sample(1:nrow(jugg),size = 0.80*nrow(pudge))
indexogr<-sample(1:nrow(ogre),size = 0.80*nrow(pudge))
indexinvo<-sample(1:nrow(invoker),size = 0.80*nrow(pudge))

Pudge has the lowest amount of observation in this data. Since we can’t increase the number of Pudge’s values we need to decrease other heroes' levels to at Pudge’s level. Then, we can create our train and test dataframes.

invokertrain<-invoker[indexinvo,]
antitrain<-anti[indexanti,]
juggtrain<-jugg[indexjug,]
dazzletrain<-dazzle[indexdazz,]
ogretrain<-ogre[indexogr,]
cmtrain<-cm[indexcm,]
pudgetrain<-pudge[indexpud,]#bind the rows:trainset<-rbind(invokertrain,antitrain,juggtrain,dazzletrain,ogretrain,cmtrain,pudgetrain)

Now, all of our levels now the same at Train Set.

#Same processing with the testset
invotest<-invoker[-indexinvo,]
antitest<-anti[-indexanti,]
juggtest<-jugg[-indexjug,]
dazztest<-dazzle[-indexdazz,]
ogretest<-ogre[-indexogr,]
cmtest<-cm[-indexcm,]
pudgetest<-pudge[-indexpud,]testset<-rbind(invotest,antitest,juggtest,dazztest,ogretest,cmtest,pudgetest)

Now we can create our model:

library(nnet)
library(e1071)
library(caret)multinomi<-multinom(most_preferred_hero1~age+Aggre+Conscient+Extraver+Neurotic+Open,
                    data = trainset)

I selected all the independent variables first. Now lets’s check the variable information. “varImp” function automatically scales the importance scores.

caret::varImp(multinomi)

As a result of the varImp function, I will conduct another model without the “Conscient” variable(edit: whoops, missed the “extraver” variable which is the lowest! sry). So we can compare the two models. Before the summary results, let’s quickly check models with Bayesian information criterion (BIC): the lowest BIC value is assumed to be the best model:

As you can see, after the remove the variable that has the lowest importance our model got better. Note that, there are various methods to compare methods: We can also look at AIC and Residual Deviances. But I’ll pass those since our model is already bad enough to conduct multinominal logistic regression.

Finally, let’s make some predictions with our models:

predmultinomi<-predict(multinomi,testset)
predmultinom2<-predict(multinomi2,testset)

caret::confusionMatrix(predmultinomi,as.factor(testset$most_preferred_hero1))

At above, we can show our model 1' predictions. If I would be honest, this is the worst model that I’ve ever conducted. Kappa value is 0.06. According to Cohen, Kappa < 0 indicates that agreement is weaker than expected by chance.

caret::confusionMatrix(predmultinomi2,as.factor(testset$most_preferred_hero1))

In sum, besides the Kappa value and Mcnemar’s p-value, the Accuracy, which shows us how often is the classifier correct, of our model is also very low. So making predictions with this data simply doesn’t reliable at all.

Finally, let’s create a decision tree. This tree gives us almost no information (since it’s low reliability in the term of statistics) but it’s easy to see probabilities and much more readable.

library(rattle)
library(rpart)
modelgini<-rpart(most_preferred_hero1~age+Aggre+Conscient+Open+Neurotic+Extraver,data = testset,method = "class",
                    parms = list(split="gini"))
fancyRpartPlot(modelgini,cex=0.7)

outpt of fancyRpartPlot(modelgini,cex=0.7)

No effect at all 😐

Our results have shown us there is no remarkable effect between hero preferences in Dota 2 and personality factors. But remember, our sample size for each hero was very small. Maybe we should’ve focused on the role preferences at all. Model tunning processings maybe gives us better results but these methods not giving magic tricks. But even we can’t find any interesting results, analyzing Dota 2’s data set was fun.

Therewithal, the player’s most preferenced heroes may change the gameplay updates which are called “meta”. I mean, in 2017 maybe Invoker, not a fit hero to getting wins. Naturally, Invoker probably was not preferenced with players when he was weak. Although we’ve made an analysis for fun, if we were to take this seriously we have to control the most preferenced heroes in 2017 by the entire Dota 2 community