Content uploaded by Iaroslav Omelianenko

Author content

All content in this area was uploaded by Iaroslav Omelianenko on Mar 03, 2017

Content may be subject to copyright.

Applying Deep Machine Learning for

psycho-demographic proﬁling of Internet

users using O.C.E.A.N. model of personality

Iaroslav Omelianenko

Research Director, NewGround LLC

yaric@newground.com.ua

March 3, 2017

Abstract

In the modern era, each Internet user leaves enormous amounts of auxiliary digital residuals (footprints) by using a variety

of on-line services. All this data is already collected and stored for many years. In recent works, it was demonstrated that it’s

possible to apply simple machine learning methods to analyze collected digital footprints and to create psychological proﬁles of

individuals. However, while these works clearly demonstrated the applicability of machine learning methods for such an analysis,

created simple prediction models still lacks accuracy necessary to be successfully applied to practical needs. We have assumed that

using advanced deep machine learning methods may considerably increase the accuracy of predictions. We started with simple

machine learning methods to estimate basic prediction performance and moved further by applying advanced methods based

on shallow and deep neural networks. Then we compared prediction power of studied models and made conclusions about its

performance. Finally, we made hypotheses how prediction accuracy can be further improved. As result of this work, we provide

full source code used in the experiments for all interested researchers and practitioners in corresponding GitHub repository. We

believe that applying deep machine learning for psychological proﬁling may have an enormous impact on the society (for good or

worse) and providing full source code of our research we hope to intensify further research by the wider circle of scholars.

1. Introduction

By using various on-line services, modern Internet user leaves an enormous amount of digital tracks in the form of

server logs, user-generated content, etc. All these information bits meticulously saved by on-line service providers cre-

ate the vast amount of digital footprints for every Internet user. In recent research [

Lambiotte, R., and Kosinski, M., 2014

],

it was demonstrated that by applying simple machine learning methods it’s possible to ﬁnd statistical correlations

between digital footprints and psycho-demographic proﬁle of individuals. The considered psycho-demographic

proﬁle comprise of psychometric scores based on ﬁve-factor O.C.E.A.N. model of personality [

Goldberg et. al, 2006

]

and demographic scores such as Age,Gender and the Political Views. The O.C.E.A.N. is an abbreviation for Openness,

Conscientiousness,Extroversion,Agreeableness, and Neuroticism.

In this work we decided to test whether applying advanced machine learning methods to analyze digital footprints

of Internet users has enough predictive performance to create precise psycho-demographics proﬁles of personality.

For our experiments we used data corpus comprising of psycho-demographic scores of individuals and their digital

footprints in form of Facebook likes. The data corpus kindly provided by M. Kosinski through corresponding web

site: http://dataminingtutorial.com.

We started our experiments with building simple machine learning models based on linear/logistic regression

methods as proposed by M. Kosinski in [

Kosinski et. al, 2016

]. By training and execution of simple models we

estimated basic predictive performance of machine learning methods against available data set. Then we proceeded

with experiments against advanced machine learning methods based on shallow and deep neural networks.

1

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

The full source code of our experiments provided in form of GitHub repository:

https://github.com/NewGround-LLC/

psistats

The source code is written in R programming language [

R Core Team, 2015

] which is highly optimized for statistical

data processing and allows to apply advanced deep machine learning algorithms by bridging with Google Brain’s

TensorFlow framework [Google Brain Team, 2015].

This paper is organized as follows: In Section 2, we describe data corpus, and it’s preprocessing routines. It is

followed in Section 3 by details about how to build and run simple prediction models based on linear and logistic

regression with results of their execution. In Section 4, we provide measures to create and execute advanced

prediction models based on artiﬁcial neural networks. Finally, in Section 6 we compare the performance of different

machine learning methods studied in this work and draw conclusions about the predictive power of models build

with considered architectures.

2. Data Corpus Preparation

In this section, we consider the creation of input data corpus from the publicly available data set, and it’s pre-

processing it to simplify further analysis by machine learning algorithms.

2.1. Data Set Description

The data set kindly provided by M. Kosinski and used in this work contains psycho-demographic proﬁles of

nu=

110 728 Facebook users and

nL=

1 580 284 of associated Facebook likes. For simplicity and manageability, the sample

is limited to U.S. users [Kosinski et. al, 2016]. The following three ﬁles can be downloaded:

1.

users.csv: contains psycho-demographic user proﬁles. It has

nu=

110 728 rows (excluding the row holding

column names) and nine columns: anonymised user ID, gender ("0" for male and "1" for female), age,

political views ("0" for Democrat and "1" for Republican), and scores of ﬁve-factor model of personality

[Goldberg et. al, 2006].

2.

likes.csv: contains anonymized IDs and names of

nL=

1 580 284 Facebook Likes. It has two columns: ID and

name.

3.

users-likes.csv: contains the associations between users and their Likes, stored as user-Like pairs. It has

nuL=

10 612 326 rows and two columns: user ID and Like ID. An existence of a user-Like pair implies that a given

user had the corresponding Like on their proﬁle.

2.2. Data pre-processing

It is an important step in machine learning analysis which will signiﬁcantly reduce the time needed for analysis and

result in better prediction power of created machine learning models.

First, we created sparse matrix representing relations among users and their Facebook likes. This matrix has an

enormous number of feature dimensions: 10 612 326 of features columns representing the users-likes relation. Thus

after construction, we applied data trimming of rare data points and performed dimensionality reduction using

singular value decomposition (SVD) [Golub, G. H., and Reinsch, C. 1970].

The necessary pre-processing steps may be summarized as followed:

2

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

1.

Construction of sparse users-likes matrix, which presents many-to-many relationships between users and their

digital footprints in the form of collected Facebook likes. The constructed matrix is enormous with high sparsity,

so it is appropriate to operate with it and store it in sparse data format, which optimized for such kind of data.

2. Trimming of sparse users-likes matrix to exclude rare data points which have minuscule signiﬁcance

3. Data imputation for missed values

4.

Dimensionality reduction to reduce enormous number of features in the generated data corpus by applying

singular value decomposition (SVD)

5. The factor rotation analysis to simplify obtained scores over SVD dimensions after dimensionality reduction

Construction of sparse users-likes matrix and matrix trimming

Before analysis, we build from provided comma-separated ﬁles the sparse matrix with users-likes relationships

encoded. After that, the sparse matrix was trimmed by removing rare data points. As a result, the signiﬁcantly

reduced data corpus was created, imposing lower demands on computational resources and more useful for manual

analysis to extract speciﬁc patterns. The descriptive statistics of users-likes matrix before and after trimming present

in Table 1.

Descriptive statistics Raw Matrix Trimmed Matrix

# of users 110 728 19 742

# of unique Likes 1 580 284 8 523

# of User-Like pairs 10 612 326 3 817 840

Matrix density 0,006% 2,269%

Likes per User

Mean 96 193

Median 22 106

Minimum 1 50

Maximum 7 973 2 487

Users per Like

Mean 7 448

Median 1 290

Minimum 1 150

Maximum 19 998 8 445

Table 1:

The descriptive statistics of raw and trimmed users-likes matrix with minimum users per like threshold set to

uL=

150 and minimum

likes per user Lu=50

The users-likes matrix can be constructed from provided three comma-separated ﬁles with the help of accompanying

script written in R language: src/preprocessing.R. To use this script make sure that

input_data_dir

variable in the

src/conﬁg.R points to the root directory where sample data corpus in the form of .CSV ﬁles were unpacked.

To start pre-processing and trimming, run the following command from terminal in the project’s root directory:

$ R s c r i p t . / s r c / p r e p ro c e s s i n g . R −u 150 −l 50

where:

-u

is the minimum number of users per like

uL

, and

-l

is the minimum number of likes per user

Lu

to keep in

resulting matrix.

3

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

The values for the minimum number of users per like

uL

and the minimum number of likes per user

Lu

was selected

based on [

Kosinski et. al, 2016

]. We have experimented with other set parameters as well (

uL=

20 and

Lu=

2), but

accuracy of trained prediction models degraded.

Data imputation of missed values

The raw data corpus has missed values in column with ’Political’ dependent variable data. Before building the

prediction model for this dependent variable, it is advisable to impute missed values. In this work, we will apply

multivariate imputation to ﬁll missed values [van Buuren, Groothuis-Oudshoorn, 2011].

The data imputation performed by the same src/preprocessing.R script, as part of users-likes matrix building routine.

The summary statistics for data imputation applied to political variable, presented in Table 2.

est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda

(Intercept) 1.39 0.01 102.29 1240.53 0.00 1.36 1.41 NA 0.06 0.05

gender -0.02 0.01 -2.80 25.27 0.01 -0.04 -0.01 0 0.44 0.40

age 0.00 0.00 -0.73 577.61 0.47 0.00 0.00 0 0.09 0.08

ope -0.23 0.00 -68.10 2446.16 0.00 -0.23 -0.22 0 0.04 0.04

con 0.05 0.00 10.92 20.28 0.00 0.04 0.06 0 0.49 0.44

ext 0.03 0.00 6.44 14.40 0.00 0.02 0.04 0 0.58 0.53

agr 0.02 0.00 6.72 189.32 0.00 0.02 0.03 0 0.15 0.14

neu -0.01 0.00 -2.07 95.30 0.04 -0.02 0.00 0 0.22 0.20

Table 2:

The descriptive statistics for data imputation applied to political variable using

LDA

method with number of multiple imputations

equals to

m=

5. The plausibility of applied multivariate imputation can be conﬁrmed by low values in column

fmi

and

lambda

. The

column

fmi

contains the fraction of missing information as deﬁned in [

Rubin DB, 1987

], and the column

lambda

is the proportion

of the total variance that is attributable to the missing data λ=B+B

m

T).

Dimensionality reduction with SVD

After two previous steps, the resulting users-likes matrix still has a considerable number of features per data sample

(8 523 of features columns). To make it more maintainable, we considered applying singular value decomposition

[

Golub, G. H., and Reinsch, C. 1970

], representing eigendecomposition-based methods, projecting a set of data points

into a set of dimensions. As it mentioned in [

Kosinski et. al, 2016

], reducing the dimensionality of data corpus has

number of advantages:

•

With reduced features space we can use fewer number of data samples, as it is required by most of the analysis algorithms

that number of data samples exceeds the number of features (input variables)

•It will reduce risk of overﬁtting and increase statistical power of results

•

It will remove

multicollinearity

and

redundancy

in data corpus by grouping related features (variables) in single

dimension

•It will signiﬁcantly reduce required computational power and memory requirements

•

And ﬁnally it makes it easier to analyze data by hand over small set of dimensions as opposite to hundreds or thousands of

separate features

To apply SVD analysis against generated users-likes matrix run the following command from project’s root directory:

$ R s c r i p t . / s r c / sv d_ v ar im ax . R −−svd_dimensions 50 −−appl y _ v arimax tr u e

4

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

where:

–svd_dimensions

is the number of SVD dimensions for projection, and

–apply_varimax

is the ﬂag to indicate

whether varimax rotation should be applied afterwards.

Factor rotation analysis

The factor rotation analysis methods can be used to simplify SVD dimensions and increase their interpretability by

mapping the original multidimensional space into a new, rotated space. Rotation approaches can be orthogonal (i.e.,

producing uncorrelated dimensions) or oblique (i.e., allowing for correlations between rotated dimensions).

We will apply one of the most popular orthogonal rotation - varimax. It minimizes both the number of dimensions

related to each variable and the number of variables related to each dimension, thus improving the interpretability of

the data.

For more details on rotation techniques, see [Abdi, H., 2003].

3. Regression analysis

There is an abundance of methods developed to build prediction models based on large data sets. It’s ranging

from sophisticated methods such as Deep Machine Learning [

Goodfellow et al., 2016

], probabilistic graphical models

[

Daphne Koller, 2012

], or support vector machines [

Cortes & Vapnik, 1995

], to much simpler, such as linear and

logistic regressions [Yan, Su, 2009].

Starting with simple methods is common practice allowing the creation of good baseline prediction model with

minimal computational efforts. The results obtained from these models can be used later to debug and estimate the

quality of results obtained from advanced models.

3.1. Cross-Validation

In this work, we will apply k-fold cross-validation to help with avoiding model overﬁtting. In k-fold cross-validation, the

original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is

retained as the validation data for testing the model, and the remaining

k−

1 subsamples are used as training data.

The cross-validation process is then repeated

k

times (the folds), with each of the

k

subsamples used exactly once as

the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage

of this method is that all observations are used for both training and validation, and each observation is used for

validation exactly once. 10-fold cross-validation is commonly used, but in general,

k

remains an unﬁxed parameter

[Kohavi, Ron, 1995].

3.2. Dimensionality Reduction

To reduce the number of features (input variables) in data corpus, we applied singular value decomposition (SVD)

with subsequent varimax factor rotation analysis. The number of the varimax-rotated singular value decomposition

dimensions (

K

) has a considerable impact on the accuracy of model predictions. To ﬁnd an optimal number of SVD

dimensions, we performed analysis of relationships between

K

and accuracy of model predictions by producing series

of regression models for different values of

K

. Then we plot prediction accuracy of created models against chosen

number of

K

SVD dimensions. Typically the prediction accuracy grows rapidly within lower ranges of

K

and may

start decreasing once the number of dimensions becomes large. Selecting a

K

that marks the end of a rapid growth

5

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

of prediction accuracy values usually offers decent interpretability of the topics. In general, the larger

K

values often

provide better predictive power [Zhang, Marron, Shen,& Zhu, 2007]. See Figure 1 for results of our experiments.

To start SVD analysis run following command from terminal in the project’s root directory:

$ R s c r i p t . / s r c / a n a l y s i s . R

The resulting plots will be saved "Rplots.pdf" ﬁle in the project root. This ﬁle will include two plots:

•

the plot with relationships between the accuracy of predicting psycho-demographic traits of individuals and

the number of the varimax-rotated SVD dimensions used (Figure 1). With this plot, it’s easy to visually ﬁnd

an optimal number of

K

SVD dimensions to maximize predicting power of regression model per particular

psycho-demographic trait of the individual.

•

the heat map of correlations between scores of digital footprints of individuals on varimax-rotated SVD

dimensions and their psycho-demographic traits (Figure 2). This plot can be used to ﬁnd most correlated traits

visually. Later, it will be shown that predictive models for traits with higher correlation have better prediction

accuracy.

3.3. Building regression model and prediction results

In our data corpus, we have eight dependent variables with psycho-demographic scores of an individual to be

predicted. Among those variables, six have continuous values, and two has categorical values (binominal: 0, 1). To

build prediction model for variables with continuous values we will apply linear regression and for variables with

categorical values - logistic regression.

Linear regression

The linear regression is an approach for modeling the relationship between a scalar dependent variable

y

and one or

more explanatory variables (or independent variables) denoted

X

. The case of one explanatory variable is called

simple linear regression. For more than one explanatory variable, the process is called multiple linear regression

[David A. Freedman, 2009].

In linear regression, the relationships are modeled using linear predictor function

y=ΘTX

whose unknown model

parameters Θestimated from the input data. Such models are called linear models [Hilary L. Seal, 1967].

Logistic regression

The logistic regression is a regression model where the dependent variable is categorical [

David A. Freedman, 2009

].

Logistic regression measures the relationship between the categorical dependent variable and one or more indepen-

dent variables by estimating probabilities using a logistic function

σ(x) = 1

1+e−x

, which is the cumulative logistic

distribution [Rodriguez, G., 2007].

In this work we consider only specialized binary logistic regression because dependent variables found in our data

corpus are binominal, i.e. have only two possible types, "0" and "1".

6

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 1:

The relationship between the accuracy of predicting psycho-demographic traits and the number of the varimax-rotated singular

value decomposition dimensions used. The results suggest that employing

K=

50 SVD dimensions might be a good choice for

building models predicting almost all individual’s traits of interest, as it offers accuracy that is close to what seems like the higher

asymptote for this data. But for Openness, Extroversion, and Agreeableness traits prediction results can be slightly improved with

higher values of K SVD dimensions.

7

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 2:

The heat map is presenting correlations between

K=

50 varimax-rotated singular value decomposition dimensions and scores of

psycho-demographic traits of individuals. The heat map suggests that Age, Gender, and the Political view traits have maximum

correlation with a maximal number of SVD dimensions. The higher correlation will result in higher prediction power of regression

model for particular psycho-demographic traits (which will be shown later).

8

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Implementing prediction models and accuracy results

The data corpus has eight dependent variables for which to build prediction model. Usually, simple machine

learning methods applied with single depend variable to be estimated, but there also exists methods for multivariate

regression analysis as well. Taking into account that our dependent variables have different types (continuous and

nominal) which require different regression methods to be applied, we have decided to build separate regression

models per dependent variable. The metric to evaluate accuracy of prediction model is also dependent on the type of

regression method in use, and we have considered the following methods:

•

the prediction power of linear regression model will be measured as Pearson product-moment correlation

[Gain, 1951]

•

the prediction power of logistic regression model will be measured as area under the receiver-operating

characteristic curve coefﬁcient (AUC) [Sing, Sander, Beerenwinkel, Lengauer, 2005]

Before executing models make sure that data corpus already pre-processed as described in Subsection: "Construction

of sparse users-likes matrix and matrix trimming"

When data corpus is ready, the following command can be executed to start linear/logistic regression models

building and its predictive performance evaluation (run command from terminal in the project’s root directory):

$ R s c r i pt . / s r c / r e g r e s s i o n _ a n a l y s i s . R

The results of predictive performance evaluation will be saved into the ﬁle ”

out/pred_accuracy_regr

.

txt

”. The

results of regression models predictions for data corpus trimmed to contain 150 users-per-like and 50 likes-per-user

varimax-rotated against K=50 SVD dimensions presented in Table 3.

Trait Variable Pred. accuracy

Gender gender 93.65%

Age age 61.17%

Political view political 68.36%

Openness ope 44.02%

Conscientiousness con 25.72%

Extroversion ext 30.26%

Agreeableness agr 23.97%

Neuroticism neu 29.11%

Mean 47.03%

Table 3: The linear and logistic regression models predictive accuracy results per depended variable.

From the Table 3 we can see that prediction power of simple linear and logistic regression models differs for each

dependent variable and for most outputs the obtained prediction accuracies is not enough to be used for real-life

predictions. The most accurate predictions made for Gender,Age, and Political view with Openness following next.

That correlates well with our previous analysis of SVD correlations heat map (Figure 2). In general only prediction

model for Gender is accurate enough to be useful in real-life applications.

In following sections, we will consider applying of advanced deep machine learning techniques to improve prediction

accuracy.

9

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

4. Fully Connected Feed Forward Artificial Neural Networks

In this work, we will consider multilayer fully connected feed-forward neural networks (NN) for building simple or

shallow (single hidden layer) and deep machine learning NN models. In fully connected NN all neurons of current

layer connected with each neuron of the previous layer. The feed-forward NN is not allowed to have cycles from

latter layers back to the earlier.

The models based on NN architecture can have one or multiple hidden layers. When only one hidden layer present,

its nodes (neurons) take inputs from input nodes (columns of input data matrix) and feeds into the output nodes,

where linear or categorical analysis performed. If the network includes two or more hidden layers, then the ﬁrst

hidden layer will take inputs from each of the input nodes and the subsequent hidden layer will take inputs from

each of previous hidden layer nodes outputs [Christopher M. Bishop, 1995].

Artiﬁcial neural networks (ANN) are prone to overﬁtting because of the added layers of abstraction, which allow

them to model rare dependencies in the training data. Various regularization methods can be applied during training

to help reduce overﬁtting. One of the promising methods to reduce overﬁtting is dropout regularization, where some

number of units (neurons) randomly omitted from the hidden layers during training. This helps to break the rare

dependencies that can occur in the training data [Hinton, G. et al., 2012].

4.1. The shallow Feed Forward ANN evaluation

A shallow neural network (SNN) is an artiﬁcial neural network with one hidden layer of units (neurons) between

input and output layers. To mimic biological neuron, hidden units in the network apply speciﬁc non-linear activation

functions. One of popular activation functions is ReLU non-linearity that we considered as activation function for the

hidden layers units in studied network architectures [Nair, Hinton, 2010].

The ReLU non-linearity was selected as it improves information disentangling and linear separability producing

efﬁcient variable size representation of model’s data [Glorot, Bordes, Bengio, 2011].

When applying rectiﬁed non-linearity (ReLU), we can consider the predictive model as an exponential number of linear models

that share parameters. Because of this linearity, gradients ﬂow well on the active paths of neurons (there is no gradient

vanishing effect due to activation non-linearities as in case of sigmoid or tanh acivation units), and as a result, the mathematical

investigation is easier. Computations are also cheaper: there is no need for computing the exponential function in activations,

and sparsity can be exploited [Glorot, Bordes, Bengio, 2011].

To reduce overﬁtting was selected dropout regularization with drop-probability 0.5, which means that each hidden

unit if randomly omitted from the network with the speciﬁed probability.

The other way to view the dropout procedure is as a very efﬁcient way of performing model averaging with neural networks. A

good way to reduce the error on the test set is to average the predictions produced by a vast number of different networks. The

standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this

is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of

different networks in a reasonable time. There is almost certainly a different network for each presentation of each training case,

but all of these networks share the same weights for the hidden units that are present [Hinton, G. et al., 2012].

The NN architecture was build using Google Brain’s TensorFlow library - an open source software library for nu-

merical computation using data ﬂow graphs. [

Google Brain Team, 2015

] Nodes in the graph represent mathematical

operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.

The resulting two layer (one hidden layer) ANN’s architecture graph depicted in Figure 5

10

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

It was found that optimal number of SVD dimensions with two-layer ANN is

K=

128, and number of units in the

hidden layer is 512, with learning rate value

γ=

0.0001, see Table 4. As loss function to be optimized was selected

Mean Squared Error (MSE) with Adam optimizer (Adaptive Moment Estimation) to estimate it’s minimum. The Adam

optimizer was selected for it’s proven advantages some of them are that the magnitudes of parameter updates are invariant

to rescaling of the gradient, its step sizes are approximately bounded by the step size hyper-parameter, it does not require a

stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing [

Kingma, Ba, 2014

].

The batch size was selected to be 100. We also tested training with batch size 10 but found no statistically relevant

prediction accuracy difference between runs with either batch size.

Prediction accuracy, γ=0.0001

Trait Variable K=50 K=128 K=256 K=512

Gender gender 93.76% 93.61% 93.00% 93.82%

Age age 85.00% 85.07% 85.03% 83.59%

Political view political 66.65% 66.86% 67.28% 65.32%

Openness ope 47.22% 51.24% 47.26% 44.81%

Conscientiousness con 28.56% 28.45% 29.65% 28.26%

Extroversion ext 32.53% 32.87% 30.44% 30.32%

Agreeableness agr 25.21% 27.66% 24.58% 24.80%

Neuroticism neu 33.53% 36.24% 33.09% 31.96%

Mean 51.58% 52.75% 51.29% 50.36%

Table 4: The predictive accuracy results of SNN per K SVD dimensions with learning rate γ=0.0001 and 512 units in the hidden layer.

The learning rate was selected by watching ratio of survived hidden units with non zero ReLU activation and

monitoring loss function over iterations plot (Figure 3). With presented hyper-parameters it was achieved maximum

ratio 0.57 of zero ReLU activations for highest learning rate, which fairly good taking into account tendency of ReLU

to saturate at zero during gradient back propagation stage when strong gradients applied, due to high learning rates

[

Nair, Hinton, 2010

]. The maximal number of iterations (50 000) and correspondingly number of training epochs was

selected based on loss function plot (Figure 3). With series of experiments it was selected as optimal the learning rate

value

γ=

0.0001, which gives smooth loss function, minimal number of "dead" neurons after ReLU activation (ReLU

zero activations ratio is about 0.5), and best prediction scores among runs (see Table 5).

Prediction accuracy, K=128 SVD

Trait Variable γ=0.001 γ=0.0001 γ=0.00001

Gender gender 93.88% 93.61% 91.97%

Age age 83.68% 85.07% 73.81%

Political view political 66.67% 66.86% 66.95%

Openness ope 49.43% 51.24% 45.92%

Conscientiousness con 27.18% 28.45% 27.79%

Extroversion ext 29.93% 32.87% 31.93%

Agreeableness agr 25.38% 27.66% 24.51%

Neuroticism neu 31.15% 36.24% 32.24%

Mean 50.91% 52.75% 49.39%

Table 5: The predictive accuracy results of SNN per learning rate with K =128 SVD dimensions and 512 units in the hidden layer.

11

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

The accompanying launch script provided to conduct experiments under Unix:

$ . / e va l _m l p_ 1 . s h ul _ s v d _ m a t r i x _ f i l e

where: ul_svd_matrix_ﬁle the path to the users-likes matrix with features columns dimensionality reduced with SVD

method against selected number of K=128 SVD dimensions

The source code of shallow ANN implementation used for experiment can be found in src/mlp.R of accompanying

GitHub repository.

4.2. Feed Forward Deep Learning Network

A deep neural network (DNN) is an artiﬁcial neural network with multiple hidden layers of units between the input

and output layers. Similar to shallow, deep neural network can model complex non-linear relationships. The added

extra layers enable composition of features from lower layers, giving the potential of modeling complex data with

fewer units than a similarly performing shallow network [Bengio, Yoshua, 2009].

Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine

should change its internal parameters that are used to compute the representation in each layer from the representation in the

previous layer [LeCun, Bengio, Hinton, 2015].

As with shallow ANNs, many issues can arise with DNNs training, with two most common problems - are overﬁtting

and computation time [Tetko, Livingstone, Luik, 1995].

The three-layer Deep Learning Network

Our experiments with deep learning networks we started with simple DNN architecture comprising of two hidden

layers with ReLU activation and dropout after each hidden layer with keep probability of 0.5. The experimental

network graph depicted in Figure 7.

Prediction accuracy, γ=0.0001

K=50 K=128 K=256 K=512 K=1024

Trait Variable 512,256 512,256 512,256 1024,512 2048,1024

Gender gender 92.60% 92.14% 92.53% 93.61% 95.68%

Age age 85.03% 82.97% 80.67% 81.45% 82.53%

Political view political 66.66% 67.34% 67.06% 67.26% 65.06%

Openness ope 44.76% 47.56% 47.68% 46.39% 40.61%

Conscientiousness con 24.33% 25.79% 25.68% 27.74% 24.67%

Extroversion ext 30.86% 32.35% 34.35% 28.16% 26.53%

Agreeableness agr 24.83% 25.66% 26.48% 21.65% 29.90%

Neuroticism neu 32.17% 33.01% 35.94% 28.50% 30.83%

Mean 50.15% 50.85% 51.30% 49.34% 49.48%

Table 6:

The predictive accuracy results of three layer DNN per

K

SVD dimensions with learning rate

γ=

0.0001, with sizes of hidden

layers presented in third table header row as [hidden1,hidden2].

We have started with learning rate

γ=

0.0001 which had shown best results for shallow ANN and attempted

series of experiments to estimate optimal value of

K

SVD dimensions. The optimal prediction accuracy of DNN

12

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

model achieved with

K=

256 SVD dimensions and two hidden layers with [512, 256] units correspondingly. Similar

prediction accuracy also achievable with

K=

1024 SVD dimensions and [2048, 1024] units per layer with learning

rate

γ=

10

−5

. But we have not considered hype-parameters due to its extra overhead for computation while giving

statistically same results as with former. The results of experiments presented in Table 6.

After ﬁnding optimal values for

K

SVD dimensions and number of units per hidden layer, we have conducted series

of experiments to determine optimal learning rate value. See Table 7

Prediction accuracy, K=256 SVD

Trait Variable γ=0.001 γ=0.0001 γ=0.00001

Gender gender 90.89% 92.53% 82.75%

Age age 79.60% 80.67% 79.44%

Political view political 67.33% 67.06% 59.69%

Openness ope 44.73% 47.68% 39.48%

Conscientiousness con 21.41% 25.68% 22.86%

Extroversion ext 24.62% 34.35% 24.07%

Agreeableness agr 17.27% 26.48% 16.34%

Neuroticism neu 27.72% 35.94% 28.53%

Mean 46.70% 51.30% 44.14%

Table 7:

The predictive accuracy results of three layer DNN per learning rate with

K=

256 SVD dimensions, and with sizes of hidden layers

512 and 256 correspondingly.

In our experiments, we have applied exponential learning rate decay with the number of steps before decay 10000

and decay rate of 0.96. Such scheme results in the positive effect on network convergence speed due to learning rate

annealing effect, which gives a system the ability to escape from poor local minima to which it might have been

initialized [Kirkpatrick et al., 1983]. We selected batch size 100 as optimal for this experiment.

The accompanying launch script provided to conduct experiments under Unix:

$ . / e v al _d n n . s h u l _ s v d _ m a t r i x _ fi l e

where: ul_svd_matrix_ﬁle the path to the users-likes matrix with features columns dimensionality reduced with SVD

method against selected number of KSVD dimensions

The source code of DNN with two hidden layers implementation used for experiment can be found in src/dnn.R of

accompanying GitHub repository.

The four-layer Deep Learning Network

This architecture includes three hidden layers with ReLU activation and one output linear layer. All network layers

are fully connected and network architecture is feed-forward as in all previous NN experiments. The experimental

network graph depicted in Figure 8.

We have tested two dropout regularization schemes: (a) dropout applied after each hidden layer with keep-probability

0.5; (b) dropout applied after each second hidden layer with keep probability calculated by formula:

pd=i

2n

, (where:

n

- number of dropouts,

i

- current dropout index). It was found that former scheme gives better results than last

one, thus for ﬁnal evaluation run, we applied dropout regularization after each hidden layer.

13

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Prediction accuracy, γ=0.0001

K=128 K=256 K=512 K=1024

Trait Variable 256,128,128 512,256,256 1024,512,512 2048,1024,1024

Gender gender 55.07% 70.63% 92.13% 91.03%

Age age 83.82% 82.78% 83.80% 84.96%

Political view political 57.42% 61.23% 64.52% 66.52%

Openness ope 13.52% 27.50% 44.88% 44.86%

Conscientiousness con 14.41% 14.92% 20.76% 28.72%

Extroversion ext 4.06% 7.30% 22.77% 26.85%

Agreeableness agr 10.12% 10.90% 17.69% 20.02%

Neuroticism neu 8.81% 11.98% 27.48% 28.97%

Mean 30.91% 35.90% 46.75% 48.99%

Table 8:

The predictive accuracy results of three layer DNN per

K

SVD dimensions with the learning rate

γ=

0.0001. The number of units

in hidden layers differ per conﬁguration and presented as [hidden1,hidden2]in third table header row

Based on our previous experiments with more shallow networks we decided to start with following hyper-parameters

for four-layer network architecture: learning rate

γ=

0.0001, learning rate decay step 10 000 with decay rate 0.96,

K=

128 SVD dimensions, and layers conﬁguration: [256, 128, 128]. We have conducted series of experiments to

ﬁnd the optimal number of

K

SDV dimensions and hidden layers conﬁguration. The heuristic behind selecting the

number of units per hidden layer was rather naive and assumes that with dropout probability 0.5 half of the units

will be saturated to zero at ReLU activation. Thus we decided to have the number of units in the ﬁrst hidden layer to

be twice as much as the number of features in input data (K). The results of experiment present in Table 8

Despite the fact that best accuracy was achieved with

K=

1024 and the number of units in hidden layers

[2048, 1024, 1024] respectively, it was detected severe model overﬁtting with train data set for speciﬁed param-

eters. So, we have decided to stop increasing

K

and sizes of hidden units as it will give no signiﬁcant further gain in

prediction accuracy against validation data set (see Figure 6).

The accompanying launch script provided to conduct experiments under Unix:

$ . / e va l _3 dn n . sh u l _ s v d _ m a t r i x _ f i l e

where: ul_svd_matrix_ﬁle the path to the users-likes matrix with features columns dimensionality reduced with SVD

method against selected number of KSVD dimensions

The source code of DNN with two hidden layers implementation used for experiment can be found in src/3dnn.R of

accompanying GitHub repository.

5. Future Work

From our experiments, we found that prediction accuracy per each dependent variable differs among machine

learning methods applied and best results achieved by using advanced machine learning methods based on neural

networks algorithms. At the same time, the prediction accuracy per dependent variable also differs between studied

models. It was found that speciﬁc combinations of NN architecture and hyper-parameters are best suited for speciﬁc

dependent variables, but lacking predictive power for the others. In future experiments, it is interesting to investigate

14

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

this dependency and build separate NN models per each dependent variable as it was done in case of simple machine

learning methods (see Section: ’Regression analysis’)

Also, it seems promising to apply methodology described in [

Ba, Caruana, 2014

] which provide evidence that shallow

networks are capable of learning the same functions as deep learning networks, and often with the same number of

parameters as the deep learning networks. In [

Ba, Caruana, 2014

] it was shown that with wide shallow networks it’s

possible to reach the state-of-the-art performance of deep models and reduce training time by the factor of 10 using

parallel computational resources (GPU).

6. Conclusion

In our experiments we have found that weak correlation exists between most of O.C.E.A.N. psychometric traits

of individuals and collected Facebook likes associated with them. Either simple or advanced machine learning

algorithms that we have tested provided poor prediction accuracy for almost all O.C.E.A.N. personality traits. It seems

not feasible yet to use machine learning models to precisely estimate psychometric proﬁle of an individual.

At the same time we have found strong correlation with demographic traits of individuals such as Age,Gender and the

Political Views, which makes it feasible to apply advanced machine learning methods to create demographic proﬁle of

an individual.

DNN

Regression SNN 512,256 2048,1024,1024

Trait Variable K=50 K=128 K=256 K=1024

Gender gender 93.65% 93.61% 92.53% 91.03%

Age age 61.17% 85.07% 80.67% 84.96%

Political view political 68.36% 66.86% 67.06% 66.52%

Openness ope 44.02% 51.24% 47.68% 44.86%

Conscientiousness con 25.72% 28.45% 25.68% 28.72%

Extroversion ext 30.26% 32.87% 34.35% 26.85%

Agreeableness agr 23.97% 27.66% 26.48% 20.02%

Neuroticism neu 29.11% 36.24% 35.94% 28.97%

Mean 47.03% 52.75% 51.30% 48.99%

Table 9:

The comparison of predictive accuracy for best prediction models found during experiments. The average prediction accuracy

demonstrated by shallow neural network (SNN) with tree-layer deep neural network (DNN [512,256]) following next. The further

increase of

K

SVD dimensions and adding of additional hidden layers results in train overﬁtting and degradation of prediction

accuracy over validation data set.

Among all tested machine learning methods the best overall prediction accuracy was achieved with shallow neural

network architecture. We hypothesize this may be the result of its ability to learn best parameters space function

within an optimal number of SVD dimensions applied to the raw input data set (users-likes sparse matrix). Adding

additional hidden layers either leads to model overﬁtting when the number of SVD dimensions is too high, or

underﬁtting otherwise compared to the shallow network. Also, it’s interesting to notice that performance of shallow

networks and deep learning networks with two hidden layers are comparable, while with introducing of third and

more hidden layers it drops signiﬁcantly. Thus we can conclude that no further improvements can be gained with

extra hidden layers inserted. See Table 9.

15

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

A. The SNN evaluation plots

The following pages provide plots and diagrams related to evaluation of two layer (shallow) feed forward artiﬁcial

neural network with one fully connected hidden layer and linear output layer.

Figure 3:

The training process evaluation based on loss values and ReLU zero activations per number of iterations. With higher learning

rate (

γ=

0.001 -orange) we have fast convergence but ratio of ReLU-zero activations is higher than 0.5 and quickly rising with

relatively low evaluated prediction accuracy, which implies that optimum was missed. With medium learning rate (

γ=

0.0001 -

violet) we have smooth loss function plot with ratio of ReLU-zero activations bellow 0.5, giving best prediction scores among all

three runs. With lowest learning rate (γ=0.00001 -purple) we can see that learning struggled to ﬁnd global minimum, reduced

speed of convergence, and despite the lowest ReLU zero activations rate - worst prediction accuracy among all runs due to high loss

values.

16

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 4:

The histograms of various tensors collected during three runs within hidden layer (Left:

γ=

0.0001, middle:

γ=

0.001, right:

γ=

0.00001). By examining weights histograms it may be noticed that middle has widest base with sharp peak which means that

layer converged, but search space was widest among all runs. The left one has narrower base and sharp peak, which means that layer

converged withing narrower search space and as result has better prediction power. The right one has narrow base but wide plateau

at the top, which means that search space is narrow but algorithm still failed to converge. The left and middle histograms has sharp

peaks compared to right one, which may be a signal that their learning rate values has more relevance for algorithm convergence and

as result we have better predictions for those learning rates.

17

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 5:

The tensor network graph for multi layer perceptron with one hidden layer (fully_connected), and one linear output layer

(fully_connected1). The input layer presented as input tensor placeholder. The hidden layer has ReLU activation nonlinearity. The

loss function is MSE (mean squared error). The train optimizer is Adam (for Adaptive Moment Estimation).

18

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

B. Deep NN evaluation plots

The following pages provide plots and diagrams related to evaluation of studied deep neural networks.

Figure 6:

The loss function plot against train and validation data sets, for

K=

1024 input features and three hidden layers with

[2048,1024,1024] units per layer correspondingly. It can be seen that model overﬁtted against training data and as result

no further improvements over validation data can be gained with more training steps.

19

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 7:

The tensor network graph for DNN with two hidden layers. The input layer presented as input tensor placeholder. The hidden

layers has ReLU activation nonlinearity. The loss function is MSE (mean squared error). The train optimizer is Adam (for Adaptive

Moment Estimation).

20

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

Figure 8:

The tensor network graph for DNN with three hidden layers. The input layer presented as input tensor placeholder. The hidden

layers has ReLU activation nonlinearity. The loss function is MSE (mean squared error). The train optimizer is Adam (for Adaptive

Moment Estimation).

21

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

References

[Kosinski et. al, 2016]

Michal Kosinski, Yilun Wang, Himabindu Lakkaraju, and Jure Leskovec, (2016). Mining Big

Data to Extract Patterns and Predict Real-Life Outcomes. Psychological Methods 2016, Vol. 21, No. 4, 493-506.

DOI: 10.1037/met0000105

[Lambiotte, R., and Kosinski, M., 2014]

Lambiotte, R., and Kosinski, M., (2014). Tracking the digital footprints

of personality. Proceedings of the Institute of Electrical and Electronics Engineers, 102, 1934-1939. DOI:

10.1109/JPROC.2014.2359054

[Goldberg et. al, 2006]

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., and

Gough, H. G. (2006). The International Personality Item Pool and the future of public-domain personality

measures. Journal of Research in Personality, 40, 84-96. DOI: 10.1016/j.jrp.2005.08.007

[Golub, G. H., and Reinsch, C. 1970]

Golub, G. H., and Reinsch, C. (1970). Singular value decomposition and least

squares solutions. Numerische Mathematik, 14, 403-420. DOI: 10.1007/BF02163027

[Abdi, H., 2003]

Abdi, H. (2003). Factor rotations in factor analyses. In M. Lewis-Beck, A. E. Bryman, & T. F. Liao

(Eds.), The SAGE encyclopedia of social science research methods (pp. 792-795). Thousand Oaks, CA: SAGE.

[Goodfellow et al., 2016]

Ian Goodfellow and Yoshua Bengio and Aaron Courville, (2016). Deep learning. Manuscript

in preparation. Retrieved from http://www.deeplearningbook.org/

[Daphne Koller, 2012]

Daphne Koller, (2010-2012) Probabilistic Graphical Models. Stanford Uni-

versity. Retrieved from

http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=

ProbabilisticGraphicalModels

[Cortes & Vapnik, 1995]

Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning. 20 (3):273-297.

DOI: 10.1007/BF00994018

[Yan, Su, 2009]

Xin Yan, Xiao Gang Su, (2009), Linear Regression Analysis: Theory and Computing, World Scientiﬁc,

pp. 1-2, ISBN 9789812834119

[David A. Freedman, 2009]

David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University

Press, p. 26.

[Hilary L. Seal, 1967]

Hilary L. Seal (1967). The historical development of the Gauss linear model. Biometrika. 54

(1/2): 1-24. DOI: 10.1093/biomet/54.1-2.1

[Rodriguez, G., 2007]

Rodriguez, G. (2007). Lecture Notes on Generalized Linear Models. pp. Chapter 3, page 45.

Retrieved from http://data.princeton.edu/wws509/notes/

[Sing, Sander, Beerenwinkel, Lengauer, 2005]

Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR:

Visualizing classiﬁer performance in R. Bioinformatics, 21, 3940-3941. DOI: 10.1093/bioinformatics/bti623

[Gain, 1951]

Gain, A. K. (1951). The frequency distribution of the product moment correlation coefﬁcient in random

samples of any size draw from non-normal universes. Biometrika. 38: 219-247. DOI: 10.1093/biomet/38.1-2.219

[Kohavi, Ron, 1995]

Kohavi, Ron (1995). A study of cross-validation and bootstrap for accuracy estimation and

model selection. Proceedings of the Fourteenth International Joint Conference on Artiﬁcial Intelligence. San Mateo, CA:

Morgan Kaufmann. 2 (12): 1137-1143. CiteSeerX 10.1.1.48.529

[Zhang, Marron, Shen,& Zhu, 2007]

Zhang, L., Marron, J., Shen, H., & Zhu, Z. (2007). Singular value decomposition

and its visualization. Journal of Computational and Graphical Statistics, 16, 833-854.

22

Applying Deep Machine Learning for psychological proﬁling using O.C.E.A.N. model of personality

[van Buuren, Groothuis-Oudshoorn, 2011]

Stef van Buuren and Karin Groothuis-Oudshoorn (2011). mice: Multivari-

ate Imputation by Chained Equations in R. Journal of Statistical Software 45 (3) American Statistical Association.

Retrieved from http://doc.utwente.nl/78938/

[Rubin DB, 1987] Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York.

[Christopher M. Bishop, 1995]

Christopher M. Bishop (1995). Neural Networks for Pattern Recognition, Oxford

University Press, Inc. New York, NY, USA c

1995 ISBN:0198538642

[Bengio, Yoshua, 2009]

Bengio, Yoshua (2009). Learning Deep Architectures for AI. Foundations and Trends in Machine

Learning. 2 (1): 1-127. DOI: 10.1561/2200000006

[LeCun, Bengio, Hinton, 2015]

LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). Deep learning. Nature. 521:

436â ˘

A¸S444. DOI:10.1038/nature14539

[Tetko, Livingstone, Luik, 1995]

Tetko, I. V.; Livingstone, D. J.; Luik, A. I. (1995). Neural network studies. 1.

Comparison of Overﬁtting and Overtraining. J. Chem. Inf. Comput. Sci. 35 (5): 826-833. DOI: 10.1021/ci00027a006

[Hinton, G. et al., 2012]

Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov,

Ruslan R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint

arXiv:1207.0580v1

[Nair, Hinton, 2010]

Vinod Nair and Geoffrey Hinton (2010). Rectiﬁed linear units improve restricted Boltzmann

machines. ICML. Retrieved from

http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_

NairH10.pdf

[Glorot, Bordes, Bengio, 2011]

Xavier Glorot, Antoine Bordes, Yoshua Bengio (2011). Deep Sparse Rectiﬁer Neural

Networks. JMLR W&CP 15:315-323 Retrieved from

http://jmlr.org/proceedings/papers/v15/glorot11a.

html

[Kingma, Ba, 2014]

Diederik P. Kingma; Lei Jimmy Ba (2014). Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980

[Ba, Caruana, 2014]

Lei Jimmy Ba, Rich Caruana (2014). Do Deep Nets Really Need to be Deep? arXiv preprint

arXiv:1312.6184

[Kirkpatrick et al., 1983]

S. Kirkpatrick, C. D. Gelatt Jr., M. P. Vecchi (1983). Optimization by Simulated Annealing.

Science, 13 May 1983: Vol. 220, Issue 4598, pp. 671-680 DOI: 10.1126/science.220.4598.671

[R Core Team, 2015]

R Core Team, (2015). R: A language and environment for statistical computing. Vienna, Austria:

R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/

[Google Brain Team, 2015]

Google Brain Team, (2015). TensorFlow is an open source software library for numerical

computation using data ﬂow graphs. Retrieved from https://www.tensorflow.org

23