Archived project

Applying Deep Machine Learning for analysis of psycho-demographic traits

Goal: In this work we will show how to apply advanced machine learning algorithms and methodologies to perform analysis of data corpus comprising of collected Facebook likes and psycho-demographic traits of individuals. We will consider common methodologies used for data corpus pre-processing and analysis. Finally, we compare performance of simple linear and logistic regression predictive models against advanced models based on deep machine learning algorithms in order to find out if advanced methodologies outperform simple ones. The work is accompanied by full source code in R programming language.

Methods: Artificial Neural Network, Machine Learning, R Programming, Logistic Regression Analysis, Multiple Linear Regression, Google TensorFlow

Date: 2 January 2017

Updates
0 new
13
Recommendations
0 new
0
Followers
0 new
4
Reads
0 new
34

Project log

Iaroslav Omelianenko
added 2 research items
The presentation of research paper: Applying Deep Machine Learning for Psycho-Demographic Profiling of Internet Users using O.C.E.A.N. Model of Personality
Iaroslav Omelianenko
added an update
Research paper deposited to arXiv preprint service: https://arxiv.org/abs/1703.06914
 
Iaroslav Omelianenko
added an update
Research paper deposited to HAL archive: https://hal.archives-ouvertes.fr/hal-01482770
 
Iaroslav Omelianenko
added 3 research items
In the modern era, each Internet user leaves enormous amounts of auxiliary digital residuals (footprints) by using a variety of on-line services. All this data is already collected and stored for many years. In recent works, it was demonstrated that it's possible to apply simple machine learning methods to analyze collected digital footprints and to create psychological profiles of individuals. However, while these works clearly demonstrated the applicability of machine learning methods for such an analysis, created simple prediction models still lacks accuracy necessary to be successfully applied to practical needs. We have assumed that using advanced deep machine learning methods may considerably increase the accuracy of predictions. We started with simple machine learning methods to estimate basic prediction performance and moved further by applying advanced methods based on shallow and deep neural networks. Then we compared prediction power of studied models and made conclusions about its performance. Finally, we made hypotheses how prediction accuracy can be further improved. As result of this work, we provide full source code used in the experiments for all interested researchers and practitioners in corresponding GitHub repository. We believe that applying deep machine learning for psychological profiling may have an enormous impact on the society (for good or worse) and providing full source code of our research we hope to intensify further research by the wider circle of scholars.
Iaroslav Omelianenko
added an update
Further experiments with four layer DNN.
 
Iaroslav Omelianenko
added an update
Experiments with different sizes of hidden layers and new dropout scheme:
  • We reduced size of hidden layers to 256,256,256 and 340,200,200
  • We applied dropout only after two hidden layers with keep probability i/(2*n), where n - number of dropouts
 
Iaroslav Omelianenko
added an update
Conducted experiment with applying batch size = 10
It was found that there are no statistically important difference in prediction accuracy between runs with batch size 100 and 10.
 
Iaroslav Omelianenko
added an update
An experiment with applying different non-linearity activation. From studying ReLU activations histograms it may be noticed that best results acquired when there is additional peak near zero. So we decided to apply symmetric non-linearity such as signsoft and compare performance. It seems like ReLU still performs better.
 
Iaroslav Omelianenko
added an update
Further experiments with three hidden layers DNN
 
Iaroslav Omelianenko
added an update
First experiment with DNN architecture with two hidden ReLU activated layers with dropout regularization.
  • Attempt to find optimal value of K SVD dimensions given learning rate 10e-4
Found interesting case of overfitting with training, given:
  • K = 1024,
  • DNN hidden layers [1024, 512]
  • dropout 0.5 after each hidden layer
 
Iaroslav Omelianenko
added an update
Further experiments with two layered fully connected multilayer perceptron aimed for hyper parameters optimization:
  1. Finding optimal number of applied SVD dimensions to the input feature set
  2. Finding optimal learning rate for optimal SVD dimension value
 
Iaroslav Omelianenko
added an update
  1. Conducted experiment with applying two layer fully connected feed forward artificial neural network to the preprocessed data corpus (see mlp_network_graph.png)
  2. Found optimal number of SVD dimensions for dimensionality reduction
  3. Found optimal loss function optimization algorithm (see results_150_50.txt)
  4. Found optimal learning rate for selected loss function optimization algorithm (see mlp_loss_relu-zero_combined.png, mlp_hidden_hist.png)
 
Iaroslav Omelianenko
added an update
  1. Conducted experiment with applying simple ML algorithms such as linear/logistic regression to the preprocessed data corpus
  2. Determined optimal number of SVD dimensions to apply against data features in order to reduce dimensionality (see svd_traits_regression_correlations.png)
  3. Made preliminary analysis of correlations between dependent variables and SVD dimensions (see svd_correlation_hmap.png)
  4. Calculated accuracy of prediction models based on linear/logistic regression (see pred_accuracy_regr.txt) which will be used further as baseline to evaluate prediction power of advanced ML models
 
Iaroslav Omelianenko
added an update
Implemented data corpus preprocessing and generated corresponding intermediate data sets as saved R matrixes and data frames
 
Iaroslav Omelianenko
added a project goal
In this work we will show how to apply advanced machine learning algorithms and methodologies to perform analysis of data corpus comprising of collected Facebook likes and psycho-demographic traits of individuals. We will consider common methodologies used for data corpus pre-processing and analysis. Finally, we compare performance of simple linear and logistic regression predictive models against advanced models based on deep machine learning algorithms in order to find out if advanced methodologies outperform simple ones. The work is accompanied by full source code in R programming language.