Article

Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In this paper I review three key technology-related trends: 1) big data, 2) non-probability samples, and 3) mobile data collection. I focus on the implications of these trends for survey research and the research profession. With regard to big data, I review a number of concerns that need to be addressed, and argue for a balanced and careful evaluation of the role that big data can play in the future. I argue that these developments are unlikely to replace transitional survey data collection, but will supplement surveys and expand the range of research methods. I also argue for the need for the survey research profession to adapt to changing circumstances. European Survey Research Association.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The established prestige of probability surveys faces other external pressures, such as the rise in the popularity of Big Data and the growing prominence of probability and nonprobability opt-in panels. These developments triggered a debate about whether surveys have a future (Couper 2013;Miller 2017;Singer 2016). Even if the sky has not yet fallen (Couper 2013), probability surveys no longer seem an indispensable method (Cornesse et al. 2020;Simmons and Bobo 2015). ...
... These developments triggered a debate about whether surveys have a future (Couper 2013;Miller 2017;Singer 2016). Even if the sky has not yet fallen (Couper 2013), probability surveys no longer seem an indispensable method (Cornesse et al. 2020;Simmons and Bobo 2015). Given the availability of other sources of insight into public opinion, which are typically easier and cheaper to deploy, high data quality remains key to the continuing viability of probability surveys (Beaumont 2020;Lavrakas et al. 2022;MacInnis et al. 2018). ...
... Over the last two decades, the methodology and practice of survey research have faced substantial challenges resulting from a rapid proliferation of communication technologies, evolving societal norms, and changing lifestyles (Couper 2013;Miller 2017). Despite the rise of alternative avenues of data acquisition, probability surveys have nevertheless remained indispensable sources of insight into public opinion and social attitudes. ...
Article
Full-text available
This paper assesses trends in three survey outcome rates within four prominent crossnationalcomparative surveys conducted in European countries in the 21st century: theEuropean Quality of Life Survey, the European Social Survey, the European ValuesStudy, and the International Social Survey Programme. These projects are recognisedfor their high-quality sampling and fieldwork procedures, extensive track records, andcommitment to rigorous methodological standards. The analysis is based on 753national surveys conducted on probability samples of the general population in 36European countries from 1999 to 2018. We investigated whether two essential surveycharacteristics, namely sampling frames and data collection modes, moderated thedecrease of survey outcome rates over time. To analyse these relationships, thesurvey year was included as the explanatory variable, and we applied multi-level linearregressions with surveys nested within countries. Additionally, the project name wasincorporated as a fixed factor, and the sampling frame and mode of data collectionwere control variables for the effect of time. Our study provides valuable insights intothe challenges of conducting high-quality Pan-European cross-national comparativesurveys over nearly two decades. We observed a consistent decline in survey outcomerates, irrespective of country or project. Neither the sampling frame nor the datacollection mode moderated this decline. Hence, even though personal register samplesand Face-to-Face interviews are often regarded as enhancements to overall surveyquality, their application does not effectively counter the factors causing a decline insurvey outcome rates.
... Non-probability samples are sets of selected objects where the sampling mechanism is unknown. First of all, non-probability samples are readily available from many data sources, such as satellite information (McRoberts et al.;2010), mobile sensor data (Palmer et al.;2013), and web survey panels (Tourangeau et al.;2013). In addition, these non-representative samples are far more cost-effective compared to probability samples and have the potential of providing estimates in near real-time, unlike the traditional inferences derived from probability samples . ...
... Non-probability samples are sets of selected objects where the sampling mechanism is unknown. First of all, non-probability samples are readily available from many data sources, such as satellite information (McRoberts et al.;2010), mobile sensor data (Palmer et al.;2013), and web survey panels (Tourangeau et al.;2013). In addition, these non-representative samples are far more cost-effective compared to probability samples and have the potential of providing estimates in near real-time, unlike the traditional inferences derived from probability samples . ...
... Next, we adopt the model-design-based framework for inference, which incorporates the randomness over the two phases of sampling 1983;Molina et al.;2001;Binder and Roberts;Xu et al.;2013). The asymptotic properties for µ A and µ B can be derived using the standard M-estimation theory under suitable moment conditions. ...
Preprint
Full-text available
Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
... This is because social media data, unlike survey data, are "found data" not "designed data." (Couper 2013). ...
... The free Streaming API provides only real-time data, is limited to about 1% of Twitter traffic and does not provide a representative sample of tweets in many cases (Hino and Fahey 2019). In addition, Twitter is quite often unable to answer specific questions posed by researchers and data users-because it is not "designed data" (Couper 2013)-which makes resorting to other data sources mandatory to compensate for the weaknesses of the Twitter datasets (Callegaro and Yang 2018). ...
... The sampling strategies undertaken when no sampling frame is available are based either on non-random methods (convenience, purposive or snowball sampling) or data search-based. On one hand, this is a consequence of the lack of sampling frames which impedes random selection (Groves et al. 2009, p. 94;Couper 2013); on the other, it is a choice to guarantee an efficient sampling procedure: selecting a sample of Twitter users by first locating tweets they posted containing the topic or event of interest is an efficient strategy to locate hard-to-reach populations or topic/event specific population. The selection is based on the dependent variable which guarantees the users are selected within the scope of the study; however, there is a risk of oversampling the users who are more active on the platform and under sampling those who rarely use or seldom post contents (Bruns and Stieglitz 2013;Zhang et al. 2018). ...
Article
Full-text available
All social media platforms can be used to conduct social science research, but Twitter is the most popular as it provides its data via several Application Programming Interfaces, which allows qualitative and quantitative research to be conducted with its members. As Twitter is a huge universe, both in number of users and amount of data, sampling is generally required when using it for research purposes. Researchers only recently began to question whether tweet-level sampling—in which the tweet is the sampling unit—should be replaced by user-level sampling—in which the user is the sampling unit. The major rationale for this shift is that tweet-level sampling does not consider the fact that some core discussants on Twitter are much more active tweeters than other less active users, thus causing a sample biased towards the more active users. The knowledge on how to select representative samples of users in the Twitterverse is still insufficient despite its relevance for reliable and valid research outcomes. This paper contributes to this topic by presenting a systematic quantitative literature review of sampling plans designed and executed in the context of social science research in Twitter, including: (1) the definition of the target populations, (2) the sampling frames used to support sample selection, (3) the sampling methods used to obtain samples of Twitter users, (4) how data is collected from Twitter users, (5) the size of the samples, and (6) how research validity is addressed. This review can be a methodological guide for professionals and academics who want to conduct social science research involving Twitter users and the Twitterverse.
... Actually, several researchers had already pointed out that a barrier to performing big data analytics was that most statistical software could not handle the large file sizes (Vajjhala et al., 2015). However, researchers have found ways around the big data five V's -at least the volume, velocity and variety attributes -by using could-based and distributed software such as Hadoop along with sampling techniques to reduce the five V's (Couper, 2013, Varian, 2014. Nonetheless, this is where another hidden healthcare big data problem lurks. ...
... In laymen terms, researchers of social media big data do not know who in the population is excluded, who is not texting or responding, or even the true extent of the underlying population. Thus it is clear there is a sampling bias beyond nonresponse in the entire big data paradigm (Couper, 2013, Varian, 2014. Almost an entire global generation and many world-wide non-English speaking cultures could be missing in popular social media big data files, depending on the situation. ...
... The difference between primary and second research collection is that primary research data collection involves conducting research oneself, or using the data for the purpose it was intended for. Secondary research data, on the other hand, was collected by a third party or for some other purpose (Couper, 2013). An advantage of using primary data is that researchers are collecting information for the specific purposes of their study. ...
Chapter
Full-text available
This chapter discusses several fundamental and managerial controversies associated with artificial intelligence and big data analytics which will be of interest to quantitative professionals and practitioners in the fields of computing, e-commerce, e-business services, and e-government. The authors utilized the systems thinking technique within an action research framework. They used this approach because their ideology was pragmatic, the problem at hand, was complex and institutional (healthcare discipline), and they needed to understand the problems from both a practitioner and a nonhuman technology process viewpoint. They used the literature review along with practitioner interviews collected at a big data conference. Although they found many problems, they considered these to be already encompassed into the big data five V's (volume, velocity, variety, value, veracity). Interestingly, they uncovered three new insights about the hidden healthcare artificial intelligence and big data analytics risks; then they proposed solutions for each of these problems.
... The increase in the number of activities reliant on mobile phones and mobile apps has improved users' familiarity with their devices and has paved the way to use mobile devices (tablets, smartphones, mobile Web) for conducting surveys and collecting data (Couper, 2013;Marcano Belisario et al., 2015) even in migration studies (Rocheva et al., 2022). Social studies have therefore analysed big data from social networking sites, or online services targeting migrants for their studies, moreover, surveys online have become more frequent. ...
... Mobile apps can represent a useful and additional data-collecting method as they can both passively collect data (e.g. geolocalisation) or be used to conduct surveys (Couper, 2013). ...
Article
Full-text available
Despite the process of migrant settlement in Italy, there is still a lack of data on migrant health and access to health services. We have developed the mobile app “ComeStai” to collect new data on this population. ComeStai collects new and original information on migrants' self-assessed health status and access to health care through short surveys. The main questionnaire includes questions about the respondent's immigrant status, health status and health habits, access to and use of health care, and socio-demographic characteristics. Every month, the app displays a follow-up survey to track key variables over time, including respondents' health status and experiences of barriers to accessing health care. After submitting the answers to the main questionnaire, the user can access an in-app platform that provides an overview of the Italian public health system and regulations, a brief description of the main public health services, and a list of NGOs providing health services or support in the Metropolitan area of Milan, as well as their location on a map. Our study aims to test the efficacy of a mobile application as a data collection tool. We present the fieldwork and discuss the advantages and challenges of such a tool. Our evaluation considers the sample size achieved, the heterogeneity and representativeness of the sample to the real population. Based on the results, we will evaluate whether to extend the research to a larger geographical scale.
... Несмотря на существование алармистского дискурса о «меняющейся научной парадигме» и «новом повороте» в социальных науках [27], в целом исследователи сходятся во мнении, что разработки в области больших данных вряд ли заменят сбор данных реактивными методами, но, вероятно, их дополнят и расширят спектр применяемых методов исследования [28]. «Нереак тивность» как концептуальное различение важно артикулировать при обсуждении методологических вопросов интеграции, где одним из важных вопросов является различие между типами связываемых данных, а также способами их получения. ...
... Однако не только корректное измерение латентных конструктов представляет трудности, цифровые следы прежде всего демонстрируют низкую валидность в измерении базовых социально-демографических параметров. Так, в докладе Pew Research Center отмечается, что в результатах Google Consumer Survey (GCS) предполагаемый пол (вычисленный предположительно на основе алгоритмов) соответствует заявленному респондентом в рамках опроса примерно в 75% случаев, а возрастные категории совпадают примерно в 44% случаев [36], что существенно снижает валидность и надежность данных, а также ограничивает возможности многомерного анализа [28]. ...
Article
The main purpose of current study is to review the main existing methodological approaches to the integration of survey data and digital traces that are used in sociological research. The paper examines key arguments in the current methodological discussion about the place of big digital data in contemporary social science research. The authors make an attempt to scrutinize the practice of integrating survey data and digital traces through the concept of “reactive – nonreactive” measurement. The possible functions of digital traces in the design of the study are indicated (on the example of social media data). On the example of three research areas (the study of media consumption, media effects and electoral behavior) general methodological principles for integrating data of different nature are demonstrated and possible prospects for the development of these approaches is described. The article discusses a wide range of methodological issues: problems of the data linking validity; potential threats to the validity of digital traces; opportunities to improve survey questionnaire, to enrich data, to search for new valid indicators of socio-political processes and to provide cross-validation of research results. The current practices of integrating administrative data are considered as well.
... In addition, Twitter opens part of its user-created data to the public in the form of Application Programming Interface (API), called Twitter API. 1 For example, Twitter Streaming API, which allows users to retrieve real-time tweets from Twitter, is known to provide up to 1% sample of all the tweets created on Twitter at a given time. 2 While this 1% sample may appear to be too small to be used in a study, it could be sufficient in many cases, considering the enormous size of the entire data. On the other hand, it is known that the random samples from Twitter could have a potential bias [5]. ...
... Few studies, however, provide a generic procedure that guides researchers who want to leverage social media data, more specifically Twitter data, for social trend analysis. This study has two main objectives: (1) to effectively identify the target audience of users in Twitter data by user profiling and (2) to develop topical and social insights from the collective voice of the target users. For the user profiling task, specifically, we present text-based customized user profiling, which can be considered to be an alternative when there are no existing user profiling solutions that are available or work for the user attribute or the data of interest. ...
Article
Full-text available
This study develops a pragmatic scheme that facilitates insight development from the collective voice of target users in Twitter, which has not been considered in the existing literature. While relying on a wide range of existing approaches to Twitter user profiling, this study provides a novel and generic procedure that enables researchers to identify the right users in Twitter and discover topical and social insights from their tweets. To identify a target audience of Twitter users that meets certain criteria, we first explore user profiling, potentially followed by text-based, customized user profiling leveraging hashtags as features for machine learning. We then present how to mine popular topics and influential actors from Twitter data. Two case studies on 16 thousand young women interested in fashion and 68 thousand people sharing the same interest in the Me Too movement indicate that our approach facilitates discovery of social trends among people in a particular domain.
... [19]), namely volume, velocity, and variety. Social media are a sub-type of big data where people express their thoughts and opinions with the purpose of sharing them with others [18]. Due to their inherent properties, SMD have been seen as a promising complementary, and even alternative, source of data for exploring PO. ...
... We have identified several dimensions based on recurring criteria mentioned in the literature concerning the nature of and the relationship between both data sources. Often-cited criteria include the type of population and data signal, the unit of observation and analysis, and the available meta-data (for a thorough discussion of the differences see [18,68], and [77]). ...
Article
Full-text available
In this article, we review existing research on the complementarity of social media data and survey data for the study of public opinion. We start by situating our review in the extensive literature (N = 187) about the uses, challenges, and frameworks related to the use of social media for studying public opinion. Based on 187 relevant articles (141 empirical and 46 theoretical) - we identify within the 141 empircal ones six main research approaches concerning the complementarity of both data sources. Results show that the biggest share of the research has focused on how social media can be used to confirm survey findings, especially for election predictions. The main contribution of our review is to detail and classify other growing complementarity approaches, such as comparing both data sources on a given phenomenon, using survey measures as a proxy in social media research, enriching surveys with SMD, recruiting individuals on social media to conduct a second survey phase, and generating new insight on “old” or “under-investigated” topics or theories using SMD. We discuss the advantages and disadvantages associated with each of these approaches in relation to four main research purposes, namely the improvement of validity, sustainability, reliability, and interpretability. We conclude by discussing some limitations of our study and highlighting future paths for research.
... Big datasets tend to have a high number of cases, but few covariates (Couper 2013). This scarcity of covariates is not an issue if the objective is to estimate a single figure. ...
... Other issues are the access and privacy policies. Most of the time, big data sources are proprietary and access, therefore, is restricted (Couper 2013). ...
Article
Full-text available
La encuesta es la técnica de investigación predominante en la investigación en Ciencias Sociales. Sin embargo, la aparición de otras fuentes de datos como las publicaciones en redes sociales o los datos generados por GPS suponen nuevas oportunidades para la investigación. En este escenario, algunas voces han defendido la idea de que, debido a su menor coste y la velocidad a la que se generan, los big data irán sustituyendo progresivamente a los datos de encuesta. Sin embargo, este optimismo contrasta con los problemas de calidad y accesibilidad que presentan los big data como la fata de cobertura de algunos grupos de la población o el acceso restringido a alguna de estas fuentes. Este artículo, a partir de una revisión profunda de la literatura de los últimos años, explora como la cooperación entre los big data y las encuestas resulta en mejoras significativas de la calidad de los datos y una reducción de los costes. Nowadays, while surveys still dominate the research landscape in social sciences, alternative data sources such as social media posts or GPS data open a whole range of opportunities for researchers. In this scenario, some voices advocate for a progressive substitution of survey data. They anticipate that big data, which is cheaper and faster than surveys, will be enough to answer relevant research questions. However, this optimism contrasts with all the quality and accessibility issues associated with big data such as the lack of coverage or data ownership and restricted accessibility. The aim of this paper is to explore how, nowadays, the combination of big data and surveys results in significant improvements in data quality and survey costs.
... On the other side, new modes of data collection using sensors, web portals, and smart devices have emerged that routinely capture a variety of human activities. These automated processes have led to an ever-accumulating massive volume of unstructured information, so-called "Big Data" (Couper, 2013;Kreuter and Peng, 2014;Japec et al., 2015). While being cheaper, larger, faster, and more detailed make Big Data appealing as an alternative or supplement to probability surveys, the non-probabilistic nature of their data-generating process introduces new impediments to valid inference for such data. ...
... Examples include political views shared on social media, Google searches for particular terms, payment transactions recorded by online stores, electronic health records of the patients admitted to a group of hospitals, videos captured by traffic cameras, and mobile GPS trajectory data by satellite. This broad range of data examples share several common characteristics in terms of volume, velocity, variety, and veracity (Couper, 2013). ...
Thesis
The steady decline of response rates in probability surveys, in parallel with the fast emergence of large-scale unstructured data (“Big Data”), has led to a growing interest in the use of such data for finite population inference. However, the non-probabilistic nature of their data-generating process makes big-data-based findings prone to selection bias. When the sample is unbalanced with respect to the population composition, the larger data volume amplifies the relative contribution of selection bias to total error. Existing robust approaches assume that the models governing the population structure or selection mechanism have been correctly specified. Such methods are not well-developed for outcomes that are not normally distributed and may perform poorly when there is evidence of outlying weights. In addition, their variance estimator often lacks a unified framework and relies on asymptotic theory that might not have good small-sample performance. This dissertation proposes novel Bayesian approaches for finite population inference based on a non-probability sample where a parallel probability sample is available as the external benchmark. Bayesian inference satisfies the likelihood principle and provides a unified framework for quantifying the uncertainty of the adjusted estimates by simulating the posterior predictive distribution of the unknown parameter of interest in the population. The main objective of this thesis is to draw robust inference by weakening the modeling assumptions because the true structure of the underlying models is always unknown to the analyst. This is achieved through either combining different classes of adjustment methods, i.e. quasi-randomization and prediction modeling, or using flexible non-parametric models including Bayesian Additive Regression Trees (BART) and Gaussian Process (GP) Regression. More specifically, I modify the idea of augmented inverse propensity weighting such that BART can be used for predicting both propensity scores and outcome variables. This offers additional shields against model misspecification beyond the double robustness. To eliminate the need for design-based estimators, I take one further step and develop a fully model-based approach where the outcome is imputed for all non-sampled units of the population via a partially linear GP regression model. It is demonstrated that GP behaves as an optimal kernel matching tool based on the estimated propensity scores. To retain double robustness with good repeated sampling properties, I estimate the outcome and propensity scores jointly under a unified Bayesian framework. Further developments are suggested for situations where the reference sample is complex in design, and particular attention is paid to the computational scalability of the proposed methods where the population or the non-probability sample is large in size. Throughout the thesis, I assess the repeated sampling properties of the proposed methods in simulation studies and apply them to real-world non-probability sampling inference.
... Research on data mining has increased over the past few decades, especially with the prevalence of web-based social and communication technologies that have enabled individuals to generate time-stamped, geolocated data and share details about their immediate surroundings using locationbased social network services (Han et al., 2020). Social media is a sub-type of big data, where people express their thoughts and opinions to share them with others (Couper, 2013). Since the release of ChatGPT, there has been a wide range of discussions on the virtual space . ...
Article
Full-text available
This study investigated the affordances, constraints, and implications of ChatGPT in education using the affordance theory and social-ecological systems theory. We employed a data mining approach that blends social media analytics including sentiment analysis and topic modelling and qualitative analysis to extract viewpoints from a collection of datasets consisting of 33,456 tweets. Key findings indicate that 42.1% of analysed tweets conveyed a positive sentiment, 39.6% were neutral, and only 18.3% conveyed a negative sentiment. We also identified five categories of ChatGPT properties (e.g., text and data analysis, AI and machine learning) and an array of affordances of ChatGPT in education (e.g., facilitating student personalised learning, classroom instruction, provision of educational resources, curriculum changes, and assessment). Meanwhile, the findings revealed key concerns, including academic dishonesty, bias, and ethics that warrant attention. This study contributes to a real-time understanding of the impact of ChatGPT on education and informs researchers, educators, and policymakers to take a holistic approach to evaluating ChatGPT in educational practices.
... This evidence seems to be more marked in the case of young people and young adults using mobile devices in answering online self-administered questionnaires [49]. Using this survey mode, respondents may also choose to respond at their own convenience, thus, in principle, increasing their ability to focus on the topics covered [20]. The main distinction among survey modes is related to the presence of the interviewer. ...
Technical Report
Full-text available
Responses to survey questions can be influenced by the data collection methods or the phrasing of questions. This paper uses 'CUB models' to identify the cognitive process components underlying responses. This methodology is applied to data from sample surveys conducted by the Bank of Italy of Italian households and firms. The results show that different data collection modes and question wording or graphical representation do not affect respondents' perceptions of the phenomena under review, but they can produce effects on response uncertainty. The proposed methodology appears particularly robust for analysing responses that require a high degree of subjective judgment, such as those on perceived satisfaction or expectations.
... ML can analyse historical survey data and identify the most predictive questions for measuring IPD in this context. This might improve the quality of data gathered and the reliability of survey-based assessments (Couper, 2013). Also, classification algorithms can predict which party members are less likely to participate in surveys based on past engagement data. ...
Preprint
Full-text available
The article argues that AI can enhance the measurement and implementation of democratic processes within political parties, known as Intra-Party Democracy (IPD). It identifies the limitations of traditional methods for measuring IPD, which often rely on formal parameters, self-reported data, and tools like surveys. Such limitations lead to the collection of partial data, rare updates, and significant demands on resources. To address these issues, the article suggests that specific data management and Machine Learning (ML) techniques, such as natural language processing and sentiment analysis, can improve the measurement (ML about) and practice (ML for) of IPD. The article concludes by considering some of the principal risks of ML for IPD, including concerns over data privacy, the potential for manipulation, and the dangers of overreliance on technology.
... In Japan, traditional survey methods are at a crossroads because of an overwhelming decline in response rates owing to the deterioration of the survey environment. Meanwhile, many empirical studies in Europe and the United States have proposed survey methodologies to improve response rates and encourage a departure from the existing framework (Couper 2013;Dillman 2017;Dillman et al. 2014;Groves 2011). The current survey situation in Japan undeniably lags the global standard, as few researchers in Japan have conducted experiments on survey methodologies (e.g., Kojima 2010). ...
Article
Full-text available
Traditional survey methods are at a crossroads in Japan owing to decreasing response rates and survey participation. Therefore, this study developed a novel survey method utilizing a Web-based voice recognition survey app as a new “interviewer” that replaces humans to conduct large-scale surveys for probability samples following the same procedure as conventional interview surveys. This study entails the development of the survey process and evaluating the practical feasibility of the new survey methodology. Overall, the response rate of the new survey method was not as high as that of the usual interview surveys. Furthermore, this study shows that the response rate was low compared with the login rate. Consequently, more attractive survey participation features should be developed to retain participants’ responses from the time they log in until they complete the survey.
... For example, Presser and McCulloch (2011) reported that the number of surveys conducted between 1984-2004 increased at a rate many times greater than the population in the US. However, in the context of decreasing survey response and increasing costs for data collection, the future of surveys has been questioned (Couper 2013), (Alwin 2013), (Couper 2017) (Rao and Fuller 2017), and as a result, attempts and initiatives emerged in recent years to find alternatives to surveys (Link et al. 2014), (Galesic et al. 2021). From these developments, we derive the importance of this study. ...
Preprint
Full-text available
When conducting a survey, many choices regarding survey design features have to be made. These choices affect the response rate of a survey. This paper analyzes the individual effects of these survey design features on the response rate. For this purpose, data from a systematic review of crime surveys conducted in Germany between 2001--2021 were used. First, a meta-analysis of proportions is used to estimate the summary response rate. Second, a meta-regression was fitted, modeling the relationship between the observed response rates and survey-design features, such as the study year, target population, coverage area, data collection mode, and institute. The developed model informs about the influence of certain survey design features and can predict the expected response rate when (re-) designing a survey. This study highlights that a thoughtful survey design and professional survey administration can result in high response rates.
... Anche i dati delle applicazioni mobili emergono come fonti di insight comportamentali e preferenze degli utenti, captati attraverso dispositivi quali smartphone e smartwatch che le persone portano costantemente con sé (Couper, 2013). Questi dati possono permettere un elevato grado di personalizzazione e contestualizzazione dell'apprendimento, nonostante le preoccupazioni per la privacy e la dipendenza dall'utilizzo attivo dei dispositivi. ...
Book
Full-text available
Il volume esplora la contemporanea intersezione tra i dati e l’educazione, un territorio in cui i big data non solo influenzano, ma trasformano attivamente le modalità di insegnamento e apprendimento. L’opera si apre con un’analisi del ruolo dei big data nella società dell’informazione e dell’influenza che essi esercitano sulle nostre vite individuali e sociali, sollevando questioni di ordine cognitivo, etico e sociale: in che modo sta cambiando il nostro modo di concepire i dati? Quali sono le implicazioni etiche legate alla raccolta e all’utilizzo dei dati in ambito educativo? Quale impatto l’uso dei big data nell’istruzione può avere sull’accesso equo alle opportunità educative? Tali questioni richiedono nuove forme di consapevolezza che vanno promosse tra i cittadini del nuovo millennio per favorire lo sviluppo di data literacy. Il libro offre una panoramica sui significati del concetto di data literacy, soffermandosi sui suoi fondamenti teorici e fornendo un quadro complessivo di cosa significhi essere alfabetizzati ai dati nel XXI secolo...
... In this paper, we will look at some options for dealing with the nonresponse problem and other problems with "official" social statistics. We highly recommend to anyone reading the current paper to also read Brick (2011), Couper (2013 and Citro (2014) who do an excellent job of assessing the current situation and looking forward. The paper by Groves (2011) is also relevant, particularly its last section. ...
... In this context, ML can analyse historical survey data and identify the most predictive questions for measuring IPD. This might improve the quality of data gathered and the reliability of survey-based assessments (Couper 2013). Also, classification algorithms can predict which party members are less likely to participate in surveys based on past engagement data. ...
Article
Full-text available
The article argues that AI can enhance the measurement and implementation of democratic processes within political parties, known as Intra-Party Democracy (IPD). It identifies the limitations of traditional methods for measuring IPD, which often rely on formal parameters, self-reported data, and tools like surveys. Such limitations lead to the collection of partial data, rare updates, and significant demands on resources. To address these issues, the article suggests that specific data management and Machine Learning (ML) techniques, such as natural language processing and sentiment analysis, can improve the measurement (ML about) and practice (ML for) of IPD. The article concludes by considering some of the principal risks of ML for IPD, including concerns over data privacy, the potential for manipulation, and the dangers of overreliance on technology.
... In this context, ML can analyse historical survey data and identify the most predictive questions for measuring IPD. This might improve the quality of data gathered and the reliability of survey-based assessments (Couper 2013). Also, classification algorithms can predict which party members are less likely to participate in surveys based on past engagement data. ...
Article
Full-text available
The article argues that AI can enhance the measurement and implementation of democratic processes within political parties, known as Intra-Party Democracy (IPD). It identifies the limitations of traditional methods for measuring IPD, which often rely on formal parameters, self-reported data, and tools like surveys. Such limitations lead to partial data collection, rare updates, and significant resource demands. To address these issues, the article suggests that specific data management and Machine Learning techniques, such as natural language processing and sentiment analysis, can improve the measurement and practice of IPD.
... Concerns over falling response rates have raised question marks over the continued value of bespoke data collection through surveys as opposed to analysis of pre-existing sources of administrative data (Couper, 2013). However, the scope surveys offer researchers to design and tailor their own data collection instruments for a specific purpose, and to control the quality of data collected, mean they are likely to continue to play a central role. ...
... Thus, the field of survey research is in a situation where PS surveys are known to have higher data quality but large samples are expensive while NPS surveys are convenient and more affordable but can suffer from large selection biases. Consequently, a natural avenue of research is the integration of both PS and NPS surveys to exploit their respective advantages in a way that overcomes their respective disadvantages and minimizes overall survey costs (Couper 2013;Miller 2017;Beaumont 2020;Rao 2021). ...
Article
Full-text available
Probability sample (PS) surveys are considered the gold standard for population-based inference but face many challenges due to decreasing response rates, relatively small sample sizes, and increasing costs. In contrast, the use of nonprobability sample (NPS) surveys has increased significantly due to their convenience, large sample sizes, and relatively low costs, but they are susceptible to large selection biases and unknown selection mechanisms. Integrating both sample types in a way that exploits their strengths and overcomes their weaknesses is an ongoing area of methodological research. We build on previous work by proposing a method of supplementing PSs with NPSs to improve analytic inference for logistic regression coefficients and potentially reduce survey costs. Specifically, we use a Bayesian framework for inference. Inference relies on a probability survey with a small sample size, and through the prior structure we incorporate supplementary auxiliary information from a less-expensive (but potentially biased) NPS survey fielded in parallel. The performance of several strongly informative priors constructed from the NPS information is evaluated through a simulation study and real-data application. Overall, the proposed priors reduce the mean-squared error (MSE) of regression coefficients or, in the worst case, perform similarly to a weakly informative (baseline) prior that does not utilize any nonprobability information. Potential cost savings (of up to 68 percent) are evident compared to a probability-only sampling design with the same MSE for different informative priors under different sample sizes and cost scenarios. The algorithm, detailed results, and interactive cost analysis are provided through a Shiny web app as guidance for survey practitioners.
... Collecting survey data via the web is also an attractive method for lowering data collection costs and curtailing data collection timeframes, both of which support greater scientific innovation [1][2][3]. These advantages extend when respondents are allowed to use multiple devices, including personal computers, laptops, tablets, and smartphones, further providing respondents with more options for convenience with little difference in measurement error between the devices [4][5][6][7]. This type of data collection approach is particularly attractive given the COVID-19 pandemic, which has strained face-to-face data collection operations in the U.S. and increased the need for the collection of population health data in a timely manner. ...
Article
Full-text available
In the United States, increasing access to the internet, the increasing costs of large-scale face-to-face data collections, and the general reluctance of the public to participate in intrusive in-person data collections all mean that new approaches to nationally representative surveys are urgently needed. The COVID-19 pandemic accelerated the need for faster, higher-quality alternatives to face-to-face data collection. These trends place a high priority on the evaluation of innovative web-based data collection methods that are convenient for the U.S. public and yield scientific information of high quality. The web mode is particularly appealing because it is relatively inexpensive, it is logistically flexible to implement, and it affords a high level of privacy and confidentiality when correctly implemented. With this study, we aimed to conduct a methodological evaluation of a sequential mixed-mode web/mail data collection protocol, including modular survey design concepts, which was implemented on a national probability sample in the U.S. in 2020–2021. We implemented randomized experiments to test theoretically-informed hypotheses that 1) the use of mail and increased incentives to follow up with households that did not respond to an invitation to complete a household screening questionnaire online would help to recruit different types of households; and 2) the use of modular survey design, which involves splitting a lengthy self-administered survey up into multiple parts that can be completed at a respondent’s convenience, would improve survey completion rates. We find support for the use of mail and increased incentives to follow up with households that have not responded to a web-based screening questionnaire. We did not find support for the use of modular design in this context. Simple descriptive analyses also suggest that attempted telephone reminders may be helpful for the main survey.
... While probability samples serve as the gold standard in social science and related fields to estimate population quantities and monitor policy effects, their acquisition and analysis become challenging due to their high cost and low response rates (Couper, 2013;Miller, 2017;Williams and Brick, 2018;Kalton, 2019). During the past 20 years, non-probability samples, on the other hand, have been increasingly prevailing since they are "wider, deeper, faster, better, cheaper, less burdensome, and more relevant" (Holt, 2007;Citro, 2014). ...
Preprint
Full-text available
Valid statistical inference is challenging when the sample is subject to unknown selection bias. Data integration can be used to correct for selection bias when we have a parallel probability sample from the same population with some common measurements. How to model and estimate the selection probability or the propensity score (PS) of a non-probability sample using an independent probability sample is the challenging part of the data integration. We approach this difficult problem by employing multiple candidate models for PS combined with empirical likelihood. By incorporating multiple propensity score models into the internal bias calibration constraint in the empirical likelihood setup, the selection bias can be eliminated so long as the multiple candidate models contain a true PS model. The bias calibration constraint under the multiple PS models is called multiple bias calibration. Multiple PS models can include both missing-at-random and missing-not-at-random models. Asymptotic properties are discussed, and some limited simulation studies are presented to compare the proposed method with some existing competitors. Plasmode simulation studies using the Culture \& Community in a Time of Crisis dataset demonstrate the practical usage and advantages of the proposed method.
... The field of survey research has experienced a profound transformation since the end of the 1990s due to the opportunity to use new data sources to make population inferences or to be integrated with traditional surveys [29]. Data integration is not new to survey researchers, who have already combined surveys based on probability-based samples (PS) with auxiliary data from censuses or administrative registers to enhance inference. ...
Article
Full-text available
In recent years, survey data integration and inference based on non-probability samples have gained considerable attention. Because large probability-based samples can be cost-prohibitive in many instances, combining a probabilistic survey with auxiliary data is appealing to enhance inferences while reducing the survey costs. Also, as new data sources emerge, such as big data, inference and statistical data integration will face new challenges. This study aims to describe and understand the evolution of this research field over the years with an original approach based on text mining and bibliometric analysis. In order to retrieve the publications of interest (books, journal articles, proceedings, etc.), the Scopus database is considered. A collection of 1023 documents is analyzed. Through the use of such methodologies, it is possible to characterize the literature and identify contemporary research trends as well as potential directions for future investigation. We propose a research agenda along with a discussion of the research gaps which need to be addressed.
... • operacionalização da produção e metodologia(Hassani, Saporta e Silva, 2014;Tennekes, De Jonge e Daas, 2011;Zikopoulos et al., 2012; Fry, 2008; Liu, Jiang e Heer, 2013;Couper, 2013;Daas et al., 2013; Unece, 2015;Reimsbach-Kounatze, 2015;Braaksma e Zeelenberg, 2015; Citro, 2014; Tufekci, 2013; Rudin et al., 2014; Buelens et al., 2012; Flekova e Gurevych, 2013; Hastie, Tibshirani e Friedman, 2009; Daas, 2012; Daas e Puts, 2014; O'Connor et al., 2010); • capacidade organizacional -financeira, tecnológica e humana (Mills et al., 2012; NAS, 2013; Schutt e O'Neil, 2013; Parise, Iyer e Vesset, 2012; Scannapieco, Virgillito e Zardetto, 2013; Struijs, Braaksma e Daas, 2014; Tam e Clarke, 2014; Cervera et al., 2014; Dunne, 2013; Davenport e Patil, 2012; LaValle, 2011; UNGP, 2012; Eurostat 2014b);• regulação, legislação, ética e privacidade(Wirthmann e Reis, 2018; Tam e Kim, 2018; Nasem, 2017a;Vaccari, 2014;Hackl, 2016;Vale, 2015; Struijs e Daas, 2013;Daas et al., 2015; De Jonge, van Pelt e Roos, 2012); • redes de relacionamento externo (ESSnet, 2013; Tam e Clarke, 2015a; Krätke e Byiers, 2014); e ...
Article
A produção tradicional de indicadores socioeconômicos, predominantemente atrelada aos produtores de estatísticas oficiais e públicas, tem a aferição da sua qualidade consolidada ao longo das últimas décadas com diversas publicações relativas a frameworks e manuais de orientações ou boas práticas. Entretanto, o fenômeno da Revolução dos Dados vem colocando uma série de desafios aos produtores ao trazer o big data como possibilidade de fonte de dados a servir de insumo para os processos de produção. Entre os desafios está o de como garantir a qualidade dos produtos estatísticos. Nesse contexto, este artigo se propõe a resumir a produção técnico-científica sobre a qualidade de indicadores socioeconômicos produzidos a partir de big data. Para tanto, é conduzida uma revisão sistemática integrativa. Como resultados, o artigo apresenta uma percepção sobre: quais os principais agentes envolvidos com o tema; quais dimensões e frameworks de qualidade têm sido considerados e propostos; e quais as lacunas de pesquisa – direcionando esforços e trabalhos futuros.
... Furthermore, in most articles that did feature big data, survey data was used in addition to or to enhance what the big data could tell us (Sturgis and Luff, 2021). Scholars (Groves, 2011;Couper, 2013;Japec et al., 2015;Salagnik, 2018) champion combining big and small data sources, as they can help mitigate their respective limitations and provide more insightful analysis that cannot be produced with either data source individually. For example, big data can be used to identify the extent of a social issue -e.g., using voting records to measure voting turnoutand combined with survey data to analyse factors that may explore the problem in more depth -e.g., creating a survey exploring how likelihood to vote varies by demographic and socioeconomic markers (Ansolabehere and Hersh, 2012). ...
Article
Full-text available
Recent computing power and storage advancements have meant more data are being collected and stored. Referred to as 'Big data', these data sources offer researchers myriad opportunities to make observations about the social world. These data can be massive, provide insight into whole populations rather than just a sample, and be used to analyse social behaviour in real time. Administrative data, a subcategory under the big data umbrella, also offers researchers abundant opportunities to conduct highly relevant research in many areas, including sociology, social policy, education, health studies and many more. This paper offers reflections on social research during the digital age by examining different forms of data, both 'big' and 'small', and their associated advantages and disadvantages. The paper concludes by suggesting that although big data has some promising elements, it also comes with some limitations and povwill not replace 'traditional' social surveys. And yet, when used in conjunction with social surveys, appropriately and ethically, big data could offer the researchers additional valuable insights.
... In the context of decreasing response rates, increasing costs for data collection, demand for more frequent and real-time statistics, demand for improving the userfriendliness of data and statistical products, the quality and adequacy of random sample surveys have frequently been discussed in the scientific literature in recent years. Even more so, the future of survey-based research has been questioned (Alwin 2013;Couper 2013;Stern et al. 2014;Czajka & Beyler 2016;Rao & Fuller 2017;Miller 2017;Couper 2017;Schnell 2019b). ...
... For example, it is less probable for the elderly and young children to be representative, whereas there is a bias towards males and higher-income groups [17]. Measurement bias occurs because a CDR is generated only when a phone is used; thus, insights generated from data can be affected by the frequency of records [18]. For example, it is difficult to obtain the detailed movements of users who do not use their phones frequently [19]. ...
Article
Full-text available
Frequent and granular population data are essential for decision making. Further-more, for progress monitoring towards achieving the sustainable development goals (SDGs), data availability at global scales as well as at different disaggregated levels is required. The high population coverage of mobile cellular signals has been accelerating the generation of large-scale spatiotemporal data such as call detail record (CDR) data. This has enabled resource-scarce countries to collect digital footprints at scales and resolutions that would otherwise be impossible to achieve solely through traditional surveys. However, using such data requires multiple processes, algorithms, and considerable effort. This paper proposes a big data-analysis pipeline built exclusively on an open-source framework with our spatial enhancement library and a proposed open-source mobility analysis package called Mobipack. Mobipack consists of useful modules for mobility analysis, including data anonymization, origin–destination extraction, trip extraction, zone analysis, route interpolation, and a set of mobility indicators. Several implemented use cases are presented to demonstrate the advantages and usefulness of the proposed system. In addition, we explain how a large-scale data platform that requires efficient resource allocation can be con-structed for managing data as well as how it can be used and maintained in a sustainable manner. The platform can further help to enhance the capacity of CDR data analysis, which usually requires a specific skill set and is time-consuming to implement from scratch. The proposed system is suited for baseline processing and the effective handling of CDR data; thus, it allows for improved support and on-time preparation.
... Furthermore, surveys are appropriate to capture subjects' attitudes by directly asking respondent about their subjective experiences, perceptions, and attitudes about a topic, but have limited ability to observe actual behavior. Social media provides a wealth of information about actual user behavior, and not as limited or focused on specific questions (Couper, 2013). Research has also found a correlation between public opinion and sentiments expressed on Twitter (O'Connor, Balasubramanyan, Routledge, & Smith, 2010). ...
Article
Full-text available
Early in the COVID-19 pandemic, the Diamond Princess became the center of the largest outbreak outside the original epicenter in China. This outbreak which left 712 passengers infected and 14 dead, followed by subsequent outbreaks affecting over one-third of the active ships in the cruise industry's global fleet, quickly became a crisis that captured public attention and dominated mainstream news and social media. This study investigates the perception of cruising during these outbreaks by analyzing the tweets on cruising using Natural Language Processing (NLP). The findings show a prevalent negative sentiment in most of the analyzed tweets, while the criticisms directed at the cruise industry were based on perceptions and stereotypes of the industry before the pandemic. The study provides insight into the concerns raised in these conversations and highlights the need for new business models outside the pre-pandemic mass-market model and to genuinely make cruising more environmentally friendly.
... Students may become more passive due to this economic focus, expecting their HEI to offer them a degree since they paid for it (Rolfe, 2002 Understanding and measuring student satisfaction is, therefore, a very complex exercise. Due to the dynamic nature of the demographic characteristics of students and technological developments, it is becoming increasingly difficult to understand and evaluate student satisfaction (Couper, 2013). Highly motivated students are more challenging to satisfy due to their higher expectations. ...
Article
Full-text available
The primary purpose of this study is to provide a conceptual framework for „student satisfaction‟ in order to understand and conceptualise the key aspects of the term. Satisfaction is a vague construct, and there is currently a lack of consensus on how best to conceptualise the term, which leads to a lack of consensual definitions of customer satisfaction and, ultimately, student satisfaction. Student satisfaction is relatively more complex than customer satisfaction in that dynamic and subjective expectations, which could significantly contribute to student satisfaction, are difficult to capture objectively. Studies have shown that higher education institutions are becoming more aware of the importance of student satisfaction, and a market-oriented and customer-oriented approach is required to manage service quality, the principal determinant of student satisfaction. Yet, with a lack of consensus on its determinants, it is difficult for researchers to develop a valid measure of student satisfaction. Therefore, the study's primary objective is to review how the concept of student satisfaction is applied in the existing academic literature in an attempt to contribute to a more robust and informed body of work. Accordingly, the authors have conducted a systematic literature review to critically examine the most relevant, authentic and recent studies that address the primary research questions; namely, what is student satisfaction, and what are the contemporary ways of comprehending and defining the term? Our systematic review approach answers the research questions by collecting and summarising all the available empirical evidence that fits the authors‟ pre-specified eligibility criteria (inclusion and exclusion) of student satisfaction and how it has been defined within the last five decades. However, there is currently no consensus on the conceptualisation of student satisfaction, primarily because of the difficulty of incorporating dynamic and complex student satisfaction determinants in higher educational settings into a single definition. However, we have attempted to conceptualise the construct. Key Words: “Satisfaction”, Customer Satisfaction”, “Student Satisfaction”, “Service Industry”, “Higher Education”
... To this end, web job portals may be useful auxiliary sources for data on labour demand (see Papoutsoglou et al., 2019, for a recent review; see also Cedefop, 2017); however, using web (big) data in the form of online job advertisements (OJA) is a complex undertaking. The use of such data, in fact, requires the resolution of problems linked to definition and data collection (duplication, selection of stable and credible sources, removing non-work ads, selection of active ads etc.), the transformation from OJA flows to stocks (Garasto et al., 2021;Turrell et al., 2019) and issues such as non-representativeness and lack of coverage at the population level (Couper, 2013;Fan et al., 2014;Japec et al., 2015;Kureková et al., 2015;Tam & Clarke, 2015). ...
Article
Full-text available
Government policy has placed increasing emphasis on the need for robust labour market projections. The job vacancy rate is a key indicator of the state of the economy underpinning most monetary policy decisions. However, its variation over time is rarely studied in relation to employment variations, especially at the sectoral level. The present paper assesses whether changes in the number of vacancies from quarter to quarter are a leading anticipator of employment variation in certain economic sectors over the previous decade in Italy, using multivariate time-series tools (the vector autoregressive and error correction models) with Eurostat data. As robustness checks for integration order and cointegration, we compare traditional critical values with those provided by response surface models. To the best of our knowledge, no previous study has evaluated this relationship using Italian data over the last decade. The results demonstrate that percentage changes in numbers employed (occupied persons) react to percentage changes in vacancies (one-quarter lagged), but not vice versa, indicating that variations of vacancies are weakly exogenous. The fastest short-term adjustment from disequilibrium is seen in the construction industry, whereas the manufacturing and the information and communication technology sectors demonstrate the strongest long-run relationships among variations. This suggests that the matching rates – the likelihood that a vacancy is filled – are higher for these than for other sectors, as a result of developments in recruitment technology for professional figures of such industries.
... For computational social science to be successful in using digital trace data, we foresee that in most (if not all) cases, data from different sources need to be combined, either to overcome the problem of unknown populations of inference or to overcome the problem of missing covariates and overall unclear measurement properties (Christen, Ranbaduge, & Schnell, 2020;Couper, 2013;Schnell, 2019). ...
... Still, social surveys are a cornerstone of social science research and are routinely used by the government and private sector to inform decisions making. Yet informed commentators have occasionally questioned the viability of survey research moving forward and called for changes in the way that we collect survey data (Couper, 2013;Miller, 2017). Further, events in the U.S.such as the surprise election of Donald Trump to the presidency in 2016-have contributed to a growing public scepticism of surveys (Cohn, 2017). ...
Article
Full-text available
Response rates for surveys have declined steadily over the last few decades. During this period, trust in science has also waned and conspiratorial theorizing around scientific issues has seemingly become more prevalent. In our prior work, we found that a significant portion of respondents will claim that a given survey is “biased.” In this follow-up research, we qualify these perceptions of bias and point to potential causes and ameliorative mechanisms.
... Sensor data is becoming increasingly important in official statistics because such data has some advantages over sample survey data: no declining response rates, no response burden, and continuous measurements in real-time [1][2][3][4]. Sensor data is currently rarely used to produce direct estimates due to the often unknown data generating process. However, sensor data can be used to assess underreporting bias in survey point estimates by linking them to survey data and applying capturerecapture (CRC) techniques. ...
Article
Capture-recapture (CRC) is currently considered a promising method to integrate big data in official statistics. We previously applied CRC to estimate road freight transport with survey data (as the first capture) and road sensor data (as the second capture), using license plate and time-stamp to identify re-captured vehicles. A considerable difference was found between the single-source, design-based survey estimate, and the multiple-source, model-based CRC estimate. One possible explanation is underreporting in the survey, which is conceivable given the response burden of diary questionnaires. In this paper, we explore alternative explanations by quantifying their effect on the estimated amount of underreporting. In particular, we study the effects of 1) reporting errors, including a mismatch between the reported day of loading and the measured day of driving, 2) measurement errors, including false positives and OCR failure, 3) considering vehicles reported not owned as nonresponse error instead of frame error, and 4) response mode. We conclude that alternative hypotheses are unlikely to fully explain the difference between the survey estimate and the CRC estimate. Underreporting, therefore, remains a likely explanation, illustrating the power of combining survey and sensor data.
... Because the inferential leap from the use of new data to the social and behavioral phenomena they are being used to study can be quite large, it will be essential for researchers to explicitly address the size of the inference(s) when designing and carrying out this kind of research. The larger the leap the greater the risk of identifying correlations that are spurious (e.g., Couper 2013;Conrad et al. 2019) and the higher the burden on researchers to establish the validity and generalizability of the results. There is a concomitant need for heightened vigilance on the part of reviewers and editors -and a more critical eye on the part of readers -to insist that authors adequately demonstrate the credibility of findings based on new data. ...
... Burada əsas problem ondan ibarətdir ki, böyük, mürəkkəb, heterogen (qeyri-bircins) verilənlərin statistik analizi uyğun olmayan dəyişənlərin və alqoritmlərin seçilməsinə səbəb ola bilər. Odur ki, böyük verilənlərdən istifadə zamanı əhatəlilikdə, seçimdə, ölçmədə və əldə olunan cavablarda yanlışlıq ola bilməsi riskləri nəzərə alınmalıdır [19,20]. ...
Article
Full-text available
E-dövlətin səmərəli idarə olunması, ölkədə sosial-iqtisadi inkişafın və sabitliyin təmini üçün cəmiyyətdə sosial münasibətlərin aşkarlanması və analizi vacib məsələlərdəndir. Sosial münasibətlərin analizi cəmiyyətdə baş verən prosesləri və mövcud sosial problemləri daha aydın şəkildə görməyə imkan yaradır. Videotəsvirlərin intellektual analizi nəticəsində insanları və obyektləri tanımaq, hadisələri və sosial münasibətləri müəyyən etmək üçün bəzi mövcud metod və yanaşmaların araşdırılması tədqiqat obyektinin əsasını təşkil edir. İctimai yerlərdə vətəndaşların davranışlarını müşahidə etməklə sosial münasibətləri aşkarlamaq və baş verə biləcək anomal hadisələri proqnozlaşdırmaq çox mürəkkəb prosesdir. Məqalədə videomüşahidə sistemləri vasitəsilə əldə olunan təsvirlərdən istifadə etməklə obyektlərin və hadisələrin analizində bəzi mövcud yanaşmalar araşdırılmış, videotəsvirlərin intellektual analizi əsasında sosial münasibətlərin aşkarlanması üçün yeni yanaşma təklif olunmuş, intellektual videomüşahidə sistemlərinin ümumi arxitektur sxemi işlənmişdir. Videomüşahidə kameraları ilə əldə olunan videotəsvirlərin analizi üçün mərhələli həll yolu təklif olunmuşdur. Tədqiqatda əldə edilən nəticələr e-dövlətin daha səmərəli idarə olunması, sosial-iqtisadi proseslərin proqnozlaşdırılması, vətəndaşların təhlükəsizliyinin təmini və bir çox sahələrdə istifadə oluna bilər.
Chapter
Health crises have particularly affected the tourism and hospitality sector, which by its very nature has been particularly vulnerable, but at the same time, resilient to, for example, the recent Covid-19 pandemic. Among the most affected sectors, the cruise industry stands out, which due to its characteristics—small spaces, high density of people and a predominantly adult clientele—represented one of the most critical scenarios for the spread of the virus. The chapter traces the main events that marked the pandemic in the cruise industry—e.g. interruptions and conditional resumption of navigation—analysing the market dynamics with a focus on the three main cruise operators. Finally, the chapter focuses on the impact of the pandemic on the cruise business and the crisis communication disseminated on social media, which amplified public concerns and perceived risk. The chapter concludes by outlining the empirical analysis that will follow in the subsequent chapters of the manuscript, highlighting the aim to explore crisis communication from a dual perspective, organisational and consumer.
Article
The aim of this article is to propose a new general methodology to measure job polarization and empirically analyses the degree of polarization from the demand side of the European Union market at regional level during the post COVID-19 period. Unlike most studies that examine polarization from the supply side, typically measured by shifts in employment shares across the occupational skill spectrum, focusing on the simultaneous growth of high-skill and low-skill jobs, this research approaches polarization from the unmet demand side. Methodologically, we applied Self-Organizing Maps for clustering and Generalized Joint Regression models, which offered considerable flexibility in modeling labor market polarization, while also addressing issues of sample selection and covariate endogeneity. This methodological framework naturally provides a new model-based index of polarization. Empirically, we analyzed institutional EU data from Cedefop, which includes job advertisements posted on online portals across the EU, categorized by region and main (ESCO) occupational groups. Conducting this analysis at the regional level in the post-pandemic period addresses significant gaps in the literature. Empirical findings highlight several pressing issues in the European labor market post-pandemic, including strong labour demand polarization, high-skill occupation saturation and educational mismatches (overqualification), particularly affecting women. To our knowledge, this is the first study to focus on job polarization from the demand side, employing flexible models to jointly model polarization and the demand for specific occupations in term of main determinants at the regional level, while integrating regional data from the Labour Force Survey and online job advertisements from Cedefop data.
Article
The evaluation of innovative web-based data collection methods that are convenient for the general public and that yield high-quality scientific information for demographic researchers has become critical. Web-based methods are crucial for researchers with nationally representative research objectives but without the resources of larger organizations. The web mode is appealing because it is inexpensive relative to in-person and telephone modes, and it affords a high level of privacy. We evaluate a sequential mixed-mode web/mail data collection, conducted with a national probability sample of U.S. adults from 2020 to 2022. The survey topics focus on reproductive health and family formation. We compare estimates from this survey to those obtained from a face-to-face national survey of population reproductive health: the 2017–2019 National Survey of Family Growth (NSFG). This comparison allows for maximum design complexity, including a complex household screening operation (to identify households with persons aged 18–49). We evaluate the ability of this national web/mail data collection approach to (1) recruit a representative sample of U.S. persons aged 18–49; (2) replicate key survey estimates based on the NSFG, considering expected effects of the COVID-19 pandemic lockdowns and the alternative modes on the estimates; (3) reduce complex sample design effects relative to the NSFG; and (4) reduce the costs per completed survey.
Article
Full-text available
Three approaches to estimation from nonprobability samples are quasi-randomization, superpopulation modeling, and doubly robust estimation. In the first, the sample is treated as if it were obtained via a probability mechanism, but unlike in probability sampling, that mechanism is unknown. Pseudo selection probabilities of being in the sample are estimated by using the sample in combination with some external data set that covers the desired population. In the superpopulation approach, observed values of analysis variables are treated as if they had been generated by some model. The model is estimated from the sample and, along with external population control data, is used to project the sample to the population. The specific techniques are the same or similar to ones commonly employed for estimation from probability samples and include binary regression, regression trees, and calibration. When quasi-randomization and superpopulation modeling are combined, this is referred to as doubly robust estimation. This article reviews some of the estimation options and compares them in a series of simulation studies.
Chapter
This chapter illustrates how artificial intelligence (AI) can be applied to analyze big data for the benefit of socioeconomic stakeholders. The research question was how state-of-the-art AI software could be used to help decision makers, particularly after major global crises such as climate change, natural disasters, and pandemics. In this chapter, the issues and controversies of big data and AI software are discussed. After an extensive literature review, this chapter proposes, develops, and tests an AI model using big data and AI statistical software. The concept was proven with sample data and a simulation. This AI big data modeling concept is argued to be valuable to many socioeconomic stakeholders, including computer science researchers and academic scholars.
Article
In the article we describe an enhancement to the Demand for Labour (DL) survey conducted by Statistics Poland, which involves the inclusion of skills obtained from online job advertisements. The main goal is to provide estimates of the demand for skills (competences), which is missing in the DL survey. To achieve this, we apply a data integration approach combining traditional calibration with the LASSO-assisted approach to correct coverage and selection error in the online data. Faced with the lack of access to unit-level data from the DL survey, we use estimated population totals and propose a bootstrap approach that accounts for the uncertainty of totals reported by Statistics Poland. Our empirical results show that online data significantly overestimate interpersonal, managerial and self-organization skills while underestimating technical and physical skills. This is mainly due to the under-representation of occupations categorised as Craft and Related Trades Workers and Plant and Machine Operators and Assemblers.
Chapter
This chapter differs from most healthcare informatics studies because the focus is on conceptual COVID-19 SARS-CoV2 (coronavirus) prediction rather than detection. The research question was how state-of-the-art informatics software could be used to detect coronavirus based on the analysis of hospital patient medical records. Healthcare practitioners need artificial intelligence (AI) software to predict what has not yet happened to prepare in advance. Therefore, this chapter proposes and tests a generic AI approach to predict first-time coronavirus infection for discharged hospital patients based on data collected from their medical records. This idea could allow healthcare informatics practitioners to leverage AI software to predict which patients will be more likely to become infected by specific viruses or diseases. The concept rather than the actual model is the most valuable outcome of the study.
Conference Paper
Full-text available
Our paper proposes a method of combining probability and non-probability samples to improve analytic inference on logistic regression model parameters. A Bayesian framework is considered where only a small probability sample is available and the information from a parallel non-probability sample is provided naturally through the prior. A simulation study is run applying several informative priors. Comparisons on the performance of the models are studied with reference to their mean-squared error (MSE). In general, the informative priors reduce the MSE or, in the worst-case scenario, perform equivalently to non-informative priors.
Article
The motivation behind this examination is to explore the status and the development of the logical investigations for the impact of interpersonal organizations on enormous information and utilization of large information for displaying the interpersonal organizations clients' conduct. This paper presents a far reaching audit of the examinations related with enormous information in online media. The investigation utilizes Scopus information base as an essential web crawler and covers 2000 of profoundly refered to articles over the period 2012-2019. The records are genuinely broke down and feline egorized as far as various standards. The discoveries show that explores have developed dramatically since 2014 and the pattern has proceeded at generally stable rates. In view of the review, choice emotionally supportive networks is the catchphrase which has conveyed the most noteworthy densities followed by heuristics techniques. Among the most refered to articles, papers distributed by re-searchers in United States have gotten the most noteworthy references (7548), trailed by United Kingdom (588) and China with 543 ci-tations. Topical investigation shows that the subject almost kept a significant and well-devel-oped research field and for better outcomes we can combine our exploration with "huge information examination" and "twitter" that are significant points in this field yet not grew well.
Article
Full-text available
After highlighting the primary purposes and quality considerations of survey research, we briefly discuss previously recommended approaches for ensuring and improving data quality. Then, we introduce the articles in this special issue that feature novel solutions for improvement.
Article
Full-text available
Public health related tweets are difficult to identify in large conversational datasets like Twitter.com. Even more challenging is the visualization and analyses of the spatial patterns encoded in tweets. This study has the following objectives: How can topic modeling be used to identify relevant public health topics such as obesity on Twitter.com? What are the common obesity related themes? What is the spatial pattern of the themes? What are the research challenges of using large conversational datasets from social networking sites? Obesity is chosen as a test theme to demonstrate the effectiveness of topic modeling using Latent Dirichlet Allocation (LDA) and spatial analysis using Geographic Information System (GIS). The dataset is constructed from tweets (originating from the United States) extracted from Twitter.com on obesity-related queries. Examples of such queries are 'food deserts', 'fast food', and 'childhood obesity'. The tweets are also georeferenced and time stamped. Three cohesive and meaningful themes such as 'childhood obesity and schools', 'obesity prevention', and 'obesity and food habits' are extracted from the LDA model. The GIS analysis of the extracted themes show distinct spatial pattern between rural and urban areas, northern and southern states, and between coasts and inland states. Further, relating the themes with ancillary datasets such as US census and locations of fast food restaurants based upon the location of the tweets in a GIS environment opened new avenues for spatial analyses and mapping. Therefore the techniques used in this study provide a possible toolset for computational social scientists in general and health researchers in specific to better understand health problems from large conversational datasets.
Article
Full-text available
Fifty-nine methodological studies were designed to estimate the magnitude of nonresponse bias in statistics of interest. These studies use a variety of designs: sampling frames with rich variables, data from administrative records matched to sample case, use of screening-interview data to describe nonrespondents to main interviews, followup of nonrespondents to initial phases of field effort, and measures of behavior intentions to respond to a survey. This permits exploration of which circumstances produce a relationship between nonresponse rates and nonresponse bias and which, do not. The predictors are design features of the surveys, characteristics of the sample, and attributes of the survey statistics computed in the surveys.
Article
Full-text available
This article originated over ten years ago as a set of joke slides showing silly spurious correlations. These statistically appealing relationships between the stock market and diary products and third world livestock populations have been cited often, in Business Week, the Wall Street Journal, the book “A Mathematician Looks at the Stock Market,” and elsewhere. Students from Bill Sharpe's classes at Stanford seem to be familiar with them. The slides were expanded to include some actual content about data mining, and reissued as an academic working paper in 2001. Occasional requests arrive from distant corners of the world, so I thank the editors of the Journal of Investing for publishing this article. Without taking a hatchet to the original, the advice offered remains valuable, perhaps even more so now that there is so much more data to mine. Monthly data arrives as a single data point, once a month. It's hard to avoid data mining sins if you look twice. Ticks, quotes, and executions arrive in millions ...
Article
Full-text available
The disastrous prediction of an Alf Landon victory in the 1936 presidential election by the Literary Digest poll is a landmark event in the history of American survey research in general and polling in particular. It marks both the demise of the straw poll, of which the Digest was the most conspicuous and well-regarded example, and the rise to prominence of the self-proclaimed "scientific" poll. Why did the Digest poll fail so miserably? One view has come to prevail over the years: because the Digest selected its sample primarily from telephone books and car registration lists and since these contained, at the time, mostly well-to-do folks who would vote Republican, it is no wonder the magazine mistakenly predicted a Republican win. This "conventional explanation" has found its way into countless publications (scholarly and in the press) and college courses. It has been used to illustrate the disastrous effects of a poorly designed poll. But is it correct? Empirical evidence, in the form of a 1937 Gallup poll, shows that this "conventional explanation" is wrong, because voters with telephones and cars backed Franklin D. Roosevelt and because it was those who failed to participate in the poll (overwhelmingly supporters of Roosevelt) who were mainly responsible for the faulty prediction.
Article
Full-text available
We developed a practical influenza forecast model based on real-time, geographically focused, and easy to access data, designed to provide individual medical centers with advanced warning of the expected number of influenza cases, thus allowing for sufficient time to implement interventions. Secondly, we evaluated the effects of incorporating a real-time influenza surveillance system, Google Flu Trends, and meteorological and temporal information on forecast accuracy. Forecast models designed to predict one week in advance were developed from weekly counts of confirmed influenza cases over seven seasons (2004-2011) divided into seven training and out-of-sample verification sets. Forecasting procedures using classical Box-Jenkins, generalized linear models (GLM), and generalized linear autoregressive moving average (GARMA) methods were employed to develop the final model and assess the relative contribution of external variables such as, Google Flu Trends, meteorological data, and temporal information. A GARMA(3,0) forecast model with Negative Binomial distribution integrating Google Flu Trends information provided the most accurate influenza case predictions. The model, on the average, predicts weekly influenza cases during 7 out-of-sample outbreaks within 7 cases for 83% of estimates. Google Flu Trend data was the only source of external information to provide statistically significant forecast improvements over the base model in four of the seven out-of-sample verification sets. Overall, the p-value of adding this external information to the model is 0.0005. The other exogenous variables did not yield a statistically significant improvement in any of the verification sets. Integer-valued autoregression of influenza cases provides a strong base forecast model, which is enhanced by the addition of Google Flu Trends confirming the predictive capabilities of search query based syndromic surveillance. This accessible and flexible forecast model can be used by individual medical centers to provide advanced warning of future influenza cases.
Article
Full-text available
With over 800 million active users, Facebook is changing the way hundreds of millions of people relate to one another and share information. A rapidly growing body of research has accompanied the meteoric rise of Facebook as social scientists assess the impact of Facebook on social life. In addition, researchers have recognized the utility of Facebook as a novel tool to observe behavior in a naturalistic setting, test hypotheses, and recruit participants. However, research on Facebook emanates from a wide variety of disciplines, with results being published in a broad range of journals and conference proceedings, making it difficult to keep track of various findings. And because Facebook is a relatively recent phenomenon, uncertainty still exists about the most effective ways to do Facebook research. To address these issues, the authors conducted a comprehensive literature search, identifying 412 relevant articles, which were sorted into 5 categories: descriptive analysis of users, motivations for using Facebook, identity presentation, the role of Facebook in social interactions, and privacy and information disclosure. The literature review serves as the foundation from which to assess current findings and offer recommendations to the field for future research on Facebook and online social networks more broadly. © The Author(s) 2012.
Article
Full-text available
This article explores new methods for gathering and analyzing spatially rich demographic data using mobile phones. It describes a pilot study (the Human Mobility Project) in which volunteers around the world were successfully recruited to share GPS and cellular tower information on their trajectories and respond to dynamic, location-based surveys using an open-source Android application. The pilot study illustrates the great potential of mobile phone methodology for moving spatial measures beyond residential census units and investigating a range of important social phenomena, including the heterogeneity of activity spaces, the dynamic nature of spatial segregation, and the contextual dependence of subjective well-being.
Article
Full-text available
Vast data-streams from social networks like Twitter and Facebook contain a people's opinions, fears and dreams. Thomas Lansdall-Welfare, Vasileios Lampos and Nello Cristianini exploit a whole new tool for social scientists.
Article
Full-text available
Abstract Social network,sites,(SNSs) are increasingly attracting the attention of academic,and,industry researchers intrigued by their affordances and reach.,This special theme section of the,Journal,of Computer-Mediated,Communicationbrings ,together scholarship on these emergent phenomena.,In this introductory article, we describe features of SNSs and propose a comprehensive definition. We then present one perspective on the history of such sites, discussing key changes and developments. After briefly summarizing existing scholarship concerning SNSs, we discuss the articles,in this special section and conclude with considerations for future,research.
Article
Full-text available
Google Flu Trends (GFT) uses anonymized, aggregated internet search activity to provide near-real time estimates of influenza activity. GFT estimates have shown a strong correlation with official influenza surveillance data. The 2009 influenza virus A (H1N1) pandemic [pH1N1] provided the first opportunity to evaluate GFT during a non-seasonal influenza outbreak. In September 2009, an updated United States GFT model was developed using data from the beginning of pH1N1. We evaluated the accuracy of each U.S. GFT model by comparing weekly estimates of ILI (influenza-like illness) activity with the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). For each GFT model we calculated the correlation and RMSE (root mean square error) between model estimates and ILINet for four time periods: pre-H1N1, Summer H1N1, Winter H1N1, and H1N1 overall (Mar 2009-Dec 2009). We also compared the number of queries, query volume, and types of queries (e.g., influenza symptoms, influenza complications) in each model. Both models' estimates were highly correlated with ILINet pre-H1N1 and over the entire surveillance period, although the original model underestimated the magnitude of ILI activity during pH1N1. The updated model was more correlated with ILINet than the original model during Summer H1N1 (r = 0.95 and 0.29, respectively). The updated model included more search query terms than the original model, with more queries directly related to influenza infection, whereas the original model contained more queries related to influenza complications. Internet search behavior changed during pH1N1, particularly in the categories "influenza complications" and "term for influenza." The complications associated with pH1N1, the fact that pH1N1 began in the summer rather than winter, and changes in health-seeking behavior each may have played a part. Both GFT models performed well prior to and during pH1N1, although the updated model performed better during pH1N1, especially during the summer months.
Article
Full-text available
This ar ticle argues that in an age of knowing capitalism, sociologists have not adequately thought about the challenges posed to their expertise by the proliferation of `social' transactional data which are now routinely collected, processed and analysed by a wide variety of private and public institutions. Drawing on British examples, we argue that whereas over the past 40 years sociologists championed innovative methodological resources, notably the sample survey and the in-depth interviews, which reasonably allowed them to claim distinctive expertise to access the `social' in powerful ways, such claims are now much less secure. We argue that both the sample survey and the in-depth interview are increasingly dated research methods, which are unlikely to provide a robust base for the jurisdiction of empirical sociologists in coming decades. We conclude by speculating how sociology might respond to this coming crisis through taking up new interests in the `politics of method'.
Article
Full-text available
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Article
I characterize the prevailing philosophy of official statistics as a design/model compromise (DMC). It is design-based for descriptive inferences from large samples, and model-based for small area estimation, nonsampling errors such as nonresponse or measurement error, and some other subfields like ARIMA modeling of time series. I suggest that DMC involves a form of "inferential schizophrenia", and offer examples of the problems this creates. An alternative philosophy for survey inference is calibrated Bayes (CB), where inferences for a particular data set are Bayesian, but models are chosen to yield inferences that have good design-based properties. I argue that CB resolves DMC conflicts, and capitalizes on the strengths of both frequentist and Bayesian approaches. Features of the CB approach to surveys include the incorporation of survey design information into the model, and models with weak prior distributions that avoid strong parametric assumptions. I describe two applications to U.S. Census Bureau data.
Article
We examine the trade-offs associated with using Amazon.com's Mechanical Turk (MTurk) interface for subject recruitment. We first describe MTurk and its promise as a vehicle for performing low-cost and easy-to-field experiments. We then assess the internal and external validity of experiments performed using MTurk, employing a framework that can be used to evaluate other subject pools. We first investigate the characteristics of samples drawn from the MTurk population. We show that respondents recruited in this manner are often more representative of the U.S. population than in-person convenience samples-the modal sample in published experimental political science-but less representative than subjects in Internet-based panels or national probability samples. Finally, we replicate important published experimental work using MTurk samples. © The Author 2012. Published by Oxford University Press on behalf of the Society for Political Methodology. All rights reserved.
Article
This article, delivered as the 22 nd Memorial Morris Hansen lecture, argues that the contract houses, typified by Westat, are uniquely situated in the cluster of institutions, practices, and principles that collectively constitute a bridge between scientific evidence on the one hand and public policy on the other. This cluster is defined in The Use of Science as Evidence in Public Policy as a policy enterprise that generates a form of social knowledge on which modern economies, policies, and societies depend (National Research Council 2012). The policy enterprise in the U. S. largely took shape in the first half of the twentieth century, when sample surveys and inferential statistics matured into an information system that provided reliable and timely social knowledge relevant to the nation’s policy choices. In ways described shortly, Westat and other social science organizations that respond to “request for proposals” (RFP) from the government for social data and social analysis came to occupy a unique niche. The larger question addressed is whether the policy enterprise as we know it is prepared for the tsunami beginning to encroach on its territory. Is it going to be swamped by a data tsunami that takes information from very different sources than the familiar census/survey methods?
Article
This paper describes an experiment in which a single questionnaire was fielded in four different styles of presentation: Text Only, Decoratively Visual, Functionally Visual and Gamified. Respondents were randomly assigned to only one presentation version. To understand the effect of presentation style on survey experience and data quality, we compared response distributions, respondent behaviour (such as time to complete), and self-reports regarding the survey experience and level of engagement across the four experimental presentations. While the functionally visual and gamified treatments produced higher satisfaction scores from respondents, we found no real differences in respondent engagement measures. We also found few differences in response patterns.
Article
Replicability of findings is at the heart of any empirical science. The aim of this article is to move the current replicability debate in psychology towards concrete recommendations for improvement. We focus on research practices but also offer guidelines for reviewers, editors, journal management, teachers, granting institutions, and university promotion committees, highlighting some of the emerging and existing practical solutions that can facilitate implementation of these recommendations. The challenges for improving replicability in psychological science are systemic. Improvement can occur only if changes are made at many levels of practice, evaluation, and reward.
Article
The response rate has played a key role in measuring the risk of nonresponse bias. However, recent empirical evidence has called into question the utility of the response rate for predicting nonresponse bias. The search for alternatives to the response rate has begun. The present article offers a typology for these indicators, briefly describes the strengths and weaknesses of each type, and suggests directions for future research. New standards for reporting on the risk of nonresponse bias may be needed. Certainly, any analysis into the risk of nonresponse bias will need to be multifaceted and include sensitivity analyses designed to test the impact of key assumptions about the data that are missing due to nonresponse.
Article
There is considerable policy interest in the impact of macroeconomic conditions on health-related behaviours and outcomes. This paper sheds new light on this issue by exploring the relationship between macroeconomic conditions and an indicator of problem drinking derived from state-level data on alcoholism-related Google searches conducted in the US over the period 2004-2011. We find the current recessionary period coincided with an almost 20% increase in alcoholism-related searches. Controlling for state and time-effects, a 5% rise in unemployment is followed in the next 12 months by an approximate 15% increase in searches. The use of Internet searches to inform on health-related behaviours and outcomes is in its infancy; but we suggest that the data provides important real-time information for policy-makers and can help to overcome the under-reporting in surveys of sensitive information.
Article
Many surveys of the U.S. household population are experiencing higher refusal rates. Nonresponse can, but need not, induce nonresponse bias in survey estimates. Recent empirical findings illustrate cases when the linkage between nonresponse rates and nonresponse biases is absent. Despite this, professional standards continue to urge high response rates. Statistical expressions of nonresponse bias can be translated into causal models to guide hypotheses about when nonresponse causes bias. Alternative designs to measure nonresponse bias exist, providing different but incomplete information about the nature of the bias. A synthesis of research studies estimating nonresponse bias shows the bias often present. A logical question at this moment in history is what advantage probability sample surveys have if they suffer from high nonresponse rates. Since postsurvey adjustment for nonresponse requires auxiliary variables, the answer depends on the nature of the design and the quality of the auxiliary variables.
Article
In the wake of high-profile controversies, psychologists are facing up to problems with replication.
Article
Results are reported from a preliminary study testing a new technology for survey data collection: audio computer-assisted self interviewing. This technology has the theoretical potential of providing privacy (or anonymity) of response equivalent to that of paper self-administered questionnaires (SAQs). In addition, it could offer the advantages common to all computer-assisted methods such as the ability to implement complex questionnaire logic, consistency checking, etc.. In contrast to Video-CASI, Audio-CASI proffers these potential advantages without limiting data collection to the literate segment of the population. In this preliminary study, results obtained using RTI's Audio-CASI system were compared to those for paper SAQs and for Video-CASI. Survey questionnaires asking about drug use, sexual behavior, income, and demographic characteristics were administered to a small sample (N = 40) of subjects of average and below-average reading abilities using each method of data collection. While the small sample size renders many results suggestive rather than definitive, the study did demonstrate that both Audio- and Video-CASI systems work well even with subjects who do not have extensive familiarity with computers. Indeed, respondents preferred the Audio- and Video-CASI to paper SAQs. The computerized systems also eliminated errors in execution of "skip" instructions that occurred when subjects completed paper SAQs. In a number of instances, the computerized systems also appeared to encourage more complete reporting of sensitive behaviors such as use of illicit drugs. Among the two CASI systems, respondents rated Audio-CASI more favorably than Video-CASI in terms of interest, ease of use, and overall preference.
Article
The Literary Digest poll of 1936 holds an infamous place in the history of survey research. Despite its importance, no empirical research has been conducted to determine why the poll failed. Using data from a 1937 Gallup survey which asked about participation in the Literary Digest poll I conclude that the magazine's sample and the response were both biased and jointly produced the wildly incorrect estimate of the vote. But, if all of those who were polled had responded, the magazine would have, at least, correctly predicted Roosevelt the winner. The current relevance of these findings is discussed.
Article
Most common diseases are complex genetic traits, with multiple genetic and environmental components contributing to susceptibility. It has been proposed that common genetic variants, including single nucleotide polymorphisms (SNPs), influence susceptibility to common disease. This proposal has begun to be tested in numerous studies of association between genetic variation at these common DNA polymorphisms and variation in disease susceptibility. We have performed an extensive review of such association studies. We find that over 600 positive associations between common gene variants and disease have been reported; these associations, if correct, would have tremendous importance for the prevention, prediction, and treatment of most common diseases. However, most reported associations are not robust: of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated. Interestingly, of the remaining 160 associations, well over half were observed again one or more times. We discuss the possible reasons for this irreproducibility and suggest guidelines for performing and interpreting genetic association studies. In particular, we emphasize the need for caution in drawing conclusions from a single report of an association between a genetic variant and disease susceptibility.
Article
Over the past few years surveys have expanded to new populations, have incorporated measurement of new and more complex substantive issues and have adopted new data collection tools. At the same time there has been a growing reluctance among many household populations to participate in surveys. These factors have combined to present survey designers and survey researchers with increased uncertainty about the performance of any given survey design at any particular point in time. This uncertainty has, in turn, challenged the survey practitioner's ability to control the cost of data collection and quality of resulting statistics. The development of computer-assisted methods for data collection has provided survey researchers with tools to capture a variety of process data ('paradata') that can be used to inform cost-quality trade-off decisions in realtime. The ability to monitor continually the streams of process data and survey data creates the opportunity to alter the design during the course of data collection to improve survey cost efficiency and to achieve more precise, less biased estimates. We label such surveys as 'responsive designs'. The paper defines responsive design and uses examples to illustrate the responsive use of paradata to guide mid-survey decisions affecting the non-response, measurement and sampling variance properties of resulting statistics. Copyright 2006 Royal Statistical Society.