ArticlePDF Available

The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data


Abstract and Figures

The goal of this paper is to examine the extent to which social science research data are shared and assess whether data sharing affects research productivity tied to the research data themselves. We construct a database from administrative records containing information about thousands of social science studies that have been conducted over the last 40 years. Included in the database are descriptions of social science data collections funded by the National Science Foundation and the National Institutes of Health. A survey of the principal investigators of a subset of these social science awards was also conducted. We report that very few social science data collections are preserved and disseminated by an archive or institutional repository. Informal sharing of data in the social sciences is much more common. The main analysis examines publication metrics that can be tied to the research data collected with NSF and NIH funding – total publications, primary publications (including PI), and secondary publications (non-research team). Multivariate models of count of publications suggest that data sharing, especially sharing data through an archive, leads to many more times the publications than not sharing data. This finding is robust even when the models are adjusted for PI characteristics, grant award features, and institutional characteristics. This paper was presented at “The Organisation, Economics and Policy of Scientific Research” workshop, Torino, Italy, in April, 2010. See: National Library of Medicine (R01 LM009765). The creation of the LEADS database was also supported by the following research projects at ICPSR: P01 HD045753, U24 HD048404, and P30 AG004590.
Content may be subject to copyright.
The Enduring Value of Social Science Research:
The Use and Reuse of Primary Research Data1
Amy M. Pienta, George Alter, Jared Lyle
Inter-university Consortium for Political and Social Research, Institute for Social Research,
University of Michigan
The goal of this paper is to examine the extent to which social science research data are shared and
assess whether data sharing affects research productivity tied to the research data themselves. We
construct a database from administrative records containing information about thousands of social
science studies that have been conducted over the last 40 years. Included in the database are
descriptions of social science data collections funded by the National Science Foundation and the
National Institutes of Health. A survey of the principal investigators of a subset of these social
science awards was also conducted. We report that very few social science data collections are
preserved and disseminated by an archive or institutional repository. Informal sharing of data in the
social sciences is much more common. The main analysis examines publication metrics that can be
tied to the research data collected with NSF and NIH funding total publications, primary
publications (including PI), and secondary publications (non-research team). Multivariate models
of count of publications suggest that data sharing, especially sharing data through an archive, leads
to many more times the publications than not sharing data. This finding is robust even when the
models are adjusted for PI characteristics, grant award features, and institutional characteristics.
1 We would like to acknowledge the National Library of Medicine (R01 LM009765). We also thank Darrell
Donakowski, Myron Gutmann, Felicia LeClere, JoAnne O’Rourke, James McNally, Russell Hathaway, Kristine
Witkowski, Kelly Zidar, Tannaz Sabet, Lisa Quist, and Robert Melendez for their contributions to the LEADS database
at ICPSR. The creation of the database was also supported by the following research projects at ICPSR: P01
HD045753, U24 HD048404, and P30 AG004590. Email Address of Corresponding Author:
Federal funding for scientific research has always been a highly competitive endeavor with a
small proportion of research grant submissions receiving awards from the National Institutes of
Health (NIH) each year. The impact, or success, of a funded research project is measured, partly,
by the research productivity of the PI and his or her research team who publish findings from
primary data collection activities. Increasingly, NIH and the National Science Foundation (NSF)
have become interested in data sharing as a means of supporting the scientific process and ensuring
the highest return on competitive investments. However, there has been little investigation of
research productivity that extends beyond the primary of use of data to test hypotheses outlined in
the grant application that led to the data collection activity. We proposed to redress this gap by
examining research productivity of research data, both as it applies to the original research team and
reuse by members outside the research team.
This research question is particularly salient for the social sciences because several social
science disciplines were among the earliest to organize efforts to share research data – so the
avenues for sharing data have been fairly well known, especially in the social science disciplines of
political science, sociology and economics. Social science research occurs in other social and
behavioral disciplines, as well. So, there is tremendous heterogeneity in data sharing in the social
sciences. As a result, despite the best efforts of social science data archives in the United States,
many social science studies do not reside in a permanent archive (Pienta et al. 2008).
The largest share of social science research is conducted with federal support. The National
Science Foundation (NSF) and the National Institutes of Health (NIH) historically have supported a
significant share of social science data collections and the trend continues today (Alpert 1955;
Alpert 1960; Kalberer 1992). This paper focuses on analyzing information from grant awards made
by NSF and NIH, making it possible to enumerate the bulk of the major social science data
collections that exist today. Also, NSF and NIH keep electronic records about grant awardees that
can be and have been culled into a single database useful for understanding the scope and breadth of
social science research that has produced research data. Thus, this research topic is both timely and
Data sharing has been an important topic of debate in the social sciences for more than
twenty years, initially spurred by a series of National Research Council Reports (1985) and more
recently the publication of the National Institutes of Health Statement on Sharing Research Data in
February 2003 (NIH 2003). Despite this formal written statement from NIH and a similar one from
the National Science Foundation (NSF-SBE n.d.) that give official support for the long held
expectations placed on grantees to share their research data, little is known about the extent to
which data collected with support from NIH or NSF has been shared with other researchers. The
limited work done suggests considerable variability in the extent to which researchers’ share and
archive research data. Our LEADS database of NIH and NSF grant information will fill this gap in
knowledge and create a research database for answering these questions.
NIH’s policy is designed to encourage data sharing with the goal of advancing science. The
benefits of sharing data have been widely discussed and understood by researchers for years. An
important part of Kuhn’s (1970) scientific paradigm is the replication and confirmation of results.
Sharing data is at the core of direct replication (Anderson et al. 2005; Kuhn 1970; Freese 2006). The
foundation of the scientific process is that research should build on previous work, where
applicable, and data sharing makes this possible (Bailar 2003; Louis, Jones & Campbell 2002). The
argument has been made, and there is some evidence to support it, that sharing data and allowing
for replication makes one’s work more likely to be taken seriously and cited more frequently (King
et al. 1995). In fact, Glenditsch, Petter, Metelits, and Strand (2003: 92) find that authors who make
data from their articles available are cited twice as frequently as articles with “no data but otherwise
equivalent credentials, including degree of formalization.”
Additionally, the nature of large datasets virtually guarantees that a single researcher or
group of researchers will not be able to use the dataset to its full potential for a single project. It may
be the case that those who collect the data are not the best at analyzing them beyond basic
descriptive analyses (Bailar 2003). Sharing data in this way ensures that resources spent on data
collection are put to the best use possible and the public benefit is enhanced.
Finally, the use of secondary data is crucial in the education of undergraduate and graduate
students (Fienberg 1994; King 2006). It is not feasible for students in a semester-long course to
collect and analyze data on a large scale. Using datasets that have been archived and shared allows
students to experience science firsthand. Instructors can use the metadata accompanying shared data
to teach students about “good science” and the results obtained from even simple analyses to
illustrate the use of evidence (data) in support of arguments (Sobal 1981).
Policies about Data Sharing
Most institutes and organizations that finance research, especially data collection, have a
policy about sharing data once the initial project is completed. The National Institutes of Health
(NIH 2003) and National Science Foundation (NSF-SBE n.d.), for example, require a clearly
detailed plan about data sharing as part of research proposals submitted for review. Plans must
cover how and where materials will be stored, how access will be given to other researchers, and
any precautions that will be taken to protect confidentiality when the data is made public. These
requirements are not, however, evaluated in the review process nor are there formal penalties for
non-compliance after the award. Most professional organizations also include a statement in their
“best practice” or ethics guidelines that addresses the issue that research reports should be detailed
enough to allow for replication and that researchers should also make available their data and
assistance in these replication attempts when the requests are made (e.g., American Sociological
Association, American Psychological Association, American Association for Public Opinion
In addition to such general statements that data collected with public funds must be shared
with other researchers and that individuals should be willing to assist others replicating their work,
some fields, such as Economics, have taken steps to make the data sharing policy more concrete. In
an attempt to allow for direct replications as well as full-study replications, the American Economic
Review and other major economics journals have instituted the practice that any article to be
published must be accompanied by the data, programs used to run the analyses, and clear, sufficient
details about the procedures prior to publication (Freese 2006; Anderson et al. 2005). The
requirement to include not only the data but also statistical code written to perform analyses
requires that individual researchers thoroughly and carefully document decisions made during the
analysis stages of the project and allows other researchers to more easily use these as starting points
for their own work. This has led to an increased use and citation of work that has been published in
journals where this type of information is required (Anderson et al. 2005; Glenditsch et al. 2003).
Sharing Social Science Data
Data are currently shared in many different ways ranging from formal archives to informal
self-dissemination. Data are often stored and disseminated through established data archives. These
data generally reach a larger part of the scientific community. Also, data in formal archives
typically include information (metadata) about the data collection process as well as any missing
data imputations, weighting, and other data enhancements. These archiving institution have written
polices and explicit practices to ensure long-term access to the digital assets that they hold that
include replication copies stored off-site and a commitment to the migration of data storage formats.
These are the characteristics that define data archives.
Another tier of data archives have more narrowly focused collections around a particular
substantive theme such as the Association of Religion Data Archives (
The data in these kinds of thematic archives are not necessarily unique, though some of their
holdings are, but the overlap between archives makes data available to broader audiences than
might be captured by a single archive. The ARDA, for instance, has a broader non-scientific
audience who are interested in analysis and reports as well as the micro-data files for reanalysis.
These archives expend resources on the usability of the collection for the present day and make
some kind of a commitment to long-term access through migration and back-ups.
Some data archives are designed solely to support the scientific notion of replication.
Journal-based systems of sharing data have become popular in Economics and other fields as a way
of encouraging replication of results (Anderson et al. 2005; Glenditsch et al. 2003). The longevity
of these collections is sometimes more tenuous than the formal archives, particularly if the
sustainability of their archival model relies on a single funding source.
Some examples of less formal approaches include authors who acknowledge they will make
their data available upon request or who distribute information or data through a website.
Researchers often keep these sites up to date with information about findings from the study, and
publication lists in addition to data files and metadata. These sites are limited to those who know
about the study by name or for whom the website has shown up in an internet search (see also
Berns, Bond & Manning 1996). Typically, the commitment to preserving the content lasts only as
long as the individual has resources available.
The Reluctance of Researchers to Archive Data
The time and effort required to produce data products that are useable by others in the
scientific community is substantial. This extra effort is seen by many as a barrier to sharing data
(Birnholz & Bietz 2003; Stanley & Stanley 1988). In addition to the actual data, information must
be added to assist secondary users in identifying whether the data would be of value to them and in
the analysis and interpretation of results. Such metadata includes complete descriptions of all stages
of the data collection process (sampling, mode of data collection, refusal conversion techniques,
etc.) as well as details about survey question wording, skip patterns and universe statements, and
post-data processing. All of these factors allow subsequent researchers to judge the quality of the
data they are receiving and whether it is adequate for their research agenda. Therefore, substantial
effort is required of those sharing data, while the benefits accrue to the secondary user.
Another significant barrier in the sharing of data is the risk of breaching the confidentiality
of respondents and the potential for the identification of respondents (Bailar 2003). The issue of
protecting confidentiality has become more salient as studies collect information about social
context, which may include census tract or block group identification to allow researchers to link
the data collected with information about the context. Not only are data about social and community
contexts being collected and included in datasets but also global positioning coordinates and
information about multiple members of a household, all of which could make identification of any
single individual easier. Additional information about biomarkers and longitudinal follow up are
also hallmarks of new data collection efforts. Both methodological innovations make it more
difficult for Institutional Review Boards to allow for the wide redistribution of data.
Other reasons individuals give for withholding data include wanting to protect their or their
students’ ability to publish from the data as well as the extra effort involved in preparing data for
sharing (Louis et al. 2002). Retaining the ability to publish from one’s data seems to be a significant
concern among scientists, both for fear of others “scooping” the story and that others will find
mistakes in their attempt to replicate results (Anderson et al. 2005; Bailar 2003; Freese 2006.
Current publication and academic promotion practices act as another barrier to sharing data
– or, put another way, those who “hoard” their data are likely to be rewarded more than those who
“share”. There are often few, if any, rewards to sharing data, especially given the expense in terms
of time and effort required to prepare clean, detailed data and metadata files. Researchers are not
typically rewarded for such behavior, particularly if the time spent on data sharing tasks infringes
on one’s ability to prepare additional manuscripts for publication. Academic culture does not
support the scientific norm of replication and sharing with tangible rewards. (Anderson et al. 2005;
Berns et al. 1996). As an example, in discussing the notion that researchers might share not only
data but also analytic/statistical code, Freese (2006:11) notes that a typical reaction to a “more
social replication policy would be to expend less effort writing code, articulating a surprisingly
adamant aversion to having [one’s] work contribute to others’ research unless accompanied by clear
and complete assurance in advance that they would be credited copiously for any such
contribution.” It is unlikely that attitudes about data sharing will change without strong leadership
and examples set by senior scientists and the commitment of scientific institutions such as
universities and professional societies who facilitate and enforce such sharing (Berns et al. 1996).
Extending Research Productivity to include Data Reuse
Research productivity is often thought of as something that scientists accomplish by
publishing their research discoveries. The second part of research productivity is not how many
times your ideas are published, but also how often the idea is cited in the work of others (Matson,
Gouvier, Manikam 1989). This is an analysis of citation counts of a scientist’s publications – how
widely cited their publications are. Thus, the impact of a scientist’s scholarship is derived directly
from their own published work. However, there has been movement in the scientific academy to
recognize the importance and value of research data. We consider the possibility that research data
may have enduring value on scientific progress as scientists use and reuse research data to draw
new analysis and conclusions. This idea is rooted in the idea of a data life cycle – where research
data can often have use beyond its original designed purpose (Jacobs and Humphrey 2004). This is
not farfetched given research productivity measures have also been used to assess institutional
productivity across universities (Toutkoushian, Porter, Danielson, and Hollis 2003). Here, we
consider the research productivity attached to research data collected by a scientist with federal
In summary, while the social sciences share in the normative expectation that research data
must be shared to foster replication and reanalysis, there is little to suggest that it is a wide spread
practice. Federal institutions and professional organizations underscore these normative
expectations with implicit and explicit sharing policies. The advantages of sharing data with the
research community are large and cumulative. Yet, with the exception of leading journals in
Economics, there are few cases in which these normative statements are coupled with penalties or
incentives to reinforce them. The institutional, financial, and career barriers to data sharing are
substantial as noted. What remains an open empirical question is the extent of data sharing across
social science disciplines and the value this has for the social sciences.
To address this question we construct a database of research projects. The LEADS database
is comprised of social and behavioral science awards made by NSF and NIH. From the National
Science Foundation online grants database (, we include in our
study research grant awards from three NSF organizations that matched prominent search terms
relating to the social sciences2
. We further restrict this set of awards to awards that include
descriptions of research activity that (1) relates to the social and /or behavioral sciences and (2)
likely includes original (or primary) data collection (including assembly of a new database from
existing or archival sources). From the National Institutes of Health online CRISP (Computer
Retrieval of Information on Scientific Projects) database ( )3
2 We determined this set of organizations in a pretest. The NSF organizations we include in the sample are: Directorate
for Computer & Information Science & Engineering, Office of Polar Programs, and Directorate for Social, Behavioral
& Economic Sciences. Search terms used to select possible awards from these NSF organizations for inclusion in
LEADS included: SOC*, POLIT*, and/or STAT*.
, we include
extramural research grant awards from the top ten NIH institutes engaging in social and behavioral
3 CRISP has since been replaced by the RePORT Expenditures and Results (RePORTER) query tool
(, although all NIH information described in this paper was collected using
the CRISP database.
Of the 235,953 eligible NSF and NIH awards in the LEADS database, 12,464 matched our
initial screening criteria (i.e., social/behavior science & collected research data). We select awards
from 1985-2001 (n=7,040). We selected this range of years because we wanted to inquire about
completed research that could have led to publications and data archiving. But, we did not want to
select awards that were completed so long ago that recall of information about the publications
related to the award would be unreasonable. From this set of awards, we determined there are 4,883
unique PIs. We attempted to invite all 4,883 of these PIs
. In additional to screening these awards for social and behavior science content, these
awards also were restricted to the collection of original quantitative data. This strategy differs from
the NSF award review in that strictly qualitative studies were not identified as such and excluded
from LEADS.
5 to complete a web survey6
The PI survey consisted of consisted of questions about research data collected, various
methods for sharing research data, attitudes about data sharing and demographic information. PIs
were also asked about publications tied to the research project including information about their
own publications, research team publications, and publications outside the research team. We
received 1,217 responses (24.9% response rate). For the analytic sample we select PIs and their
research data if (1) they confirm they collected research data (86.6% of the responses), (2) they did
not collect data for a dissertation award (n=33), or (3) they were missing data on the dependent
4 We determined this set of institutes in a pretest. The NIH institutes we include in the sample are: Agency for
Healthcare Research and Quality, National Institute of Child Health and Human Development, National Institute of
Mental Health, National Institute of Nursing Research, National Institute on Aging, National Institute on Alcohol Abuse
and Alcoholism, National Institute on Drug Abuse
5 While we attempted to e-mail survey invitations to all investigators, we were unable to reach at least 1,632 PIs after
one or more attempt. This was due to PIs no longer living, non-public e-mail addresses (i.e., e-mail addresses not
readily discoverable on the Web), and non-working e-mail addresses (i.e., ‘bounced’ e-mails).
6 PIs with multiple awards were asked about just one of their awards, as selected randomly.
Publication MeasuresResearch productivity is typically assessed by either or both citation
and publication analysis. The outcome measures used in this analysis are various measures of
publication counts. Publication counts are based on self-reported information provided by PIs of
the research grant awards at NSF and NIH. PIs are asked to report number of publications related to
the data they collected including estimates for: own publications, publications of the research team,
extant publications not related to the research team and the number of publications (in each of the
three previous categories) that include students. We include in this analysis count of publications
where the PI is one of the authors (range 0 – 100). This is the first measure of primary publications.
A second measure of primary publications is created that also includes counts of publications where
the PI may or may not be an author, but at least one member of the research team is an author
(range 0-350). Secondary publications are publications where none of the original research team
(PI, co-investigators, students or other researchers) is a co-author of the publication (range 0-700).
This measure indicates the extent of reuse (or secondary use) of research data beyond its original
collection purpose. Next, total publication count is constructed by adding counts of all primary
publications with counts of secondary publications (range=0-713). Finally, the number of
publications where a student was author or co-author is defined (range 0-160).
The main independent variable used in the analysis is data sharing status. PIs are asked
whether they share the research data from the award (1) formally through a data archive (or
institutional repository), (2) informally, not through a data archive (including shared upon request,
personal website, departmental website), or (3) not shared.
To ensure that data sharing is not “standing in” for other known predictors of productivity,
we include covariates describing characteristics of the individuals who collect the data, the award
mechanism used to fund the data, and the institutional home of the original data collection.
Research productivity has been linked to departmental prestige (Long 1978), age (Levan and
Stephan 1991) and gender (Penas and Willet 2006) among other factors. We begin by describing
PI characteristics we are able to measure.
PI CharacteristicsWe expect that characteristics of the PIs themselves will be associated
with both data sharing status and various publication counts. Some researchers have more time for
archiving and publishing whereas others may be more likely to engage in training and service. We
attempt to control for this by including various social and demographic characteristics of the PIs in
the models. The gender of the PI is male (=1) or female (=0)7. The self-reported race/ethnicity of
the PI is defined as white (=1) versus non-white (=0). Age (in years) at time of initial award is
calculated by subtracting year of birth from year at start of initial award (range 27-75). Self-
reported faculty status/rank at time of initial award is defined as senior (tenured faculty), junior
(untenured faculty), and non-faculty (including students, postdocs, research staff). Self-reported
discipline is classified from an open-ended question and collapsed into the following categories: (1)
health sciences (nursing, medicine, public health) and psychology, (2) core social science (political
science, sociology, and economics), and (3) other social science-related discipline (anthropology,
film, communications)8
Institutional Characteristics – Next, we construct a set of measures about the institutions
awarded the research grant the institution of the PI at time of initial award. First, we use the
. Finally, the number of federal grants awarded throughout one’s career is
defined as number of self-reported federal research grants (range 1-100).
7 A small number of cases were missing data about self-reported gender. For these, we used a constructed measure of
gender where gender was assigned using a tool that analyzed the common first names and estimated gender. This
measure was constructed for the full database, not just the PIs who were part of the survey. The tool constructed names
based on the top 100 names from the Social Security Baby Names index for 1930-1975 (in 5-year increments). See:
8 We combine health sciences and psychology because neither of the disciplines typically shares data. We combine the
core social science disciplines (sociology, political science, and economics) because the rates of data sharing were
similar across the three disciples. The other category shares data at rates midway between the other two categories.
Carnegie Classification9 to differentiate research institutions from non-research institutions.
Research institutions include research universities, doctoral granting universities, and medical
schools/centers). Non-research institutions include 2- and 4- year colleges, Master’s colleges and
universities, professional institutions and tribal colleges. Other institutions not classified under
Carnegie are divided into private research organizations10 and other non-Carnegie institutions11
Grant Award CharacteristicsTwo measures of the grant award are defined. First, we
differentiate awards made by the National Science Foundation from the National Institutes of
Health. NSF has had in place a data sharing policy for a longer time and it is expected that data will
be shared and archived more frequently when funded by NSF. The other award measure is the
duration of the award, measured in years (range =0-8 years). This is included to capture size of
The other institutional characteristic defined is the region where the institution is located. This is
defined into the following categories: northeast, south, midwest and west.
Analysis PlanDescriptive statistics are calculated using univariate and bivariate statistics.
Because the outcome measures are publication counts, Poisson regression models are considered.
Overdispersion led us to the choice to estimate negative binomial regression models of publication
counts for longitudinal data (offset by the amount of time between initial award and the survey).
We estimate two sets of models for each outcome. First, we estimate models that include only a
three category data sharing status measure. The second set of models adds the various PI,
institution, and award characteristics. We do not include any covariates in the final models shown
(model 2) that were not statistically significant across all outcomes. The hierarchical set of models
9 Carnegie Classification information was based on the "Classifications Data File" (Last update: April 12, 2009). See:
10 Private Research Organizations include Non-academic institutions with a research mission (e.g. American Bar
11 These include: Non-US institutions (e.g. University of Oxford) and awards made to individuals
(model 1 and model 2) allows us to understand the extent to which differences by data sharing
status might be attributed to other characteristics of PIs, institutions and the awards.
Descriptive sample characteristics are presented in Table 1. The sample of PIs is fairly
evenly dived between males (51.9 percent) and females (48.1%). The majority of the sample is
white (86.8 percent) and tenured (54.3 percent). Only 20 percent of the sample of PIs is non-
faculty. The mean number of Federal grants the PIs have been awarded throughout their careers is
6.2. The majority of PIs come from either the psychological or health sciences (62.5%). Just over
a quarter of the sample are PIs in the core social science disciplines (sociology, economics and
political science).
[Insert table 1 about here]
The sample of PIs comes from all four major regions of the U.S. The largest numbers of
grant awards are made to institutions located in the northeast (36 percent) and the fewest number of
grant awards are made to institutions located in the west (18.7 percent). The vast majority of PIs of
the research grant awards are working at institutions classified by the Carnegie Classification as
research institutions (78.8 percent). The second largest institution type represented in the PI survey
is private research organizations (12.3 percent). Few awards were made to non-research institutions
and other types of organizations not reflected in the Carnegie Classification (6.5 percent and 2.5
percent respectively). Only 27.3 percent of the awards in our sample come from the National
Science Foundation, with the majority coming from the National Institutes of Health (72.7%). The
mean length of award is 3.1 years. Few awards produce research data that are shared formally
either in a data archive or institutional repository (11.5%). Of the rest, half the data from the awards
are shared informally, not in an archive (44.6 percent), and half are not being shared beyond the
research team (43.9%).
Turning to Table 2, we next examine how various characteristics of the PIs, institutions and
grant awards are related to data sharing status. Women are more likely to archive data than men
(12.0 and 8.1 percent respectively; chi-squared is statistically significant). We see that PIs who are
senior faculty are more likely to archive data as well (12.0 percent) – nearly twice as often as junior
faculty (7.1 percent) and non-faculty (8.6 percent). There are strong disciplinary differences as
well. The core social science disciplines archive data at the highest rate (27 percent). Psychologists
and health scientist archive data least often (4.6 percent). PIs at institutions located in the south
are least likely to archive data (8.5 percent). The Carnegie Classification of the institution awarded
a research grant to collect data is not associated with data sharing status. Data funded by NSF
research grant awards are nearly three times more likely to be archived than data funded by NIH.
[Table 2 about here]
Table 3 shows the distribution of various publication counts for the full sample and by data
sharing status. For example, PIs write (or contribute to) a median of 4 publications. However, PIs
who archive data formally write a median of 6 publications – compared to PIs who do not share
data (3 primary publications). Research teams are also more productive when they archive the data.
The median number of research team publications is 8 when data are archived compared to 3 when
data are not shared outside the research team.
[Table 3 about here]
Large numbers of research data contribute to no secondary publications beyond the PI and
research team. Thus, across all categories, the median number of secondary publications is 0. For
this outcome we examine the mean. A research grant award produces 2 secondary (non research
team) publications on average. However, when data are archived 4 secondary publications are
reported on average. We turn to the total number of publications next. A research grant award
produces a median of 5 total publications. However, when data are archived a research grant award
leads to a median of10 publications. When data are shared informally a research grant is linked to a
median of 7 publications. And, when data are not shared outside the research team, the research
data lead to only a median of 4 publications overall. The same pattern is found for publications
with student authorship as well.
Multivariate results are presented in Tables 4.12
[Table 4 about here]
Count of total publications is related to data
sharing status. Both archiving data and sharing data informally are related positively to count of
total publications (b=1.094 and b=1.020 respectively). Both associations are statistically significant
(p<.01). This can be interpreted (by taking the exponential of the log-odds) as archiving data
leading to 2.98 times more publications than not sharing data. When data are shared informally
(compared to not shared at all), 2.77 times the number of publications is produced.
Turning to model 2, additional covariates are added to the model to account for potential
differences in PIs, institutions, and the grant awards. The coefficients for archiving data and
informally sharing data are positively associated with number of total publications count in
comparison to not sharing data at all. These two coefficients are smaller than in model 1, but still
statistically significant. This can be interpreted (by taking the exponential of the log-odds) as
archiving data leads to 2.42 times more publications than not sharing data. Informally sharing data
leads to 2.31 times more publications than not sharing data at all. Thus, the effect of data sharing,
12 Dispersion differs from 0 across all outcomes and models supporting negative binomial regression models. Log-
likelihood estimates are presented in Tables 4 & 5, standard errors appear in parentheses.
formally or informally, is not explained by differences in the PI themselves, the awards or the
institutions that were given the awards to conduct the research. Research productivity benefits
clearly from data sharing, particularly archiving data.
Other coefficients in the model demonstrate that being older at the time of award is
associated with increasing log-odds of total publications. Being older at time of award may
translate into a measure of writing and publishing experience – and in turn older PIs may have a
publishing advantage that is not explaining by other factors. One of the surprising results is that
faculty status (senior, junior, and non-faculty) at time of award was not statistically significant in
the model. This covariate (and gender, race and number of federal grants) are not included in model
2. The other PI characteristic that affects total publications is PI’s discipline. Compared to data
collected by core social scientists, data collected by health scientists and psychologists have lower
log-odds of leading to overall publications (b=-.254).
Only one measure of the institutional climate surrounding the award that produced data is
retained in model 2. Carnegie Classification is associated with total publication count. Compared
to data collected at research universities, data collected at non-research institutions reduce the log-
odds of overall publications (b=-.685). Data collected at other non-Carnegie classified institutions
(but not private research organizations that were classified separately), compared to data collected
at Carnegie research universities, are actually associated with increased log-odds of publications
(b=.230). Finally, the greater the length of the initial award period the greater the log-odds of
publication (b=.199).
The next set of models examines the number of secondary publications. Secondary
publications are publications by researchers outside the research team. We find that secondary
publications are also related to data sharing status. Both archiving data and sharing data informally
are positively related (increase the log-odds) of secondary publications (b=2.515 and b=2.375
respectively). Both associations are statistically significant (p<.01). This can be interpreted (by
taking the exponential of the log-odds) as archiving data leads 12.37 times more publications than
not sharing data. When data are shared informally (compared to not shared at all), 10.75 times the
number of publications are produced.
Turning to model 2, additional covariates are added to the first model to account for
potential differences in PIs, institutions, and the grant awards. The coefficient for archiving data is
positively associated with secondary publication count in comparison to not sharing data at all, but
is smaller than in model 1 (b=1.919 in model 2 compared to b=2.515 in model 1). This can be
interpreted (by taking the exponential of the log-odds) as archiving data leads to 6.81 times more
secondary publications than not sharing data. Both archiving and informal sharing are positive and
statistically significant in model 2 (p<.01). Informal data sharing leads to 4.78 times more
secondary publications than not sharing data. Thus, the effect of data sharing, formally or
informally, is not explained by differences in the PI themselves, the awards or the institutions that
were given the awards to conduct the research. Data reuse that leads to research productivity is tied
closely to data sharing, particularly archiving data.
The remaining covariates in model 2 have similar relationships with secondary publications
as total publications. A few notable differences emerge. Other social science disciplines differ
from the core social science disciplines in that data collected by other social scientists have lower
log-odds of secondary publications. The other difference is that research data collected by private
research organizations are related to greater log-odds of secondary publication compared to research
universities (b=1.230). This reinforces the idea that scientists at private research organizations
collect data to be shared externally.
Multivariate results are also presented in Tables 5.13
[Table 5 about here]
The first set of models shows that
primary publications (PI included as an author) are also related to data sharing status. Archiving
data is positively related to the log-odds of primary PI publications (b=.620). Informal data sharing
is positively related to the log-odds of primary PI publications (b=.743). Both associations are
statistically significant (p<.01). This can be interpreted (by taking the exponential of the log-odds)
as archiving data leading to nearly 2 times more publications than not sharing data. Adding the
additional covariates in model 2 does not explain the data sharing effects. In the last set of models
we saw private research organizations (PROs) produce data that lead to greater numbers of
secondary publications. Here, in model 2, we see that PROs produce data that lead to lower log-
odds of primary PI publications compared to research universities (b=-.216). Also in this model, we
see that NIH data increase the log-odds of primary publications compared to NSF data.
Much like the other publication metrics, the number of publications including students is
related to data sharing status. Archiving (b=.700) and sharing data informally (b=.763) increase the
log-odds of publications including students in comparison to not sharing data. Adding the
additional covariates in model 2 does not explain data sharing differences.
The LEADS database contains valuable information about a wide range of social science
research data collected with support from the National Science Foundation and the National
Institutes of Health. NSF and NIH awards typically lead to some of the largest investigator-initiated
13 Dispersion differs from 0 across all outcomes and models supporting negative binomial regression models. Log-
likelihood estimates are presented in Tables 4 & 5, standard errors appear in parentheses.
research activities in the U.S. and both institutions have had longstanding expectations that data
collected with public money ought to be made available to the public and/or research community. In
the social science research community, more so than in other basic disciplines, there have been
longstanding avenues for archiving and sharing data. Even with this advantage, we confirm that the
majority of social science data are not archived publicly (88.5%). Informal data sharing, though
much more common (44.6%), does not ensure that the scientific information collected with public
funding has enduring value beyond its original primary publications.
One of the central questions stemming from this disparity is whether research productivity
varies by data sharing. We find strong and consistent evidence that data sharing, both formal and
informal, increases research productivity across a wide range of publication metrics. Data
archiving, in particular, yields the greatest returns on investment with research productivity (number
of publications) being greater when data are archived. Not sharing data, either formally or
informally, limits severely the number of publications tied to research data. We hypothesized that
some of the data sharing advantage would be explained by PI characteristics and characteristics of
the institutions and grant awards. We find that although this is true, large persistent advantages in
research productivity accrue when data are shared. Finally, we also include a large number of
publication metrics to better understand how data sharing affects primary versus secondary
publications. Data sharing is related to all publication metrics, even primary PI publications.
However, data sharing has the largest effects on secondary publications as expected. Data
archiving, and informal data sharing, generate many more secondary publications and PI and
research team exclusive use.
Limitations - It is unclear whether larger numbers of primary publications lead to data
sharing or if sharing data leads to more primary publications. While both are plausible, it is likely
that the association we observe between data archiving and primary publications reflects the fact
that PIs archive data when their research is complete and all primary findings are published. That
said, we carefully selected a range of grant awards that would have been completed years ago.
Larger research projects probably lead to more publications and greater likelihood of data
sharing. While we have included a measure of grant award duration to get at some of the variability
in grant award size, a better measure of the size of the research project is amount of dollars. The
largest social science data collections cost more money to collect, are intended for public
dissemination, and have more questions that will appeal to a larger number of scientists.
Unfortunately, this information was not available for NIH awards.
Alpert, Harry. 1955. The Social Sciences and the National Science Foundation. Proceedings of the
American Philosophical Society, 99(5), Conference on the History, Philosophy, and the
Sociology of Science: 332-333.
Alpert, Harry. 1960. The Government’s Growing Recognition of Social Science. Annals of the
American Academy of Political and Social Science, 327, Perspectives on Government and
Science: 55-67.
Anderson, Richard G., William H. Greene, B.D. McCullough and H.D. Vinod. 2005. The Role of
Data and Program Code Archives in the Future of Economic Research. The Federal Bank of
St. Louis Working Paper Series.
Bachrach, Christine. 1984. Contraceptive Practice among American Women, 1973-1982. Family
Planning Perspectives 16:253-259.
Bailar, John C., III. 2003. The Role of Data Access in Scientific Replication. Paper presented at
Access to Research Data: Risks and Opportunities. Committee on National Statistics, National
Academy of Sciences.
Berns, Kenneth I., Enriqueta C. Bond, and Frederick J. Manning (eds). 1996. Resource Sharing in
Biomedical Research. Committee on Resource Sharing in Biomedical Research, Division of
Health Sciences Policy, Institute of Medicine. Washington, D.C.: National Academy Press.
Fienberg, Stephen E. (1994). Sharing Statistical Data in the Biomedical and Health Sciences:
Ethical, Institutional, Legal, and Professional Dimensions. Annual Review of Public Health
Freese, Jeremy. 2006. Replication Standards for Quantitative Social Science: Why Not Sociology?
Unpublished manuscript, University of Wisconsin-Madison.
Glenditsch, Nils Petter, Claire Metelits, and Havard Strand. 2003. Posting Your Data: Will You be
Scooped or Will You Be Famous? International Studies Perspectives 4(1):89-95.
Jacobs, James A., and Charles Humphrey. (2004). “Preserving Research Data.” Communications of
the ACM. 47(9): 27–29.
Kalberer, Jr., John T., 1992. When Social Science Research Competes with Biomedical Research.
Medical Anthropology Quarterly, New Series, 6(4):391-394.
King, Gary. 2006. Publication Publication. Political Science & Politics, 39(1):119-25.
King, Gary, Paul S. Herrnson, Kenneth J. Meier, M.J. Peterson, Walter J. Stone, Paul M.
Sniderman, et al. 1995. Verification/Replication. PS: Political Science and Politics 28(3):443-
Kuhn, Thomas. 1970. The Structure of Scientific Revolutions. Chicago: University of Chicago
Levan, Sharon G. and Paula E. Stephan. 1991. Research Productivity Over the Life Cycle: Evidence
for Academic Scientists, American Economic Review 81, 1, 114-132.
Long, J. Scott. 1978. Productivity and Academic Position in the Scientific Career. American
Sociological Review 43, 6, 889-908.
Louis, Karen Seashore, Lisa M. Jones, and Eric G. Campbell. 2002. Sharing in Science. American
Scientist 90(4): 304-307.
Matson, Johnny L., William Drew Gouvier, and Ramasamy Manikam. 1989. Publication Counts
and Scholastic productivity: Comment on Howard, Cole and Maxwell. American Psychologist,
National Institutes of Health (NIH). 2003. Final Statement on Sharing Research Data. February 26,
2003. Retrieved September 6, 2006 from
National Research Council. 1985. Sharing Research Data. Stephen E. Fienberg, Margaret E.
Martin, and Miron L. Straf, Eds. Committee on National Statistics. Washington, D.C.: National
Academy Press.
National Science Foundation Directorate for Social, Behavioral, and Economic Sciences (NSF-
SBE). (n.d.) Data Archiving Policy. Retrieved August 21, 2006 from
Pienta, Amy, Myron Gutmann, Lynette Hoelter, Jared Lyle, and Darrell Donakowski. 2008. The
LEADS Database at ICPSR: Identifying Important “At Risk” Social Science Data. Roundtable
paper presentation at the American Sociological Association Annual Meeting 2008, Boston,
Robbin, Alice. 2001. The Loss of Personal Privacy and Its Consequences for Social Research.
Journal of Government Information 28(5): 493-527.
Sobal, Jeff. 1981. Teaching with Secondary Data. Teaching Sociology. 8(2): 149-170.
Stanley, Barbara and Michael Stanley. 1988. Data Sharing: The Primary Researcher’s Perspective.
Law and Human Behavior 12(2): 173-180.
Toutkoushian, R.K., Porter, S.R., Danielson, C., & Hollis, P.R. (2003). Using publication counts to
measure an institution’s research productivity. Research in Higher Education, 44, 121–148.
Table 1. Descriptive Sample Characteristics (n=930)
PI Characteristics
Female (%) 48.1
Primary Publications (w/ PI)
White (%) 86.8
Primary Publications (w/any Research Team
Faculty Status @ Award - Senior (%)
Faculty Status @ Award - Junior (%)
Faculty Status @ Award) - Non-Fac (%)
Discipline - Core Social Science
Discipline - Psychology & Health
Disciple - Other
# Fed Grants in Lifetime (mean)
Institutional Characteristics
Region - Northeast (%)
Region - Midwest (%)
Region - South (%)
Region - West (%)
Carnegie-Research (%)
Carnegie-Non Research (%)
Carnegie-Uncl, PRO (%)
Carnegie-Uncl, Other (%)
Grant Characteristics
NSF Award (%)
Duration of Initial Award, Years
Data Sharing Status
Shared Formally, Archived
Shared Informally, Not Archived
Not Shared
Table 2. Bivariate Relationships: Data sharing status by PI Characteristics,
Institutional Characteristics, and Grant Award Characteristics
(n=409) p-
PI Characteristics
Female (%)
White (%) 12.0 45.5 42.5 *
Nonwhite 8.1 39.0 52.9
Age @ award (mean)
Faculty Status @ Award - Senior (%)
Faculty Status @ Award - Junior (%)
Faculty Status @ Award - Non-Fac
Discipline - Core Social Science
Discipline - Psychology & Health
Disciple - Other
# Fed Grants in Lifetime (mean)
Institutional Characteristics
Region - Northeast (%)
Region - Midwest (%)
Region - South (%)
Region - West (%)
Carnegie-Research (%)
Carnegie-Non Research (%)
Carnegie-Uncl, PRO (%)
Carnegie-Uncl, Other (%)
Grant Award Characteristics
NSF Award (%)
NIH Award
Duration of Initial Award, Years
* p<.1; ** p<.05; ***p<.01 (p-values for chi square tests)
Table 3. Bivariate results: Data Sharing Status by Publication Counts
Primary Publications (w/ PI)
Primary Publications (w/ any Research Team Member)
Secondary Publications (no Team Member)
Total Publications
Total Publications including Students
Table 4. Multivariate Results: Negative Binomial Regression Models of Publication Counts
Total # Publications, Self-Reported
Total # Secondary Publications, Self-Reported
Model 1
Model 2
Model 1
Model 2
Data Sharing Status
Primary Publications (w/ PI) 1.094 (0.123) *** 0.884 (0.128) *** 2.515 (0.415) *** 1.919 (0.443) ***
Shared Informally-Not Archived
Not Shared
Primary Publications (w/any Research Team Member)
PI Characteristics
Age at award
Discipline - Health and Psychology -0.254 (0.102) ** -0.977 (0.370) ***
Discipline - Other (v.s Core Soc Sci)
Institutional Characteristics
Carnegie-Non Res University
Carnegie-PRO (vs. Res Univ) 0.230 (0.113) 1.230 (0.387) ***
Grant Award Characteristics
NIH (vs. NSF) 0.075 (0.093) -0.202 (0.358)
Duration of Award, Years
log-likelihood estimates (standard errors in parentheses)
* p<.1; ** p<.05; ***p<.01
Table 5. Multivariate Results: Negative Binomial Regression Models of Publication Counts
Total # Primary Publications, Self-Reported Total # Student Publications, Self-Reported
Model 1 Model 2 Model 1 Model 2
Data Sharing Status
Primary Publications (w/ PI)
Shared Informally-Not Archived
Not Shared
Primary Publications (w/any Research Team Member)
PI Characteristics
Age at award
Discipline - Health and Psychology
Discipline - Other (vs. Core Social Sci)
Institutional Characteristics
Carnegie-Non Res University
Carnegie-PRO (vs. Res Univ)
Grant Award Characteristics
NIH (vs. NSF)
Duration of Award, Years
log-likelihood estimates (standard errors in parentheses)
* p<.1; ** p<.05; ***p<.01
... sharing also influence data-sharing behaviors (Kim & Zhang, 2015). Other research has identified a variety of barriers to data sharing, including time constraints, ethical concerns, and inadequate incentive structures (Pienta et al., 2010(Pienta et al., , 2008. ...
... Informal data sharing involves sharing data upon request, placing data on a website such as, or other forms of selfdissemination (Pienta et al., 2010(Pienta et al., , 2008. Formal data sharing is the long-term preservation of data in data archives or repositories (such as, ...
... Formal data sharing is the long-term preservation of data in data archives or repositories (such as, the official portal for European data) which may be accessed by other researchers (Alter & Gonzalez, 2018;Pienta et al., 2010;Tenopir et al., 2011). Depositing data in a formal repository comes with advantages, such as secure, long-term preservation of data in formats accessible to other researchers (Van den Eynden & Corti, 2017). ...
This secondary analysis of restricted-use data from the LEADS Database examines individual and contextual factors associated with social scientists’ propensity to share data. Prior literature primarily examines simple bivariate relationships between individual and contextual factors and data sharing. This study improves on this literature by considering multiple such factors simultaneously among U.S. principal investigators in the social sciences using structural equation modeling. By examining this full set of predictors, we explained a high proportion of variance (R² = .69) in researchers’ propensity to share data. Ethics-related barriers, favorable disciplinary attitudes toward data sharing, discipline, and academic rank all significantly predict propensity to share data. Also, favorable disciplinary attitudes toward data sharing were negatively associated with both perceived time- and ethics-related barriers to data sharing. These findings, although contradictory compared to some prior research, have implications for educators and may inform efforts to promote data sharing in the social sciences.
... Experience has demonstrated that the durability of the data increases and the cost of processing and preserving the data decreases when deposits are timely. Further, archived data result in a greater number of publications and a higher profile for data producers (Pienta, 2010). ...
... Sharing data helps advance science and maximize research investment. Recent research has found that when data are shared through an archive, research productivity is enhanced and the number of publications based on the data is dramatically increased (Pienta, 2010). Experience also has shown that the durability of the data improves and the cost of processing and preservation decreases when data deposits are timely. ...
... In the study (Hajduk et al., 2019), scientists emphasize that data exchange should be simple, feasible, and accessible. Research by scientists demonstrating the benefits of applying in practice the principles of open science and the dissemination of research data should also be considered (Pienta, Alter & Lyle, 2010;Henneken, & Accomazzi, 2011;Dorch, 2012;Piwowar & Vision, 2013;McKiernan et al., 2016;Zhang, Ma, 2021). ...
Full-text available
JEL Classification: I23, I25, I28, O1 Purpose: To determine the current state of development of open science in the paradigm of open research data in Ukraine and the world, as well as to analyze the representation of Ukraine in the world research space, in terms of research data exchange. Design / Method / Research Approach: Methods of synthesis, logical and comparative analysis used to determine the dynamics of the number of research data journals and data files in the world, as well as to quantify the share of research data repositories in Ukraine and the world. Trend and bibliometric analysis were used to determine the share of publications with their open primary data; analysis of their thematic structures; identification of the main scientific clusters of such publications; research of geographic indicators and share of publications by research institutions. Findings: The study found a tendency to increase both the number of data logs and data files in Dryad (open data repository). The results of the analysis of the share of data repositories indexed in re3data (register of research data repositories) show that 51% of the total number are repositories of data from European countries, with Germany leading with 460 repositories, followed by the United Kingdom (302 repositories) and France (116 repositories). Ukraine has only 2 data repositories indexed in re3data. The trend of relevance of data exchange is confirmed by the increase of publications with datasets for the last 10 years (2011-2020) in 5 times. Research institutions and universities are the main sources of research data, which are mainly focused on the fields of knowledge in chemistry (23.3%); biochemistry, genetics and molecular biology (13.8%); medicine (12.9%). An analysis of the latest thematic groups formed on the basis of publications with datasets shows that there is a significant correlation between publications with open source data and COVID-19 studies. More than 50% of publications with datasets both in Ukraine and around the world are aimed at achieving the goal of SDG 3 Good Health. Theoretical Implications: It is substantiated that in Ukraine there is a need to implement specific tactical and strategic plans for open science and open access to research data. Practical Implications: The results of the study can be used to support decision-making in the management of research data at the macro and micro levels. Future Research: It should be noted that the righteous bibliometric analysis of the state of the dissemination of data underlying the research results did not include the assessment of quality indicators and compliance with the FAIR principles, because accessibility and reusability are fundamental components of open science, which may be an area for further research. Moreover, it is advisable to investigate the degree of influence of the disclosure of the data underlying the research result on economic indicators, as well as indicators of ratings of higher education, etc. Research Limitations: Since publications with datasets in Scopus-indexed journals became the information base of the analysis for our study, it can be assumed that the dataset did not include publications with datasets published in editions that the Scopus bibliographic database does not cover. Мета дослідження: Дослідження має на меті визначити сучасний стан розвитку відкритої науки в парадигмі відкритих даних досліджень в Україні та світі, а також проаналізувати представлення України у світовому дослідницькому просторі, в частині обміну даними досліджень. Дизайн / метод / підхід дослідження: Методи синтезу, логічного та порівняльного аналізу, використані з метою визначення динаміки кількості журналів даних досліджень та файлів даних в світі, а також для здійснення кількісної оцінки частки репозитаріїв даних досліджень в Україні та світі. Трендовий та бібліометричний аналіз використано для визначення частки публікацій з їх відкритими первинними даними; аналізу їх тематичних структур; визначення основних наукових кластерів таких публікацій; дослідження географічних показників та частки публікацій за дослідницькими установами. Результати дослідження: Дослідження виявило тенденцію до зростання як кількості журналів даних, так і файлів даних в Dryad (репозитарій відкритих даних досліджень). Результати аналізу частки репозитаріїв даних індексованих в re3data (реєстр сховищ даних досліджень) показують, що 51% від загальної кількості складають репозитарії даних країн Європи, причому Німеччина лідирує з 460 репозитаріями, за якою йдуть Великобританія (302 репозитарії) та Франція (116 репозитаріїв). Україна має лише 2 репозитарії даних, індексованих в re3data. Тенденція актуальності обміну даними підтверджується збільшенням публікацій з наборами даних за останні 10 років (2011-2020) у 5 разів. Науково-дослідні установи та університети є основними джерелами даних досліджень, які переважно зосереджені на галузях знань з хімії (23.3%); біохімії, генетики та молекулярної біології (13.8%); медицини (12.9%). Аналіз найновіших тематичних груп, що були сформовані на основі публікацій з наборами даних показує, що існує значна кореляція між публікаціями з відкритими первинними даними та дослідженнями COVID-19. Більше 50% публікацій з наборами даних як в Україні, так і світі спрямовано на забезпечення цілі SDG 3 «Міцне здоров'я». Теоретичне значення дослідження: Обґрунтовано, що в Україні виникає необхідність впровадження конкретних тактичних й стратегічних планів щодо відкритої науки та відкритого доступу до даних досліджень. Практичне значення дослідження: Результати дослідження можуть бути використанні для підтримки прийняття рішень в управлінні даними досліджень на макро та мікрорівні. Перспективи подальших досліджень: Варто зазначити, що праведний бібліометричний аналіз стану поширення даних, які лежать в основі результатів досліджень, не включав оцінювання показників якості та відповідність FAIR принципам, адже доступність та можливість повторного використання є основоположними складовими відкритої науки, що може бути напрямком подальших досліджень. Крім того, також доцільно дослідити ступінь впливу оприлюднення даних, які лежать в основі результату дослідження на економічні показники, а також показники рейтингів закладів вищої освіти тощо. Обмеження дослідження: Оскільки інформаційною базою аналізу для нашого дослідження стали публікації з наборами даних в журналах, що індексуються базою Scopus, можна передбачити, що у вибірку даних не увійшли публікації з наборами даних, що були опубліковані у виданнях, які бібліографічна база Scopus не охоплює. Тип статті: Теоретичний Ключові слова: відкрита наука, управління даними досліджень, обмін даними, репозитарії даних досліджень, публікації з наборами даних.
... The theoretical results from Sections "The Effect of Different Possible Levels of k" and "Costs of Preparing Open Data" both predict an overall correlation between the adoption of open data and research quality. This fits well with the existent empirical literature showing that papers published under open data have higher citation counts than those without (Pienta et al., 2010;Marwick and Birch, 2018). The results in these papers are correlational, and it is conceivable that the open data itself increased citation count through encouraging others to build on the published results-indeed that is the preferred interpretation in the literature. ...
Full-text available
Open data, the practice of making available to the research community the underlying data and analysis codes used to generate scientific results, facilitates verification of published results, and should thereby reduce the expected benefit (and hence the incidence) of p-hacking and other forms of academic dishonesty. This paper presents a simple signaling model of how this might work in the presence of two kinds of cost. First, reducing the cost of “checking the math” increases verification and reduces falsification. Cases where the author can choose a high or low verification-cost regime (that is, open or closed data) result in unraveling; not all authors choose the low-cost route, but the best do. The second kind of cost is the cost to authors of preparing open data. Introducing these costs results in that high- and low-quality results being published in both open and closed data regimes, but even when the costs are independent of research quality open data is favored by high-quality results in equilibrium. A final contribution of the model is a measure of “science welfare” that calculates the ex-post distortion of equilibrium beliefs about the quality of published results, and shows that open data will always improve the aggregate state of knowledge.
... An analysis of more than 10,000 studies in the life sciences found that studies with the underlying data available in public repositories received 9% more citations than similar studies for which data were not available (Piwowar & Vision 2013). Research has also indicated that sharing data leads to more research outputs using the data set than when the data set is not shared (Pienta et al. 2010). Open code is also associated with studies being more likely to be cited than those that do not share their code: in a study of all papers published in IEEE Transactions on Image Processes between 2004 and 2006, the median number of citations for papers with code available online was three times higher than those without (Vandewalle 2012). ...
Full-text available
Energy use is of crucial importance for the global challenge of climate change, and also is an essential part of daily life. Hence, research on energy needs to be robust and valid. Other scientific disciplines have experienced a reproducibility crisis, i.e. existing findings could not be reproduced in new studies. The ‘TReQ’ approach is recommended to improve research practices in the energy field and arrive at greater transparency, reproducibility and quality. A highly adaptable suite of tools is presented that can be applied to energy research approaches across this multidisciplinary and fast-changing field. In particular, the following tools are introduced – preregistration of studies, making data and code publicly available, using preprints, and employing reporting guidelines – to heighten the standard of research practices within the energy field. The wider adoption of these tools can facilitate greater trust in the findings of research used to inform evidence-based policy and practice in the energy field. PRACTICE RELEVANCE Concrete suggestions are provided for how and when to use preregistration, open data and code, preprints, and reporting guidelines, offering practical guidance for energy researchers for improving the TReQ of their research. The paper shows how employing tools around these concepts at appropriate stages of the research process can assure end-users of the research that good practices were followed. This will not only increase trust in research findings but also can deliver other co-benefits for researchers, e.g. more efficient processes and a more collaborative and open research culture. Increased TReQ can help remove barriers to accessing research both within and outside of academia, improving the visibility and impact of research findings. Finally, a checklist is presented that can be added to publications to show how the tools were used.
... Papers with publicly available datasets receive a higher number of citations than similar studies without available data (Piwowar et al. 2007, Piwowar & Vision 2013, Henneken & Accomazzi 2011, Dorch 2012, Sears 2011, Gleditsch & Strand 2003. In addition to increased citations for data sharing, Pienta et al. (2010) found that data sharing is associated with higher publication productivity. They examined 7,040 NSF and NIH awards and concluded that the typical research grant award produces a median of five publications, but when data are archived a research grant award leads to a median of ten publications. ...
Full-text available
Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.
Aim/purpose The present investigation was undertaken to find the existing position including the policies, practices and the issues and concerns of librarians in collecting, processing and archiving the research data in public university libraries in Jordan. Background Research is the bastion of scientific growth and development of a society and the data is its chief ingredient which communicates the results and inferences of the research. The research process is directly interrelated with the data life cycle and the two cannot be segregated. Keeping the importance of the curation and management of the research data and the trivial policy interests of the funding agencies and the governments, the current study was undertaken to explore the Research Data Management (RDM) status and practices in Jordan. Methodology The study used the online questionnaire to get the opinions and the prevailing state of affairs from the librarians/heads of the libraries in the public universities. Contribution The study is an eye opener for the higher education commission in Jordan to design the policies and suitable guidelines for streamlining the RDM system in the country especially the Public and Private Universities. It can be pursued for the exploration of RDM practices and policies in similar other Middle East countries for determining the issues and challenges in implementing the RDM policies in their research institutes. Findings It revealed that the Jordan lags RDM polices at all fronts of government, institutional as well as that of the funding agencies which hinders all the progress and growth of RDM in university libraries. However, keeping all the obstacles and the challenges aside, the Jordanian Public University Libraries (JPUL) are providing RDM services, although on a limited scale, at different levels of curating only some valuable research data, preserving and providing access to the preferred users through online platforms substituting the data repositories. There is a lack of national and international RDM policies and ineffective implementation. Coupled with this, is the wider gap between the required skills and the available skills among the library staff which can be filled by conducting suitable training programs and workshops on continuous basis to impart the skills required to manage RDM practices in libraries. Further improvements in providing the effective RDM services is the need among the JPUL and similar may be the case in other middle east countries including the development of infrastructure, policy framing, designing of data repository, specialized courses on data archiving, data cataloguing, data sharing, etc. while protecting the intellectual property rights. Future research The similar studies can be explored to find the existing policies and the loopholes in implementation of national and international policies of data management in different research institutes. The more important area would be to pursue the role of funding agencies and the framing of suitable policies regarding the data curation and management of their funded projects. Research implications The distinction of the present research study can be considered by the fact that there is a lack of sufficient studies in Jordanian universities which can present a comprehensive proclamation on the processes and policies adopted by the libraries for research data collection, archiving and its accessibility. The study can help initiating the new policies regarding the data management tools and techniques in Jordan as we witness in some of the well developed countries of the world and even there are many courses running on research data management and its implementation in universities. Relevance of the present study The results of the present study can be used by other libraries especially in designing and developing the strategies of research data management which can help improving the data re-use and sharing in Jordan universities for secondary analysis. It can also help in refocusing the data and its immense value among research community. Besides, it can also help the research funding agencies and the government of the country to make policy decisions regarding the submission of reports along with the data so as to save lot of human efforts and resources for collecting and collating the necessary data for conducting researchers in future. The study will be helpful in reducing the financial burden of research projects of different funding agencies which otherwise are spending hugely on data collection, surveys, questionnaire designing, pilot studies, interviews, etc.
The relevance of open research data is already acknowledged in many disciplines. Demanded by publishers, funders, and research institutions, the number of published research data increases every day. In learning analytics though, it seems that data are not sufficiently published and re-used. This chapter discusses some of the progress that the learning analytics community has made in shifting towards open practices, and it addresses the barriers that researchers in this discipline have to face. As an introduction, the movement and the term open science is explained. The importance of its principles is demonstrated before the main focus is put on open data. The main emphasis though lies in the question, Why are the advantages of publishing research data not capitalized on in the field of learning analytics? What are the barriers? The authors evaluate them, investigate their causes, and consider some potential ways for development in the future in the form of a toolkit and guidelines.
Technical Report
Full-text available
The report covers the DA(ta) SH(aring) - 2020 survey study, which dealt with issues of data sharing, data managment as well as data (re-)use in the Austrian social sciences. The german language report is a comprehensive overview over the empirical results of the study.
Full-text available
ICPSR has created a database to document information about the thousands of social science studies that have been conducted over the last 40 years. Included in the database are descriptions of social science data collections funded by the National Science Foundation and the National Institutes of Health. These records are supplemented with additional information gathered through correspondence with principal investigators of those awards with the goal of gathering information about the public availability of any research data collected with grant support. The goal of this paper is to describe the LEADS database and provide results regarding the scope of social science research data that are "at risk" of being lost. In the social science research community there have been longstanding expectations and mechanisms for archiving and sharing data. Even with this expectation, analysis of the LEADS database shows that the majority – nearly 75% --of researcher-initiated social science research data is not archived publicly. Further, we find that a substantial minority have been lost.
A general overview of the use of secondary data in teaching sociology is presented, pulling together previous information about teaching with secondary data and laying out the potential for additional applications. After describing what constitutes secondary data, reasons for using secondary data in the classroom, where it can be used, ways of using it, sources of data, information about computing, and potential problems are discussed, and appendices listing sources of data and data sets designed for teaching are included. It is hoped that sociologists will be encouraged to at least dabble in this teaching technique. If they feel it is successful, then they can use the information presented here to aid them in expanding secondary data use.
Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies
Identifies 2 important dimensions of data sharing: the degree to which the primary investigators may determine whether they will share their data (i.e., whether they are free to refuse data requests) and the reason for which the data set is being requested. The negative aspects of viewing data sharing as an obligation include increased burden on the primary investigator, lack of incentive to share data, loss of control over the use of data, and negative effects on scientific progress. Recommendations regarding data sharing policies are made in light of these negative effects. (PsycINFO Database Record (c) 2014 APA, all rights reserved)
This paper examines the interrelationship between scientific productivity and academic position, two key dimensions of the scientific career. Contrary to the results of most earlier studies, the effect of departmental location on productivity is found to be strong, whereas the effect of productivity on the allocation of positions is found to be weak. Productivity, as indicated by measures of publications and citations, is shown to have an insignificant effect on both the prestige of a scientist's initial academic appointment and on the outcome of institution changes later in the career. Although the relationship between productivity and the prestige of an academic appointment is insignificant at the time a position is obtained, the effect of departmental prestige on productivity increases steadily with time. For those scientists who change institutions, the prestige of the new department significantly affects changes in a scientist's productivity after the move. It is argued that past studies have obtained spurious results due to their failure to employ a longitudinal design. Not only do cross-sectional designs provide misleading results regarding the interrelationship between departmental location and productivity, but they also systematically alter the findings regarding the effects of sponsorship and doctoral training on productivity.
The credibility of quantitative social science benefits from policies that increase confidence that results reported by one researcher can be verified by others. Concerns about replicability have increased as the scale and sophistication of analyses increase the possible dependence of results on subtle analytic decisions and decrease the extent to which published articles contain full descriptions of methods. The author argues that sociology should adopt standards regarding replication that minimize its conceptualization as an ethical and individualistic matter and advocates for a policy in which authors use independent online archives to deposit the maximum possible information for replicating published results at the time of publication and are explicit about the conditions of availability for any necessary materials that are not provided. The author responds to several objections that might be raised to increasing the transparency of quantitative sociology in this way and offers a candidate replication policy for sociology.
Important new developments have strengthened the standing of the social sciences in the federal government. Historical analysis emphasizes the recency of the government's recognition of the national contributions of social science re search. Significant progress has been made despite critical fluctuations. Five factors contributing to the more favored governmental position of social science research are (1) chang ing congressional attitudes; (2) acceptance of the social sci ences at the White House level; (3) inclusion of the social sciences as part of broad definitions of scientific disciplines; (4) the general post-Sputnik interest in American education; and (5) the concern with redressing imbalances in American higher education. Research support for the social sciences is growing but a critical shortage remains in funds for fellowships and assistantships. The social sciences approach the next decade in a climate of acceptance and encouragement.
Argues that the study of research productivity by G. S. Howard et al (see record 1988-09385-001) replicated the failure of W. M. Cox and V. Catt (see record 1978-21651-001) to use a representative sample by selecting only in-house American Psychological Association (APA) journals and ignoring some journals published by specific APA divisions. (PsycINFO Database Record (c) 2012 APA, all rights reserved)