How do you organize your experimental data?

Science is all about data, but we scientists (at least biologists) are not specialists of data. I spend so much time organizing data, and wonder if we can use Ruby on Rails-like approaches to organize, analyze, and visualize data?

I want to handle my data in a simpler way. I guess I could use MVC, DRY or RESTful approach.. That way, other people can understand your data more easily. Does anyone use rails in research?

16 Replies
  • Pablo de Castro

    In order to provide a diiferent angle to the discussion, let me express my own set of regrets about how hard it is for research data management project developer teams (based at the University Research Office, Library, etc) to reach researchers and to disseminate research data management (RDM) initiatives and opportunities at an institutional level (meaning university or research centre).

    There is a large number of ongiong RDM projects at institutional, national, international and discipline-based levels that seem unable to raise awareness about the need for a systematic research data management to be carried out by research groups everywhere. The technical infrastructure is increasingly available (see LabArchives, http://www.labarchives.com/ to add just one additional example to the ones that were already mentioned), the policies are often there as well (for instance the NSF data sharing mandate in the US, see http://blog.datadryad.org/2010/11/15/nsf-policy-on-dissemination-and-sharing-of-research-results/) and the international networks are already collaborating for harmonising the different national approaches (see "Research Data Management: Activities and Challenges" Workshop held last Nov in Bonn by the Knowledge Exchange, http://www.knowledge-exchange.info/Default.aspx?ID=475, with attached report), so everything seems ready for a broad movement towards an accountable research data management activity to flourish. But we need researchers to be aware of those efforts!

    Maybe the publishers are actually the most appropriate stakeholders to look upon for getting these initiatives widely disseminated. Some good news came up last week in this regard, see "LabArchives and BioMed Central: a new platform for publishing scientific data" at http://blogs.openaccesscentral.com/blogs/bmcblog/entry/labarchives_and_biomed_central_a

    Apr 12, 2012
  • Roi Paz

    It have been already done; an ELN including file system, import/export/edit files online of all type of files.
    Very simple to use with unlimited space for free, try www.sparklix.com e-Notebook - you'll be surprised how managing scientific data are mad easy.
    Enjoy

    Apr 8, 2012
  • M Fabiana Kubke

    @ Constantinescu I use VUE quite a bit, (and love it) but while it is good for some things, I am not sure I would describe it as a good data management system.

    Mar 30, 2012
  • M Fabiana Kubke

    I don't use it (yet) so not sure how useful this is - but I am about to begin to explore Data Finder (http://www.andreas-schreiber.net/2008/07/datafinder-a-python-application-for-scientific-data-management/) mainly because I am helping a set of programmers that are trying to create an elab notebook that integrates with the file management system.

    Mar 30, 2012
  • Constantinescu Nicolaie

    Please, take a look at http://vue.tufts.edu/ and you will be all in a very nice surprise as it has the means and the tools to manage visually even data sets. And believe it or not, it does the job pretty darn well. It has a good linkage with the repositories solutions and more than that do not forget that it does it visually.

    And, yes it's free!

    Mar 1, 2012
  • Nikolay Chekanov

    The problem with software development tools like git is that they are intended for raw text which code is. If your manuals are compiled from TeX or something like that, then why not.
    Also my friend have suggested me to push "launch codes" (scripts and snippets for starting programs, with all parameters, for any given stage of research) to version control systems, so they can be quickly found and reused later (bash history is not a reliable friend for that). Anyone who submitted an article to journal and was suggested to remake some parts after several months will understand my point.

    Feb 25, 2012
  • Taro Kiritani

    Thanks a lot for your inputs! I was not aware of these tools you guys mentioned, and they might be something I have been looking for. I imagined that our research and data are so diverse that it might be difficult to make a generic tool. But it is nice to know that people have already implemented some software that could make my life easy.

    -Nikolay, thanks for sharing your thoughts. I have been discussing that kind of stuff with my friends, but it was vague to me. Thanks for making it explicit. I can imagine how hard you have been working to manage people's data.. Another way to share data effectively might be to use github. Open source and distributed science?? We can share and edit manuals, for example?

    Feb 25, 2012
  • Nikolay Chekanov

    Good question. Seems that everyone gets it in different sense though. My guess is the most abstract.

    The problem is that while there should exist software with good data management implemented inside of it, you rarely have a chance to use it solely. Instead you need different tools, especially in the beginning when you try them all and choose the one with best results, features and/or usability, or in complex cases where merging different data types is required. In bioinformatics, as example, one might need to compare gene expression and DNA methylation datasets, the former obtained with microarrays and the latter with sequencing.
    Another issue lies in sharing. Primarily it's about working together with your colleagues. Imagine someone taking a long vacation, and you have to use his results. Try to find these amongst folders named "exp1", "2", "2revised", "tmp_280211" and "2010-09-14". He may have written about it in his lab journal, or he may not. Not all of us are bureaucrats, you know.
    Yet there's another one. The reference data, which are static by their nature. They are the most shared (bioinformatics again: you need a reference genome for most types of sequencing experiments), yet come from external sources and come in a variety. So, if you haven't organized them well already, be prepared that every your colleague will download his own copy and hide it deep in his "tmpfolder4" structures. Programs apply to that too, users are often not so patient to wait for sysadmin to install another small tool (which tend to upgrade very rapidly). So again, no unified common userland entails feudalism.

    I work in such multiuser&multidata environment. In fact, I'm the only one who organizes it. Here are my thoughts:
    - No writing in the root, ever. On top-level of the shared storage, your own home, project folder etc. Root should be very simple and stable. We have these folders: userland-programs, reference, raw data, and "Projects". Programs is a little /usr where everyone can install compiled stuff, there are own bin, lib and include directories, bin is added to PATH and so on. The Projects folder should be hierarchically organized too, if you have three or more similar in nature yet independent experiments (or vice versa), place them in separate folder in Projects. A temp folder is a good idea too, though there should be a number of them, one for each project and user. If you write code, you'll need a repository with version control for it, that's obvious.
    - Encourage unified naming and placing policy. For example, there might be short independent projects, we name them with a date and a name of person who needs them (often not the one actually working on them). Within the project I prefer to make a simply-named folder for each stage (i.e. Alignment, SNP calling, SNP comparison), inside of which might be lying series of uniquely but uniformly named experiments, for my purpose only date is not sufficient, I follow it by the name of used program and its major parameters or by distinctive sample features.
    - You'll definitely need a metadata archive to write details about objects. That poses the problem of laziness, as it usually lies in another reality than the file system, on a lab intranet website or in your lab journal (that is not recommended). We use Redmine issue tracking system to monitor everyone's tasks, also it has a wiki and a good document management plugin (dmsf). Information about the disk contents goes to wiki, dmsf is good for storage of articles, books, manuals etc. In case when there's a lot of similar lengthy stages in one project, these stages can be presented as issues in Redmine, so overall project progress can be clearly monitored. Jaime Alexander here proposed mindmaps, I think that would be a good substitution for wiki in some cases. Another approach is IRODS, which works with metadata on low level near the file system, which is good, but it might take considerable amounts of time and brains to understand and deploy it.

    I didn't cover many similar themes such as how to organize databases or backups or whatever, but that's the most general and basic guidelines. If anyone has additional opinions on the topic in that sense, you're welcome.

    Feb 25, 2012
  • Comment protected

    Join ResearchGate now to read this comment.
  • Jaime Alexander Cuellar

    I´m sociologist, but usually in social sciences we confront problems with different kinds of data (quali/quanti). I used ATLAS-ti long ago, but droped because is expensive, and a little time-consuming... not fitted for quick tasks, but useful and light. The learning curve is long, if you want to make the very best of it.
    Now I use a combination of my own folders-structure/file-naming, and open or free software to visualyze info structure: Cmaptools (http://cmap.ihmc.us/) and its ontology management version COE (http://www.ihmc.us/groups/coe/); or now I´m using Docear (http://www.docear.org/), that automaticly creates a mind map of folder structure with links to files... and even to bookmarks if the pdf file has table of contents! (well, if is not protected).
    Recently I found a not-so costly option for data organization and analysis as a on-line service: Dedoose (http://www.dedoose.com/).
    I hope this helps.

    Feb 24, 2012
  • Fabio Porto

    I suggest using scientific workflow languages/engines to run your experiments, preferably one with provenance management. Vistrails, Taverna and Chiron are examples of these.

    Feb 24, 2012
  • Clinton Thompson

    I'm not a biologist and haven't used any of the approaches you mentioned so I can't comment on them although data organization, analysis, and presentation apply to most all of us. With that said, I found J. Scott Long's "The Workflow of Data Analysis Using Stata" to be a fantastic resource in that it explicitly outlines and describes what efficient data workflow looks like and how to do it. The book is, of course, geared toward to Stata users but I suspect the principles would be helpful no matter what program you use.

    Feb 24, 2012
  • Taro Kiritani

    I just suspected someone already created an efficient way to manage scientific data. I do have sql database for my experiments, but it got a bit cluttered and consumes a lot of time and energy..

    Thanks for your insights. I was looking for a programming tool, but ATLAS sounds interesting. I hope it will simplify my data handling. I was also interested in R, but my problem is in organization of data rather than in analysis.

    Feb 24, 2012
  • Elena Mihailova

    Thank you very much! I will try to download ATLAS program if i find it.

    Feb 24, 2012
  • John Gómez

    I didn't Know MVC, DRY or others you talk about, Maybe if you need software R is a free and powerfull tool to handle and analyze data.

    Feb 23, 2012
Follow this Post

Contributors