by Stephen Kiyoi, Second Year NLM Associate Fellow
NN/LM Pacific Southwest Region
UCLA Louise M. Darling Biomedical Library
On January 18, 2012, the UCLA Louise M. Darling Biomedical Library hosted Carly Strasser, Ph.D., for an enlightening talk on data management in scientific research. Carly’s work at DataONE and the California Digital Library (CDL), as well as her doctoral education in Biological Oceanography at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution, give her a unique perspective on challenges of scientific data management. In addition to explaining her own work, Carly gave an excellent background on the cultural and technical issues in data archiving, sharing, and publication. The following is a detailed summary of Carly’s talk. Her presentation is also accessible on SlideShare.
The Mistakes Scientists Make
Carly’s experience and research on data management indicate that many scientists are not familiar with metadata, data centers, or repositories, and do not generally archive or share their data. However, movement towards these best practices will be encouraged by new funding agency requirements mandating specific data management plans. The National Science Foundation (NSF) has led this effort by requiring funded researchers to create a data management plan. Elements of this plan include: (1) types of data; (2) metadata and data standards; (3) policies for access and sharing; (4) provisions for reuse; and (5) plans for archiving. Enforcement and specificity of these plans, while currently lax, will likely increase over time. Carly encourages the research community to factor in data management costs in their grant applications.
Barriers to Best Practice
Data management involves large investments of money, time, and training. Scientists fear loss of rights or benefits. Their research ideas may be “scooped,” their data may be used inappropriately, and other scientists may contradict their original findings through new analyses. Scientists also lack incentives. There is currently no commonly accepted standard for citing and rewarding scientists who share their data. Carly’s research has indicated that although many professors see data management as generally important, few cover data management principles in their undergraduate courses, causing one to wonder where students will learn about data management.
Best Practices for Scientists
Carly emphasized the importance of data organization, which involves creating unique identifiers early in the research process, standardizing fields in spreadsheets and databases, using descriptive file names, and archiving an original copy of raw data before performing analysis. She also emphasized the importance of good data collection, including defining and enforcing standards, having two people enter the same dataset to reduce human error, and minimizing manual data entry whenever possible.
Several software packages have been designed to help with metadata (Morpho, Metavist, NOAA MERMaid) and workflow (Kepler) creation, but Carly noted that these tools, while powerful, are very difficult to use. Carly hopes that as usability of these tools improves, more scientists will make use of them to enable more powerful and automated data management. Carly envisions a future in which scientists know where and how to archive and share their data, and scientists regularly cite the datasets they consult.
Digital Curation for Excel (DCXL) Project
Currently, Carly is the project manager for DCXL, an Excel plugin enabling scientists to share, publish, and archive their data. She has conducted numerous surveys and interviews with scientists, librarians, and repository managers to determine the features to be included in the DCXL tool. When it is released, the tool will enable compatibility between systems, automatically generate metadata, facilitate manual metadata creation, and assist with depositing of data into relevant repositories. In its first iteration, the DCXL tool will focus on atmospheric, ecological, hydrological, and oceanographic data.
Lastly, Carly highlighted several data management services already offered by the University of California Curation Center (UC3) at the CDL. UC3 has created Merrit for data repository, EZID for persistent identification of datasets, and the DMPTool, which guides researchers through the creation and generation of ready-to-use data management plans required by specific funding agencies. Carly has also collaborated on a short, twenty-page primer on data management and why it is important. Please visit Carly’s website to learn more about her work!