[Skip to Content]
Visit us on Facebook Visit us on FacebookVisit us on Twitter Visit us on TwitterVisit our RSS Feed View our RSS Feed
SEA Currents January 23rd, 2019
CategoriesCategoriesCategories Contact UsContact Us ArchivesArchives Region/OfficeRegion SearchSearch



Date prong graphic

Reflections on Big Data in Healthcare: Exploring Emerging Roles

Posted by on May 3rd, 2018 Posted in: Data Science

Written by: Paul Levett, Reference and Instructional Librarian, Himmelfarb Health Sciences Library, George Washington University, Washington, DC

Do you think health sciences librarians should get involved with big data in healthcare?

Of the four V’s: velocity, volume, variety, value, described in Cognitive Class (n.d.), it is value where medical librarians come into a discussion about Big Data because we add value to unstructured data, we bring order to chaos! Traditionally librarians have done this by creating metadata about learning objects, e.g. cataloging, finding aids, & infographics. However data mining, cleaning, analysis, and visualization requires computer programming, mathematics, and statistics skills not part of library school MLS programs.  

Burton & Lyon (2017) point to a technical skills gap that prevents librarians from contributing to big data initiatives. They promote the NCSU Data Science and Visualization Institute and Library Carpentry workshops to provide knowledge and opportunity to practice. But the NCSU Data Science and Visualization Institute lasts just one week, nowhere near enough time to develop and practice computer programming language, math, and statistics skills. Library Carpentry workshops typically are one-off instructional sessions that offer even less time, although I appreciate that the course material is available online at http://librarycarpentry.github.io/.  

If we look at the argument should librarians be doing data science, you can argue data science skills do touch on all the domains identified by Drummond et al (2015, Fig. 3 p.15) in the national librarian education needs assessment. Were I invited to suggest a program for developing the necessary skills to work in Big Data in Health Care Information Systems I would suggest a program like the MSc Data Analytics program in the University of Sheffield Department of Computer Science, that provides opportunities to study R and Python programming and statistical analysis and work on a real world project to apply those skills over a one year timeframe. Students on this program apply advanced Mathematics skills which is why the program requires an undergraduate degree in mathematics, economics, accounting, physics, chemistry or engineering.  

This suggests a need for the creation of a data scientist specialty role, but I am not convinced the Library actually is the best home for that role. Recently Simmons College (2017) surveyed 1117 graduates of their MLS program about core librarian professional skills and knowledge, of whom nobody rated data science as a core or a specialized skill, 14 mentioned statistics/working with data, only 6 mentioned data science/curation/management. As recently as last November in the IMLS (2017) meeting on positioning MLS programs for the 21st century there was lots of discussion about increasing the diversity of the profession but only one mention of data curation.

Tsakalos (2017) described Data wrangling as “the process of importing, cleaning, and transforming raw data into actionable information for analysis. It is a time-consuming process that is estimated to take about 60-80% of analysts’ time.” I feel the current push for librarians to develop data wrangling skills is perilously close to an admission from data analysts they want to offload what appears to be an onerous burden. This role would better fit someone working in University Departments of Computer Science, Mathematics, Statistics, or Epidemiology and Biostatistics.  It’s critical for librarians to manage expectations that the library is not a raw data processing warehouse but instead is a knowledge repository.

Where should librarians get involved?

There may be a role for librarians to pass on to Hospital IT departments information about updates and changes to important biomarkers where those need to be manually set as parameters by programmers building clinical decision support on top of EHR systems, however as this enters the realm of medico-legal responsibility the onus should be on EHR software developers to perform this necessary ongoing maintenance role.

Krumholz (2014) described how observational non-experimental studies generate data to support causal inferences and he points to comparative effectiveness studies as a potentially useful application of cluster analysis on large clinical data sets. A systematic review should be a pre-requisite for any health policy comparative effectiveness study, and this is where I as a librarian could best employ my literature search skills.

Librarians could be trained and certified to deliver RedCAP training, the data capture form design issues are similar to Microsoft Access, librarians would benefit by developing a deeper understanding of study design issues such as timing follow-up, patient data protection principles, and setting automated reminder parameters, while the enterprise would benefit from additional trainers to further spread the use of the RedCAP clinical trial data collection tool.


Burton, M., and Lyon, L. (2017). Data Science in Libraries. Research Data and Preservation (RDAP) Review. Bulletin of the Association for Information Science and Technology, 43(4) 33-35.

Cognitive Class/Fireside Analytics (n.d.). Big Data 101. Retrieved from  https://cognitiveclass.ai/courses/what-is-big-data/

Drummond, C., Clareson, T., Gemmill Arp, L., and Skinner, K. (2015). Libraries, Archives, and Museums (LAM) Education Needs Assessments: Bridging the Gaps. Retrieved from https://educopia.org/sites/educopia.org/files/publications/MtL_LAM_EducationNeedsAssessments_20151104_0.pdf

U.S. Institute of Museum and Library Services (IMLS) (2017). Positioning MLS programs for the 21st century. Retrieved from https://www.imls.gov/news-events/events/positioning-library-and-information-science-graduate-programs-21st-century

Krumholz, H. M. (2014). Big data and new knowledge in Medicine. Health Affairs, 33(7): 1163-1170

Simmons College (2017). Librarian professional skills and knowledge survey April 2017. Retrieved from http://slis.simmons.edu/blogs/unbound/2017/05/17/core-skills-lis/

Tsakalos, V. (2017). Data wrangling. Retrieved from https://www.r-bloggers.com/data-wrangling-cleansing-regular-expressions-33/

Image of the author ABOUT Tony Nguyen

Email author View all posts by

SEA CUrrents Archives 2006-Present

SEA Currents Archives: 2001-2005

Subscribe to SEA Currents

Blog Categories

Funded under cooperative agreement number UG4LM012340 with the University of Maryland, Health Sciences and Human Services Library, and awarded by the DHHS, NIH, National Library of Medicine.

NNLM and NATIONAL NETWORK OF LIBRARIES OF MEDICINE are service marks of the US Department of Health and Human Services | Copyright | Download PDF Reader