by Douglas J. Joubert, Informationist, NIH Library, Washington, DC
Over the last seven weeks, in the Big Data in Healthcare – Opportunities for Librarians, we learned about big data and data science within the context of five distinct disciplines. This essay will provide an overview of big data and data science within each of the five disciplines, with a focus on how librarians can support researchers working in these fields.
Although not focused exclusively on Big Data, a recent report has strongly advocated for an increased role for librarians in the field of data science (Burton, Lyon, Erdmann, & Tijerina, 2018). This report outlines a multi-faceted framework for understanding the internal (within the discipline) and external (within the broader science disciplines) drivers that are changing the way in which we think about data.
Data science is one those terms that can take on different meanings, based on a particular practice area. One of the more popular representations of data science is that of Drew Conway. Conway represents data science as the intersection between three primary domains [Figure 1]. It is not vital that librarians be experts in each of the three domains that comprise this Venn diagram, nor is it even possible. What is important, and serves as the primary thesis of this essay is that librarians be grounded in how researchers in each of these areas produce, organize, and analyze data.
Figure 1: The Data Science Venn diagram.
This course introduced us to a number of different perspectives on the topic of big data. The first view was provided by a data informationist (Lisa Federer) who works for a large biomedical research center. She defined big data as having a number of distinct qualities. This first of these qualities is the amount of data being produced, commonly referred to as its volume (Federer, 2017). The second quality is the variety of the data, specifically, pulling data from many different sources, in many different formats (Federer, 2017). The third feature of big data is the rate in which the data is being produced, or its velocity (Federer, 2017). Last, is data veracity. This refers to how much trust we place in the source of the data and the data quality (Federer, 2017). Additional definitions were provided by two social scientists, a practicing clinician, and a nursing researcher.
The nursing perspective provided some additional insights that are worth exploring. First is the unique role that nurses play in the delivery of health care, and how this role influences big data research (Brennan, 2015). Second, Dr. Brennan emphasized that terms like the Big Data to Knowledge (BD2K), big data, and precision medicine mean different things to different people (Brennan, 2015). The role for nursing to play is making these terms meaningful to patients and their families. Last, she emphasized that these tools need to be understood from the nursing experience, which takes a more humanistic approach when compared to the traditional medical model of health care delivery. Nurses are focused on getting the goals of precision medicine into the “hands of the people” (Brennan, 2015). All of these different perspectives are needed to fully understand the role of big data and how big data is changing the way that we conduct research, deliver health care, and make informed decisions.
Using three elements from Martin’s User-Centered Data Management Framework for Librarians, I will advocate for the increased role of librarians in both data science and big data initiatives. These elements are: (1) Service, (2) Best Practices for Data Management, and (3) Literacy (Martin, 2016).
Libraries have a long and rich history of providing services to different user groups. Adding data services as a component to more traditional library services allows libraries to respond to an increased demand for specialized levels of support for data science. Potential roles for librarians could fall into the following categories (1) data extraction, (2) data wrangling, (3) data analysis, or (4) data visualization (Hamalainen, 2016). Some of these skills, like data extraction or data analysis, can be performed without much additional training. Data wrangling and data visualization are not out of reach for most librarians, if they get supplemental training. These four areas also require the least amount of overhead when compared with, for example, hosting a data repository.
Also, many data service questions are very similar to the types of reference questions that librarians have traditionally answered. For example:
Each week in this class presented us with a different challenge for managing data, and innovative solutions for dealing with these challenges. We also learned that these challenges are being addressed by local and national initiatives. At the federal level, a 2013 report was released by the Office of Science and Technology that outlined a number of important policy principles (Holdren, 2013). Many of these principles align to the work of libraries, and present us with numerous opportunities. The first is helping researchers comply with changing grant requirements. Second is working with researchers in efforts to maximize transparency and accountability in terms of collecting and storing data. Last is connecting researchers with tools like the Open Science Framework to support data sharing and increasing reproducibility.
As someone who has spent a great deal of his professional life teaching library users, this topic resonates the most with me. Also, I feel that librarians make some of the best teachers. Teaching about data literacy, data analysis, and data management offers incredible potential for librarians. It has been my experience that starting small is the best entry point into teaching these topics, for example, working with a colleague to develop a data literacy class, or volunteering to serve as a teaching assistant or back-up for a more seasoned teacher. Teaching a class in R or Python are admirable goals. However, it might not be the best place to start, nor is it necessarily the right solution for your library. Finally, look for both formal and informal professional development opportunities. This MOOC (Big Data in Healthcare) and Best Practices for Biomedical Research Data Management are just two recent examples of librarian-led data management classes. However, Meet Up groups and connections developed through Social Media are also wonderful way to learn and network.
Brennan, P. (2015). Big Data in Nursing. Bethesda: NINR Big Data Bootcamp.
Burton, M., Lyon, L., Erdmann, C., & Tijerina, B. (2018). Shifting to Data Savvy: The Future of Data Science in Libraries.
Federer, L. (2017). Data Science 101. NNLM Beyond the SEA Webinar Series.
Hamalainen, H. W. (2016). Geoscience Librarianship 101: Making Sense out of “GeoReference.” Baltimore.
Holdren, J. P. (2013). Increasing Access to the Results of Federally Funded Scientific Research. Retrieved from https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_me mo_2013.pdf
Martin, E. (2016). The Role of Librarians in Data Science: A Call to Action. Journal of eScience
Librarianship, e1092. http://doi.org/10.7191/jeslib.2015.1092
 http://www.datacommunitydc.org/calendar/ or https://www.meetup.com/find/s