Managing Scholarly Datasets: Challenges and Best Practices for Archivists

As scientific research becomes increasingly data-intensive and collaborative, libraries face mounting pressure to provide exhaustive support for the entire research data lifecycle. This evolution brings both unprecedented opportunities and significant challenges that require strategic responses and innovative solutions.

The Expanding Role of Research Data Management

Research data management (RDM) encompasses the organization of data from its entry into the research cycle through its dissemination and archiving of valuable results (1). Academic libraries have emerged as critical stakeholders in this process, offering services ranging from basic consultations on data management plans to advanced services such as establishing institutional repositories and providing specialized curation support (1,2). The implementation of open science policies, including Plan S in Europe and the Nelson Memo in the United States, has further underscored the foundational importance of RDM in ensuring that research data remain accessible and reusable (1,2).

The Association of College & Research Libraries has highlighted growing trends in open science and reproducibility, indicating that libraries must expand their roles in RDM to advance scientific progress (1). This shift reflects broader recognition that proper data management is not merely beneficial but essential for researchers, notably as funding agencies increasingly mandate data sharing and preservation (2).

Critical Infrastructure Challenges

One of the most pressing challenges facing libraries is the infrastructure required to manage big data. Modern research datasets are characterized by four key attributes: volume, velocity, variety, and veracity (3). The sheer amount of data being generated requires substantial storage spaces and increased network capacity (3). Libraries must invest in sophisticated technological infrastructure capable of handling structured, semi-structured, and unstructured data while ensuring data quality and reliability (3).

Storage costs present particularly acute challenges. Approximately 30-35% of all stored data is categorized as “cold,” meaning it’s rarely accessed but must be preserved long-term (4). While cloud storage offers convenient solutions for small data volumes, costs scale approximately linearly with archive size. For archives exceeding a few hundred terabytes, on-premises solutions using Linear Tape Open (LTO) technology become dramatically more cost-effective (5). The California Digital Library has explored pilot projects that involve upfront capital investment in storage nodes to address these financial pressures and to create sustainable preservation networks (6).

Current storage technologies present their own preservation challenges. Hard disk drives remain a cornerstone of archival storage, but emerging technologies like Heat-Assisted Magnetic Recording promise increased capacity (4). Magnetic tape continues to offer compelling advantages for cold storage, with durability of up to 30 years when properly stored and strong cost-effectiveness at scale (4). However, balancing performance, cost, and environmental impact requires careful strategic planning, as different storage types consume varying amounts of energy and have different carbon footprints (7).

Metadata Standardization Complexities

Metadata challenges represent another significant hurdle for libraries managing scholarly datasets. Despite the availability of numerous metadata standards—including Dublin Core, MARC, MODS, and discipline-specific schemas—libraries face persistent problems with incomplete, inconsistent, or missing metadata (8,9). The lack of widely adopted standards means that stakeholders often fail to adhere to them unless required by major platforms like PubMed or Crossref (8).

Research from Memorial Sloan Kettering Cancer Center highlights how variable levels of searchability across data repositories, limited normalization between repositories, and the sheer volume of biomedical research data contribute to cataloging difficulties (9). Affiliation fields are particularly problematic, as they are rarely identified, lack standardization, contain misspellings, and are often excluded from indexing (9). These metadata quality issues create barriers to dataset discovery, access, and reproduction—fundamental impediments to open science (8).

The scholarly communications industry faces additional complications from interoperability issues arising from disconnected submission and production systems, manual-entry errors when publishers enter persistent identifiers, and a lack of consistent affiliation and funding data (8). These problems make it challenging to model future agreements and compel institutions to allocate unanticipated resources to educate researchers on metadata creation and use (8).

Skills and Collaboration Barriers

Technical expertise represents a critical constraint for many libraries. Providing repository services, access and discovery systems, and preparing datasets for repositories requires sophisticated skill sets and sustainable technological infrastructure (10). Libraries must navigate complex technical aspects while also addressing the diverse expectations and limitations of multiple stakeholders, including IT services, academic departments, research support services, data centers, and other institutions (10).

The fragmented nature of research processes compounds these difficulties. Many institutions lack integrated RDM policies and strategies to guide their data management support (10). Limited ICT policies and infrastructures further constrain what libraries can accomplish (10). The development of coordinated, cohesive collaboration among stakeholders is particularly challenging due to unclear roles, uncertain responsibilities, and insufficient support from senior management (10).

Privacy and ethical considerations add another layer of complexity. The healthcare, social networking, and government domains contain large amounts of sensitive information that require careful handling (3). Libraries must ensure robust anonymization techniques, implement strict access controls, and use role-based permissions to protect participants’ privacy while enabling data sharing (11). Researchers need training on data anonymization, informed consent, and the ethical implications of data sharing to navigate these requirements successfully (11).

Best Practices and Strategic Responses

Despite these challenges, leading libraries have developed effective strategies for managing scholarly datasets. Successful institutions adopt a tiered approach to storage, balancing performance and cost-efficiency by leveraging appropriate technologies for different access patterns (7,5). The “3-2-1” storage rule—maintaining three copies of data on two different storage types, with one in a different geographical location—provides a gold standard for preserving critical data. However, its expense necessitates selective application (12).

Libraries should provide comprehensive services throughout the research data lifecycle, including educational programs on data management methodologies, workshops on using field-specific datasets, and consultations on data management plans (13). Many institutions have established specialized data services with dedicated staff, recognizing that researchers face numerous concerns regarding data storage, integrity, and backup options, yet often lack sufficient time or preparation to meet these requirements independently (13).

Platform solutions like Globus demonstrate how software-as-a-service models can simplify research data management while reducing costs (14). These platforms address challenges through cloud-hosted services, efficient data transfer mechanisms, and integrated authentication systems (14). Similarly, Library Services Platforms such as Bibliovation offer comprehensive solutions by combining traditional bibliographic management with support for diverse metadata formats, including RDA, Dublin Core, and MARC21, while storing all data types—bibliographic, patron, transaction, acquisitions, and digital objects—in relational databases accessible via web browsers (15). Such platforms help libraries manage both physical and digital content within unified systems, addressing the challenges posed by emerging data sources and external applications (15). Additionally, developing data lakehouse architectures offers scalable solutions to enhance institutional repository capabilities and improve researcher experiences (11).

Effective metadata practices require strategic standardization efforts. The Make Data Count initiative exemplifies how international collaboration can transform impact measurement for open research data by developing evidence-based metrics that capture dataset usage and citations comprehensively (16). Libraries should implement interactive pre-deposit support to help researchers prepare data for publication and conduct post-deposit enhancements to improve dataset quality and usability (16).

Integration of emerging technologies offers promising opportunities. Artificial intelligence and machine learning can automate metadata extraction, content analysis, and preservation planning (11,17). Blockchain technology provides mechanisms for establishing provenance and securing digital content against unauthorized changes (17). Internet of Things devices enable capture and storage of real-time data streams, expanding the scope of digital library assets (17).

Managing large-scale scholarly datasets presents multifaceted challenges for academic libraries, spanning technical infrastructure, financial sustainability, metadata standardization, and collaborative coordination. As research becomes increasingly data-intensive, libraries must strategically invest in appropriate storage technologies, develop comprehensive skill sets, establish robust metadata practices, and foster institutional partnerships. By embracing emerging technologies, adhering to international standards, and maintaining focus on user needs, libraries can fulfill their evolving role as essential partners in the research enterprise while ensuring long-term preservation and accessibility of valuable scholarly data.

References

Ho, A.K. et al. (2025). Research data management services in academic libraries to support the research data life cycle: A systematic review. Journal of the Association for Information Science and Technology. https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.70008
Subaveerapandiyan, A. & Ugwulebo, J.E. (2023). Research Data Management Practices and Challenges in Academic Libraries: A Comprehensive Review. Library Philosophy and Practice. https://www.researchgate.net/publication/372692575
Garoufallou, E. et al. (2021). Big Data: Opportunities and Challenges in Libraries, a Systematic Literature Review. College & Research Libraries, 82(3). https://crl.acrl.org/index.php/crl/article/view/24918/32769
APTrust (2025). Exploring Trends in Archival Storage: Insights from a National Academies Report. https://aptrust.org/2025/07/28/exploring-trends-in-archival-storage-insights-from-a-national-academies-report/
Active Archive Alliance (2024). 2024 Data Storage and Active Archive Trends – Part III. https://activearchive.com/blog/2024-data-storage-and-active-archive-trends-part-iii/
California Digital Library (2018). Tackling the storage costs of digital preservation. https://uc3.cdlib.org/2018/10/22/tackling-the-storage-costs-of-preservation/
APTrust (2025). Understanding Storage and Data Access in Digital Preservation. https://aptrust.org/2024/07/29/understanding-storage-and-data-access-in-digital-preservation/
Insights Publishing Group (2024). Challenges and roadblocks to robust metadata in the scholarly communications industry. https://insights.uksg.org/articles/10.1629/uksg.642
Memorial Sloan Kettering Cancer Center Library (2024). Reflections on the MSK Data Catalog at 1,000 Records. https://library.mskcc.org/blog/2024/10/a-reflection-of-msk-data-catalog-at-1000-records/
Yu, H. H. (2017). The role of academic libraries in research data service (RDS) provision. The Electronic Library, 35(4), 783–797. https://doi.org/10.1108/el-10-2016-0233
Mwinami, P. et al. (2024). Research data management in academic libraries: institutional repositories as a reservoir for research data. Library Management, 45(1/2). https://www.emerald.com/insight/content/doi/10.1108/lm-06-2024-0070/full/html
Vassar College Libraries. Digital Preservation Basics. https://library.vassar.edu/c.php?g=1218328&p=8911412
Yoon, A. & Schultz, T. (2017). Research Data Management Services in Academic Libraries in the US: A Content Analysis of Libraries’ Websites. College & Research Libraries, 78(7). https://crl.acrl.org/index.php/crl/article/view/16788/18346
Chard, K. et al. (2017). Globus: A research data management platform. Referenced in Subaveerapandiyan & Ugwulebo (2023). https://www.researchgate.net/publication/318224056_Globus_Research_Data_Management_as_Service_and_Platform
LibLime/PTFS (2023). Bibliovation Library Services Platform. https://ptfs.com/wp-content/uploads/2023/10/LibLime-Bibliovation.pdf and https://liblime.com/bibliovation/
California Digital Library (2024). Community enrichment of DOI metadata and Make Data Count initiative. https://uc3.cdlib.org/2024/08/
Confinity (2024). Future Trends in Digital Preservation for Libraries. https://www.confinity.com/culture/the-future-of-digital-preservation-in-libraries-trends-and-directions

0 Shares

Managing Scholarly Datasets: Challenges and Best Practices for Archivists

The Expanding Role of Research Data Management

Critical Infrastructure Challenges

Metadata Standardization Complexities

Skills and Collaboration Barriers

Best Practices and Strategic Responses

References

Corporate

Product and Service

Contact Info

The Expanding Role of Research Data Management

Critical Infrastructure Challenges

Metadata Standardization Complexities

Skills and Collaboration Barriers

Best Practices and Strategic Responses

References

Reader Interactions

Leave a Reply Cancel reply

Footer

Corporate

Product and Service

Contact Info