Log Number: LG-71-17-0202-17

The University of North Texas Libraries and the Computer Science and Engineering Department will research the efficacy of using machine-learning algorithms to identify and extract publications contained in web archives. The overarching goal of this project is to understand if machine-learning models can successfully identify content-rich PDF and Word documents from web archives that align with library and archives collecting plans. The researchers are working in two phases. They are first increasing their understanding of the workflows, practices, and selection criteria of librarians and archivists through ethnographic-based observations and interviews. Next, this increased understanding informs the use of novel machine-learning techniques to identify content-rich publications collected in existing web archives. Identifying these documents will empower libraries, archives, and museums to meet their curatorial missions.
