Cultural Heritage Language Technologies

  • Scaife, Allen (PI)

Grants and Contracts Details


Digital libraries have already begun to serve humanists in two major ways. First, they enhance the reach of traditional researchers: once a text is digitized, even the simplest of search facilities allows users to interact with and study texts in entirely new ways. Second, where the technologies of industrialized print made possible a network of research libraries centered primarily in Europe and North America, electronic collections, linked by high-speed networks, are becoming digital libraries that serve a global audience, reaching far beyond universities into schools, public libraries, and private homes. The cultural heritage of humanity can play new roles in the lives of professional scholars and the general public alike. A properly designed global information infrastructure can strengthen our own cultural identities while opening up for us cultures radically different from our own. But humanists cannot develop their information technologies in isolation. They must build upon and help shape a common information infrastructure which they must share with scientific enterprises on the one hand and businesses on the other. The range of technologies is impressive and growing rapidly. In the United States, for example, the National Science Foundation is creating a National Science Digital Library (NSDL) aimed at distributing information to an audience ranging from elementary school through adult practitioners. The US Defense Department, meanwhile, is supporting an initiative entitled Translingual Information Detection, Extraction, and Summarization (TIDES). TIDES is developing ways to apply a range of language technologies such as machine translation and information extraction to dozens oflanguages. Programs such as these provide a technological foundation that could revolutionize the position of cultural heritage languages in scholarship and in society as a whole. Programs such as the NSDL and TIDES, though of potential benefit to humanists, are aimed at very different audiences. Scientists do not read journal articles in the same ways that humanists read Shakespeare or Dante. We in the humanities need to extend and generalize technologies designed for scientists, scholars, and diplomats if we are to enjoy their benefits. We propose to concentrate in this project on the problems of integrating electronic text corpora and scholarly resources for cultural heritage languages. The overall problem is immense: digital libraries not only facilitate work on individual questions within existing discipline (e.g., a study of "guest ftiendship" in archaic Greece) but can also place on a new level our ability to compare cultures (e.g., Homeric Epics and Old Norse Sagas). For the purposes of this grant, we have elected to work with three languages; classical Greek, early modern Latin, and old Norse. Our focus for all three ofthese languages will be the creation of advanced digital library applications that (1) adapt computational linguistic and data mining techniques for the needs of humanists, (2) establish an international ftamework for the long-term preservation of data, the sharing of metadata, and interoperability between affiliated digital libraries, and (3) lower the barriers to reading these texts. The final result of this collaboration will be a suite of applications that includes multi-lingual information retrieval facilities; concept identification and visualization tools; vocabulary profiles; and a syntactic parsing toolbox with facilities for word sense and morphological disambiguation, and the resolution of attachment ambiguity. It will also include infrastructurelevel programs that share data, metadata, and tools among affiliated digital libraries. This infrastructure will allow partner libraries to generate automatic hypertexts that link similar resources in different collections, federate their search facilities, and share resourceintensive programs. Finally, we will create or integrate new corpora of texts as testbeds for these applications. These corpora include approximately 300 MB (more than 60,000 printed pages) ofliterary and scientific early modern Latin texts, including many ofIsaac Newton's papers, and 12 MB of Old Norse literature with many texts linked to manuscript images. For this project, we have gathered an international research team combining humanists, digital library specialists, and computer scientists. The principal investigator for this project, Jeffrey Rydberg-Cox at the University of Missouri at Kansas City, has worked intensively for the past three years to study the application of computational linguistics to Ancient Greek texts and integrate his findings into a digital library. He has worked with co-principal investigator Gregory Crane on the Classical Greek texts, lexica, and parser developed by the Perseus Project based at Tufts University. Greek thus provides us with one well-established dataset on which we can build. Second, two of our partner institutions - the Perseus Project and Andrea Bozzi at the Istituto di Linguistica Computazionale del CNR in Pisa - have created similar foundations for Classical Latin. In this grant, we will extend these tools for the problems of early modern Latin ---an immense corpus, far too large for conventional translation, but essential to the study of European culture. In this area, we will work with Stephan Rueger and the Newton Project - a group digitizing Issac Newton's papers at Imperial College, London. Rueger is a computer scientist who will develop data visualization and extraction tools for Latin texts. Third, we will work with Timothy R. Tangherlini at the University of California at Los Angeles and Matthew Driscoll of the Arnamagnaean Institute on Old Norse, a language with an immense and rich literature to study the problems of boot-strapping a cultural heritage language into an already existing system. Finally, we will work with Ross Scaife and the Stoa Publishing consortium at the University of Kentucky to explore problems of sharing of meta data and applications among affiliated digital libraries. We have chosen the particular languages to work on based on the research interests and existing corpora of the collaborators. Because our project includes both humanists and computer scientists, and because some of our collaborators are also researchers with significant experience in the theory and practice of digital library systems, we are well placed to assess the needs of humanities scholars and the potential applications of information technology to these materials. We believe this collaboration among humanists, programmers, and humanist-programmers will result in a richer, more usable system than either humanists or digital library developers could create alone.
Effective start/end date1/1/028/31/05


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.