The Archive Assistant (AA) is an open-source application that runs on a server with a database backend. All components of the system are open-source (LINUX, APACHE, MYSQL, etc.). The entire suite of applications is available for free on the Internet from the respective open-source providers. The application is partly similar to other programs in that it is meant to assist archives and libraries in scanning, cataloguing and disseminating their manuscript collections. It differs from other similar initiatives in the fact that it enables the institution to gradually receive, examine and accumulate the transcriptions that are returned from the end-users who use the Transcription Assistant application described below.
An inescapable premise upon which our whole initiative is predicated is that institutional archives will make available to transcribers the digital images of the manuscript pages in their holdings, so they can be downloaded via an Internet search engine, as is becoming more and more customary at many institutions. Our system is designed to work with any type of manuscript in any language, with any alphabet and of any age, though it has been primarily tested to date on Venetian manuscripts from the 13th to 17th centuries from the Venice State Archives, and on early American manuscripts from the American Antiquarian Society. Our emergent transcription system relies on the diffusion of digital images of manuscripts as the basis for the distributed asynchronous production of transcriptions. Scanned images are packaged together with the metadata of the manuscript that they depict, to create an XPG (eXtended jPeG) file. Appropriate metadata accompanies a manuscript to make it usable in a historical context. These metadata are generally already used in the manuscript catalogues in operation at libraries and archives. The XPG file type supports metadata and image packaging into a single XML file.
Our metadata sub-system currently consists of a superset of the MARC and Dublin-Core standards, allowing for the conversion from one standard to the other. We are currently working on AA functions to facilitate the bulk importing of existing MARC and Dublin-Core databases into our system.
More sophisticated components of the AA application are being developed to allow advanced searches on metadata and transcription text, with the possibility of expanding searches in the future to more sophisticated image-based algorithms such as those discussed in Rath et al. After a successful search, users will be able to browse the listed manuscript pages and select them for downloading into their own machine for use with the Transcription Assistant.
After the end-user has transcribed a manuscript page, the XPG file is augmented with an XML-based transcription section, according to the manuscript Markup Language (MML) that we have developed for this purpose. After an initial transcription is made, the manuscript page (manuscript metadata + image + transcription metadata + transcription) is packaged into an MML file from then on.
A further component deals with the reception of returned MML files containing transcriptions produced by end-users, in conjunction with the Contribution Accountant and with the backend MySQL database where the XPG manuscript images and MML files are permanently stored on the archive server.
The final component of the AA assists with manuscript processing, automatically boxing what it thinks are individual words (using a sophisticated “smearing” algorithm, designed by WPI students in a previous project) and attempting, with optical character recognition (OCR) techniques, to make a preliminary guess at the content of a newly added manuscript.