Future of the Book – Page 8

Week 4: Colonial Despatches

Having a deep passion for history, my attention was almost immediately drawn to the University of Victoria’s “Colonial Despatches” project presented in the TEI project list. The project is a digital archive containing “transcriptions of virtually the complete correspondences between the British colonial authorities and the successive governors of the nascent Vancouver Island and British Columbia colonies”, among other historically valuable documents. These artifacts provide the history of Vancouver Island and British Columbia from 1846 to 1871 from the perspective of the individuals that were closely tied to the governance and development of the land, its resources and its population.

The original project was created in the 1980s using files encoded in Waterloo Script ( a text-encoding language processed using SCRIPT). These files have now been converted into XML, and the University of Victoria has built a web application to make these files readable and searchable. The files were converted into XML based on TEI P5 Guidelines.

The project provides a detailed guideline that outlines the mark-up scheme used for each record. This document demonstrates how the TEI guidelines were used in the creation of the XML files. The project guide also includes details about the tags used within the markup, offering an explanation for the purpose of the tag and examples on how to use them.

The XML code for all documents in the project are available to the public, making this a very useful example for our own encoding challenge if you are working with handwritten or annotated works.

Bibliography:

Colonial Despatches: The Colonial Despatches of Vancouver Island and British Columbia 1846-1871. University of Victoria (n.d.). Retrieved from: http://bcgenesis.uvic.ca/about.htm

TEI in the wild

For this week’s task, identifying TEI in the wild, I selected the New York Public Library’s Digital Schomburg: African American Woman Writers of the 19^th Century. This project makes its use of TEI explicit, detailing its use of machine-readable form under a link called “Technical Notes.” According to the site, Digital Schomburg uses Standard General Markup Language (SGML), according to the Text Encoding Initiative Lite Document Type Definition.

On the home page, beneath the technical notes link is a link titled “Editorial Methods.” This includes a comparison of the print and electronic forms of the text, which claims “Not a single character of text has been deleted or excluded from the source documents in their conversion to the Digital Schomburg Edition” (http://digital.nypl.org/schomburg/writers_aa19/technotes.html). It describes its use of TEI as “literary,” and explains the difference between a TEI header, encoding description, revision history description, and text profile description.

While the project clearly describes its method, the editors do not go so far as to describe challenges encountered during their work. They stick to a descriptive approach to their use of XML. Having scoured the project’s website, I found no trace of accessible code. While the editors are willing to describe their process, they protect it by refusing to share it with visitors to the site. This closes down the possibilities for scholarly engagement with the text, assuming users’ interest in the project’s form is fairly superficial.

The link to the Digital Schomburg project can be found here: http://digital.nypl.org/schomburg/writers_aa19/technotes.html

TEI in eZISS

Looking through the list of TEI projects, I noticed eZISS: Scholarly Digital Editions of Slovenian Literature. This project is hosted by the Scientific Research Centre of the Slovenian Academy of Sciences and Arts in Ljubljana, Slovenia. I was interested in eZISS because the material was unfamiliar and the project is extensively documented. Although the texts are only in Slovenian, the entire website is available in English.

eZISS offers “selected Slovenian texts with integrated facsimiles, transcription and scholarly commentary, in some cases including audiovisual recordings” (eZISS). The project is described as existing at the intersection of Slovenian literature, ecdotics (philological study of texts and their presentations), and modern information technology (eZISS).

There is a large focus on encoding, which is likely because information technology is part of this intersection. Encoding isn’t just a vehicle for showcasing texts but a fundamental aspect of the project. On the main page of the site, they write:

The complex digital encoding of texts with facsimiles, transcriptions, critical apparatus and audiovisual recordings is achived with the help of open standards of textual markup: Unicode, XML, and the TEI Guidelines. This foundation helps the editions to be better resistant to technological change, software independent and compatible with other standardised digital resources. From the source XML, an HTML version is created with XSLT stylesheets; to read the HTML, only a standard browser is required (eZISS).

The researchers are also very open with their sources and code. The texts are all licensed under a Creative Commons Attribution-Share Alike license. As an example, I looked at the Freising Manuscripts, the first document of Slovenian culture.

From here, you can view all of the components online or save them to your computer. Interestingly, the website states that you will “also get the XML/TEI files, suitable for further processing” (Freising Manuscripts).

The downloaded edition provides everything from the icons used on the webpage to the facsimiles (gif folder) to the TEI files.

Screenshot (25)

What impressed me most about this project is the emphasis on openness, either through the CC license or through explicit mentions of further processing.

Lastly, the project is very well documented. There are three English publications on the development of the eZISS editions (and several more in Slovenian). According to the article, “E-Slomšek: A TEI Encoding of a Critical Edition of 19^th Century Slovenian Rhetoric Prose” by Erjavec, Ogrin, and Faganel, in small nations such as Slovenia, “publishing critical editions with facsimile, transcriptions and apparatus in traditional print form faces great economic barriers, primarily due to the very small book market” (Erjavec et al., 2004, p. 31). Digital editions of this work, thus, have a “much better chance… of preserving, interpreting and making available Slovenian cultural heritage” (p. 31). In this case, the availability of open standards allowed for projects that the researchers saw as contributing to national identity and preservation of culture.

References

Erjavec, T., et al. (2004) E-slomšek: A TEI encoding of a critical edition of 19^th century Slovenian rhetoric prose. Pregled NCD 5: 31-41.

eZISS (2011). About. eZISS: Scholarly digital editions of Slovenian literature. Assessed February 2, 2015

TEI (2007). Scholarly digital editions of Slovenian literature. Accessed February 2, 2015

TEI in the wild – Vincent van Gogh’s letters

‘Vincent van Gogh – The Letters’ is a “scholarly edition of all extant letters (902) sent by or to Vincent van Gogh (1853-1890)” (TEI, 2010). The TEI project page notes that, “Van Gogh’s correspondence is a unique resource for the insight it provides into both his artistic practice and his personal life. The full digital edition is available online; a reading edition is available as a six volume book edition. While intended for a scholarly audience, the edition is expected to serve the interest of a much wider public” (TEI, 2010).

The project, which is hosted by the Huygens Instituut, in collaboration with the Van Gogh Museum Amsterdam, provides for each letter, “a zoomable facsimile, a transcription of the original text (mostly Dutch or French), a new translation into English, and extensive annotation. More than 2000 illustrations are given of the works of art that Van Gogh mentions in his letters. Introductory essays discuss Van Gogh, his letters and his circle. Other material includes a timeline, maps, indices and a bibliography” (TEI, 2010).

Interestingly, the TEI project page also notes that, “the project started before the era of the web, and it was only later decided the web would be its main publication platform. The letters and annotations were created in a word-processing program and later converted semi-automatically into TEI (something that you probably want to avoid doing)” (TEI, 2010). It also describes how the project, “created one TEI (P5) document per letter, holding header information (we introduced some new header elements), facsimile information, transcription, translation and annotation. Other TEI documents hold secondary texts such as the essays and bibliography” (TEI, 2010).

The website itself describes the use of XML generally – though it takes a bit of searching to find the link (under ‘About this edition’ – ‘The web edition’). It notes that: “This edition, like many modern digital editions, is based on XML (eXtensible Markup Language) documents. XML is a standard for the creation of documents in which the document text is interspersed with ‘tags’, brief labels that describe the nature and properties of the text fragments that they surround. The Text Encoding Initiative (TEI) has proposed guidelines for the names and types of the tags to be employed in humanities texts. Out of the 400+ existing tags, a so-called ‘schema’ can be created that contains exactly those tags that are applicable to a certain type of document (such as a letter that is prepared for a scholarly edition). New tags can be defined when the existing tagset is insufficient. The schema describes the required and permitted tags in a class of XML documents. It can be used to check the correctness of these documents. A dedicated schema was created for the Van Gogh edition. A number of non-standard tags were used, some of which were ‘borrowed’ from the DALF (Digital Archive of Letters in Flanders) project” (Jansen, Luijten, and Bakker (eds.), 2009).

It then goes on to explain the specific use of XML for the project: “One XML document was created for each of Van Gogh’s letters and each related document. It holds letter-level metadata (title, number, date, correspondents, etc.), the full transcription, the translation, the notes, the textual notes, and the information that connects transcribed pages with images of those pages (facsimile elements). The XML files were created in an automatic conversion from word-processor documents. The conversion result was checked and extensively corrected. The XML files were manually indexed to facilitate searching and cross-referencing” (Jansen, Luijten, and Bakker (eds.), 2009).

I find it incredible to think that this XML was converted from word documents and not born digital!

The page even mentions that, “for those interested in technical matters, we provide somesample XML files. In the zip file we also include the so-called ‘ODD’ file which is used to customise the TEI schema and the schema files generated from the ODD-file. We use W3C schema rather than Relax NG because the contractors who performed the conversion to XML were more familiar with that format” (Jansen, Luijten, and Bakker (eds.), 2009).

Here is an example of one of the XML files opened in word format:

screenshot

The site also describes in detail the software tools that support the project. Overall, I am impressed with the degree of detail provided by the project, and their transparency with regards to the technical process.

Bibliography:

Text Encoding Initiative. (2010). Vincent van Gogh – The letters. http://www.tei-c.org/Activities/Projects/vi02.xml.

Jansen, Leo, Luitjen, Hans, and Bakker, Nienke (eds.) (2009). Vincent van Gogh – The Letters. Amsterdam & The Hague: Van Gogh Museum & Huygens ING. http://vangoghletters.org.

TEI for the Divine Comedy Soul: The World of Dante

Through the TEI projects portal, I came across The World of Dante created by the Institute for Advanced Technologies in the Humanities, University of Virginia. The site is dedicated to the study of the Divine Comedy through multimedia research tools such as interactive maps, diagrams, music, and a timeline. The project uses a combination of SGML and XML and states that translators had to make compromises but their hope is that any loss in translation can be made up by the other data available on the site that is not found elsewhere online (Parker, The World of Dante).

The site offers a variety of viewpoints into Dante’s World but some aspects seem odd and hard to maneuver. For example, Interactive view of Botticelli’s Chart of Hell is at first unclear as there are no instructions on how manipulate the map nor what you are seeing (or that could just be me). The timeline is an interesting feature that uses XML and gives documentation and source code for those wishing to build their own timelines. Each book is edited in XML that is indexed and searchable throughout the site. The layout of each canto is Italian on the left and English translation on the right. XML tags are visible in a column on the right and are divided into: people, places, creature, etc. Selecting an option brings up a new window with the tag used displayed in the header and a description of that selection. For example, in Canto 2 of Inferno under the People tag, if Enea is selected, a window opens with the header “Person”, a “Name” tag, “Description” tag, and “more information” link. Clicking on the link will show more tags associated with the person such as, gender, nature, etc.

The site offers a detailed description of the editorial process and some of the challenges with translating poetry. It gives some editorial guidelines for tagging and what decisions were made about relevant data such as persons, mythical creatures, etc. It mentions the use of software for the project, but does not name it. Consequently, it does not make its code available for others to use or view. Overall, the site is a good resource for those looking to engage with The World of Dante, offers open access to literary texts, and makes good use of XML for presentation and searchability of material.

Bibliography:

The World of Dante. University of Virginia. Accessed February 2, 2016. http://www.worldofdante.org/

Week 4: TEI in the Wild – Dictionary Edition

One TEI-using project close to home is the Dictionary of Old English Web Corpus (DOE), which has its physical headquarters on the 14th floor of Robarts. The DOE is a compilation of all surviving Old English texts, some in more than one copy. Each text has been XML-encoded and complies with TEI guidelines. Essentially, the DOE contains all of the surviving vocabulary of the Old English period (around 600-1150 CE). It is searchable in a variety of ways and is one of the best resources in the study of Old English.

Unfortunately, the website gives very little detail about its encoding strategies other than that they are compatible with the TEI-P5 2007 guidelines. It does not make its code available for others.

The project has also not published anything about its methods or challenges. The DOE’s editor, Antonette diPaolo Healey, wrote an article about the move from “manuscripts to megabytes” and the digital tools used by the DOE, yet she does not go so far as to talk about the code behind it.

However, in looking for information on the DOE, I did come across a short paper on the use of XML to create electronic texts of medieval manuscripts. The author goes through a few examples of this, and it showcases a current use for XML. It also has a great title.

That article can be found in the UTL catalogue: Powell, Kathryn. “XML and Early English Manuscripts: Extensible Medieval Literature.” Literature Compass 1 (2003): 1-5. doi: 10.1111/j.1741-4113.2004.00061.x.

Bibliography

DOE website: http://tapor.library.utoronto.ca.myaccess.library.utoronto.ca/doecorpus/index.html

DOE About Page: http://www.doe.utoronto.ca/pages/pub/web-corpus.html

Healey, Antonette diPaolo. “The Dictionary of Old English: From Manuscripts to Megabytes.” Dictionaries: Journal of the Dictionary Society of North America 23 (2002): 156-179. doi: 10.1353/dic.2002.0009.

Week 4: TEI in the Wild – Voices of the Holocaust

Through the TEI projects page, I came across a project at the Illinois Institute of Technology called “Voices of the Holocaust” which is an “online collection of interviews with Holocaust survivors conducted in the immediate aftermath of World War II.” The project relies on TEI encoding to “provide a structured data model for the transcriptions, which allows various manifestations of the interviews (text, audio)–as well as other types of content (metadata, GIS, scholarly criticism)–to be integrated into a dynamic, robust presentation for the user.” (http://www.tei-c.org/Activities/Projects/vo02.xml).

On the “Voices of the Holocaust” website, there is a more in-depth description of its use of TEI (version P5), “which is used to encode not only the text itself, but also the biographical, historical, and geographical metadata related to the transcriptions, interviewees, and content; scholarly commentary and footnotes; and time-code information from the audio files to facilitate text-audio synchronization […] The Glossary of Terms, Glossary of Camps & Ghettos, and GIS data are also stored in TEI XML format; information from these files is included within the interview files using XInclude.” It also states that <oXygen/> XML Editor was used for text encoding, and mentions that built-in support for the TEI schema is included in the program (http://voices.iit.edu/project_notes).

Lastly, the “Voices” website offers a link to a sample TEI XML interview file, which can be found at this URL: http://voices.iit.edu/xml/voth_project_tei_example.xml

All in all, the project is fairly out front regarding its methods of TEI encoding.

Stressing TEI

There are a lot of online scholarly projects that feel like graveyards even when they first launch – at least that’s my experience. The usefulness of some of these sites seems limited, especially in terms their potential audience. But For Better for Verse is a relatively practical site for all students of poetry to practice scansion. (Maybe my enthusiasm stems from my own difficulties around scansion…) Users of the site can guess where stresses go in a poem – specific to the syllable – and have their guesses checked. As discussed in class, this site uses TEI as a way of tagging something that is not presentational: syllables.

TEI features this site on its “Projects” page, in which an overview of the project’s use of TEI is given:

“The several dozen poems in the site are marked up with TEI P5 coding, especially subsection 6 on Verse. We introduced slight modification of the marking of syllable divisions within a word but chiefly followed the TEI protocols.”

The actual website of For Better for Verse, however, does not contain any of this information (or even a link to the TEI page). The XML is not available to access. The site was developed by the University of Virginia Department of English. Contact information is given, and so it is possible that the contacts may be able to share the source code with anyone interested in the site’s use of TEI.

Bibliography:

“For Better for Verse.” Text Encoding Initiative. Accessed February 3, 2016. http://www.tei-c.org/Activities/Projects/fo02.xml.

Week 4: TEI in the REEDs

Records of Early English Drama, (REED) is an online “international scholarly project” aimed at “establishing for the first time the context from which the drama of Shakespeare and his contemporaries grew” (Records of Early English Drama). It brings together resources (namely transcriptions of historical documents related to the topic) from all over the world, including U of T, and makes them open and accessible from one online location.

The project is fairly transparent about the construction of the resource through a document it makes available called the “Fortune White Paper.” It discusses their TEI encoded prototype edition in depth and is accessible from the drop down menu under Online Resources, then Building EREED. It’s quite a lengthy document but if you scroll down to section 3 (Editorial Work: Technology) it starts to discuss their servers as well as their use of Oxygen in their TEI work. They describe their records as beginning in Microsoft Word MS and then converted to TEI-XML. Section 4 details how they chose to work with TEI’s Guidelines for Electronic Text Encoding and Interchange, based on the eXtensible Markup Language. The document continues to give a detailed breakdown of each decision and method employed with regards to the creation of the markup of the REEDs project, and even offers a sample segment of code using a random item. It doesn’t offer a downloadable XML file from within this document, however if one returns to the Building EREED page and clicks the Downloadable Script link, a page is made available with two different downloadable TXT files as well as a schema. The first one is described as “A TXT file of a PERL script, which parses REED document text files, converting REED’s markup into TEI-Lite conformant XML, … [which] populates the REED database”, while the second one notes it was prepared as an experiment and is a “TXT file of a drafted PERL script, which queries the REED database and formats the resulting data as Microsoft Word RTF output” (Records of Early English Drama). I get the sense that these are still examples of how the database works rather than actual samples of XML from the actual records made available online, but it does seem fairly in depth overall. Unlike the Folger Shakespeare example, one cannot download the code directly from the page of the record one is viewing, so the level of transparency is not quite on par, but there is a great deal of information prepared and made accessible about the project online.

Bibliography

Records of Early English Drama. University of Toronto. 2016. Online. February 2016.

TEI in the Wild: Yellow Fever Commission

The U.S. Army Yellow Fever Commission IMLS Digitization Project by the University of Virginia is an online exhibit that discovers the work, historical importance and impact of the U.S. Army Yellow Fever Commission. The homepage for the project is available at http://exhibits.hsl.virginia.edu/yellowfever/. The collection consists of thousands of documents,including handwritten and typed notes, news paper articles, photographs, miscellaneous printed materials and artifacts which the University has been digitizing.

The project uses XML with TEI attributes to mark- up the resources and save them in TIFF files on CDs. The project also made digitized, transcribed and marked-up primary materials available online the website. The project provides a rather detailed history of its journey through the digitization project as well as mistakes and challenges they faced. The project goes further to share lessons they learned and make recommendations for similar projects. More information on the digitization process is available at http://exhibits.hsl.virginia.edu/yellowfever-new/collection-digitization-report-1999-2004/.

The the details of its use of XML is not available, but it appears to be based upon the University of Virginia Library’s TEI Encoding Guidelines. The University of Virginia Library’s TEI Encoding Guidelines available at http://dcs.library.virginia.edu/digital-stewardship-services/tei-encoding-guidelines/.