Manus OnLine and the Text Encoding Initiative Schema

The Italian Central Institute of Cataloguing (ICCU) has developed a web application to export manuscript descriptions from Manus OnLine (MOL), the Italian national catalogue of manuscripts, into TEI XML documents. As of June 2013, single descriptions can be downloaded from the MOL Online Public Access Catalogue (OPAC) as TEI XML documents by any user. The aim of the rst tool is to supply manuscript descriptions to the institutions which produced them using MOL, because libraries and research organizations need to use their own data, outside the MOL project. The aim of the second service – exporting manuscript catalogues of entire collections as TEI XML les - is to promote exchange of data with other national and international projects. For these purposes, the ICCU needs a rich schema, not only a partial set of metadata, in order to be able to export complete structured manuscript descriptions. As bibliographical standards (MARC, Unimarc) cannot represent the entire structure of MOL descriptions adequately, ICCU adopted the TEI schema as an interchange standard. This paper describes solutions adopted by ICCU in encoding MOL contents within the TEI <msDesc> element and problems they encountered in performing this work. The encoding of information about material ( <material> ), number of folios and physical dimensions of manuscripts ( <extent> ) are discussed as well as a list of data pertaining to original letter cataloguing.

digital images through MOL using personal computers connected through the Internet to the ICCU main server.Within MOL, manuscript descriptions can be linked to images stored in any digital repository.Moreover, cataloguers share the same authority list of names (Marcuccio 2010;Bagnato, Barbero, and Menna 2009).
Since all the manuscript descriptions are stored directly in the same central repository, as soon as a description is marked as "published" by the cataloguer, it becomes immediately accessible and searchable through the OPAC.
At the time of writing (January 2014), MOL contains about 141,000 manuscript descriptions from 270 dierent libraries and 470 cataloguers are involved in the project.The catalogue will be continuously updated and extended.As the Italian manuscript collections preserve the majority of European manuscripts, MOL contains material of outstanding research importance for all periods and disciplines.Greek manuscripts are catalogued as well, for example those belonging to the Biblioteca Trivulziana of Milan, the Biblioteca Riccardiana in Florence, the Biblioteca Angelica, and the Biblioteca Nazionale Centrale (National Central Library) in Rome.Having been created and developed by the Ministry of Heritage and Culture, MOL is free: cataloguers need only a username and a password to begin their work.

MOL and the TEI Schema: the Same Logical Model
MOL and the TEI schema share a logical model which is typical of European scientic manuscript cataloguing practices (Petrucci 2001;Rehbein, Sahle, and Schassan 2009;Fischer, Fritze, and Vogeler 2010).Data considered in paleographical and codicological environments, and as a consequence in manuscript cataloguing, are

•
the manuscript identier; • the physical description; • history; • the contents (texts); • the bibliography; and • names of people and corporations who have any responsibility for the manuscript curation and for texts, and names of places where manuscripts were written.In European manuscript cataloguing practice, the manuscript identier-composed of the name of the place and library where the manuscript is preserved and by an ocial shelfmark-is necessary to dene every single manuscript.This identier is linked to a single physical description when the manuscript consists of one unit, or to multiple physical descriptions when the manuscript is composite.Each area devoted to the physical description is linked to one or more texts, because several texts can be copied in the same manuscript or part of a manuscript.Names can be linked both to the physical description (to represent, e.g., owners or copyists) and to texts (to represent, e.g., authors or translators).This traditional conceptual data model inspired the organization of data storage in the MOL database in which the identier and the physical descriptions are considered as two separate entities while texts and names are considered dierent types of entities.The relationship between the identier and the physical descriptions in MOL is one-ton (one-to-many); the relationship between physical descriptions and texts is also one-to-n.The relationship between names (which populate the MOL authority le) and physical descriptions and between names and texts is n-to-n.
The TEI schema has the same logical structure as MOL.In fact, following the TEI P5 Guidelines (TEI Consortium 2013: 10.2The Manuscript Description Element), within an element <msDesc> • the manuscript identier is marked with <msIdentifier>, which is mandatory in any description; • the physical description is marked with <physDesc>-optional; • history is marked with <history>-optional; • the contents (texts) are marked with <msContents>-optional; • the bibliography is marked with <msBibl> within <additional>-optional; and • names can be inserted at dierent levels within several other elements.
Moreover, at deep levels the TEI schema contains an almost complete set of tags devoted to manuscript description and cataloguing (Barbero and Smaldone 2000;Burnard 2001;Milanese 2007).This is why on several occasions and with dierent aims the ICCU adopted the TEI schema to exchange data between dierent electronic repositories.
In the past, between 2004 and 2007, when Manus was still a standalone application, the ICCU was able to export TEI XML les from local Manus databases, which were not accessible through the web, and import them into the online public catalogue (Barbero 2004) At present, the TEI schema is used in MOL with another very important aim.The ICCU wants to supply data to the institutions which produced them.In fact libraries and research organizations also need to use their own data outside MOL.For example, they need complete data to manage their own online public access catalogues (OPACs) or to populate their local digital libraries. 6Since institutions produce data on their own, they have the right to access all the data they produce; that is why the ICCU needs a suitably rich schema, not only a partial set of metadata.Bibliographical standards are not adequate for structuring manuscript descriptions in their entirety, because they have a quite dierent logical structure (Daines and Nimer 2012;Barbero 2013). 12 In 2012, an export application was developed to export catalogues of manuscript collections from MOL as TEI XML les; moreover, users can download each description from the OPAC in TEI format: The rst-child-level elements of <msDesc> dened by the TEI schema, <msIdentifier>, <msContents>, <physDesc>, <history>, and <additional>, appear whenever the MOL database contains the corresponding information.If a manuscript is dened as composite and its parts are described separately in MOL, then the TEI XML document contains as many <msPart> elements as there are codicological sections.
Within these elements, many other TEI elements are used to create a deeply structured XML manuscript description.In several cases, during the design of the export application, it was clear which TEI elements had to be used in encoding MOL original information.For example, there is no doubt that the name of the library holding a manuscript must be encoded through a <repository> element within <msIdentifier>; the eld containing the history of the manuscript, which can be very long in MOL, can be encoded in a <summary> element within the element <history>; the author's name is inserted in the <author> tag within an <msItem>.
Several specic highly formalized data have also been encoded as TEI attribute values.For example, the international code identifying libraries, composed of two letters representing a city and four numbers, is encoded as the value of the canonical @key attribute within <institution>, because this code can associate each library with the records of the national database Anagrafe delle biblioteche italiane: 8 <institution key="LO0020">Biblioteca comunale Laudense</institution> The code identifying a single record inside the MOL authority le of names (composed of CNMN and a number) is encoded as the value of the @key attribute within <name>: <name key="CNMN0000014152">Alighieri, Dante</name> The global @n attribute has been used in <msPart> and <msItem> to encode the numbers of the codicological units and of the texts.These are numbers which are automatically recorded in MOL to order codicological parts and texts in the same sequence as they appear within the original codex: The @type attribute has been used on <title>, and four values have been dened to express whether the title is "attested" in the manuscript by the copyist, "added" by a hand dierent from the main copyist's hand, "elaborated" by the cataloguer, or "identied" as a published work: <title type="elaborato">Divina Commedia.Paradiso</title> <title type="presente"> Seguita la terça Comedia di Dante chiamata Paradiso et capitolo primo</title> ...

21
The @type attribute is also used on <incipit> and <explicit> to distinguish which part of the text begins or ends with those words: the dedication letter, the preface, the rst poem of an anthology, the main text, and any other signicant part of the text.The ICCU also uses the @defective attribute on the elements <incipit> and <explicit> to specify whether incipit or explicit belongs to an acephal text or if the text is lacking its end: <incipit type="prefazione" defective="false"> Apuleius Plato ad cives suos Ex pluribus paucas vires herbarum</incipit> <incipit type="primoTesto" defective="false"> Herba nomine plantago a Graecis dicitur</incipit> In some other cases, the XML encoding of information stored in MOL needs to be discussed, because dierent solutions can be adopted in the export process to create valid TEI documents.
Astonishingly, this is the case in the encoding of three types of codicological data which are very common in manuscript descriptions: material, number of folios, and dimensions of the manuscripts.
In fact there is no consensus on the values of the @material attribute that can be used in the element <supportDesc>.Even if papyrus, parchment, and paper can absolutely be considered the most common materials, together with mixed material manuscript, dierent solutions have been chosen to express these data through the @material attribute.The TEI P5 Guidelines (Appendix C) suggests "paper", "parch", and "mixed" as values of this attribute, but ultimately the ICCU decided to follow the important European cataloguing projects Manuscriptorium 9 and e-Codices 10 in using "chart" (paper), "perg", (parchment) and "mixed".
In describing the number of folios, cataloguers usually need to encode the quantity of guard leaves bound at the beginning of the volume, of folios which constitute the volume proper, and of guard leaves at the end of the volume.These are three simple numbers, but TEI P5 Guidelines (10.7.1.2 Extent) proposes dierent solutions and dierent projects have adopted dierent encoding methods which, of course, will not help them in sharing data.
The project Manuscriptorium, which describes itself as a "European digital library of written cultural heritage," uses arabic numerals directly within the <extent> element; if the codex has guard leaves, their number is distinguished through a + (plus sign): <extent>2+30+2</extent>.As all three numerals (separated by plus signs) are expressed only if a manuscript has guard-leaves at both the beginning and the end, and sometimes manuscripts have guard-leaves only at the beginning or only at the end, an automatic procedure cannot distinguish the semantic value of numerals expressed.
e-Codices, the virtual manuscript library of Switzerland, encodes the quantity of folios through non-structured information within a <measure> element contained in the element <extent>; at the same time a structured version of the same information is given as value of the @n attribute: <extent><measure type="leavesCount" n="114 + 3">114 Bll.+ 2 Vor-und 1 Nachsatzbl.</measure></extent>.As @n is a global attribute which should number the element, not express any number contained within the element, probably this solution is not to be recommended.
As the ICCU needs to preserve the structure of the MOL data model as far as possible, and also the semantic distinction among initial and nal guard leaves, in the export process the following solution has been adopted: <extent> <measure type="Guardieiniziali" unit="carte">2</measure> <measure type="Corpo" unit="carte">30</measure> <measure type="Guardiefinali" unit="carte">2</measure> </extent> As with material and number of folios, the size of leaves can also be marked with dierent elements following the TEI P5 Guidelines (10.7.1.2Extent): with the elements <dimensions>, <height>, and <width>, or with the element <measure>.Manuscriptorium adopts the rst solution: while e-Codices uses an empty <measure> element and, again, an @n attribute: <measure type="pageDimensions" n="24.0x 17.5 cm / 23.3 x 16.5 cm / 24 x 18 cm"/>.As in MOL, several measurements can be recorded and each of them must be related to a folio number from which size was assessed; in the export application the following solution has been adopted: <extent> <measure n="1" type="height" unit="mm">234</measure> x <measure n="2" type="width" unit="mm">123</measure> <locus>c.1</locus>

</extent>
Other problems to be discussed are due to MOL contents and data structure.First, each cataloguing area, for example those devoted to material, date, number of folios, and many others, contains a generic eld called "Note," where cataloguers can register annotations in prose about that area.
The ICCU decided not to encode all the MOL "Note" elds in XML with the same <note> element, without introducing any distinction, and tried to distinguish the original MOL eld represented in each <note> element.For this reason each <note> uses the @n attribute to number the element and the @type attribute to encode the subject of the original "Note" eld.For example, annotations about the rst area of MOL, devoted to the identication of the manuscript, are marked as <note n="1" type="sez01">, while annotations about the second area, devoted to the form of the manuscript, are marked as <note n="2" type="sez02">.
However, even if these data can be encoded in a TEI XML document by creating specic combinations of a generic <note> element with specic attribute values, at least for the data that are commonly recorded in manuscript cataloguing practice (for example pricking, ruling, and ink), the ICCU would recommend the introduction in the TEI schema of new specic elements.

31
Other codicological data pertaining to manuscript decoration, music notation, and binding, which are analytically recorded in several MOL elds and are expressed through a specialized lexicon, are exported in TEI XML les within a generic <term> element.As with our use of the <note> element, this encoding also uses @n and @type attributes to mark data derived from dierent MOL elds and to express in this way the specic semantic area of each <term>.For example, a cataloguer can point out the presence of neumes in a manuscript by putting a ag (yes in the database) in the MOL eld "Neumatic notation"; such information is exported in the TEI XML document as <term n="4" type="notazione">neumatica</term>.The presence of clasps can be recorded by putting a ag in the MOL eld "Clasps," which causes an element <term n="14">fermagli</term> to be created in the TEI XML document.

Letters
The Italian standard for manuscripts adopted by MOL also includes original letters.The information about original letters in MOL consists of the following elds related to the physical description: • type of letter (letter, postcard, picture postcard, business card…), • presence of envelope, • presence of typescript part, • presence of a printed letterhead, • presence of original signature, • presence of manuscript notes.
Such data should be encoded in a <physDesc>, but at present, there are no TEI elements which could be used for this purpose.
The information about the text of original letters recorded in MOL consists of • folios, • name of sender, • name of addressee, • stage of the text (draft, original, copy), • place, • date, • summary.
Such data should be encoded through an <msItem> within <msContents>.The number of folios can be marked with <locus>; the sender can be encoded as an <author> and the addressee as a name with a specic <respStmt>; but at present the ICCU does not know which elements could be used for other elds.Close collaboration among dierent projects and the TEI Consortium would be essential in order to develop an encoding system for original letters compatible with TEI <msDesc> encoding; for example a new set of tags to be used in the header of correspondence editions could be studied in collaboration by the TEI Manuscripts SIG 13 and the TEI Correspondence SIG. 14

Figure 1 .
Figure 1.The OPAC of Manus OnLine with the download button.
(Magrini, Pasini, and Arduini 2005) European project Rinascimento Virtuale, 4 devoted to the digital publication of Greek palimpsests; on that occasion, the TEI schema was adopted to share manuscript descriptions of palimpsests among all the European partners(Magrini, Pasini, and Arduini 2005).Later, starting in 2007, the ICCU planned and performed harvesting between the CERL Portal, a meta search interface developed by the Consortium of European Research Libraries, and MOL: the ICCU exposes MOL contents on the web, providing OAI-PMH records with light TEI metadata inside, and CERL harvests them. 511