Building, Encoding, and Annotating a Corpus of Parliamentary Debates in XML-TEI: A Cross-Linguistic Account

This paper introduces an integrative and comprehensive method for the linguistic annotation of parliamentary discourse. Initially conceived as a documentation for a specific and small-scale research project, the annotation scheme takes into account national specificities and is geared to proposing an annotation scheme that is both highly standardized and adaptable to other research contexts. The paper reads as a specific application of the Text Encoding Initiative (TEI) framework applied to a subset of official transcripts of plenary proceedings in three parliamentary cultures. The TEI annotation scheme proposed here has two main applications: first, it serves as a basis for the encoding of parliamentary corpora by providing a systematic way of annotating both elements within the text (e.g. turns, incidents, interruptions) and the metadata associated with it (e.g. variables pertaining to the speaker or the speech event); second, it provides a cross-linguistic empirical basis for further annotation projects.


Introduction 1 : Parliamentary talk as a linguistic object of annotation
Linguistic annotation can be defined as "the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data" (Leech, 2013: 2).
While reflections on linguistic annotation go hand in hand with the development of corpus studies (Ide and Pustejovsky, 2017), we argue that there is still a need for a context-sensitive, fine-grained annotation of parliamentary corpora, specifically in the context of linguistic research.Since linguistic annotation shapes linguistic research (i.e.allows for specific research questions, but also potentially limits the interpretation), we maintain that the issues and decisions pertaining to linguistic annotation are an integral part of linguistic research, and therefore, should be part of the annotated corpus once released and become available to the research community together with the data.
The following paper aims at tackling this issue by offering methodological reflections on what doing linguistic annotation within the TEI framework means, especially when the annotation serves a small-scale research contrastive project but is geared towards further applications and intends to distribute an open and reusable language resource.In other words, what can be learned from a cross-linguistic linguistic research on parliamentary discourse for other, potentially more comprehensive annotation frameworks for parliamentary discourse?In this paper, we do not propose any analysis or substantive discussion utilising the data.Rather, we intend, through a focus on specialized discourse, to show how and why a reflection on annotation practices belongs to the analysis, and is not only a preliminary step.
On this ground, we present an integrative and comprehensive approach for the linguistic annotation of parliamentary discourse on the basis of "small specialised corpora" (Koester, 2010).
We apply the annotation scheme to three electronic corpora based on the stenographic protocols of the British House of Commons, the German Bundestag, and the French Assemblée nationale.
The novelty of this approach is that it integrates three different parliamentary traditions.In order to ensure not only the interoperability but also the comparison between different parliamentary cultures, we need a common annotation framework flexible enough to accommodate national specificities, yet standardized enough to be valid for, we expect, any type of parliamentary discourse.Based on the Text Encoding Initiative (TEI) Guidelines, the annotation framework combines a highly standardized and flexible structure: it is both specific to the methodological and technical difficulties encountered while dealing with this particular type of corpora and generalizable to other types of linguistic projects that may aim at extending or refining the annotation scheme presented here.
The following argumentation proceeds in four steps: first, we explain the rationale behind a cross-linguistic encoding of parliamentary debates.Second, we show why the Text Encoding Initiative is a sustainable, reproducible, highly standardized, yet equally flexible annotation framework apt at capturing parliamentary interaction.Third, we describe the annotation scheme at the level of the metadata contained in the TEI header, more specifically the variables associated with each speaker (each Member of Parliament in our case).We also detail the annotation scheme at the level of the text, delineating why the transcription of speech vocabulary should be preferred to a more drama-oriented mark-up for parliamentary data.In order to invite for further applications of the TEI framework, a final part is devoted to documenting and archiving the data from an open access perspective.

Adopting a contrastive view on the annotation scheme
The corpus annotation and documentation take place in a specific research project focusing on the uses and functions of third-person forms in three communities of practice: the German, French, and British parliaments (Truan 2018, Truan 2021).While the focus of this research project and the reasons for the comparison of these three linguacultures will not be discussed in this paper, it appears necessary to sketch out the context in which corpus building has materialized.We first discuss the focus on parliamentary discourse, then move to the contrastive view underlying the project since its inception.We finally set forth the reasons why the Text Encoding Initiative (TEI) is a robust procedure for encoding parliamentary corpora.
Yet "in spite of the growing visibility of parliamentary institutions, the scholarly interest for the study of parliamentary discourse has been rather low until recently" (Ilie, 2006: 188).In this context, parliamentary debates increasingly become an object of linguistic annotation (see Fišer and Lenardič, 2018 for an overview of CLARIN parliamentary corpora on which the corpora addressed here are also listed).
While we do not engage in a debate on whether parliamentary discourse is of intrinsic research value, we believe that records of parliamentary interaction constitute a very insightful corpus for linguistic analysis.First, in most Western countries, parliamentary debates are publicly available in several complementary formats: video, audio, text.Hence, the plenary sessions are already transcribed by a team of professional stenographers familiar with parliamentary procedures as well as with the Members of Parliament, thus enabling the researcher to focus on other levels of transcription and annotation.
The corpus used for the present study relies on the official transcripts of the plenary debates.The differences between stenographic protocols and the parliamentary debates as well as the problem they raise have been extensively described for the three countries under investigation (see Slembrouck, 1992;Mollin, 2007 for the House of Commons; Gardey, 2005 for the Assemblée nationale; Olschewski, 2000 for the German Bundestag).Notwithstanding these valid reservations, official transcripts are "a valuable basis to start from" (Zima, Brône, and Feyaerts, 2010: 140) (also see Cribb and Rochford (2018: 13), who speak of "a robust reporting procedure").Moreover, video recordings are not a panacea since they are highly dependent on the choices made by the cameraperson.In the case of unauthorised turns, verifications with the video recordings are sometimes impossible since the camera focuses on the speaker and very rarely on the cointerlocutors.
Second, being at the interface between spoken and written data, parliamentary discourse gives access to a wide range of discourse features (see Vuković (2012) for a comparison of preprepared and spontaneous parliamentary discourse at the House of Commons).Even if official proceedings/transcripts do not adequately capture the interactional nature of the events such as pauses, hesitations, etc. (see Cribb and Rochford (2018) for an example based on the House of Commons), we adopted a TEI structure based on spoken data rather than drama-oriented data in order to allow for further projects that would take into consideration the interactional nature of plenary sessions.Finally, parliamentary debates display a wide range of speakers over a large time span, thus inviting for both diachronic and synchronic sociolinguistic case studies in terms of (expressed) gender, status, or political affiliation (see Burnett and Bonami (2019) for the Assemblée nationale).

Why a new annotated corpus of parliamentary debates?
As sketched above, corpus studies based on parliamentary interaction have become numerous in the last decade.Against this background, what may a new annotated corpus of parliamentary data bring?Why not work with already available parliamentary corpora?While reference corpora such as the Hansard corpus that consists of British Parliament speeches between 1803 and 2005 (1.6 billion words, 7.5 million talks) would offer statistically robust results with corpus-assisted techniques, they also do not give access to the whole co-text2 because of property rights.Moreover, no equivalent corpus for the German Bundestag and the French Assemblée nationale currently existed when the project started (2015)(2016).
It should be noted, however, that several projects involving parliamentary data have been launched in the meantime.The GermaParl R data package, a corpus that includes "all plenary protocols that were published by the German Bundestag between February 1996 and December 2016" (Blätte and Blessing, 2018: 810), has been developed (Blaette 2017).Apart from the fact that the period covered by the corpus is by no means comparable to the British House of Commons, it also raises problems in terms of transcription that will be addressed below.Furthermore, as the authors acknowledge, "[a] thematically specialized corpus […] may offer significantly more detailed metadata and annotation" (Blätte and Blessing, 2018: 810).A provisory version of other annotated French parliamentary debates has also been created (Diwersy, Frontini, and Luxardo, 2018) after the first release of the corpus in November 2016 (see Section 4 for more detail on the platform that hosts the corpora.).
Parla-CLARIN 3 , a comprehensive project aiming "to develop a TEI customization for annotating parliamentary debates" by "storing and interchanging linguistically annotated corpora of parliamentary data to be used in scholarly research" 4 , has been launched in 2018, e.g. two years after the release of the corpora in open access.Similarly, "ParlaMint: Towards Comparable Parliamentary Corpora"5 , a project funded by CLARIN, "is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid-2020"6 .ParlaMint was not, however, available when the project started (2015)(2016), as it has been launched in July 2020.While these projects offer important and valuable sources of comparison, the annotation scheme described in this paper was conceived prior to them.
Most of the projects presented above largely differ from the small-scale contrastive project which is the focus of this paper, however, as they involve teams and infrastructures, while the TEI annotation presented here has been implemented by one person only (the first author).The annotation scheme used for the analysis (Truan 2018(Truan , 2021) ) thus not only invites for extension and possibly revision, but also offers a point of entry for further (doctoral) projects working on small specialized corpora, thus showing what can be annotated for specific research purpose and with limited means.Moreover, small-scale annotation schemes offer other advantages: for example, the possibility to encode the variable majority/opposition, which has not, to the best of our knowledge, been implemented otherwise (Truan 2019: 45), as it needs to be done manually.
The variety of sources and formats is a strong point in favour of a common annotation framework.All the texts have been retrieved from the official websites of the respective parliaments: http://hansard.parliament.uk/for the House of Commons; http://pdok.bundestag.de/ for the German Bundestag; http://archives.assemblee-nationale.fr/ for the Assemblée nationale. 7oth the British House of Commons and the French Assemblée nationale display the parliamentary proceedings in HTML, which allows for a quick, easy, and accurate retrieval of the content.The German corpus, on the other hand, is based on PDF files.PDF files are noticeably less adequate for further encoding and tagging.In this case, the files have sometimes suffered from inadequate word breaks, thus necessitating minor corrections.
We carried out the encoding process into the TEI Guidelines by combining manual and automatic processing workflows, with the idea of keeping both the content and the metadata of the sources.In particular, we used the GROBID software suite 8 , which provides a relatively efficient transformation process from PDFs to a decent TEI format, although not fully compliant with the target encoding scheme.Attention was given to unifying the final format across the three languages and parliamentary settings so that the same phenomena and features would be encoded exactly in the same way for each sub-corpus.

Small monolingual corpora as the basis for a cross-linguistic perspective
The rationale behind the constitution of "small monolingual corpora" 9 (Koester, 2010) is to allow for the interaction between statistical measures and a close-reading analysis sensitive to the sociopolitical context in which parliamentary interaction takes place.In order to ensure that external variables that may shape parliamentary talk are accordingly assessed, the research project that builds the basis for the annotation scheme focused on a limited range of national debates concerning a major European Council meeting (see Truan 2021: chap. 4).
Despite their high degree of conventionality, parliamentary debates involve a wide range of different activities (or subgenres) such as ministerial statements, speeches, debates, oral/written questions and Question Time (Ilie, 2006: 191).In order to capture a wide array of speakers and to ensure a thematic continuity, one plenary debate per year held in the British, German, and French national parliaments, respectively, about a major European Council meeting (ex ante, ex post, or on the same day) was selected between 1998 and 2015.As Auel and Raunio (2014: 17) stress, "problematic for the comparative analysis is that identifying EU debates is rather difficult in some parliaments".While the Bundestag and the Assemblée nationale list what they consider to be EU debates on their websites for the current and previous legislative periods, the House of Commons does not provide such information on its website.Search engines do not enable further distinction between the EU being only mentioned in a debate on, say, agriculture on a national level, and the EU being the specific topic covered during the plenary session.For these reasons, and given the fact that European affairs are not the focus of this work, but only a common variable to ensure the 8 GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications.For a detailed description see Romary and Lopez (2015) for a quick overview. 9The German corpus is the largest with 417.095 tokens.The British corpus encompasses 188.913 tokens, the French one 137.620.comparability of the data, it has been assumed that the European Council meetings offer a baseline against which to collect the national plenary debates.
To increase the reliability of the comparison, the genre of parliamentary debate has therefore been considered a constant variable, together with the focus on European Council meetings.The main purpose was to avoid contrastive analyses based on the languages but disregarding the specificities of a particular culture or institution.Following Krzeszowski (1989: 61), we recognise that "[t]ext-bound CS [contrastive studies] are corpus-restricted" since no systematic generalizations outside the original data are made.Bearing in mind that institutional settings are accordingly more stabilised, routinised and conventionalised than everyday interactions, it can be posited that genres function as an intermediary level of representativeness prior to analysis or as a first step towards the comparison of discourse communities that should be the horizon of expectations of a contrastive discourse analysis (see von Münchow, 2010).
While the annotation scheme described in this paper presents typical features of parliamentary interaction, it also represents a first step towards the integration of contrastive perspectives while developing an annotation framework.The advantage of the comparison pertains to its heuristic value: by reflecting on similarities and differences during the annotation process, we come closer to an architecture valid and applicable to a large variety of linguistic data and metadata (also see Truan 2019 for methodological reflections on doing contrastive discourse analysis).

Preventing the built-in obsolescence of the corpus
In this section, we outline the principles guiding the documentation of the corpus and show how the choices we made are intended to serve general purposes.We argue that annotating corpora cross-linguistically calls for a very flexible annotation framework that allows for multiple, expansible, and evolving annotations that may change over the course of time-a principle that is deeply rooted in the Text Encoding Initiative (TEI).In order for this paper to be received outside the TEI community as well, we first briefly present the TEI Guidelines and show why they are deemed to be appropriate for parliamentary debates.We then link this general framework to what we call a sustainable corpus.

The TEI annotation scheme
The Text Encoding Initiative (TEI, see Romary 2009) has become, since its inception in 1987, the reference technical standard for the representation of textual content in the humanities.Based upon the W3C XML recommendation, it covers a wide range of genres and provides users with a vocabulary of nearly 600 XML elements.At the core of the TEI Guidelines resides the face that any TEI-based project should define its own subset (or customisation) where the elements which are deemed useful for the representational task at hand are selected, documented and possibly amended.
The TEI annotation is used to store the "detailed information about the speakers or writers" (Koester, 2010: 72).Linked with "the goals of the interactions in texts and the setting in which they were produced as part of the corpus database means that linguistic practices can easily be linked to specific contextual variables" (Koester, 2010: 72).The XML-TEI annotation enables researchers to fruitfully visualise the articulation between text and context, i.e. between the plenary session and the metadata associated with it.Interpretative data is situated within the corpus as dedicated TEI elements.As will be detailed in Section 6, the corpus is available under a license CC BY 4.0., which enables anyone to correct or extend the metadata if necessary.
Based on this general understanding, the annotation framework has been conceived with this contrastive research question in mind: the subset we have conceived consists in elements that are deemed equally valid for British, French, and German parliamentary debates.We argue that the cross-linguistic view enables us to take into account national specificities while "emphasiz [ing] what is common to every kind of document", as Burnard (2014) highlights for the TEI.In this sense, and despite the fact that the political context changes over time between France, Germany, and the United Kingdom, the TEI gives access to a common technical, practical, and methodological framework between the three subcorpora and the three languages.

A sustainable corpus
When designing the TEI-based encoding scheme of our corpus, we have been led by the idea that it could be easily taken up by other scholars to carry out various types of research, but also to allow its possible extension (in terms of coverage) or enrichment (e.g.additional annotated features).
Although we would avoid the term 'reference corpus', which is more applicable to large scale endeavours to build up a representative sample for a language (see e.g.Kupietz et al., 2010), we strived towards a sustainable corpus that may be combined in time and space with other endeavours to describe language resources in a variety of contexts and for a variety of genres.In this framework, adopting a sampling strategy focused on our research question was not seen as a restriction in the constitution of the corpus.Rather, we saw this strategy as a way to have a better grasp on the parameters for the linguistic analysis and thus encoding.
With this perspective in mind, the use of the TEI Guidelines as a reference background for the encoding scheme was motivated by the lack of consistency across the various corpora of parliamentary debates available online in their native source representations.As reflected in the corpus overview page compiled by the CLARIN infrastructure 10 , existing corpora have been mainly designed on the basis of proprietary formats ranging from flat plain-text (Kapočiūtė-Dzikienė, Šarkutė, and Utka, 2017; Clarin:el, 2011) representations to ad-hoc XML vocabularies (Pražák and Šmídl, 2012;Hansen, 2018;Vitali and Zeni, 2007), with even some attempts to define a specific metadata schema for parliamentary debates (Gartner, 2013)-a practice that can be seen as opposite to the underlying assumptions of the TEI community that strives towards finding consensus to cover similar use cases 11 rather than ad hoc solutions.Besides, even for those corpora abiding to the TEI Guidelines, there are some strong discrepancies in the actual TEI encoding styles: whereas some (Research Group of Computational Linguistics, University of Tartu n.d.) have used a simple paragraph segmentation for the encoding of turns and associated features, others (Blätte and Blessing, 2018) have considered parliamentary debates as a possible instance of drama, with a third group of researchers (Pančur, Šorn, and Erjavec, 2018) who based their work upon the Transcription of Speech module of the TEI Guidelines.
The (internal) debate within the TEI community as to which module can optimally deal with parliamentary corpora between 'Drama' and 'Transcription of Speech' relates to a more essential question: how should parliamentary debates be considered as a scholarly source?Three arguments plead, in our view, for an annotation as transcription of speech rather than drama.First, when 10 https://www.clarin.eu/resource-families/parliamentary-corpora,accessed on 05.03.2019.  1As another illustration of this, we can mention what the TEI Lex 0 has aimed to achieved in the domain of interoperable lexical data and was therefore granted with the Rahtz Prize for TEI Ingenuity in 2020 (see https://www.dariah.eu/2020/11/20/dariah-working-group-on-lexical-resources-wins-innovation-prize/).designing the annotation scheme, we were quickly set on identifying parliamentary debates as the tangible record of an observable interaction rather than a performance that could be derived from a pre-existing script.Indeed, even if MPs may have notes that they read when intervening in a parliamentary debate, "seul le prononcé fait foi", i.e. the transcription only records what has actually been said.
Second, even if one could claim-following the theatrical metaphor-that MPs play a role, specifically depending on their relation to the government (majority, opposition) or their specific positioning on certain political issues, we also observe speakers as concrete entities to which we can associate, as we shall see, concrete personal and sociolinguistic markers in the context of a given political speech.Finally, parliamentary debates display a wide range of phenomena pertaining to spoken (multimodal) interactions such as overlaps, interruptions, background noises or applause, which may all be deemed to bear (an interactional, if not political) meaning and thus cannot equate with blocking as indications pertaining to the staging of actors in order to facilitate the performance.Furthermore, MPs often depart from the script.(At the British House of Commons, they are not allowed to read a text aloud.)While the resemblance between parliamentary debates and theatre is attested (Ilie 2003), there is always a room for improvisation, unplanned reactions, interventions, or comments at the parliament.It is true that some of these characteristics may not be transcribed by the official stenographers (see below for a discussion); yet they remain available.
Third, although some parliamentary records appear to be strongly edited and may be seen as very close to the way written prose or drama would look like (in style or structure), we think it would go against a general strive towards interoperability to adopt, for a subset of the general corpus of parliamentary records, an encoding strategy that would be different to what is needed for more fine-grained transcriptions.As a matter of fact, the tag set for the transcription of spoken language of the TEI guidelines does not imply that all details from the source be encoded and one can implement, with a very small subset of the corresponding elements, exactly what could be achieved when adopting an encoding strategy based upon the tag set intended for drama.
For these three main reasons, we have adopted a TEI annotation scheme distinct from drama.

Enabling sociolinguistic explorations: The TEI Header
The criteria for documenting the corpus are directly derived from the model sketched out in the first two sections.In this section, we account for two levels of analysis underlying the annotation scheme: first, the TEI header (<teiHeader> element), which stores information related to "the metadata associated with the digital document itself, analogous to the title page of a printed book" (Burnard, 2014), second, the transcriptions of speech within the <text> element itself (for instance, the distribution of turns).

Political speakers: The TEI element <person>
In this part, we describe the metadata attached to the TEI element <person> corresponding to each speaker.In this corpus, the TEI header contains, among others, the metadata (or variables) associated with the environment of the parliamentary debate (organisation, place, date encoded in <settingDesc>, see figure 3) and with the speakers (name, sex, political party, political affiliation, position encoded in <particDesc>, see figure 1) 12 .
An important decision was to encode speakers' related information in the header of each document and to associate such descriptions with a group of features relevant for the linguistic analysis of parliamentary discourse.In compliance with the TEI Guidelines, and more specifically its Language Corpora module, such information is situated in the profile description section (<profileDesc>) of the TEI header within the element (<particDesc>) dedicated to the cataloguing of participants in a spoken discourse.Our choice was essentially motivated by the need to find an adequate compromise between two possible strategies: i. on the one hand, localising speaker-related information at the utterance level, with the risk of lacking genericity, introducing redundancy and above all introducing contradictory information throughout the document, when annotation is not carried out consistently; ii. on the other hand, grouping all speakers' related information within a global prosopographic document (i.e. an independent digital thesaurus of persons) where each MP would have been identified once and for all, thus preventing a finer grained analysis 12 For a comprehensive mapping of all the TEI tags used in this work and how they have been applied to parliamentary debates specifically, see [link removed because points to one of the authors].
accounting for the variation of, for instance, political role over time and across parliamentary debates 13 .Crucially, providing the speaker's description at the (local) level of each parliamentary debate or TEI document (i) does not prevent from setting up an external, more comprehensive prosopographic document where all biographic indications (and somehow independent from specific political contexts) may be maintained (ii).Referring from the corpus documentation to such a prosopographic document by means of the @corresp attribute on the <person> element is technically simple.
Our documentation strategy has been determined by our ground decision within our corpus to fragment parliamentary debates into document units corresponding to plenary sessions, with the additional advantage of optimising the maintenance of the corresponding information within our corpus at large (e.g.allowing other researchers to easily complement the corpus with additional sessions, as independent TEI documents), as well as facilitating cross-session analysis.Hence, each XML-TEI document corresponds to one plenary debate as a communicative unit, i.e. a given spatiotemporal unit bound to a specific situation in which a group of given participants discusses a given topic (Kerbrat-Orecchioni, 1990: 216), thus making the text the proper linguistic object under investigation 14 .
We have chosen to identify the speakers in each debate in the corresponding header and not in each utterance (or prior to each utterance) for three main reasons: i. it allows for a better readability of the TEI document at first glance since the metadata associated with each speaker is not mixed-and thus potentially hard to retrieve-all together in the text (see the ode to simplicity in the next section); ii. it ensures the consistency of the parameters applied to each speaker since the list of the speakers attending a specific plenary debate is given at the beginning; iii. it permits to develop and extend the metadata associated with each speaker if necessary by changing the TEI header only once, and not every time a speaker produces a new turn. 13There is a trade-off here as to how much speaker-related information should be localized with the parliamentary debate as opposed to be grouped in a prosopographic document.We expect our encoding to reflect the need to make each plenary debate an autonomous object not requiring a constant back and forth access to an external authority document. 14But note that one session can last more than one day, i.e. can be split.Usually, the <id> of a speaker corresponds to the last name.In case of speakers sharing their last name with another speaker of the corpus, as is the case here, the first name has been added.Another option could have been to add the date of birth for each speaker.
The first group of features attached to the description of an MP within a plenary debate corresponds to stable-or bearing very rare variation-characteristics pertaining to the identification of the speakers according to long-term properties such as name (<persName>), sex (<sex> 16 ) and nationality (<nationality>).Knowledge on <sex> content allows for simple comparisons such as length of speeches by gender (see Truan 2021: chap. 4).
The second group of features is more specific to each plenary debate and corresponds to the characteristics borne by the speakers from a political point of view: these are their political affiliation (<affiliation>), their relation to current government (<floruit> 17 , with values majority and opposition) and the electoral circumscription where they have been elected (<residence>).
This approach allowed us to look into the corpus through variables that have not, as far as we know, been consistently integrated into the corpus-based and corpus-driven analysis of parliamentary discourse so far.We could gain insights of the relationship between opposition and majority in terms of person reference that otherwise would have remained hidden.For instance, referring to certains ('some'), for a member of the UMP (Conservatives in France), is likely to denote the Communists at the Assemblée nationale (see Truan 2021: chap.7).Building categories of discourse participants is closely intertwined with the speaker's construal of who is included and 16 Although the TEI documentation reports on a @value attribute to normalize the corresponding content of the <sex> element, it does not provide a real standardized set of values as reference (https://teic.org/release/doc/tei-p5-doc/en/html/ref-sex.html,accessed on 29.04.2019).We thus discarded this attribute in our encoding, but we used normalized values within the corpus (male/female/none). 17Although the description of the <floruit> element in the TEI guidelines may suggest that <floruit> should remain limited to the description of a temporal time span, it appears acceptable to extend this description to the nature of the activity of the person in the given time span, especially when this activity may change over time, as is the case for the variable majority/opposition in the political sphere.who is excluded.Such a finding could only be attained through the exploration of the correlation between linguistic forms and manually encoded variables in the form of TEI constructs.
The annotation framework was geared towards the coding of external variables (or metadata) which had only rarely been taken into account until the first release of the corpus in November 2016, such as the variable majority/opposition or grouping parliamentary groups together such as PDS/Die Linke coded as "Far Left" (see the use of <trait> in figure 1) (for some observations on the variable majority/opposition in a Norwegian corpus, see Lapponi and Søyland, 2016;Lapponi et al., 2018).
Although we have not encountered this situation in our corpus, it should be signalled that even in the last group of features, a change within a given debate can happen, when for instance an MP changes sides.Such a scenario is attested by the creation of The Independent Group (TIG) in February 2019 18 .In such cases, the flexibility of the TEI toolkit would allow for a meaningful representation, notably through the use of temporal attributes as exemplified in figure 2. As shown previously, each parliamentary debate constitutes a specific speech event taking place at one time and one place.The speech event constitutes a macro frame in which speakers, who 18 https://news.sky.com/story/live-speculation-more-mps-will-quit-to-join-independent-group-11642586,accessed on 16.03.2019.alternatively become hearers as well, produce several turns.The contextual description of the speech event must thus contain the basic features that enable a user of the corpus to situate each utterance within a precise geo-temporal environment, but also to understand the broader political context.
The TEI Guidelines provide a suitable construct to do so within the TEI header by means of the <setting> element within <settingDesc> element, whose usage we have adapted to match our purposes.As illustrated in the example below, we have described the following features attached to a parliamentary debate: i. the name of the organisation (<orgName> 19 ) where the debate is taking place, namely the corresponding national parliament (for this corpus: House of Commons, Deutscher Bundestag and Assemblée Nationale); ii. the actual date of the debate (<date type="parliamentaryDebateDate">) both as recorded in the original transcript and normalized according to the ISO standard 8601 (yyyy-mmdd); iii. the name of the head of government in place (<persName>20 ), so that the debate can easily be put in relation with a wider political context.We adopted a complementary numbering marker (e.g.Blair I) to signal successive government with the same leader; iv. the actual legislative session (<name>21 ) within which the debate is taking place.
In addition to these generic political parameters, we added two specific descriptors intended to provide information about the European debate per se.In doing this, we pursued our general encoding strategy, and reused existing elements from the TEI Guidelines while slightly adapting their semantics as TEI components.For the description of the main topic(s) of the European Council meeting about which the national parliament is debating, we used the <activity> element.For the description of the place where the European Council meeting took place, we used the <locale> element.Both choices could probably be the least consensual ones if we were to carry out a wider dialogue with the scientific community on the standardisation process and the encoding of parliamentary debates.

Encoding the content
In this section, we present the decisions pertaining to the turn level (5.1) as well as the intra-turn level (5.2).Importantly, we do not address other levels of annotation such as word-level annotation that could have been subjected to a TEI markup as well.Indeed, for the purpose of this project, the results are based on automatic part-of-speech tagging.The open source software TXM 22 (Heiden, Magué, and Pincemin 2010) used in this project indeed proceeds to language-specific part-ofspeech tagging when a corpus is imported.5.1.The representation of spoken political discourse: The turn level 22 Available at http://textometrie.ens-lyon.fr/?lang=en (accessed on 05.02.2021).

Utterances/Turns
The <u> element (with gloss utterance) in the TEI Guidelines potentially covers any kind of linguistic segmentation in a transcription of a spoken sequence as long as this segment may be attributed to a single speaker.For the purpose of encoding parliamentary debates, we decided to adopt a terser interpretation of this element and considered that it should represent a turn in the standard linguistic acceptance.Turns are a superficial unit pertaining "to the surface structure of conversation" (Kerbrat-Orecchioni, 2004: 8) since they solely indicate a change of speaker.The reason behind is essentially to account for the essentially monological nature of parliamentary interaction so that a specific speaker's intervention can be easily identified and distinguished from the preceding and following turns of other MPs.<u who="#ROBERTSON-ANGUS">On the question of European enlargement and immigration

Interruptions
In Diwersy, Frontini, and Luxardo (2018), the authors observe that the descriptor "speech type (debate, interruption, vote explanation, etc.)", which is not given in our corpus annotation, proves to be "particularly important when it comes to differentiate effects of register variation ranging from highly formulaic to less formal speech (as in the case of e.g.interruptions)".The main reason for not annotating this level of analysis is, once again, to be found in the contrastive perspective we adopt.Whereas interruptions are thoroughly transcribed in the official recordings of the Bundestag and the Assemblée nationale, enabling new research questions on the special kind of dialogue emerging during these interactions, unexpected or unauthorised turns at the British parliament are only indicated as 'interruption' with no further information provided on the nature, source or content of the disruption, as in (1): (1) Mr. David Cameron (Tories) [majority]: There is a case for saying that the institutions that Europe put in place after the second world war and I would include NATO as well as the European Union have played a role in making sure that we settle our problems around conference tables rather than on the fields of Flanders.To that extent, yes, I think that it is right.Interruption Someone says, "Why not go?".(UK 2012.10.22)Although the co-text sometimes gives insights on what kind of 'interruption' was at stake (and although the video recordings are available online), it is clear that transcription practices (to name only one factor) have a considerable impact on a contrastive research overall.For statistical purposes, it appeared more suitable to encode changes of speakers without discriminating between unauthorised and authorised interventions, which enables us to retrieve automatically all the utterances of a given speaker.

Segments and quotes:
The intra-turn level Finally, we had to resort to the very generic <note> element to mark up additional commentaries present in the transcripts of the debates and usually added by the parliamentary clerks: <note>Official Report, 15 January 2014, Vol.573, c. 11MC.</note>For the purpose of our corpus, we have not fully used the richness of the Transcriptions of Speech module of the TEI Guidelines, as described in Schmidt (2011).This is both due to the specific scope of the linguistic study that we were pursuing and the actual informational simplicity of the available sources.Still, the choice we made of using this module offers the possibility of a variety of potential enrichments, either by ourselves, or indeed by anyone who would want to further complement the corpus.The possibility to align with precision, but means of a timeline, the various turns, sub-segments or any kind of incident, offers the potential to have a better insight in the nature of the interactions carried out in parliamentary contexts, from a prosodic or gestural point of view for instance.

Documenting and archiving the data
The Text Encoding Initiative has been, right from the onset, the basis for a strong open science vision, where interoperability would be at the service of sharing and reusing digital content encoded according to the TEI guidelines (see Romary, 2020 for an overview).For this reason we here provide an overview of our efforts to make the corpus FAIR (Findable, Accessible, Identifiable and Reusable; Wilkinson et alii (2016)).
As already alluded to, the corpus has been designed with the idea that it could be easily reused and complemented by others.It thus appeared coherent to adopt a completely open distribution setting for it, by releasing it on the Ortolang platform (https://www.ortolang.fr/).Ortolang combines several important technical features: i. specialisation on linguistic data with the possibility to attach several linguistic descriptors (language, genre, source type etc.) to the corpus itself; ii. provision of unique identifiers to the resources; iii. long-term archiving for all uploaded resources; iv.version management, which allows to publish corrections and improvements to the corpus while keeping the same underlying digital identity; v. precise identification of the various contributors to a resource; vi.linking of resources with open licences-in our case a Creative Commons CC-BY licence requiring proper attribution to the authors (CC BY 4.0) in <publicationStmt>; vii.finally, the possibility to add an XSLT stylesheet to the corpus to account for a default search and presentation environment (in HTML).
Beyond the technical setting, we conclude with dissemination issues that, to our view, are an essential part of the annotation project.First, we considered that beyond seeing the corpus as reusable (linguistic) content, presenting the annotation framework as an ongoing process could also play a role as a methodological point of comparison for other comparable endeavours.As a consequence, we decided to distribute all the source documents rather than limiting access through, e.g., a query interface, as is the case for the EuroParl corpus for instance.Second, although there are often fears of being plundered when data is disseminated at too early a stage in a research process, the author who compiled the corpus as part of her dissertation project took the decision to have the data online even before the actual doctoral publication would be available. 23  The three corpora are available online at the following addresses: hdl.handle.net/11403/uk-parlfor the British corpus; 23 One of the points of contention could be that the dissemination through Ortolang is not a fully open source project such as GitHub.We see GitHub, which is a private platform and thus does not fulfil all our criteria of a sustainable environment, as a possible front end for the further development of such a corpus as ours, while keeping an environment such as Ortolang as the final end publication setting.
These links are dynamic persistent identifiers that always reference the latest published version of the subcorpora; thus no specific date of access of the given sources is provided.The online access of the corpus (or the three sub-corpora) has been released in November 2016.

Conclusion
This paper suggested an integrative and comprehensive approach for the linguistic annotation of parliamentary discourse that takes into account national specificities and is specifically geared to proposing an annotation scheme that is both highly standardized and adaptable.The method is based on the Text Encoding Initiative (TEI) framework.We argued that the linguistic features of parliamentary interaction call for an annotation scheme distinct from the ways theatrical plays have been accounted for within the TEI community.We also pleaded for a cross-linguistic annotation framework easily reproducible.Specifically, we have shown that the metadata information such as 'political affiliation' or the opposition between majority and opposition are crucially needed in order to allow for the comparison between several parliamentary systems.
We understand this paper as a first step towards the annotation of parliamentary corpora on a larger scale.We recognize that the small size of the corpora (from approximately 137.000 tokens for the French corpus to 417.000 for the German corpus) allowed for such a fine-grained annotation that may be more difficult to implement more massively.Accordingly, the application of this annotation scheme to a bigger corpus needs to be systematized.On the other hand, it would also be possible to further complement the detailed annotation scheme, for instance by providing timestamps and the hyperlinks to the videos, as suggested by (Cribb and Rochford, 2018: 13), in particular "so that a user at a particular point in the report can link through to the audio recording effortlessly and accurately".A narrower linkage between the videos and the transcripts could also lead to an insightful annotation in terms of kinesics-a dimension which, arguably, would adequately complete a close-reading discourse-analytic endeavour.
These further extensions and exploitations of the annotated corpora are at the core of our understanding of annotation as a process rather than a finish product (also see Bucholtz, 2000) for a similarly reasoned argument in terms of "the politics of transcription").Making decisions explicit, transparent, and replicable are primary prerequisites for doing science in the digital age.
We hope to have shown that the annotation scheme developed in this project is only a first step.

Figure 1 :
Figure 1: Example of a speaker's description entry in the TEI header of a session document

Figure 2 :
Figure 2: Exemplifying a change in political party within a plenary debate

Figure 3 :
Figure 3: Example of a session's description entry in the TEI header of a session document Since the release of the corpora in open access in November 2016, the corpora (including their documentation) have been downloaded (partly or as a whole) 111 times for the British corpus, 43 times for the German corpus, and 373 times for the French corpus, respectively.The corpora are listed on the website of CLARIN 24 .Moreover, they have been discussed in several other annotation projects (see for instance Blätte 2018 or Diwersy, Frontini, and Luxardo 2018) and used as a comparison corpus for other research projects (see for instance Stefanowitsch 2019).These examples show that an early release (November 2016) of the annotated corpora together with their documentation in open access, long before the dissertation project was submitted (October 2018), is beneficial to the community.