A French text-message corpus: 88milSMS . Synthesis and usage

In this article, firstly we briefly summarise the sud4science project and data collection (http:// sud4science.org), ensuing processing/analysing stages, and the resulting corpus, 88milSMS (http://88milsms.huma-num.fr), through a synthesis of quotes and references to previous articles (§ 1). Secondly, we provide a state of the art on some research initiatives that use 88milSMS in various domains and frameworks, which will enable future cross-disciplinary insight (§ 2). Then, we present other usages of the 88milSMS corpus we identified through surveys (§ 3). Finally, we suggest future paths for textual data collection and analysis. Dans cet article

In our previous work (Panckhurst et al. 2016b), we described the different methods in order to collect, to pre-process, and to publish the data of the sud4science project.This paper discusses and analyses the use of the 88milSMS corpus obtained in the context of our project.
In this article, firstly we briefly summarise the sud4science data collection, ensuing processing/analysing stages, and the resulting corpus, 88milSMS 2 , through a synthesis of quotes and references to previous articles ( § 1).Secondly, we provide a state of the art on some research initiatives that use 88milSMS in various domains and frameworks, which will enable future cross-disciplinary insight ( § 2).Then, we present other usages of the 88milSMS corpus we identified through surveys ( § 3).Finally, we suggest future paths for textual data collection and analysis.

From sud4science to 88milSMS
This section provides a schematic synthesis of both the text-message data collection project sud4science (http://sud4science.org), which was part of the sms4science international initiative (http://www.sms4science.org),and the data processing to compile the resulting 88milSMS corpus.A more in-depth project description and analysis is provided in Panckhurst (2017: 185-235).Exhaustive references to the datacollection project and ensuing corpus can be consulted online 3 .

Data collection
In 2011, over 88,000 authentic French text messages were collected during a 13-week period from the general public in Montpellier, France (Panckhurst et al. 2013, Panckhurst et al. 2016b) and SMS 'donors' were also invited to fill out a sociolinguistic questionnaire (Moïse 2013, Panckhurst andMoïse 2014).
Figure 1 provides quantitative results on the sud4science text-message data collection (number of SMS, characters, words, donors, smileys/emoticons, emoji) and sociolinguistic questionnaire (donor gender and age, telephone type, monthly plan, education level, etc.) (cf.Panckhurst et al. 2013: 109-111, for more detail).After the sud4science SMS data collection took place, there was a pre-processing phase of checking and eliminating any spurious information (including duplicates, advertisements, messages from telephone operators, etc.) (cf.Panckhurst et al. 2014b and2014c for general explanations, details and advice).

Anonymization
An anonymization phase was conducted (Accorsi et al., 2014, Patel et al., 2013), owing to legal requirements for data-protection of private data (Ghliss and André, 2017).This involved anonymizing names, telephone numbers, places, brand names, addresses, codes, URLs (see Fig. 2 for precise tags and occurrences and § 2.2 for more detail on the semi-automatic software procedure).

Transcoding and annotation
Before disseminating the constructed corpus, we explored the possibilities of "transcoding" raw text messages into standardized French and linguistic "annotation".Concerning the terminology, we chose to define these terms as follows: "[Transcoding] can be defined as converting from one form of coded representation to another.This allows to discriminate between oral speech (to written) 'transcription' techniques and written (to written) 'transcoding' ones, such as SMS data.From a linguistic point of view, one can also use the mainstream 'standardization', a synonym that we indeed used previously, along with 'normalization', which we prefer to use when faced with computational linguistics matters (Lopez et al., 2014)."(Panckhurst, 2016: 3)."Linguistic annotation of SMS data for the 88milSMS corpus [is] 'interpretative' linguistic information indicated via appropriate tags [see below] related to the difference between a 'raw' text message and its transcoded equivalent in standardized French.[We decided not to include] lemmatisation or part-of-speech (POS) tagging […], which do indeed also correspond to other methods of linguistic annotation (based mainly on providing lexico-morpho-syntactic information)."Panckhurst (2016: 5).
[…] Mark-up initiatives should not be imposed upon researchers; it seems more relevant to let them conduct their own annotation bearing their specific scientific questioning in mind, without being trapped within a unique theoretical framework.Another alternative is that researchers may of course prefer to provide both 'raw' and tagged corpora: "Dissemination will take two different forms: one version of a corpus with the 'raw' text without any tokenization and annotation (v1), and a second version of the same corpus with the annotations (v2)."(Chanier et al., 2014, p.2).For instance, Riou and Sagot (2016) present morpho-syntactic tagging of a specific corpus within the French CoMeRe corpora repository (v2), following on from a previous version without it (v1).(Panckhurst 2016: 7-8).

Language Sciences perspectives
9. Textual and graphical 'softeners' are frequently used to decrease ambiguity, and/or aid interpretation; 18   10.Emoticons :-^^ <3 are common and emoji 01F 382   are sometimes included. 19  17 Other research projects include further findings on evolving writing practices: "[In sms4science] Cougnon (2015) showed that there was little difference between generations concerning linguistic practices; for example, there are no differences regarding words borrowed from other languages, or regionalisms.Cougnon and Draelants (2018) also showed that all generations find that respecting norms in writing conventions is very important.In terms of spelling and syntax, there are more subtle variations: verb tenses and modes are more often problematic for the young and informal question forms and negations which suppress the "ne" particle are also more apparent, which shows young people communicate in a more informal manner, but not in an incorrect one.Cougnon et al. (2017) compared dictations and writings over a 100year period, and their study shows that today's younger generations are in actual fact better at writing essays, using connectors and expressing ideas as compared to young people of yesteryear."(Panckhurst & Cougnon, 2019)  A number of recent Master's and PhD dissertations allow to pursue further in-depth linguistic (André 2017, Cougnon, 2015, Guryev 2017, Morel 2017, see below) and NLP analyses (Kogkitsidou 2018, Tarrade 2017, Zenasni 2018, cf.§2.2.) of French SMS and instant-message writing.
By manual linguistic analysis of ~10,000 authentic text messages in French, from corpora collected in the sms4science project including 88milSMS (Belgium, Reunion Island, Switzerland, Quebec and southern France) André (2017) shows that SMS writing A French text-message corpus: 88milSMS.Synthesis and usage Corpus, 20 | 2020 is aimed at personal appropriation of the graphic code, without orthographic standards systematically declining.He stipulates that SMS writing reveals identity, in terms of relationships to scriptors' writing and ability to adapt their discourse.The study also indicates that SMS writing can sometimes present characteristics that account for the existence of a strong link between graphic code and cognitive oralisation of a message.Cougnon (2015) conducts detailed linguistic analyses of over 50,000 text messages collected from around the world within the sms4science project, including: language switching, neologism usage, regionalisms.She also provides descriptive and inferential statistics which give insight into modern trends of SMS-writing linked to sociodemographic variables (age, sex, education, etc.).
In his PhD dissertation, (Guryev, 2017) provides analysis of the syntactic variation of French interrogative structures in Swiss spontaneous electronic interaction (instant messaging, texting, WhatsApp, etc.).He postulates that under the pressure of various linguistic and non-linguistic constraints, the SMS writer chooses the particular variant which allows him/her to best achieve given communicative goals.In order to identify different types of constraints or factors that may influence the choice of variants, a multidimensional analysis model is applied which focuses simultaneously on grammatical, interactional and sociolinguistic parameters.Morel (2017) analyses plurilingual practices within the Swiss sms4science.chcorpus (both SMS and WhatsApp) with French as a main language.The research focuses on three levels of regularity of plurilingual texting, i.e. (1) linguistic, ( 2) sociolinguistic, and (3) interactional.His PhD provides a detailed account of a pattern of plurilingualism previously unexplored.

NLP and Data Mining approaches
As specified in § 1.1., the NLP dimension of the project allowed initial processing of the data collection in particular with the 'Seek&Hide' student software for anonymization (Accorsi et al. 2014, Patel et al. 2013), and 'AlignSMS', a student alignment prototype for transcoding/normalizing French text messages (Lopez et al., 2014).Next, the focus was on classifying 'unknown' non-standard items (INSO) (Lopez et al. 2015) in text messages, thus helping to automatically identify lexical creativity 21 in 88milSMS, which in turn may increase and improve electronic dictionary content (Figure 7).
Six key points summarise the computational linguistics and text-mining processing aspects of the project (see Figure 7 for a graphical representation): Real-life applications emanating from such projects could have an enormous societal impact: e.g., automatic transcoding of text messages into standardized French could be successfully incorporated into vocalizing software for those unable to consult the telephone screen (drivers, the blind, etc.).
A French text-message corpus: 88milSMS.Synthesis and usage Corpus, 20 | 2020 22 First names are the main items to be hidden, but the task is difficult because different spellings can be used for a given name (e.g.Nicolas, Nico, Nicooo, Niko, Nicoco, Nyko).Within the framework of the Seek&Hide software, word-processing techniques based on a dictionary are used to label the information which needs to be anonymized.Based on such labels, the three-step semi-automatic system decides which words are to be: a) automatically anonymised, b) ignored, or c) highlighted so that the human linguist expert annotators can then process the data, via a web interface (cf.Fig. 8).Alignment (AlignSMS).The algorithm we proposed to align "raw" anonymized SMSs with normalized SMS is based on the pivot principle (Choudhury et al., 2007) according to four steps: 1) identification of textual blocks to be aligned, 2) identification and alignment of invariant blocks (i.e.pivot blocks), 3) deducting alignments based on step 2, and 4) manual alignments of non-aligned blocks.
INSO extraction for lexical creativity identification (Lopez et al. 2015).Our system uses ten sequential filters in order to classify items into ten predefined categories.These categories are designed to capture all items which are not considered to be an INSO (in French Item Non Standard Original for Unknown Non Standard Item).Examples of categories are "items identifiable from lexical resources", "items without accents but identifiable in dictionaries", "items with a sole character", "hours and dates", "smileys", etc.The main idea is to capture the various items with these filters.Items that pass through all filters are considered to be INSOs (cf. Figure 7).This kind of resource is relevant for electronic dictionary improvement.
Normalization.Based on the 88milSMS corpus, Tarrade (2017) develops a rule-based system using the Stanford CoreNLP architecture.These rules aim at generating normalized item candidates taking into account diacritic signs, agglutination, apocopes, consonant contractions/clippings, etc. according to a predefined typology of linguistic phenomena (Tarrade et al., 2017).A score is computed for each candidate according to the kind of triggered rules and the morphosyntactic context of the item.Kogkitsidou (2018) proposes a hybrid approach for automatic SMS normalization by combining fine-grained linguistic analysis based on local grammars within a machine translation model.For an information retrieval task, over the original and normalized versions of an SMS corpus, a comparison with three open source tools for name entity recognition shows that each system enhances the tagging performance over the normalized SMS.

Spatial entity recognition and extraction.
Other recent research encompasses spatial recognition/extraction and sentiment analysis.(Zenasni et al. 2018) propose a new method combining several NLP approaches, including statistical information (i.e.similarity measures), lexical analysis (i.e.presence or absence of accents), grammatical analysis (i.e.part-of-speech (POS) tagging), and a text-mining approach based on ngrams of words for identifying and extracting spatial entities from the 88milSMS corpus.
The proposed methods enable to extract variations of spatial entities (e.g.motpellier, montpelier, Montpel are associated with Montpellier).Moreover, this unsupervised method has been compared with a machine learning approach in order to identify spatial entities in the 88milSMS corpus (Lopez et al. 2018).It combines an approach based on Linked-Open Data for extracting rich contextual features along with standard ones that are usually included in NER systems.Both approaches (i.e.unsupervised and supervised) obtain comparable results.
Sentiment analysis.The work of (Khiari et al. 2016) presents a new opinion-mining method by combining lexical and semantic information.More precisely, the proposed approach applied to 88milSMS gives more weight to words with a sentiment (i.e.presence of words in a dedicated dictionary) for a classification task based on three classes: positive, negative, and neutral.Moreover, the system takes into account lexical information (e.g.repetitions of characters) in the prediction model.

Surveys
Once the 88milSMS corpus was uploaded to the Huma-Num platform (http:// 88milsms.huma-num.fr) in 2014, we gave researchers and the general public the option of signing up to a scientific newsletter. 23

Corpus usage (2017)
Three years after providing 88milSMS for public download and dissemination, we decided to conduct a survey on usage of the corpus and asked if researchers were interested in a study day to be organised.Unfortunately, only 10% of those receiving the newsletter responded.General answers are summarised in Figure 9 24 below with a strong disciplinary tendency towards language sciences and computing including NLP, text mining and corpus linguistics research, within Europe and beyond, mainly from higher education establishments: In terms of dissemination, 50% of the research cited was successfully circulated in Master's theses, PhDs, habilitations, books, articles, proceedings, etc. (Figure 10).Several colleagues and students from other disciplines contacted us in order to insert their references on our website (Kodelja et al. 2015, Thovex 2016).

Survey update (2019)
In March 2019, we sent an update query via the scientific newsletter to find out if colleagues had cited and/or used the 88milSMS corpus data in their work.The survey responses received have been minimal.However, they indicate that the corpus is being used in language sciences, as is to be expected, but also in other disciplines: 25 • Geography: Identification of place names and interpretation of variations (up-and-coming Master's 2 internship subject, 2019, IGN-Paris & Paris-Est Marne-la-Vallée University); • Language Sciences (use 88milSMS): -University courses for 2 nd -year students; identifying and improving spelling mistakes (Poitiers University); discourse genres (Lorraine University); -recent PhD (date non-stipulated) on French as a foreign language and how to include SMS-writing in didactic situations; -qualitative comparative analysis between differing corpora, related to morphosyntactic French question-form usage (Guryev 2018) and interactional aspects comparing SMS and oral language (Guryev 2019); • Psychology: digital communication and teenagers (relational, emotional romantic aspects, 12-16 year- olds, Master's 1 thesis 2019, Toulouse Jean-Jaures University).

Conclusion
This article provided a synthesis of the sud4science/88milSMS project and resulting corpus usage.In addition, this research allowed us to discover a number of new facettes which are not necessarily systematically investigated by academics.Also, we sometimes We consider the following 4 keypoints to be fundamental for successful applied research: 1. deliver crucial research information to the general public; 2. demand that research results be factored into Ministerial reforms; 3. provide scientific expertise for devising real-life applications/software; 4. continue applied research and link academic and other institutions.
Also, real-life applications/software can help improve people's daily lives.Voice recognition and speech synthesis have been perfected over the decades.Our SMS research might provide insight into how electronic lexica can be modified in order to improve vocal tools used by the blind and/or those who are momentarily impeded from writing on their mobile devices.
Academics need to spend more time off-campus, mingling with people from other walks of life, in order to understand how their own research can become truly applied and useful for all.Links between Universities and other institutions/private enterprise are also crucial.
We consider SMS-writing to be one of the major creative features -an enrichmentof 21st century French written language.Analysing mediated digital discourse inevitably places researchers in the public eye.However, society often perceives contemporary writing styles in a negative fashion.As linguists and computer scientists working with NLP and text-mining, we shall continue to observe and not judge.It is our job (albeit a constant struggle) to continue to dismantle popular beliefs and convey that all written forms should be acceptable, not only standard French language.More positive ideas about technology usage and societal links need to be conveyed.
Recent data collections 26 ( Whatsup, Ueberwasser and Stark 2017;thumbs4science, Cougnon et al. 2017) and future ones will continue to study evolving written language in the 21st century, i.e., investigating sociolinguistic aspects and societal impacts related to mobile technology usage and mediated digital discourse, including plurilingual and cross-cultural perspectives.

Figure 3 .
Figure 3. Transcoding example and related issues

Figure 8 .
Figure 8. Screenshot of the 'Seek and Hide' web interface