Encoding Newton’s Alchemical Library: Integrating Traditional Bibliographic and Modern Computational Methods

The Chymistry of Isaac Newton (http://chymistry.org) project team has digitized and encoded, following the TEI Guidelines, the complete corpus of Newton’s alchemical manuscripts, which total more than two thousand pages and over one million words. Newton cited more than ve thousand published and unpublished works in these manuscripts; many of his annotations reference items in his own library, as he was an exceptionally dedicated reader of alchemical texts. Newton’s extensive citations and annotations provide a window into his alchemical research and practices, and serve as the basis for our authoritative bibliography of his alchemical sources. The bibliography is being developed as both a stand-alone reference work and an integrated resource with the alchemical manuscripts, providing additional context for Newton’s citations and orilegia. Once nished, the bibliography will provide complete, structured citations—which often would appear very abbreviated or incomplete in the manuscripts—that can be formatted to comply with modern bibliographic conventions and bibliographic management systems. Our bibliography will also link to digitized online versions of the source texts available through Early English Books Online, HathiTrust Digital Library, and other digital repositories. The citations include quasi-facsimile title page transcription, a technique used for bibliographic description of rare books, to enable richer forms of citation analysis. By analyzing the citations, we will be able to date Newton’s manuscripts, cluster manuscripts that cite the same or related sources, and, ultimately, generate network graphs that will reveal connections between the cited authors and texts and how they inuence Newton’s own ideas and work.


Introduction 1
Best known for his contributions to gravitational theory, calculus, and optics, Newton was also a serious student and practitioner of alchemy. His library was full of dog-eared alchemical books and manuscripts, and he wrote and transcribed close to a million words on the subject, although he never published any of them. His notes and unnished manuscripts contained over ve thousand references to alchemical texts and practices (gure 1). Newton even employed his own citation methods within his manuscripts and notes. It is unusual to see this level of specicity in citation practices in the seventeenth century. In an eort to better understand Newton's alchemical scholarship-as well as the study of alchemy in the seventeenth century more broadly-our team seeks to reconstruct, from the fragmentary citations in his personal papers, a comprehensive list, with more complete bibliographic information, of the hundreds of alchemical texts that Newton read and referenced. This work is meant as a complement to the larger Chymistry of Isaac Newton (http://chymistry.org) project,which began in 2003 with a focus on transcribing and TEI-encoding the complete corpus of Newton's alchemical manuscripts, which total more than two thousand pages and over one million words. Along with a scholarly edition of diplomatic and normalized transcriptions and facsimile page images of Newton's alchemy, the Chymistry of Isaac Newton project also includes pedagogical resources primarily focused on recreating experiments, including a lab unit that features video recordings of reenacted experiments, and online tools that include reference works and a Latent Semantic Analysis Tool to enable a deeper understanding of Newton's writings.

The Bibliography 4
The methods for generating Newton's alchemical bibliography required traditional bibliographic research as well as compiling and encoding the bibliography following the TEI P5 Guidelines (TEI Consortium 2017).

Tracing the Bibliographic References 5
The team started the bibliography by simply identifying and tagging citations to printed works and manuscripts in Newton's alchemical papers. To date we have located over ve thousand citations, which have been encoded with <bibl> elements that will all soon point to the full citation in the bibliography using the @corresp attribute (example 1). As part of the process of tracing the citations and adding the corresponding link to the main entry in the bibliography, the project team will also be checking to make sure all citations are tagged. We do have a few manuscripts that were published without the <bibl> tagging so we expect the total number of citations to grow to considerably over ve thousand as we revisit the corpus.
Example 1. Example of encoded citations provided by Newton using the <bibl> tag.
<p><del rend="strike" hand="#in"><g ref="#UNx263f">☿</g><hi rend="super"><choice> <orig>ij</orig> <reg>ii</reg> </choice></hi></del> <add place="supralinear" rend="caret">lapidis</add> pro ejus solutione seu liquefactione in decoctione<lb/> ab albedine ad rubedinem <bibl>Philal on Ripl. G p. <add place="supralinear">61, 62,</add> 180, 365.</bibl> <bibl>Artef p 5 lin 12</bibl><lb/></p> <p><del rend="strike" hand="#in">lapidis</del> vel Ceratio lapidis pro ejus liquefactione <add place="supralinear" rend="caret">&amp; ablutione</add> post nigredinem <bibl>Flammel annot<lb/> p 770.</bibl><lb/></p> 6 The tagging was the easy part. Next, identifying exactly what Newton was referring to in each of these citations was a meticulous process requiring detective work by several specialistssubject experts and rare books and special collections librarians. Newton's citations were often fragmentary because he used abbreviated notes intended for himself. Considering that he was working before formal citation practices were developed, his references are remarkably consistent and clear to the modern reader. That said, in some cases we were able to see that Newton was referencing something-page numbers and abbreviations to titles-but exactly what he was citing, as in gure 2, is not immediately obvious. For example, Newton used the term "Th. Ch." to refer to the Theatrum Chemicum, a multivolume compilation containing a multitude of alchemical tracts, which he cited numerous times throughout his manuscripts. Newton referenced a handful of other collections as well as the Theatrum Chemicum, such as the Artis Auriferae, published several times in ever-expanding form during the sixteenth and early seventeenth centuries, and the Musaeum Hermeticum, another work that grew over time as it was republished. We compiled the tables of contents for each of these collections to properly identify the individual tracts that Newton referenced. The project team agreed to enter referenced tracts as individual entries in the bibliography with a complete citation to the anthologized source. Newton occasionally cited "second hand" references in which he would attribute something to one author that was actually stated by another author. Clarifying this is critical for pointing to the correct reference from the alchemical manuscripts. Bibliographic tagging of the manuscripts also allowed us to do a rudimentary text analysis to study the words that frequently occurred in the citations. After generating the output of the existing <bibl>s encoded in the manuscripts, we used the TAPoRware Text Analysis Tool 1 and the Voyant Tools 2 to check for frequency of terms and distribution of terms across the corpus. This allowed us to determine that Newton's most frequently cited text was George Starkey's Secrets Reveal'd, published posthumously in 1669, a result which provided quantitative evidence that Newton had studied this work carefully. Starkey, writing under the pseudonym Eirenaeus Philalethes, was irregularly cited by Newton as philal.philaletha, philal, philos, and other variants. In addition, running the citations through the text analysis tools conrmed the degree to which name variants would benet from normalization through the compilation of the bibliography. The text analysis also showed that Newton frequently cited George Ripley, a well-known fteenth-century British alchemist, and Raymond Lull, a thirteenth-century philosopher, among others ([bad link to item: ]).  Newton's alchemical manuscripts reect not only his own original work, but the work of other scholars, alchemists, and philosophers. By compiling an authoritative bibliography, we are able to correctly attribute the paraphrases, quotes, or transcriptions of long passages that appear in Newton's alchemical manuscripts, as well as the extent to which Newton drew from other authors. Owing to the iterative nature of the process of compiling the bibliography, which required extensive research, the project team decided to use Zotero 3 because of the ease of data entry and availability of the Zotero-to-TEI XSLT stylesheet as an initial way to generate the bibliography.
A key resource for building our bibliography was John Harrison's The Library of Isaac Newton (

Use of Quasi Facsimile Transcription 13
Writing in the late seventeenth century, Newton typically referenced texts written and/or published during the fteenth through seventeenth centuries. He also cited medieval sources, but these were usually reprinted in some of the contemporary printed editions and compilations in his library. According to print practice during the early modern period, all the bibliographic information about a work-such as author, date of publication, and place of publication-was contained on the title page. Title pages were critical to the Newton bibliography because we want to pinpoint as precisely as possible which edition or printing of a text Newton cited. This level of precision was important to the project team because the exact printing dates of the material Newton cited in his work allow us to better date when he was producing his alchemical manuscripts and to accurately identify his citations. 14 However, the ne detail of these title pages is frequently garbled by modern bibliographic protocols; it is not uncommon, for instance, for catalogers to replace the original punctuation with modern punctuation. Moreover, the titles commonly used to refer to books of this period may bear little resemblance to the title as printed on the title page. To give an obvious example, Newton's masterwork of gravitational theory is often referred to in brief as "the Principia," the third word of its actual title, Philosophiae naturalis principia mathematica. Harrison's The Library of Isaac Newton frequently abbreviates long-winded seventeenth-century titles, undoubtedly in the interest of conserving space, but at the same time creating the potential for confusion.

15
In order to precisely record the ne nuances of an early-modern title page, bibliographers and catalogers have long used a method called quasi-facsimile transcription (QFT). The goal of QFT-as it was put by Fredson Bowers, who claried and codied its rules in his magisterial Principles of Bibliographical Description-is "bringing an absent book before the eye of the reader" (2005). The method involves using a very specic set of rules to transcribe every letter, punctuation mark, rule, and page break on the title page, capturing as much detail as possible, down to the use of small caps and swash italics (gure 5).  (1961,(38)(39)(40)(41) notes, to make the problem of identication acute, these two editions can only be distinguished by three inconsistencies in the spelling and punctuation as seen in gure 6, the one spelling "Naturall" with two l's instead of one, with commas rather than periods after "philosophy" and "it," and spelling "Ric: Davis" rather than "Ri: Davis." Harrison's citation for this book, as compiled in The Library of Isaac Newton (1978, 109), "Some considerations touching the usefulnesse of experimental naturall philosophy... 2 vols. 4°, Oxford, 1664-1671," is utterly incapable of distinguishing which edition Newton might have owned. Natural Philosophy illustrating the nuances of different editions, and how the act of quasi-facsimile transcription assists in identifying the precise text that Newton referenced. 18 We used QFT in order to record the most accurate information possible about the texts Newton cited. We chose QFT over discrete TEI elements for representing bibliographic metadata found in the title pages mostly for practical reasons. It would have been too resource-intensive to reect the typographic conventions of transcribing a title page from an early modern edition using TEI, and we did not want to break new ground given the well-established and widely accepted conventions of QFT. Using QFT consistently was essential to the bibliographic research process. Including the QFT in the TEI document, even if the title page elements were not granularly encoded, allows the team to maintain the TEI XML document as the authoritative source for the bibliography. The QFT transcription is encoded in the title element that is part of the <biblStruct> along with a supplied title to streamline metadata display for readability (example 2). As mentioned earlier, we compiled the bulk of the bibliography using Zotero. Once the bibliography was close to completion, we exported the bibliography from Zotero to RDF, then used stylesheets provided by the TEI Community (available on GitHub 4 ) to convert from RDF to P4. Finally, another stylesheet was used to conform to the most current version of the TEI Guidelines, P5.

20
The entries in the bibliography are grouped using a <listBibl> with individual citations in a <biblStruct> (gure 7). The bibliography is still a work in progress as new <bibl> s are encoded in the manuscripts that cite sources not yet compiled. Those newer citations are shorthand encoded with a <bibl> and identier so that the linking mechanism from the manuscripts to the bibliography can continue smoothly. Entries tagged with <bibl>s in the bibliography will be collocated and individually traced following the methodology detailed earlier.

Integrating the Bibliography with the Manuscripts 21
We envision Newton's bibliography as a standalone online reference and also as a resource tightly integrated with the alchemical manuscripts. At this point in the project, we have preliminary conceptual designs of how to display full citations in context in light of other critical apparatus conventions we are currently employing for the alchemical manuscripts. We have identied a couple of challenges regarding integration of the bibliography with the alchemical manuscripts that the project team needs to further consider: (1) contextualizing citations that reference longer quotes, and (2) properly attributing quotes that reference multiple authors. The standalone version of the bibliography is still under development and is relying on TEI Boilerplate 5 for online publication. Our goal is to include full text access via persistent URLs to the source materials hosted by HathiTrust, the Internet Archive, or EEBO, giving preference to the most optimal scans and open access resources.

22
To help us eciently and accurately integrate the bibliography, the project team created a series of stylesheets to output the citation (contents within a <bibl>), the value of the @corresp attribute, and the manuscript source (gure 8). This serves two distinct purposes: (1) it provides the encoders with a quick way to reference whether an entry in the bibliography already exists, and (2) it facilitates review by the project editors to ensure that passages were properly cited.

Next Steps 23
Once the bibliography is complete, the Newton project team, through careful analysis of the citations, will be better able to date Newton's manuscripts, to cluster manuscripts that cite the same or related sources, and, ultimately, to generate network graphs that will reveal connections between the cited authors and texts and how they inuenced Newton's ideas and work. The citation analysis will be combined and integrated with parallel work being done in other veins by this team to establish the order of composition of the alchemical manuscripts. 24 We have also been working on Newton's watermarks; on the evolution of his orthography; on the elemental composition of his inks by XRF spectrometry; and on mapping the overall semantic structure of the corpus through latent semantic analysis, with its observable patterns of reuse and reengagement.

The Newton Corpus and Latent Semantic Analysis 25
The team has had a conceptual map of the corpus in hand for several years, drawn from latent semantic analysis (LSA 6 ), but the ideas themselves do not suggest an obvious order of progress.
Newton's scholarly progression in topics like calculus, mechanics, and gravitation, for which we have well-founded intuitions, seems to unfold in his manuscripts in a discernible order. Yet, we still do not understand the directions Newton took in his alchemical studies because the ideas remain largely mysterious to us. As a result, we have a map of his alchemical ideas but we still need other clues to clarify their order of development, and the citations will constitute one of the foundations on which we can determine ordering and dating of manuscripts.

26
LSA is well established method in the eld of information retrieval. It was originally designed to accomplish basic tasks in search (Berry, Dumais, and O'Brien 1995), and was subsequently used to try to model human cognition (Landauer and Dumais 1997). It starts with word counts from a set of documents, usually a large set, that are used to create a term-document matrix, which is a simple numerical representation of the corpus. Linear algebra and its vector-space methods give us a numerical model of the structure of Newton's alchemical manuscripts based ultimately on shared vocabulary and ideas. We have discovered in our work with Newton that the mathematical foundations of LSA make it particularly well suited to identifying the reuse of text passages and phrases in large corpora produced by one or more authors, and that makes LSA a valuable tool for structural text analysis of large corpora.

27
the Chymistry of Isaac Newton project has published the results of its LSA work in interactive, online component on its public website. 7 The LSA component can produce a list of chunks or passages that are strongly linked by shared vocabulary and provide a measure of the strength of the relationship using cosine similarities. More simply, LSA represents documents as "bags" or "buckets" of words with emphasis on how many times a word appears in a document. To identify concepts, since words have multiple meanings, LSA looks for patterns that group words together: for example, "sublimation," "dissolve," and "bodies" might appear in passages in which Newton is noting the transition of substances from solid to gas without passing the liquid phase (see gure 9). Newton's alchemical corpus that reveal strongly correlated passages (denoted by the yellow highlighting).

28
LSA also gives us numerical measures of the semantic similarity of any two passages in the whole corpus. Mathematically, that measure is a cosine calculated from vector representations of the two passages in an eigenvector space, and it has a value between zero and one. When two texts have a cosine nearly equal to one, it implies that the two are virtually identical, likely word-forword from one end to the other. The cosines are a convenient measure of the degree of semantic entanglement of the two passages.
are likely to nd Newton reusing or rethinking text: working over the same ground, recalling or copying the same sentences or phrasing from one member of the pair to the other-and, always, one of the two must have been written before the other. In a mysterious corpus like these alchemical papers, large amounts of this kind of low-level information about otherwise hard-torecognize shared structure can help us to see the shape of this work in much greater detail, and, perhaps, thereby make sense of larger trends in Newton's evolution as a practical chymist and a student of alchemy. 8 Figure 10. Screen shot of the Latent Semantic Analysis Tool, available as part of the Chymistry of Isaac Newton project, revealing pairs of manuscript passages that highly overlap with cosine similarities of 0.9 and greater.

30
As the cosines decrease toward 0.7 and below, there can still be a fair amount of shared vocabulary in the two, but often less shared phrasing, if any at all. Inspection of these pairs can suggest that they belong to some subgenre because of the language, but Newton is clearly doing dierent work with the same language. In pairs much below 0.7, there may be apparent likenesses in the use of one or two co-occurring terms that suggest a possible connection, but usually there is little else to support the idea. In LSA's spectrum-like vector representations of the text passages, even the cooccurrence of a few words in two passages must increase their cosine. It may be an indication of the general semantic similarity of these documents that the lowest observed cosine of any pair in the alchemical corpus was just above 0.4 and not lower.

31
LSA also gives us network graphs of all the passages as clouds of individual nodes, connected with other nodes only when their cosine exceeds a given threshold like 0.7, or 0.8, or 0.9, and these graphs help us to visualize the shape of the whole corpus, or pieces of it. The network graph (gure 11), for example, shows all the pairs of passages in Newton's alchemical manuscripts that have a cosine similarity of 0.7 or greater. It is a stable pattern because the underlying foundationsthe collection of documents and the word counts in their tranches-do not change as a rule, but the graph shows that the whole collection does separate into many smaller semantic subnets. The graph can serve as a kind of map or atlas of locations where Newton worked with the same ideas across the entire corpus of 119 manuscripts.   Figure 13. Network graph produced from the Latent Semantic Analysis Tool represents six documents that are found by LSA to share a large amount of text in certain sections of each of these documents. Each node represents a span of around 250 words of manuscript text, a lengthy passage with a quill and ink. In the passages shown in the graph, Newton rewrote the same material or revisited the same authors a number of times, and so this concatenation may represent a persistent locus of interest over a period of months or years.

35
Passages or nodes in gure 13 that possess many connections will also likely contain direct quotations from the alchemical books that Newton was reading. The nodes or passages to which they are connected also often make the same citations, or paraphrase the quotations and contents found in the multiply connected passages. This graph therefore serves as a map of citation patterns across these six documents.
As it is everywhere else, the basic problem here is to discern the order of composition of these six documents. Sometimes Newton's editorial marks provide clues, but not as often as we would like. This is where we rely on the citations, bibliography, and the orthographic, watermark, and ink evidence to ll in the gaps in the analysis. The resulting clusters will not only have the benet of showing the gradual increase of authoritative sources by Newton; they will also lay the groundwork for network analysis to reveal the connections that he saw among authors' works and ideas.

37
The citations constitute an independent order of evidence with its own rules that will have an impact on how to determine the order of composition of Newton's work in alchemy. When the improved and expanded citation analysis and the ink and paper evidence are all integrated with the semantically distinct clusters of passages and manuscripts that we have already discovered with our LSA tool, we should achieve a highly articulated view of how each cluster of related passages was constructed and gain a better sense of what Newton was doing in each.  8 In the sixteenth and seventeenth centuries, the term "chymistry" was used interchangeably with "alchemy." Chymistry was a eld that included not only the attempt to transmute base metals into gold and silver, but a host of other activities as well. Early modern chymists distilled alcoholic spirits from wine and beer, made mineral acids for use in metallurgy and mining, produced sophisticated pharmaceuticals, and fabricated pigments for artists, among other pursuits. One could almost say that chymistry combined pursuits linked nowadays to the disciplines of nuclear physics (at least in the case of transmutation), pharmacology, and industrial or technical chemistry.