Today is the American Library Association midwinter meeting LibHackathon here at the Penn Libraries. I thought I'd share a project using library data that I've been working on for a little while now in the hopes that it will be not only useful to scholars but also might generate some conversation over how libraries and archives distribute their valuable descriptive information.
In short, this piece is all about how we get to this:
From this:
Over the years and especially here at Penn I've been fortunate enough to work with a number of catalogers in both special and general collections. I can't think of a more under-appreciated part of the scholarly community. I've seen first-hand how much time, energy, and bibliographic skill goes into the description of texts and objects of all kinds. I've heard heated debates over whether one piece of information or another should go into one of the million-and-one MARC fields. What comes out of the other side of this process should be a goldmine of easily usable truly 'big' bibliographic data. Instead, I think it's safe to say that 99% of library users have no idea why one might want to search the 752 field instead of the 260 field for place of publication. Moreover, this is hardly the sole fault of users. Try searching any library online catalog for just information from subfield c of field 300 and see how far you get! So much structured data ignored and thousands of hours of cataloger effort hidden from the world [1].
Fortunately the data is there if you know how to find it [2]! I've been playing around with our catalog data at Penn for a while now and decided a few weeks ago that I wanted an easy way to visually display networks of provenance in our manuscript collection. Penn has a deep commitment to provenance and book history and for my money our catalogers have done some of the richest work in describing provenance of any manuscript collection I've seen. The Kislak Center here at the Penn Libraries currently has cataloged around 1,640 codex manuscripts (manuscripts bound in book form) as well as around 300 codex manuscripts from the Lawrence J. Schoenberg collection [3]. I knew from experience that most of these had detailed descriptions of former ownership in their online catalog records and it seemed reasonable to just download them all and make a quick visualization of who owned which manuscripts in common.
I realize now that this task would have been near to impossible at most libraries where the online catalogs and back-end databases don't easily allow public users to batch download full records. Fortunately at Penn all of our catalog records are available in MARC-XML form which looks something like this:
I knew that our catalogers were keen on including structured data about former owners in the 700 field with a "former owner" phrase after their name. It was easy enough to download a list of all of the manuscripts that possessed this field. Then, after some much needed coaching from Dot Porter, the Kislak Center's XML guru and medievalist extraordinaire, I was able to write an XSL transformation which would spit out just what I wanted. At first glance though, I didn't turn up nearly as many results as I'd hoped and I seemed to be missing a lot of data. Looking through the records I saw that, on the plus side, the 700 field was highly structured with authorized name headings but didn't always incorporate all of the rich narrative textual information in the 561 field (labeled "provenance" in our public catalog. For example, an owner like Sir Thomas Phillipps would have his name included in the 700 field but the auction house which sold the manuscript would appear only in the 561. This is for very good reasons ("Sotheby's" is rarely a "former owner") but I really wanted to know everything about a text so I moved on to extracting every 561 field from the manuscripts. Instead of nice, neat authorized names, I of course got a lot of fascinating narrative:
I broke each of these lines of narrative into sentences and began the arduous work of identifying each owner in a chain of provenance uniquely. After some maddening time using OpenRefine, regular expressions, and plain copying and pasting I got a list I was happy with. In the end I came up with 3,252 manuscript/provenance pairs, like so:
1,285 of our 1,640 odd codices (including two ms. rolls, because: why not) had at least some provenance data recorded as well as an additional 265 of the 311 Schoenberg manuscripts we've cataloged. Out of these I was able to identify 985 "unique" entities through whose hands our manuscripts had passed. More interestingly, 225 of these owners had formerly been in possession of two or more of our manuscripts.
The historical strengths of our collection and Penn's institutional history can be seen pretty clearly here at the center of the cluster. Our codices primarily come from European and American collections as mediated by the prominent dealers and auction houses of London, New York, Philadelphia, Paris,Florence, and Munich. In addition we have received several very large collections over the years including the Gondi-Medici collection via the dealer Bernard Rosenthal and the recent gift of the Lawrence J. Schoenberg collection.
In addition to seeing this network of former owners connected to manuscripts I also wanted to see which groups of owners went together - that is, what strings of provenance were the most common in our collection. To accomplish this I leaned on Andy Danner, a rising star in the Computer Science department at Swarthmore (and a former grad school roommate). Where I had puzzled over ways to write a script to count up owners by commonly owned manuscripts and come up with pretty much nothing workable, Andy took all of 25 minutes to write a simple Python program to accomplish just that. After running the program I had a list of 4,752 pairs or co-owners. These ranged from contemporary collector-auction house pairs, like Schoenberg-Sotheby's, with 91 manuscripts having passed through both their hands, to connections between bookdealers - for example the 6 manuscripts which passed through the hands of both Sam Fogg and Jörn Günther to older provenance strings such as the triad of the Venetian senator Jacopo Soranzo (1686-1761), the Italian collector Matteo Luigi Canonici (1727-1805), and British bibliophile Walter Sneyd (1809-1888) with five manuscripts passing to Canonici then to Sneyd and then through a variety of channels reunited again at Penn.
In looking at this network I was interested in the densest former possessor clusters. Unsurprisingly, Larry Schoenberg appears statistically at the head of all of our past possessors in terms of sheer number of ties to other "owners." This league table looks different from the raw counts of course as the Gondi-Medici manuscripts might make up a substantial portion of our collection but they barely appear as important in this visualization which privileges connections over sheer volume (the Gondi-Medici mss. sharing almost no owners in common with other parts of the collection). This also tends to privilege manuscripts which have moved around quite a bit. Some of the manuscripts in the Schoenberg collection have 7 or 8 recorded owners if we include auction houses with LJS 64 taking the cake with 11! . I'd be curious to see how this robust movement compares with similar research library collections.
To get a sense of what this means, I looked at two of Penn's great donors Henry Charles Lea and Larry Schoenberg and their collections. It's instructive to note how far apart on the network diagram Lea and Schoenberg are. Though they both gave large collections of medieval and early modern manuscripts to the university, they collected in different eras and with different methods and agendas. Lea built his collection over the 19th century though trips to Europe and conversations with dealers disposing the remnants of religious and family libraries across Italy and Iberia with a focus on the inquisition. Schoenberg collected pieces originating all over the world from Japan to the United States with broad interests in science, mathematics, language, and the manuscript form. He also collected in the 20th and 21st centuries which meant his sources were primarily dealers and auction houses. Their networks do overlap in a few places, but these points of intersection are mostly dealers like Tregaskis of London and Sotheby's which just confirms the centrality of the rare book market centered on London.
It's no surprise that another point of connection between Lea and Schoenberg is Sir Thomas Phillipps, perhaps the greatest English book collector of all time (ok, well in quantity at least). Lea bought a manuscript at one of the first sales of Phillipps' collection and Schoenberg purchased several more a century later. The repeated Phillipps sales across the long 20th century injected a huge number of manuscripts into the market at a time when American universities and private collectors were more aggressively building collections and I wouldn't be surprised if Phillipps is at the heart of many American institutional collection networks.
I should say that there are of course serious data bias issues going on here. Schoenberg is well know for his passion for provenance history and founding the Schoenberg Database of Manuscripts (SDBM), he had an interest in documenting the life story of his collection in the way Lea did not, leaving us with much less easily accessible data on the prior disposition of his collection. That being said, I think this kind of network diagram helps make clear the rising importance of specialized antiquarian book dealers and houses over the last century and a quarter.
Finally and perhaps most exciting of all, I'm just now in the process of working through the data and the visualizations to suggest manuscripts which, in the absence of visible or recorded evidence, might have been owned by a particular person or institution based on similar chains of provenance. I hope to make all of this data available through our institutional repository at some point and I'd love to hear from others engaged in similar projects - and I hope this piece will encourage others to begin taking a look at how they might be able to use the detailed metadata created by generations of librarians.
__________
In short, this piece is all about how we get to this:
Network diagram of Penn codex manuscripts and former owners |
MARC record for UPenn Ms. Codex 465 |
Over the years and especially here at Penn I've been fortunate enough to work with a number of catalogers in both special and general collections. I can't think of a more under-appreciated part of the scholarly community. I've seen first-hand how much time, energy, and bibliographic skill goes into the description of texts and objects of all kinds. I've heard heated debates over whether one piece of information or another should go into one of the million-and-one MARC fields. What comes out of the other side of this process should be a goldmine of easily usable truly 'big' bibliographic data. Instead, I think it's safe to say that 99% of library users have no idea why one might want to search the 752 field instead of the 260 field for place of publication. Moreover, this is hardly the sole fault of users. Try searching any library online catalog for just information from subfield c of field 300 and see how far you get! So much structured data ignored and thousands of hours of cataloger effort hidden from the world [1].
Fortunately the data is there if you know how to find it [2]! I've been playing around with our catalog data at Penn for a while now and decided a few weeks ago that I wanted an easy way to visually display networks of provenance in our manuscript collection. Penn has a deep commitment to provenance and book history and for my money our catalogers have done some of the richest work in describing provenance of any manuscript collection I've seen. The Kislak Center here at the Penn Libraries currently has cataloged around 1,640 codex manuscripts (manuscripts bound in book form) as well as around 300 codex manuscripts from the Lawrence J. Schoenberg collection [3]. I knew from experience that most of these had detailed descriptions of former ownership in their online catalog records and it seemed reasonable to just download them all and make a quick visualization of who owned which manuscripts in common.
I realize now that this task would have been near to impossible at most libraries where the online catalogs and back-end databases don't easily allow public users to batch download full records. Fortunately at Penn all of our catalog records are available in MARC-XML form which looks something like this:
I knew that our catalogers were keen on including structured data about former owners in the 700 field with a "former owner" phrase after their name. It was easy enough to download a list of all of the manuscripts that possessed this field. Then, after some much needed coaching from Dot Porter, the Kislak Center's XML guru and medievalist extraordinaire, I was able to write an XSL transformation which would spit out just what I wanted. At first glance though, I didn't turn up nearly as many results as I'd hoped and I seemed to be missing a lot of data. Looking through the records I saw that, on the plus side, the 700 field was highly structured with authorized name headings but didn't always incorporate all of the rich narrative textual information in the 561 field (labeled "provenance" in our public catalog. For example, an owner like Sir Thomas Phillipps would have his name included in the 700 field but the auction house which sold the manuscript would appear only in the 561. This is for very good reasons ("Sotheby's" is rarely a "former owner") but I really wanted to know everything about a text so I moved on to extracting every 561 field from the manuscripts. Instead of nice, neat authorized names, I of course got a lot of fascinating narrative:
Provenance note for UPenn Ms. Codex 234 |
1,285 of our 1,640 odd codices (including two ms. rolls, because: why not) had at least some provenance data recorded as well as an additional 265 of the 311 Schoenberg manuscripts we've cataloged. Out of these I was able to identify 985 "unique" entities through whose hands our manuscripts had passed. More interestingly, 225 of these owners had formerly been in possession of two or more of our manuscripts.
Past possessors of Penn's manuscript codices in yellow with individual manuscripts in grey. (Gephi network diagram rendered with sigma.js).[Full Screen View]
The historical strengths of our collection and Penn's institutional history can be seen pretty clearly here at the center of the cluster. Our codices primarily come from European and American collections as mediated by the prominent dealers and auction houses of London, New York, Philadelphia, Paris,Florence, and Munich. In addition we have received several very large collections over the years including the Gondi-Medici collection via the dealer Bernard Rosenthal and the recent gift of the Lawrence J. Schoenberg collection.
Center Cluster showing a variety of donors, bookdealers, and auction houses |
In addition to seeing this network of former owners connected to manuscripts I also wanted to see which groups of owners went together - that is, what strings of provenance were the most common in our collection. To accomplish this I leaned on Andy Danner, a rising star in the Computer Science department at Swarthmore (and a former grad school roommate). Where I had puzzled over ways to write a script to count up owners by commonly owned manuscripts and come up with pretty much nothing workable, Andy took all of 25 minutes to write a simple Python program to accomplish just that. After running the program I had a list of 4,752 pairs or co-owners. These ranged from contemporary collector-auction house pairs, like Schoenberg-Sotheby's, with 91 manuscripts having passed through both their hands, to connections between bookdealers - for example the 6 manuscripts which passed through the hands of both Sam Fogg and Jörn Günther to older provenance strings such as the triad of the Venetian senator Jacopo Soranzo (1686-1761), the Italian collector Matteo Luigi Canonici (1727-1805), and British bibliophile Walter Sneyd (1809-1888) with five manuscripts passing to Canonici then to Sneyd and then through a variety of channels reunited again at Penn.
Former possessor network of manuscripts owned by Walter Sneyd |
Past possessors of Penn's manuscript codices in yellow linked with others who possessed the same manuscript. (Gephi network diagram rendered with sigma.js). [Full Screen View]
In looking at this network I was interested in the densest former possessor clusters. Unsurprisingly, Larry Schoenberg appears statistically at the head of all of our past possessors in terms of sheer number of ties to other "owners." This league table looks different from the raw counts of course as the Gondi-Medici manuscripts might make up a substantial portion of our collection but they barely appear as important in this visualization which privileges connections over sheer volume (the Gondi-Medici mss. sharing almost no owners in common with other parts of the collection). This also tends to privilege manuscripts which have moved around quite a bit. Some of the manuscripts in the Schoenberg collection have 7 or 8 recorded owners if we include auction houses with LJS 64 taking the cake with 11! . I'd be curious to see how this robust movement compares with similar research library collections.
Network of former "owners" of Mss. also owned by Lea |
Network of former "owners" of Mss. also owned by Schoenberg |
It's no surprise that another point of connection between Lea and Schoenberg is Sir Thomas Phillipps, perhaps the greatest English book collector of all time (ok, well in quantity at least). Lea bought a manuscript at one of the first sales of Phillipps' collection and Schoenberg purchased several more a century later. The repeated Phillipps sales across the long 20th century injected a huge number of manuscripts into the market at a time when American universities and private collectors were more aggressively building collections and I wouldn't be surprised if Phillipps is at the heart of many American institutional collection networks.
I should say that there are of course serious data bias issues going on here. Schoenberg is well know for his passion for provenance history and founding the Schoenberg Database of Manuscripts (SDBM), he had an interest in documenting the life story of his collection in the way Lea did not, leaving us with much less easily accessible data on the prior disposition of his collection. That being said, I think this kind of network diagram helps make clear the rising importance of specialized antiquarian book dealers and houses over the last century and a quarter.
Finally and perhaps most exciting of all, I'm just now in the process of working through the data and the visualizations to suggest manuscripts which, in the absence of visible or recorded evidence, might have been owned by a particular person or institution based on similar chains of provenance. I hope to make all of this data available through our institutional repository at some point and I'd love to hear from others engaged in similar projects - and I hope this piece will encourage others to begin taking a look at how they might be able to use the detailed metadata created by generations of librarians.
__________
[1] The introduction of RDA and other linked data standards may correct some of these issues but I think that will likely be a long time coming.
[2] See Anna-Sophia Zingarelli-Sweet's work outlining some of the promises and difficulties of looking at catalog data at scale: I'm also lucky enough to work with one of the most impressive data-mappers and library catalog users in the country: John Mark Ockerbloom who maintains the Online Books Page and pioneered the "get at your library" link in Wikipedia.
[3] In addition the Kislak Center here at Penn holds nearly 1,000 other "manuscript
collections" which can consist of anything from single leaves to tens of
thousands of pages of archival material as well as codices. Given this somewhat arbitrary distinction, I'd love to expand this analysis to the entirety of the Penn collection.
I did my master's thesis on what used to be Engl. 6 (a Wycliffite Bible), now Codex 201. It bears the ownership inscription of Gilbert, Bishop of Bath and Wells. The problem is that there were TWO Gilberts who fit the name, and one succeeded the other, and the inscription is in a secretary's hand, so it's impossible to tell whether it was Gilbert Bourne, Mary Tudor's bishop, or Gilbert Berkeley, Elizabeth I's bishop. The librarians at Wells Cathedral in the 80's were unable to distinguish which bishop it was. The curator of the Morgan library dated the binding as 1550-1570, so that doesn't help either. I probably still have all those documents in my files somewhere if you want scans of them; please let me know. All best, Jo Koster (olim 'Tarvers')
ReplyDeleteAre you doing your master's thesis on what used to be Engl.6, now Codex 201? That's great! Need some assistance? If so, forward your request to edit-ing.services now for a review!
ReplyDeleteWell thought content impressed me! I am glad to check out sharing information and accordingly meet with the demands. manuscript typing services
ReplyDeleteA look into the growing number of dissertations written on team building activities by students. The article looks at how students request help from team building professionals. write a company profile
ReplyDeleteCompiling and writing your dissertation is a challenging task. Submitting it in a neatly typed format can be a bigger challenge, but not one that cannot be resolved. general supplies company profile
ReplyDeleteYou know them, those things that cause you the foremost anguish in your calling. They hold your business success in their pages and your failure is warranted by your roller-coaster approach to writing them. my blog
ReplyDeleteDiversity and amount ar 2 various things, and each play an important half once it involves writing a thesis. visit the site
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHappy to read your post. After reading this whole article amazed to know about Penn's Codex Manuscripts. Great thesis about penn libraries about 1,640 codex manuscripts catalog collection. Also here anyone can collect lots of helpful information. Visit our website https://www.cheapwritingservices.net/
ReplyDeleteCongratulate on your new development and the hardwork that you did to make your project a success and I hope that you will get a business partnership proposal letter from the authorities as well to further enhance the work of the University.
ReplyDeleteMany people today have websites, and that they got to have skilled content written for them. so as to try to to that, they're going to got to rent ghost writers. the very fact is that ghost writers ar those who ar specialised in writing any piece of labor, as long because it involves data that's simply accessible on the web. homepage
ReplyDeleteFilm script writing courses area unit offered by several institutes worldwide. Script writing is crucial to the medium world and courses facilitate prepare future script writers. The business is increasing quickly and isn't restrained by geographic location, culture, or language. video game script writing
ReplyDeleteAol Helpline Number Aol technical support Helpline number assist you in technical questions while using aol email services And facing any issues related to Aol login,emails issues, are answered
ReplyDeleteby Aol Helpline team.Please contact us at +1800-284-6979. our customer support officer are available 24x7 round o clock.feel free to contact us.Aol Helpline Number
This comment has been removed by the author.
ReplyDeleteOpen Gmail and select Settings appear in the top-right corner. In the inbox type section, choose ‘Unread’ first from the drop-down menu. In the inbox sections section, make your selections using the drop-down menu and select Save Changes appears at the bottom of the screen to finish the step.
ReplyDeleteAOL Helpline Number UK.
This is an excellent blog and also very attractive and interesting.Thanks for sharing the informative topic. I can't wait to dig and start using my time on blogging.
ReplyDeleteI newly wrote an article -
Best Antivirus Protection Support Phone Number
AOL password recovery methods Get Helpline Number
Aol customer service number
Episode is a game which gives you chance to increase your joy time with giving platform to decide your own episode for a specific storyline. You can easily play this game if you are a chronicle teller and want to insert episodes every time to a given situation.
ReplyDeleteEpisode mod apk can easily be downloaded and installed on all your android devices. This publication is helpful to take privileges for all episodes of the specific storyline.
https://apkmodule.com/episode-mod-apk/
Simply desire to say your article is as amazing.
ReplyDeleteWonderful job you've done, keep it up! cheers.
ReplyDeleteYou make so many great points here that I read your article a couple of times.
ReplyDeleteI appreciate your skills and style in elaborating on the topic.
ReplyDeleteI just found this blog and Hopes it continue. Keep up the great work, Thankyou!
ReplyDeleteI was excited to uncover this website. I want to to thank you for this fantastic read!!
ReplyDeleteExcellent blog post, i hope you will keep more post for sharing. Many thanks
ReplyDelete