Resources for Software Curation

I. Collecting/Acquiring/Appraising Software

Data Management, Planning & Policies

Cornell’s Guide to Writing “Readme” Style Metadata: Templates/best practice/guidance for creating “readme” files to accompany data sets/software.

Data Management Planning Tool (2011-present): An online application that helps researchers create data management plans.

Depsy (2015-present): Depsy helps users investigate impact metrics for scientific software, tracking research software packages hosted on CRAN (software repo for R programming language) or PyPI (software repo for Python-language software).

GNU Ethical Repository Criteria: Criteria for “hosting parts of the GNU operating system”; can also be used to evaluate other repositories hosting free source code (and optionally executable programs too)

1st IEEE Workshop on Future of Research Curation and Research Reproducibility (2016): Summarizes workshop discussions and recommendations related to curation of research data, software, and related artifacts.

IFLA Key Issues for E-Resources Collection Development: A Guide for Libraries (2012): Overview for libraries that addresses some key issues in collecting “e-resources.”

Springer Nature Research Data Policies (2016): FAQ by researchers about data policies, data repositories, and sharing data.

Guidelines & Tools

Collecting Software: A New Challenge for Archives and Museums

Guidelines for Transparency and Openness Promotion in Journal Policies: “Established by the Open Science Framework The TOP Guidelines provide a template to enhance transparency in the science that journals publish. With minor adaptation of the text, funders can adopt these guidelines for research that they fund.”

How to Appraise and Select Research Data for Curation (2010): Discussion of appraisal concepts; geared towards research data but provides insight into practices for appraising software.

Media Stability Ratings (2018):  Assigns a “media stability rating” to different media formats, in attempt to mitigate loss.

Stewardship of E-Manuscripts (2009): Compilation of tools that can be used in acquisition & stewarding of born-digital materials.

Timbus Debian Software Extractor  (2015): Tool to extract metadata for debian software packages, developed as part of the Timbus Context Project.

II. Describing Data/Software/Environments

Descriptive Standards & Definitions

Asset Description Metadata Schema for Software: A metadata schema and vocabulary to describe software making it possible to more easily explore, find, and link software on the Web.

Best Practices for Cataloging Video Games using RDA & Marc21 (2015):

DataCite (2016-present): A metadata schema for the publication and citation of research data.

Data Documentation Initiative (2011-present): Standard to describe the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.

DDI-RDF Discovery Vocabulary (2013): RDF vocabulary to support the discovery of micro-data sets (aka “raw data”) and related metadata using RDF technologies.

Force 11 Software Citation Principles (2016): A consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues.

Software Ontology (2011): A resource for describing software tools, their types, tasks, versions, provenance and data associated.

Trove Software Map: Classifies software by the following 9 attributes: development status, environment, intended audience, name, natural language, operating system, programming language, and topic.

User Studies

Software Search is Not a Science, Even Among Scientists (2016): Survey of how researchers search for software, including criteria they use to evaluate software results (e.g., how easy is the software to learn)

Examples of Cataloged Software/Data Sets/Repositories

JHU’s Data Archive: Data and Software associated with Seviour et al

Computer History’s Source Code for FORTRAN II compiler

re3data: Registry of research data repositories

III. Preserving Software

Case Studies & Reports

A Case Study in Preserving a High Energy Physics Application with Parrot (2015): Describes the development of Parrot, an application dependency capture program for complex environments.

Exploring Curation-Ready Software (2017): Report 1 by the Curation-Readiness Working Group at the Software Preservation Network.

Heritage.exe (2016): Cross-comparison case study of software preservation strategies at three US institutions.

Improving Curation-Readiness (2017): Report 2 by the Curation-Readiness Working Group at the Software Preservation Network.

Preserving and Emulating Digital Art Objects (2015): Reports on the results of an NEH-funded research project “to create contemporary emulation environments for artworks selected from the archive, to classify works according to type and document research discoveries regarding the preservation effort.”

Preserving Virtual Worlds I, II (2007-2010; 2011-2013): The Preserving Virtual Worlds projects I and II explore methods for preserving digital games and interactive fiction.

Preserving.Exe: Toward a National Strategy for Software Preservation (2013): A report from the National Digital Information Infrastructure and Preservation Program of the Library of Congress, focused on identifying valuable and at-risk software.

SPN Metadata Survey (2017): Survey results on how institutions with digital preservation programs are using metadata to aid in preserving software.

Research Initiatives

The Digital Curation Sustainability Model(DCSM) (2015): JISC-funded project to highlight the key concepts, relationships and decision points for planning how to sustain digital assets into the future.

National Software Reference Library (NSRL): The NSRL is designed to collect software from various sources and incorporate file profiles computed from this software into a Reference Data Set (RDS) of information.

PERSIST (2012-present): UNESCO hosted initiative to “ensure long-term access to the World’s Digital Heritage by facilitating development of effective policies, sustainable technical approaches, and best preservation practices.”

Software Preservation Network (SPN) (2013-present): Community of practitioners and researchers, working to address the problems of how to preserve software.

Software Heritage Network (2016-present): “The goal of the SHN is to collect all publicly available software in source code form, replicate it massively to ensure its preservation, and make it available to everyone who needs it.”

Tools, Applications, Best Practices & Standards

Library of Congress Recommended Format Statement for Software: “Identifies hierarchies of the physical and technical characteristics of software which will best meet the needs of all concerned, maximizing the chances for survival and continued accessibility of creative content well into the future.”

National Archives’ Strategy for Preserving Digital Archival Materials (2017): Overview of strategies used by NARA to preserve digital materials.

Obsolescence Ratings (2018): “This list categorizes the ease with which a range of formats that have been, or are, in common use in their fields can be read, in terms of the equipment available to do so.”

Pericles Extraction Tool (2015-present): Extraction of significant environment information from live environments, to better support object use and reuse, in the scope of long term preservation of data.

Preservation Quality Tool  (2016-present): “This tool will provide for reuse of preserved software applications, improve technical infrastructure, and build on existing data preservation services.”

Software Independent Archival of Relational Databases (SIARD) (2007): An open file format developed by the Swiss Federal Archives for the long-term archiving of relational databases; data can be stored long-term independently of the original software.



Scholar Profile: Nick Montfort

As part of ongoing research at MIT Libraries, I have been conducting interviews with scholars across campus who create, use, or reuse software to understand more about their scholarly practices.  Below are snippets from my interview with Nick Montfort, a professor of digital media in the Comparative Media Studies and Writing section at MIT.  Nick is also an interactive fiction writer, computational poet, and code studies scholar.    

On Reconstructing Code

“So software or creative computing programs or research programs….these are the areas I work in.  There are different sorts of outcomes and some of them are important software produced at MIT, like Joseph Weizenbaum’s Eliza which is a very frequently cited research system and highly influential – Janet Murray named it the first “computer character.” It’s a simulated parody of a Rogerian psychotherapist….asking for you to speak about yourself, and then reflecting that back for you to hear.

One of the interesting things about this system from my perspective is that the original code doesn’t exist, but there’s a paper that describes its function in great detail.  So there are many, many re-implementations of it.  You can run it on the Commodore 64 and BASIC – there are programs to implement an Eliza-like system for that. So there’s not really a canonical Eliza in the way that there is a canonical Adventure.   The lack of preservation for software doesn’t always mean that– if you don’t have the original code or object– it doesn’t always mean that its not influential, important, able to be cited, able to be part of the intellectual discourse.  Of course, it presumably doesn’t HURT to have access to those works in any case.”

On Emulation as Software Preservation

“An emulator is a software version of a computer. Some people find it very distasteful that the emulator is not the authentic hardware which is interesting to note….the way we see it, you can think about it as a particular edition OF a computer. In fact, the Commodore 64 that’s over there (points)  running that program right now is one edition but there are different editions of the C64 with different hardware.  So for example, there’s been a ROM revision to the Commodore 64, so it behaves a little bit differently depending upon which ROM revision you have.  So, in fact even when say, ‘the hardware,  it’s running on the hardware’…there’s more than one ‘the hardware!’ I think that’s even more obvious today.   So, for example, when you have a PlayStation 3 that is supposed to be compatible with a PlayStation 1 or 2 initially, but then that feature is dropped as they refine the production of it….”

A Close Reading of a Commodore 64 Keyboard


“You can see a lot about the layout of the keyboard which is different from modern keyboards.  So if you tried to type in this program that I initially typed in, one thing that you might find funny about it is that if you press shift plus…you need the shift to type plus on a modern typewriter…you get this large cross symbol that doesn’t work – it’s not a plus sign – it’s a special graphical character… the keyboard layout is different in several ways…you have a pi symbol on the keyboard, you don’t have curly braces, you have the arrow keys are in the bottom right and you need to press shift to move up and shift to move left….so maybe these are all curiosities, but when you start to use the system, they change your experience of it.  The other thing is that these graphical characters, including the ones you see on here, are characters you can just type, along with other graphical characters.  You can type them into a program or directly at the BASIC interpreter – you can deal with it quite easily…

The thing about the hardware version then is just from the standpoint of the keyboard, you can see the keyboard is different.  It wasn’t standardized in the way that our Mac and PC keyboards are today, but it also provided these extra facilities like the curious character set of the Commodore 64 was exposed to you because it was actually visible on the keyboard – you could see what the different characters were.  And when you work in an emulator….well, first of all you have to figure out how you want your key mappings to be.  For example, if you’re a Commodore 64 touch typist, you might want your keyboard to be set up in the same physical layout as the C64, but mostly people chose a logical layout where, for instance, if you press shift plus on your keyboard its going to correspond to the plus sign on the commodore 64.  So, you have these issues with setting up the keyboard – that’s one of the reasons why emulation is better suited for joystick games, where it’s a pretty straightforward mapping than using the keyboard in elaborate ways.  On the other hand, if you do want to use an emulator, it provides these extra facilities.  So, you can save the full state of the machine at any point.  So if you look at something more intricate and wanted to show how a word processor or GEOS (the Macintosh-like operating system for the C64) or an elaborate game that has a lot of state….if you want to show how these things worked, then you probably want to save a particular point and you might not always have the capability for doing this within the software itself, but the emulator would allow you to say, ‘Ok, we’ll just take the full machine state’ and will allow a classroom working together or students individually or scholars to come back to that.” 

On Temporality and Games

A_Mind_Forever_Voyaging_Coverart.png“I don’t go very often to play old games, actually…I fear I’m more of a collector (laughs) although I am interested in the ability for people to use these, rather than for their preciousness and economic value.  When people came to play A Mind Forever Voyaging, we did some videography.  It’s a 1985 InfoCom game and it’s very easily played on modern day computers.  But what I did is I set up for a group of four people the first official Infocom edition of the game to run on the Apple IIC.  And then over on this large screen, I connected a computer with the most recent (although it’s pretty old) official Infocom release.  Activision released this Masterpieces of Infocom for MS-DOS Windows 3.1/ Windows 95 at some point in the late 90s.  And I had this running in DOSBox essentially.  So they had their choice between playing these…or both of these…and the group decided they wanted to play on the Apple II and they remarked on some specific material differences there.

One of the things that’s interesting is that the pace of play is different – you don’t have a multi-tasking machine, it’s not connected to the internet, you can’t go and look for hints…you can go and look on your phone, of course, but you don’t have it easily available to you. Additionally, you don’t have the same very rapid pace.  I watched students playing interactive fiction recently and not stopping to read the text outputs, just sort of powering through typing commands.   On the Apple II when you type a command, there would be a little pause before you get a response.  If you type something that’s completely not understood or not useful, you would get a response back fairly quickly.  And then if you did something interesting that changed the state of the game or required disk access, then there would be a longer pause — the disk would spin up, and for players, what I remember and what people report is that there is this moment of anticipation – like ‘Oh Something Is Going to Happen Now! It’s So Exciting!’ So the material qualities of the system there make some sort of difference in play.  I think it’s also why people would play interactive fiction pieces that took maybe ten or twelve hours to work through in the 1980s.  People spend that much time playing games, but interactive fiction specifically is much more abbreviated in comparison to that.  Now people make 2 hour 15 minute games that are for briefer play – people still enjoy engaging with the form – eighty games were released at the IF Competition this year.”

On Authenticity and Networked Everything

“At a classic gaming expo, there was this setup with a big wood-grained cathode ray tube television, and like a really ugly 1970s couch with Atari cartridges on a coffee table and a system in front… and of course it’s in the middle of a convention center, not in someone’s house and you could sit down and play the games in this reconstructed sort of context. So people can always build more or less context around things, to give different sorts of ideas.  We can’t reconstruct even the 70s or 80s in great detail and certainly as you go further back in the history of material texts or literary or gaming or cultural history, its very tough to do.  So I think that there are certain things that people are going to encounter because of historical interest and as scholars.  Their engagement with it might be limited and that’s fine, they also might bring ideas back into the mainstream. So for instance, one of my points in showing people the Commodore 64 is that you can turn it on, you can write a one line program like this… it’s not just historical curiosity about the Commodore 64.  There are a bunch of reasons for this.  It didn’t come with a disk drive, you needed to purchase it separately which allowed for the up-selling of it.  And it allowed for lower cost of that one unit that didn’t have moving parts and so forth.  But it did have BASIC built in, which was the case with essentially all home computers at the time and that programming language did facilitate this immediate exploration of what you could do with computing, being able to do very small scale programs.  Some people would type in pages-long programs from magazines or books and not have any way to save them! So when you turned off your computer it was gone! But it took a long time to type this in, and you might make mistakes and have to go correct it, and then you could play the game afterwards, but as soon as you turned the computer off it was gone but the whole process of doing this engaged you with programming and computing in ways that aren’t as possible now.  

Of course, there are people who did engage with the early World Wide Web that way, they went to ‘view source’,  they looked at how HTML was put together and that’s how they learned.  There’s no view source in the App store…there was ‘view source’ in the 90s, there still is, and this ability to turn something on and immediately type in a short program and make changes to it, work with it, is not something that I bring up…when people come in and sometimes students say I’d like to take your course and it says no programming experience is required but I’m worried that I don’t have programming experience, and I say, ‘Well, sit down at the Commodore 64 and let’s program some.’ And in fact it’s not that much of a challenge when it’s posed that way.  So, it’s still something that is useful today and it’s still also useful as a design critique of current computers. While we’ve added a lot of capabilities, certainly the Commodore 64 is not better at accessing social networks, video editing, etc… but we’ve lost some of the ability to work with computation in direct and useful and powerful ways.  And I’m not sure that an emulator accomplishes that – I think sitting down at a Commodore 64 accomplishes that in a different way, because by the time you have installed the emulator and opened it up and your keyboard doesn’t match etc., you now have made things into a much harder problem then they originally were.”  

On Curating Software-Driven Works: Autofolio Babel 

“This is Autofolio Babel or Portfolio Babel, you could also say, it’s based on Jorge Luis Borges’ Library of Babel – there are a lot of computational projects on this.  One of the things about this piece is that Borges defines quite specifically how the books are supposed to look: that they are 80 characters wide and 40 characters tall, arranged in a square… and Borges specifies a 24-character alphabet with some punctuation symbols. Instead of using this alphabet, I used a unigram distribution of Borges’ story itself in Spanish.  So the most likely thing that one would see coming up on the screen would be a page from Borges’ story, and if you look closely you can probably see, because of accent marks maybe if you study it for a while,  you can tell that it’s Spanish language text in its origin.



Screenshot, Una página de Babel

So there’s a piece of software, each of these screens is driven by a Rasperry Pi Zero and this is just a program, it goes much slower than if it runs on a standard, much larger computer – I’ve rotated the screen at the HTML level – the material aspects of this are a bit different – we have a folio here (two screens), and here (two computers), it’s one folio that generates another folio, although this folio is powering this folio down here. They really generate each other.

One of the ways in which this work might be presented is on a table, possibly in front of a chair, or at a lectern, in a way that is suitable to its nature as a book object rather than some other type of screen.  So it would be similar to the kind of curation that people do with video art and to have that kind of care with a piece like this.  There are elements of these pieces that will wear out.  And thinking about if you were curating [Nam June Paik’s] Electronic Superhighway – it has like 170 CRTs and you can’t just say I’ll throw in a flat panel if one of them goes out…most people can, but not people who curate video art.

It’s not really a software concern at this point, but rather a system concern for a system that includes software.  And having Babel as the software component work – that’s more or less a subset.  I wouldn’t want someone to take video of this and put that video out as a ‘preservation method.’  This needs to be a functioning computing machine for this to work, so the software preservation would be part of it from my standpoint.

So I would want the ability to actively compute and recombine…and then one could do various things…in the same way that if your book wears out, you have some type of manuscript or print codex that is damaged or something, you can think about how you would restore this if it were a book? So you can obviously rebind books, in this example, maybe it would be the opposite of binding — maybe you replace the screens, but keep the casing and power apparatus if there were some problem there.  Certainly, if you needed to replace capacitors, most people wouldn’t say that would be problematic. It sort of gets into being a Ship of Theseus problem… of how much replacement effaces the original. This is an interesting case, but it’s something I would consider within book arts/art curation.  I would say librarians and special collections have a particular perspective on it, and art curators would have another.”  

Describing Autofolio Babel (currently in the Trope Tank at MIT)

“Autofoilo Babel consists of these two Dell displays. They are the same model, logos in the front are covered with gaffers tape, these are salvaged…everything here is salvaged…I bought the Raspberry Pis at some point but not for the purpose of making this particular piece. So this is a type of bricolage maybe…one of the ways you could describe the media of the piece is reused electronics.  These have two monitors that are detachable from these stands, but they are both on the stands that come with them. There are two Mini-HDMI to HDMI male to male cables. There are two micro USB to USB male to male cables. There are two Raspberry Pi Zeros – a very early model.  There’s 8 GB SD cards, two of those.  There’s two of everything because it’s a folio.  And these are bound together with two wire twist ties – and there are two power cords which go from the monitors to a standard 125 volt power supply.  So the SD card has a Raspberry Pi image and that’s an image that is set up to automatically start.  It’s a fairly standard image, but there are a few important changes that are made so it starts a browser.  In this case, it starts Chromium in a particular mode where it doesn’t pester and ask you about unlocking your password and stuff; and it sets it to full screen and runs. It also turns off screen blanking, power saving, and screen saving.  So this will run as long as this is on and then the piece itself that’s in there is a free software piece – it’s a single webpage that is almost the same as the one that’s online at –the change really is just rotating this page. 

If I were to sell this to a collector, for instance, they would….I’m trying to think of what the licensing situation would be…there is a slight customization I’ve made to a free software piece, but there’s nothing that the collector would be able to do that would restrict the basic software from being freely available as it is now…and also able to be modified.  People can make their own versions, they can make their own work out of it, as has happened at least once.  So I’ll just show you…..this is just an operating system, that’s Chromium… I haven’t hooked up a mouse, just hooked up a keyboard, but in fact you don’t really need a mouse because you can get to most things on the keyboard here.  So this doesn’t have networking – it’s not on the network and this particular piece is to be read in a certain way for certain values of reading.  This is easier to manage since this is not a networked artifact – it doesn’t receive updates – there are not security issues with it – you can go in and mount this card read only and go through the whole image if you wanted and get the information you wanted or copy it and go through that it.”

On Authorship and Code Modification

“For my dissertation, I created a research interactive fiction system called Curveship with its own domain – so you could do everything you expect to do with interactive fiction, but it wasn’t deployable.  You couldn’t make a game you could give to other people.  So for that reason or other reasons, it never took off for people to use.  But that’s a larger system with thousands of lines of code – in theory it would be a platform for work.  Most of my work is considerably smaller- a page or line of code – these are online for people to use and modify.   Taroko Gorge is an example of something I wrote in Python when I was in Taiwan years ago, and after that made a JavaScript version of it.  And people began to modify that JavaScript version and put in their own words, without having a lot of expertise as programmers or identifying as programmers.  And they started to make their own “remixes” of that work, so there’s dozens of those that are available online.   To me, they don’t really threaten the integrity of the original work. I suppose there’s a possibility that someone could be confused that someone’s later modification might be something I did somehow.  But given the whole context of computing, the real concern is that people are intimidated and don’t think that things are open to modification – I see that it’s much more urgent to make that work available.  

I have a project called Memory Slam which is a slowly-growing collection of classic systems – classic and simple versions that I’ve re-implemented.  So, I’ve made Python versions and I’ve made JavaScript versions…there’s six of those pieces now.  I created this so that people could study and modify these systems but they are not close material re-makings of the systems.  So David Link took an exhibit on tour where he rebuilt Ferranti Mark 1 (the world’s first commercially-available electronic computer) and had things functioning very much like the original Christopher Strachey Loveletter Generator and for the people who got to go to that exhibit, great, but there’s another experience to be able to study and modify the way that code like that functions.  So, for example, could you make a love-letter generator into something that expresses dislike or hatred of someone? Could you make a love-letter generator about food? To what extent are the formal properties of the system susceptible to various changes?  


Screenshot, Love Letter Generator,

So when I redid these, the point was mainly to make them available for that type of study, modification, play…I think they are good formal models of those original systems, but they are not capturing all the material qualities.  And the reason I mention all this about Memory Slam is that probably it would make sense to put new versions of that code up – I have Python 2 code – and it might be useful to add Python 3 or somehow find something that could work in both versions.  I could make cleaner html and JavaScript versions. And if I do this – is there a point to keeping the original version and how would that be kept?”  

Dear Reader, I Was Hoping He Would Tell Me

“So one thing I could do is include the git repository in the directory itself that’s available to anyone – so if you really care to know the history of it…you can review that.  When I worked on Curveship, I used Subversion.  Sometimes, it’s rather heavy and sometimes you don’t know whether you will be done with something in 30 minutes or whether it might be a project of several weeks. And you don’t know with a small scale work, do you want to create a branch where you are exploring that you might merge in? This version control perspective is often quite elaborate for very small scale projects.”

On Distributional Poetics

“This 10-print program, which is a random maze generator, is an example of a particular type of distributional poetics, where you see there’s two symbols and in this case, picking from them is equally likely… and that’s a concrete poem or visual art piece that’s made that way. You can make things with words or with lines or syntactically with phrases as well.  There is a shift both as a reader or appreciator of this work from an aesthetic perspective, and as a maker of this work.  It’s that both perspectives need to be…it’s only meaningful if they are attuned to the distributional nature of the work.


Inside 10 PRINT,

So Borges’ description of the Library of Babel is one in which you have an exhaustive library and some pages might be ripped but there’s always a page that is one character different somewhere else in the library, right? So the idea of an exhaustive library in which every possible page like this, every volume containing these pages is represented, and this is a distribution of analog…it’s important also that even though you don’t see this in the work, on the web it makes more sense but these are pages…they are web-pages…so that is something that metaphorically connects through the web to Borges’ idea. So if you come to this thinking ‘it’s a loop of video’ rather than ‘its producing every possible arrangement of these letters’ then I don’t see how your aesthetic perspective on it would be particularly useful – or would allow you the fullest appreciation of it.  I think there are ways in which we are readers of distributions and ways in which we are writers of distributions, and this is keeping things fairly simple, because if you start with existing stores of text and process them, that’s something else. But here we are just talking about a simple distribution system and just processing them right? So the poetics question is – how do we present this in such a way and how do we make this in such a way that it has the inter-textual connections and the metaphorical connections? It is a page, it connects to the description, it implements the specification of Borges’ story in one way but not in another way…and so forth.

So the poetics of this piece have to do with the physical organization of it, what’s shown to someone who is viewing it.  There are certain things… it has a title that evokes something about book arts, for instance, and so a person who knows something about digital media art and something about book arts might know there are things that appear on screens that aren’t videos and might be more aesthetically prepared to receive this.” 

Preservation as Play Back?

“So there’s also the ability to document things.  Compared to documenting a play, it would be significantly harder to have video documentation of a play in part because when you get video documentation it interferes with the production of the play – with the actors putting it on. Here you can just go and take video of this and see what the piece looks like, pretty much, as documentation, but you are not preserving the object any more than taking a good photograph of a painting is preserving a painting.  The archival perspective is often coming from record keeping…in this case, the informational content or the record content is maybe not the main thing going on.”

What is the Scholarly Object? What Should we Preserve? 

“Let’s make a distinction between traditional scholarship and creative practice – so in this piece (Una página de Babel) the software component is referred to by Álvaro Seiça in his PhD and some of his work was actually modifying this piece. So from that standpoint, it enters traditional scholarship, just as there has been practice-based scholarship with other pieces of mine. So in order to follow the arguments that Álvaro makes, in order to follow the discussions in the “great conversation” – what types of software preservation should be done….well, this goes back to Joseph Weizenbaum. The version we have for his system is a LISP implementation that some people call the original, but he didn’t write it in LISP, he wrote it in Michigan Algorithm Decoder, this system called MAD, the code may be around….it might be in the archives….but the core of what was needed was his representation of how that system worked in his paper.  Now could we learn more about the specifics of this — the type of implementation he did, what his process of development was–  if we had that code….yes, of course, that would be very useful.  And we have snippets of example interactions.  But at some point there were lots of these and they were on Teletypes so they were actually in a medium that, if that hadn’t been discarded…there could be a box of transcripts with Eliza that is sitting in the Institute Archives right now.”


Data Epistemologies and Ways of Knowing

“Edison could tell a soprano from Basso Baritone or Tenor & each from another…by looking at record thru a microscope”  -from Paul Israel’s biography of Thomas Edison

While empirical research has confirmed that digital tools and technologies are fundamentally changing how disciplinary scholars work with digital collections [1],  the inverse of this relationship has received little attention.  Are digital collections changing to support the needs and emerging practices of scholars?  Are interfaces, aggregate thumbnail displays, query mechanisms, search terms, types of content, download options, etc enabling scholars to work in these spaces?   As the recent IMLS project “Always Already Computational” grant proposal notes,  “Predominant digital collection development focuses on replicating traditional ways of interacting with objects in a digital space.” [2]  Indeed, much of the research exploring interactions with, and use of, digital collections does not attend to the space of user interaction as a potential site of meaning-making. [3]

My doctoral research focused on this problem area to understand how historians use digitized archival photographs as evidence in their scholarly activities. [4]  An underlying objective of my research was to explore the humanistic practices that scholars bring to bear on non-textual archival objects such as digitized photographs in an attempt to understand whether and how “ways of knowing” have shifted in digital research environments.  I wanted to understand what mattered to historians across dimensions of their overall experience, foregrounding the space of interaction. [5]  Conversely, I also wanted to understand the possible implications of interacting with digitized images as forms of data.  For example, did it matter that historians might privilege things like technical metadata in their interpretations, aspects that might have been invisible to them in print/analog environments?

Don Ihde, a philosopher of technology, has written compellingly about the hermeneutic qualities of scientific instruments, and how these tools can shape and mediate our perceptions.  Interestingly, Ihde describes a research scenario from a third-order perspective, describing the different players in the system (the scientist, the laboratory, the instruments, and the object(s) of study) as parts interacting in the construction of the story of scientific research.   “The laboratory not only prepares inscriptions -but it is the place, the site, where things – scientific objects – are prepared or made readable.” [6]  How do our tools structure and/or facilitate the telling of our stories?

This question is an important counterpoint to the evolving “collections as data” imperative and I’d like to think, is motivated by a similar point of provocation.  While we consider the possibilities in building computationally-aware platforms that can compute all kinds of “data”,  can we also explore ways to document the pathways and actions that our computational tools and techniques afford to us? To make visible what often becomes hidden in the abstract space of computation?  I’m angling here for an approach that captures how we know what we know – an “epistemology of data” perspective.

To the extent that we have transitioned from a scholarship of fixed representations to a scholarship of dynamic digital traces,  how we as information workers prime and prune  these traces is of enduring interest.  Embedding self-reflexive modes into our spaces of interaction can help us to see how technology participates in the equation. More to the point of this post, incorporating self-reflexive design into the building of computational spaces can potentially reveal much about how tools structure our practices, both in scholarly “ways of knowing” and in every day life.

[1] Rutner, J. & Schonfeld, R. (2012).  Supporting the changing research practices of historians, Final Report from ITHAKA S+R; Chassanoff, A. (2013).  Historians and the use of primary source materials in the digital ageThe American Archivist 76(2), 458-480.

[2] IMLS Grant Proposal (2017). “Always already computational: Collections as data,” Disclosure: I am participating as a current “partner” on this grant.

[3] Two notable exceptions in the field of Library and Information Science (LIS) are: Bates, M. (2003).  The cascade of interactions in the digital library interfaceInformation Processing and Management 38(3), 381-400;  Lee, C.A. (2012). Digital curation as communication mediation.  In A. Mehler, L. Romary, & D. Gibbon (Eds.), Handbook of technical communication (pp. 507-530). Berlin: Mouton De Gruyter.

[4] Chassanoff, A. (2016).  Historians’ experiences using digitized archival photographs as evidence (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (UMI No.10251831).

[5] In the LIS literature, digital collections use is often examined through quantitative measures of access to resources (e.g., transaction logs, web analytics) or qualitative analysis of scholarly preferences for resource access (e.g., are scholars retrieving materials through print or electronic methods?)   Such approaches tell us little about how scholars interact with and use digital collections.

[6] Ihde, D. (1999). Expanding hermeneutics: Visualism in science. Evanston, IL: Northwestern University Press.


Towards Strategies for Making Legacy Software Curation-Ready

In this blog post, I am going to reflect upon potential strategies that institutions can adopt for making legacy software curation-ready.  The notion of “curation-ready” was first articulated as part of the “Curation Ready Working Group”, which formed in 2016 as part of the newly emerging Software Preservation Network (SPN).  The goal of the group was to “articulate a set of characteristics of curation-ready software, as well as activities and responsibilities of various stakeholders in addressing those characteristics, across a variety of different scenarios”[1].  Drawing on inventory at our own institutions, the working group explored different strategies and criteria that would make software “curation-ready” for representative use cases.  In my use case, I looked specifically at the GRAPPLE software program and wrote about particular use and users for the materials.  

This work complements the ongoing research I’ve been doing as a Software Curation Fellow at MIT Libraries [2] to envision curation strategies for software.  Over the past six months, I have conducted an informal assessment of representative types of software in an effort to identify baseline characteristics of materials, including functions and uses.  

Below, I briefly characterize the state of legacy software at MIT.

  • Legacy software often exists among hybrid collections of materials, and can be spread across different domains.
  • Different components(e.g., software dependencies, hardware) may or may not be co-located.
  • Legacy software may or may not be accessible on original media. Materials are stored in various locations, ranging from climate-controlled storage to departmental closets.
  • Legacy software may exist in multiple states with multiple contributors over multiple years.
  • Different entities (e.g., MIT Museum, Computer Science and Artificial Intelligence Laboratory, Institute Archives & Special Collections) may have administrative purview over legacy software with no centralized inventory available.
  • Collected materials may contain multiple versions of source code housed in different formats (e.g., paper print outs, on multiple diskettes) and may or may not consist of user manuals, requirements definitions, data dictionaries, etc.
  • Legacy software has a wide range of possible scholarly use and users for materials. These may include the following: research on institutional histories (e.g., government-funded academic computing research programs), biographies (e.g., notable developers and/or contributors of software),  socio-technical inquiries (e.g., extinct programming languages, implementation of novel algorithms), and educational endeavors (e.g., reconstruction of software).

We define curation-ready legacy software as having the following characteristics: being discoverable, usable/reusable, interpretable, citable, and accessible.  Our approach views curation as an active, nonlinear, iterative process undertaken throughout the life (and lives) of a software artifact.

Steps to increase curation-readiness for legacy software

Below, I briefly describe some of the strategies we are exploring as potential steps in making legacy software curation-ready.  Each of these strategies should be treated as suggestive rather than prescriptive at this stage in our exploration.

Identify appraisal criteria. Establishing appraisal criteria is an important first step that can be used to guide decisions about selection of relevant materials for long-term access and retention. As David Bearman writes, “Framing a software collecting policy begins with the definition of a schema which adequately depicts the universe of software in which the collection is to be a subset.”[3]  It is important to note that for legacy software, determining appraisal criteria will necessarily involve making decisions about both the level of access and preservation desired.  Decision-making should be guided by an institutional understanding of what constitutes a fully-formed collection object. In other words, what components of software should be made accessible? What will be preserved? Does the software need to be executable? What levels of risk assessment should be conducted throughout the lifecycle?  Making these decisions institutionally will in turn help guide the identification of appropriate preservation strategies (e.g., emulation, migration, etc) based on desired outcomes.

Identify, assemble, and document relevant materials. A significant challenge with legacy software lies in the assembling of relevant materials to provide necessary context for meaningful access and use.  Locating and inventorying related materials (e.g., memos, technical requirements, user manuals) is an initial starting point. In some cases, meaningful materials may be spread across the web at different locations.  While it remains a controversial method in archival practice, documentation strategy may provide useful framing guidance on principles of documentation [4].

Identify stakeholders. Identifying the various stakeholders, either inside or outside of the institution, can help ensure proper transfer and long-term care of materials, along with managing potential rights issues where applicable.  Here we draw on Carlson’s work developing the Data Curation Profile Toolkit and define stakeholders as any group, organizations, individuals or others having an investment in the software, that you would feel the need to consult regarding access, care, use, and reuse of the software[5].    

Describe and catalog materials. Curation-readiness can be increased by thoroughly describing and cataloging select materials, with an emphasis on preserving relationships among entities. In some cases, this may consist of describing aspects of the computing environment and relationships between hardware, software, dependencies, and/or versions. Although the software itself may not be accessible, describing related materials (i.e., printouts of source code, technical requirements documentation) adequately can provide important points of access. It may be useful to consider the different conceptual models of software that have been developed in the digital preservation literature and decide which perspective aligns best with your institutional needs [6].  

Digitize and OCR paper materials. Paper printouts of source code and related documentation can be digitized according to established best practice workflows[7].  The use of optical character recognition (OCR) programs produces machine-readable output, enabling easy indexing of content to enhance discoverability and/or textual transcriptions.  The latter option can make historical source code more portable for use in simulations or reconstructions of software.

Migrate media. Legacy software often reside on unstable media such as floppy disks or magnetic tape. In cases where access to the software itself is desirable, migrating and/or extracting media contents (where possible) to a more stable medium is recommended [8].   


As an active practice, software curation means anticipating future use and uses of resources from the past. Recalling an earlier blog post, our research aims to produce software curation strategies that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future”[9]. As the born-digital record increases in scope and volume, libraries will necessarily have to address significant changes in the ways in which we use and make use of new kinds of resources.  Technological quandaries of storage and access will likely prove less burdensome than the social, cultural, and organizational challenges of adapting to new forms of knowledge-making. Legacy software represents this problem space for libraries/archives today.  Devising curation strategies for software helps us to learn more about how knowledge-embedded practices are changing and gives us new opportunities for building healthy infrastructures [10].   


[1] Rios, F., Almas, B., Contaxis, N., Jabloner, P., Kelly, H.. (2017). Exploring curation-ready software: use cases. doi:10.17605/OSF.IO/8RZ9E

[2] These are some of the open research questions being addressed by the initial cohort of CLIR/DLF Software Curation Fellows in different institutions across the country.  

[3] Bearman, D. (1985). Collecting software: a new challenge for archives & museums. Archives & Museum Informatics, Pittsburgh, PA.

[4] Documentation strategy approaches archival practice as a collaborative work among record creators, archivists, and users.  It often traverses institutions and represents an alternative approach by prompting extensive documentation organized around an “ongoing issue or activity or geographic area.” See:  Samuels, H. (1991). “Improving our disposition: Documentation strategy,” Archivaria 33,

[5] The results of two applied research projects provide examples from the digital preservation literature.  In 2002, the Agency to Research Project at the National Archives of Australia developed a conceptual model based on software performance as a measure of the effectiveness of digital preservation strategies. See: Heslop,  H., Davis, S., Wilson, A. (2002). “An approach to the preservation of digital records,” National Archives of Australia, 2002; in their 2008 JISC report, the authors proposed a composite view of software with the following four entities: package, version, variant, and download. See:  Matthew, B., McIlwrath, B., Giaretta, D., Conway, E. (2008).“The significant properties of software: A study,”

[6] Carlson, J. (2010). “The Data Curation Profiles toolkit: Interviewer’s manual,”

[7]  Technical guidelines for digitizing archival materials for electronic access: Creation of production master files–raster images. (2005). Washington, D.C.: Digital Library Federation,

[8] For a good overview of storage recommendations for magnetic tape, see: To read more about the process of reformatting analog media, see: Pennington, S., and Rehberger D. (2012). The preservation of analog video through digitization. In D. Boyd, S. Cohen, B. Rakerd, & D. Rehberger (Eds.), Oral history in the digital age. Institute of Library and Museum Services. Retrieved from

[9] Moore, R. (2008). “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).

[10] Thinking about software as infrastructure provides a useful framing for envisioning strategies for curation.  Infrastructure perspectives advocate “adopting a long term rather than immediate timeframe and thinking about infrastructure not only in terms of human versus technological components but in terms of a set of interrelated social, organizational, and technical components or systems (whether the data will be shared, systems interoperable, standards proprietary, or maintenance and redesign factored in).”  See:  Bowker, G.C., Baker, K., Millerand, F. & Ribes, D. (2010). “Toward information infrastructure studies: Ways of knowing in a networked environment.” In J. Hunsinger, L. Klastrup, & M. All en (Eds.),International handbook of Internet research. Dordrecht; Springer, 97-117.


Software as a Collection Object

As I described in my first post, an initial challenge at MIT Libraries was to align our research questions with the long-term collecting goals of the institution. As it happens, MIT Libraries had spent the last year working on a task force report to begin to formulate answers to just these sorts of questions. In short, the task force envisions MIT Libraries as a global platform for scholarly knowledge discovery, acquisition, and use. Such goals may at first appear lofty. However, the acquisition of knowledge through public access to resources has been a central organizing principle of libraries since their inception. In his opening statement at the first national conference of librarians in 1853, Charles Coffin Jewett proclaimed, “We meet to provide for the diffusion of a knowledge of good books and for enlarging the means of public access to them. [1]

Archivists and professionals working in special collections have long been focused on providing access to, and preservation of, local resources at their institutions. What is perhaps most unique about the past decade is the broadened institutional focus on locally-created content. This shift in perspective towards looking inwards is a trend noted by Lorcan Dempsey, who describes it thusly:

In the inside-out model, by contrast, the university, and the library, supports resources which may be unique to an institution, and the audience is both local and external. The institution’s unique intellectual products include archives and special collections, or newly generated research and learning materials (e-prints, research data, courseware, digital scholarly resources, etc.), or such things as expertise or researcher profiles. Often, the goal is to share these materials with potential users outside the institution.[2]

Arguably, this shift in emphasis can be attributed to the affordances of the contemporary networked research environment, which has broadened access to both resources and tools. Archival collections previously considered “hidden” have been made more accessible for historical research through digitization. Scholars are also able to ask new kinds of historical questions using aggregate data, and answer historical questions in new kinds of ways.

This begs the question – what unique and/or interesting content does an institution with a rich history of technology and innovation already have in our possession?

Exploring Software in MIT Collections

As a research institution, MIT has played a fundamental role in the development and history of computing. Since the 1940s, the Institute has excelled in the creation and production of software and software-based artifacts. Project WhirlwindSketchpad, and Project MAC are just a few of the monumental research computing projects conducted here. As such, the Institute Archives & Special Collections has over time acquired a significant number of materials related to software developed at MIT.

In our quest to understand how software may be used (and made useful) as an institutional asset, we engaged in a two-pronged approach. First, we aimed to identify the types of software that MIT might provide access to.  Second, we aimed to understand more about the active practices of researchers creating, using, and/or reusing software. What function or purpose was software being created, used, and/or reused for? We thought that framing our research in this way might help us develop a robust understanding of both existing practices and potential user needs. At the same time, we also recognized that identifying and exposing potential “pain points” could guide and inform future curation strategies. After an initial period of exploratory work, we identified representative software cases found in various pockets across the MIT campus.

Collection #1: The JCR Licklider Papers and the GRAPPLE software

Materials in The JCR Licklider Papers were first acquired by the Institute for Special Archives and Collections in 1996. Licklider was a psychologist and renowned computer scientist who came to MIT in 1950. He is widely hailed as an influential figure for his visionary ideas around personal computing and human-computer interaction.

In my exploration of archival materials, I looked specifically at boxes 13-18 in the collection, which contained documentation about GRAPPLE, a dynamic graphical programming system developed while Licklider was at the MIT Laboratory for Computer Science. According to the user manual, the focus of GRAPPLE on “the development of a graphical form of a language that already exists as a symbolic programming language.” [3] Programs could be written using computer-generated icons and then monitored by an interpreter.


Figure 1. Folder view, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499),Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

Materials in the collection related to GRAPPLE include:

  • Printouts of GRAPPLE source code
  • GRAPPLE program description
  • GRAPPLE interim user manual
  • GRAPPLE user manual
  • GRAPPLE final technical report
  • Undated and unidentified computer tapes
  • Assorted correspondence between Licklider and the Department of Defense

Each of the documents has multiple versions included in the collection, typically distinguished by date and filename (where visible). The printouts of GRAPPLE source code totaled around forty pages. The computer tapes have not yet been formatted for access.

While the software may be cumbersome to access on existing media, the materials in the collection contain substantial amounts of useful information about the function and nature of software in the early 1980s. Considering the documentation related to GRAPPLE in different social contexts helped to illuminate the value of the collection in relationship to the history of early personal computing.

Historians of programming languages would likely be interested in studying the evolution of the coding syntax contained in the collection. The GRAPPLE team used the now-defunct programming language MDL (which stands for “More Datatypes than Lisp”); the extensive documentation provides examples of MDL “in action” through printouts of code packages.


Figure 2. Computer file printout, “eraser.mud.1”, 31 May 1983, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

The challenges facing the GRAPPLE team at the time of coding and development would be interesting to revisit today. One obstacle to successful implementation noted by the team were the existing limitations of graphical display environments. In their final technical report on the project from 1984, the GRAPPLE team describes the potential of desktop icons for identifying objects and their representational qualities.

Our conclusion is that icons have very significant potential advantages over symbols but that a large investment in learning is required of each person who would try to exploit the advantages fully. As a practical matter, symbols that people already know are going to win out in the short term over icons that people have to learn in applications that require more than a few hundred identifiers. Eventually, new generations of users will come along and learn iconic languages instead of or in addition to symbolic languages, and the intrinsic advantages of icons as identifiers (including even dynamic or kinematic icons) will be exploited. [4]

Some fundamental dynamics in the study of human-computer interaction remain relatively unchanged despite advances in technology; namely, the powerful relationship between representational symbols and the production of knowledge/knowledge structures. What might it look like to bring to life today software that was conceived in the early days of personal computing? Such aspirations are certainly possible. Consider the journey of the Apollo 11 source code, which was transcribed from digitized code printouts and then put onto Github. One can even simulate the Apollo missions using a virtual Apollo Guidance Control (AGC).

Other collection materials also offer interesting documentation of early conceptions of personal computing while also providing clear evidence that computer scientists such as Licklider regarded abstraction as an essential part of successful computer design. A pamphlet entitled “User Friendliness–And All That”notes the “problem” of mediating between “immediate end users” and “professional computer people” to successfully aid in a “reductionist understanding of computers.”

Figure 3. Pamphlet, “User friendliness-And All That”, undated, box 16, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

These descriptions are useful for illuminating how software was conceived and designed to be a functional abstraction. Such revelations may be particularly relevant in the current climate – where debates over algorithmic decision making are rampant. As the new media scholar Wendy Chun asks, “What is software if not the very effort of making something intangible visible, while at the same rendering the visible (such as the machine) invisible?” [5]


Building capacity for collecting software as an institutional asset is difficult work. Expanding collecting strategies presents conceptual, social, and technical challenges that crystallize once scenarios for access and use are envisioned. For example, when is software considered an artifact ready to be “archived and made preservable”? What about research software developed and continually modified over the years in the course of ongoing departmental work? What about printouts of source code – is that software? How do code repositories like github fit into the picture? Should software only be considered as such its active state of execution? Interesting ontological questions surface when we consider the boundaries of software as a collection object.

Archivists and research libraries are poised to meet the challenges of collecting software. By exploring what makes software useful and meaningful in different contexts, we can more fully envision potential future access and use scenarios. Effectively characterizing software in its dual role as both artifact and active producer of artifacts remains an essential piece of understanding its complex value.


[1] “Opening Address of the President.” Norton’s Literary Register And Book Buyers Almanac, Volume 2. New York: Charles B. Norton, 1854.

[2] Dempsey, Lorcan. “Library Collections in the Life of the User: Two Directions.” LIBER Quarterly 26, no. 4 (2016): 338–359. doi:

[3]  GRAPPLE Interim User Manual, 11 October 1981, box 14, J.C.R. Licklider Papers, 1938-1995 (MC 499), Institute Archives and Special Collections, MIT Libraries, Cambridge, Massachusetts.

[4] Licklider, J.C.R. Graphical Programming and Monitoring Final Technical Report, U.S. Government Printing Office, 1988, 17.

[5] Chun, Wendy Hui Kyong. “On Software, or the Persistence of Visual Knowledge.” Grey Room 18 (Winter 2004): 26-51.

Curation as Context: Software in the Stacks

As scholarly landscapes shift, differing definitions for similar activities may emerge from different communities of practice.   As I mentioned in my previous blog post, there are many distinct terms for (and perspectives on) curating digital content depending on the setting and whom you ask [1].  Documenting and discussing these semantic differences can play an important role in crystallizing shared, meaningful understandings.  

In the academic research library world,  the so-called data deluge has presented library and information professionals with an opportunity to assist scholars in the active management of their digital content [2].  Curating research output as institutional content is a relatively young, though growing phenomenon.  Research data management (RDM) groups and services are increasingly common in research libraries, partially fueled by changes in federal funding grant application requirements to encourage data management planning.  In fact, according to a recent content analysis of academic library websites, 185 libraries are now offering RDM services [3].  The charge for RDM groups can vary widely; tasks can range from advising faculty on issues related to privacy and confidentiality, to instructing students on potential avenues for publishing open-access research data.

As these types of services increase, many research libraries are looking to life cycle models as foundations for crafting curation strategies for digital content [4].  On the one hand, life cycle models recognize the importance of continuous care and necessary interventions that managing such content requires.  Life cycle models also provide a simplified view of essential stages and practices, focusing attention on how data flows through a continuum.  At the same time, the data flow perspective can obscure both the messiness of the research process and the complexities of managing dynamic digital content [5,6].  What strategies for curation can best address scenarios where digital content is touched at multiple times by multiple entities for multiple purposes?  

Christine Borgman notes the multifaceted role that data can play in the digital scholarship ecosystem, serving a variety of functions and purposes for different audiences.  Describing the most salient characteristics of that data may or may not serve the needs of future use and/or reuse. She writes:

These technical descriptions of “data” obscure the social context in which data exist, however. Observations that are research  findings  for  one  scientist  may  be background context to another. Data that are adequate evidence for one purpose (e.g., determining whether water quality is safe for surfing) are inadequate for others (e.g., government standards for testing drinking water). Similarly, data that are synthesized for one purpose may be “raw” for another. [7]

Particular data sets may be used and then reused for entirely different intentions.  In fact, enabling reuse is a hallmark objective for many current initiatives in libraries/archives.  While forecasting future use is beyond our scope, understanding more about how digital content is created and used in the wider scholarly ecosystem can prove useful for anticipating future needs.  As Henry Lowood argues, “How researchers will actually put their hands and eyes on historical software and data collections generally has been bracketed out of data curation models focused on preservation”[8].  

As an example, consider the research practices and output of faculty member Alice, who produces research tools and methodologies for data analysis. If we were to document the components used and/or created by Alice for this particular research project, it might include the following:

  • Software program(s) for computing published results
  • Dependencies for software program(s) for replicating published results
  • Primary data collected and used in analysis
  • Secondary data collected and used in analysis
  • Data result(s) produced by analysis
  • Published journal article

We can envision at least two uses of this particular instantiation of scholarly output. First, the statistical results of the data can be verified by replicating the conditions of the analysis.   Second, the statistical approach executed by the software program can be executed on a new inputted data set.  In this way, software can simultaneously serve as both an outcome to be preserved and as a methodological means to an (new) end.  

There are certain affordances in thinking about strategies for curation-as-context, outside the life cycle perspective.  Rather than emphasizing content as an outcome to be made accessible and preserved through a particular workflow, curation could instead aim to encompass the characterization of well-formed research objects, with an emphasis on understanding the conditions of their creation, production, use, and reuse.   Recalling our description of Alice above, we can see how each component of the process can be brought together to represent an instantiation of a contextually-rich research object.

Curation-as-context approaches can help us map the always-already in flux terrain of dynamic digital content.  In thinking about curating software as a complex object for access, use, and future use, we can imagine how mapping the existing functions, purposes, relationships, and content flows of software within the larger digital scholarship ecosystem may help us anticipate future use, while documenting contemporary use.  As Cal Lee writes:

Relationships to other digital objects can dramatically affect the ways in which

digital objects have been perceived and experienced. In order for a future user to make sense of a digital object, it could be useful for that user to know precisely what set of surrogate representations – e.g. titles, tags, captions, annotations, image thumbnails, video keyframes – were associated with a digital object at a given point in time. It can also be important for a future user to know the constraints and requirements for creation of such surrogates within a given system (e.g. whether tagging was required, allowed, or unsupported; how thumbnails and keyframes were generated), in order to understand the expression, use and perception of an object at a given point in time [9].

Going back to our previous blog post, we can see how questions like “How are researchers creating and managing their digital content” are essential counterparts to questions like “What do individuals served by the MIT Libraries need to able to reuse software?” Our project aims to produce software curation strategies at MIT Libraries that embrace Reagan Moore’s theoretical view of digital preservation, whereby “information generated in the past is sent into the future” [10].  In other words, what can we learn about software today to make an essential contribution to meaningful access and use tomorrow?  

Works Cited
[1] Palmer, C., Weber, N., Muñoz, T, and Renar, A. (2013), “Foundations of data curation: The pedagogy and practice of ‘purposeful work’ with research data”, Archives Journal, Vol 3.

[2] Hey, T.  and Trefethen, A. (2008), “E-science, cyberinfrastructure, and scholarly communication”, in Olson, G.M. Zimmerman, A., and Bos, N. (Eds), Scientific Collaboration on the Internet, MIT Press, Cambridge, MA.

[3] Yoon, A. and Schultz, T. (2017), “Research data management services in academic libraries in the US: A content analysis of libraries’ websites” (in press). College and Research Libraries.

[4] Ray, J. (2014), Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[5] Carlson, J. (2014), “The use of lifecycle models in developing and supporting data services”, in Ray, J. (Ed),  Research Data Management: Practical Strategies for Information Professionals, Purdue University Press, West Lafayette, IN.

[6] Ball, A. (2010), “Review of the state of the art of the digital curation of research data”, University of Bath.

[7] Borgman, C., Wallis, J. and Enyedy, N. (2006), “Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries”, Center for Embedded Network Sensing, 7(1–2), 17 – 30. doi: 10.1007/s00799-007-0022-9. UCLA: Center for Embedded Network Sensing.  

[8] Lowood, H. (2013), “The lures of software preservation”, Preserving.exe: Towards a national strategy for software preservation, National Digital Information Infrastructure and Preservation Program of the Library of Congress.

[9] Lee, C. (2011), “A framework for contextual information in digital collections”, Journal of Documentation 67(1).

[10] Moore, R. (2008), “Towards a theory of digital preservation”, International Journal of Digital Curation 3(1).

Software Curation: An Introductory Post

In October 2016, I began working at the MIT Libraries as a CLIR/DLF Postdoctoral Fellow in Software Curation. CLIR began offering postdoctoral fellowships in data curation in 2012; however, myself and three others were part of the first cohort conducting research in the area of Software Curation.  At our fellowship seminar and training this summer,the four of us joked about not having any idea what we would be doing (and Google wasn’t much help). Indeed, despite years of involvement in digital curation, I was unsure of what it might mean to curate software. As has been well-documented in the library/archival science community, curation of data means many different things to many different people.  Add in the term “software” and you increase the complexities.

At MIT Libraries, I was given the good fortune of working with two distinguished and esteemed experts in library research: Nancy McGovern, the Director of the Digital Preservation Program and Micah Altman, the Director of Research.   This blog post describes the first phase of our work together in defining a research agenda for software curation as an institutional asset.

Defining Scope

As we began to suss out possible research objectives and assorted activities, we found ourselves circling back to four central questions – which themselves split into associated sub-questions.

  • What is software? What is the purpose and function of software? What does it mean to curate software? How do these practices differ from preservation?
  • When do we curate software? Is it at the time of creation? Or when it becomes acquired by an institution?
  • Why do institutions and researchers curate software?
  • Who is institutionally responsible for curating software and for whom are we curating software?

Developing Focus and Purpose

We also began to outline the types of exploratory research questions we might ask depending on the specific purpose and entities we were creating a model for (see Table 1 below). Of course, these are only some of the entities that we could focus on; we could also broaden our scope to include research questions of interest to software publishers, software journals, or funding agencies interested in software curation.

Entity All libraries/archives MIT Libraries
Research library What does a library need to safeguard + preserve software as an asset? How are other institutions handling this? How are funding agencies considering research on software curation? What are the MIT libraries’ existing and future needs related to software curation?
Software creator What are the best practices software creators should adopt when creating software? How are software creators depositing their software and how are journals recommending they do this? What are the individual needs and existing practices of software creators served by the MIT Libraries?
Software user What are the different kinds of reasons why people may use software? What are the conditions for use? What are the specific curation practices we should implement to make software usable for this community? What do individuals served by the MIT Libraries need to able to reuse software?

Table 1: Research questions by entity and intended audience

Importantly, we wanted to adopt an agile research approach that considered software as an artifact, rather than (simply) as an outcome to be preserved and made accessible.  Curation in this sense might seek to answer ontological questions about software as an entity with significant characteristics at different levels of representation.   Certainly, digital object management approaches that emphasize documentation of significant properties or characteristics are long-standing in the literature.  At the same time, we wanted our approach to address essential curatorial activities (what Abby Smith termed “interventions”) that help ensure digital files remain accessible and usable. [1]  We returned to our shared research vision: to devise a set of conceptual models for software curation strategies to assist research outcomes that rely on the creation, use, reuse, and study of software.

Statement of Research Objectives and Working Definitions

Given the preponderance of definitions for curation and the wide-ranging implications of curating for different purposes and audiences, we thought it would be essential for us to identify and make clear our particular interests.  We developed the following statement to best describe our goals and objectives:

Libraries and archives are increasingly tasked with responsibilities related to the effective long-term preservation and curation of software.  The purpose of our work is to investigate and make recommendations for strategies that institutions can adopt for managing software as complex digital objects across generations of technology.

We also developed the following working definition of software curation for use in our research:

Software curation encompasses the active practices related to the creation, acquisition, appraisal and selection, description, transformation, preservation, storage, and dissemination/access/reuse of software over short and long periods of time.

What’s Next

The next phase of our research involves formalizing our research approach through the evaluation, selection, and application of relevant models (such as the OAIS Reference Model) and ontologies (such as the SWO). We are also developing different data curation profiles to flesh out the activities, roles, and relationships that are bound up in software creation, use, and reuse. In addition to reporting on the status of our project, you can expect to read blog posts about both the philosophical and practical implications of curating software in an academic research library setting.


[1] As Abby Smith notes, “We have to intervene continually to keep digital files alive.  We cannot put a digital file on a shelf and decide later about preservation intervention.  Storage means active intervention.” See: Smith, Abby (2000). Authenticity in Perspective. In Authenticity in a Digital Environment, Council on Library and Information Resources.

A version of this post first appeared on MIT’s Program for Information Science blog.