A Citation is Forever

“Miscellaneous fancy work.” From the Project G...
“Miscellaneous fancy work.” From the Project Gutenberg eBook of Encyclopedia of Needlework. (Photo credit: SWANclothing)

Last week, I talked about the basic model I'm considering for managing static web content in a way that lets us find it based on when we looked at it. The idea is that if I want to cite something, I should be able to point at what I'm citing and know that someone else following my citations will see the same thing I did.

Today, I want to explore what it means for something to be citable.

I come from the sciences, where citation is a shorthand for bringing in a body of work that you don't want to reproduce in your text. It's like linking in a library in a program. You're asserting that something is important to your argument and anyone can find out why they should believe it by following the citation. You don't have to explain the reasoning behind what you're referencing.

If you use citations to give shout outs to people in your field, then you don't need what I'm thinking about. Readers understand that these citations are to remind them about the other people and their body of work, not the particular passage pointed to in the citation. The details aren't important enough to look up.

I'm interested in the citations that people need to follow.

If you are citing a paper in a print journal, you're relying on the fact that it is very difficult to make changes to a published journal and fairly easy to keep the paper around long enough. No one is going to go around to all the libraries that hold a copy and replace a page to update data or correct some mistake in the text. They might publish a follow on article or letter letting everyone know about the mistake, but that is a separate piece that can be cited along with the original. You can point to both and show a sequence.

More importantly, someone can see what you saw when you cited the first but not the second in a paper you wrote. If your conclusions are wrong because of the mistake in the paper you cited, the reader of your paper can understand why you made the mistake you made because of the error in the paper you cited.

Everyone can track the sequence in the scholarly conversation because they can see all the pieces as they happened. They can trace lines of thought as they evolved. They can see where mistakes were made and then corrected.

This isn't just in journals. This holds for books and other publications as well that happen to be in paper. When it's easier to publish a new edition or a new work than correct the previously published copies, we can get the trail that shows us the conversation.

But what happens when it's easier to change what was published than create a new work? We just need to look at current practices for web-based projects to see: they can change at any time. 

This means that what I see today might not be there tomorrow. Even if I give a URL and a date on which I accessed the site, someone else might not be able to find what I found. The conversation becomes quicksand. How can we depend on something that can change at any time?

One solution could be to scrape every scholarly web-based project and store it in the Internet Archive, but that loses any data exploration capabilities. It reduces the web to a static, electronic equivalent of the book without taking advantage of the affordances of the medium.

Another solution is to package everything up into a virtual machine, but that breaks as well. When someone wants to view the project, they play the virtual machine. If an institution is hoping to preserve the site, then they won't allow changes. The project becomes a limited-access, dead project. It becomes a bauble. A curiosity. It's a bit better than scraping the pages, but it doesn't capture all the affordances of the medium.

Even if we were to find these solutions passable, we would have to constantly be scraping the web or taking virtual machine snapshots as the project grew. It's not enough to preserve the end result if the project has been available long enough for people to cite it.

Every citation must be accessible.

Let's consider a project that is collecting data and providing analytics on that data. If I go to the site today, run through the data, and use those results, then I should cite the project as it was today when I ran the analytics. Someone coming to the site tomorrow might see different results if new data has been added to the project in the interim.

Ideally, I should be able to point someone to a URL that will show what I saw when I saw it using the data that was available at that time. If a project has the URL http://www.example.com/foo/, then perhaps the citation could be http://www.example.com/201206071600/foo/. Assuming the project is built on software that can track when data was added and when pages were changed, this would give someone the same content I had when I visited. Memento is a project looking at this question, but from a slightly different angle.

Big assumptions, but disk is cheap and getting cheaper. There's no reason we can't come up with a technological solution to support citations in this way.

The other big problem is sustainability. So we solve the technology problem of pointing people to the right version of a resource, but how do we know it will still be around in a hundred years? Citation requires being able to point at the right version, but it also requires accessibility and availability. Someone has to keep the computers running.

If you have your own site, you might pay for a domain name and web hosting. It's not a lot of money. Maybe a hundred dollars a year. Are you willing to pay that much each year for the rest of your life? What happens after you aren't around any more to pay the bills for the site? Who pays to upgrade to the newest software?

This is why libraries and archives tend to scrape the web or run the last version of a project in a virtual machine. These are both relatively low maintenance. No code has to be updated to keep the information around. It's like pressing a leaf into a book. We keep the outline and colors, but have to give up its life in the process.

I don't have all the answers, but these are some of the problems I see. Feel free to chime in with your own thoughts on this. I'm curious what you see as possible ways around the problem of citation and sustainability in the on-line world.