Rather than write specifically about software I’ve been testing, I’d like to take this opportunity to share where I am at with a personal DH project that I am working on. I plan for this project to be of use in relation to both of the DH-centric courses I am taking this semester: Digital Humanities Methods 2 & Visualizing Big Data. I worry if I do not do this now, and in the shared blog format, then I risk losing some key points of issue and success I have encountered during long hours of experimentation over the break. While I found myself hardcore devoted to XML and XSLT transformations for several weeks, my latest obsession has been the R language. I have found that the visualization techniques explored with this language are helping me to create very interesting data-based images that I hope to integrate into work wherein much of the web based presentation of these explorations will be done through XML to HTML transformations and encoding of texts with XML using techniques we have studied, notably RegEx to find specific highlights.
I may be getting ahead of myself here, and that is just because I’m sort of trying to sound all of this stuff out to myself. I imagine that many of my classmates find themselves needing to do this on a regular basis. The nice thing about all of these methods is that they remain unchanged, unlike my project ideas. I have done my best to post some of the hurdles I am jumping in trying to hammer these new techniques out for myself, and I still plan to record some of my initial experiences in learning some basic R programming. That said, let me try and take a few breaths here, and explain my line of thinking as I approach the moment where I need to be able to give a foundational answer to the question of what I would like to use these skills to accomplish:
Idea #1: The Scrooge McDuck Analysis
My first ideas are often my least practical, but the fact that I have had the opportunity to jump right into exploring assigned DH methods on my own terms has been both challenging and satisfying. My attempt to scour some comic books in the desk drawer for some relevant data was not completely in vain, and it did lead me to some interesting findings. Most notably I found that there are still some major headaches in data-acquisition when attempting to transform scanned printed material into digitally malleable text. When seeing the trouble one has to go through, even with expensive OCR software, I realized messing about with old comic books might be a little flippant without a larger end-game in mind, so I tried to shift my lens to a more encompassing range when moving forward.
Idea #2: The Postmodern Database
When exploring methods of XML and XSLT transformations using oXygen, my initial attempt involved creating spreadsheet-based data on a small set of canonical works of postmodern fiction. While this allowed me to play with data encoding on a very simple level, again I was left with the question: What do I want to do with this?– and once again my exploration of the materials I had in my developing toolkit seemed to offer an insight lesser than what the tools themselves promised. This is because, again, the initial question I had was far too simple, and the question that developed became too complicated. I began to think I could perhaps analyze something found nested in these postmodern texts, but being that few (okay, none of them) are in the public domain, the idea of finding full versions of these texts was an anxiety inducing idea. Another problem became my realization that the prose itself is so experimental that perhaps I should move out of this arena for at least the time being. I needed data that was more easily available for “playing with” and this data, while extremely playful in and of itself, was simply not readily available for me to use. I needed some more readily acquired initial data that could be ported into the onslaught of new applications and methods coming to me from both my DH Methods class and Data-Vis class.
Idea #3: The 19th Century Metadex
At this point I had hit two specific walls of frustration from my two specific DH classes. From my methods course: How can I acquire a collection of texts I might like to markup for reasons of close analysis with a digital flavor? From my Big Data Vis course: How can I find cultural data with quantitative value that can be uniquely visualized? I thought that if I could create a value scale out of a specific 19th century data set of Books about Books and superimpose (not sure if this is precisely the right word to use) full texts found in the gutenberg.org database I could offer scholars an interesting hub in which to pull some 19th Century information in a way that would lessen grunt-effort in a pleasurable way. While I was able to find very interesting old data, it was hard to find new data that could match up. None of the books that I had found in a collection of bibliographer indexes seemed to match up with latter-day value systems like LibraryThing or GoodReads. I also began to see some warning signs signaling that my trying to place monetary data in relation to this data-set was also going to be damn near impossible, but more of that when I hit Idea #4. Basically the dataset I had found was both too small to get any interesting results out of, and I found myself having little more than a YEAR to plot data out of. Now I could have used concordance software to create numerical data out of word usage or character appearances and what not, but I was finding myself more and more drawn to the idea of valuation of texts, bringing me to…
Idea #4: The Billion Dollar Library
I found a NYT article from 1898 as linked to here: http://query.nytimes.com/gst/abstract.html?res=F30E1FFD3C5C11738DDDAD0894DE405B8885F0D3
This data I found interesting because it is a summation of two of the most expensive book auctions to take place in the 19th Century. The data was old enough to where there could be no real Copyright issues holding me back from using it, and it came with some monetary data that I could explore with R Language visualizations I was at this point being expected to learn. I thought it would be interesting to calculate rates of inflation from 1898 to the present day, and come up with how much each of these books went to auction for in today’s dollars (or pounds), and then have links to digitized versions of these texts (Hence my title The Billion Dollar Library). I still think this is my best idea, however, I again found myself running into one of my problems from idea #1: OCR software and its current limitations. While I found an entire book with dates and values of the entire Henry Perkins library, summarized in the NYT article, when running this book through OCR I found myself with more errors than info. It would take me months to get this data inputted correctly, and even after that the data is very limited (it fills 3 columns, and one of those columns has no numerical value). This is something I really do hope to come back to, but for now, I think I must move on once again
Idea #5: Idea number 5, there is NO idea number 5, Idea number 6… Nobody Expects the Spanish Inquisition?
I am taking a breather from idea-having for the time being and for my R language programming I am pulling provided data sets from a collection of databases (quandl, NY Open Data, etc). As far as XML and XSLT go, I feel that I need to get back in there and do some practicing before I forget what I figured out. This is one of my biggest issues: trying to make steady strides forward when juggling so many new techniques. I often find in under-the-hood computing land, the biggest danger is forgetting what you already know. I would definitely not equate programming, encoding, or most digital methodologies as akin to bike-riding. Looking forward to more Geolocation methods in our next DH Methods meeting, though I’m wary to say I see myself using a lot of them in my major project, though I have found learning these methods to be VERY helpful. They have guided me as I try to decipher what exactly the limitations are of vector imagery. Vectors are very important in the R language, and I look forward to more and more vector usage, creation, and purpose.