09/21/12

Publishing and Using Linked Data at DHWI

In January I will be conducting a week-long workshop on Publishing and Using Linked data as part of the Digital Humanities Winter Institute at the Maryland Institute for Technology in the Humanities.  Space is still available, so register today!

The publication of structured knowledge representations and open data on the Web opens new possibilities for collaboration among humanities researchers and cultural heritage organizations. This course will introduce participants to the core principles of Linked Open Data (LOD), techniques for building and understanding LOD models, how to locate LOD sources for research, tools for manipulating, visualizing, and integrating available data, and best practice methodologies for publicizing and sharing datasets.

For this course I will be drawing from initial work done by the Learning Linked Data project at the University of Washington iSchool, which has laid out a core inventory of learning topics.  The LOD for Libraries, Archives, and Museums community has also been actively promoting access to increasing amounts of cultural heritage information via Linked Data approaches.  Some of the questions we’ll be exploring in the workshop are:

  • what does the digital humanities community need from linked data
  • what use can we make of these large data sets
  • how we can synchronize scholarly work with the larger linked data community.

To help gain momentum for the workshop, I’ve created a wiki, called Linked Data for Humanities where I will be sharing a drafts of the syllabus, resources, and example humanities projects.   (a big hat tip to Mia Ridge and the Museums and the Machine Processable Web wiki, which has been an important resource for the LODLAM community).   If you have a humanities-based Linked Data project,  questions, comments, or recommendations for things the course should cover, please join in the conversation.

08/7/12

NASASocial Reflections

Last week I had the chance to participate in a #NASASocial event commemorating the 50th Anniversary of the Kennedy Space Center.  The event was timed to kick-off a weekend of NASA Social events leading up to the Mars Science Laboratory (#MSL) landing on August 5/6. 2012.

For me, this was a dream come true. I’ve wanted to visit KSC ever since I stuffed myself with too many Cheerios in order to get a special Space Shuttle kit in the 80s. Sadly, I never made it to a shuttle launch before the program ended. KSC was high on my to-do list since I moved to Florida, so it was exciting to receive the invitation to the #KSC50 event. The NASASocial team hosted us in the KSC press center for two days of presentations about current NASA programs, especially focusing on the Curiosity mission. We also were able to participate in the first multi-Site NASA social event by joining a live simulcast with MSL scientists and engineers at the Jet Propulsion Lab and other NASA Centers that were hosting similar events). The highlight of the event for me was our tour of early launch sites and getting to go inside the Launch Control Center and Vehicle Assembly Building.

www.flickr.com

This was my first “social media event” of this type and its given me a lot to ponder. Interestingly, this felt very different than my use of social media during conferences. When I’m at MCN, MW, etc., etc. I have a pretty good idea of who the audience is for my tweets, but here I felt a little spammy. I’m not sure what you all thought of the stream coming at you last week, but I was trying to be somewhat restrained in crossing the streams. I am also a more casual fan of the space program and rank pretty low on the space geek ladder. After arriving, I’d wished I’d done some more reading up about what’s going on at NASA so I could ask our panelists better questions.

I was a little disappointed that we didn’t hear more history during and event cast as a 50th Anniversary celebration. I’m not sure this is a criticism as much as a surprising mismatch of expectations.  We did get to hear from some NASA old-timers who shared some great personal anecdotes about their time at NASA.  We did get a fat copy of This New Ocean, a part of NASA’s historical publications, but little mention was made of NASA’s other historical collections or efforts to document it’s history.  During the event I started tweeting links to oral histories from some of our speakers (Jay Honeycutt, Lee Solid, Roy Tharpe).  As we went around on our bus tour our guide did point out some landmarks, but only provided a little bit of what I’d call interpretation. Throughout the tour I was pulling up information from Wikipedia and other NASA sites about the locations we were visiting (had I thought about it, I should have looked for any dedicated apps related to KSC.  They do seem to have an official app, but the one review doesn’t make it look worth $.99)   I’d be curious to see what kinds of interpretation is offered on the public tour that covers the same area.

My other takeaway from this event, is that I need to work at being social at social events.  I’m usually pretty good in a crowd of people I know, but still shy among strangers.  I sense there was some un-official back channel that I might have tuned into if I’d been a little more aggressive about talking to other attendees.   The organizers seemed to leave this part of being social to us and it has me wondering what impact “icebreakers,” etc. have on these kinds of events.  Compared to my conference experiences, I didn’t see as much direct back-and-forth on Twitter among participants (at least using the NASASocial hashtag).  Again, as a n00b, I may have been missing out on something (and ugh! I could never get the wifi to work right, so I was limited to my phone – regretting my wifi-only iPad this weekend).   From NASA’s perspective, I’m guessing that the events were successful.  The NASASocial tag   trended in the US on both days and seemed to feed the buzz leading up to the landing.

Big thanks to NASASocial for letting me come aboard for this event. It was a great opportunity to learn more about KSC and the MSL Mission.  Since it was my first time participating in an event like this, it also has me thinking hard about how museums can use social media in this way to engage their audiences.  It’s going to provide a great example for my students when we discuss social media and museums later this fall.

(and yes, it’s been a while. Do we really need another “I haven’t been blogging for a while post…I don’t think so!)

08/11/11

Reconciliation Recap

@jonvoss asked what I’d been up to related to reconciling my data, so here’s a brief account of what I’ve done over the last few weeks.   Much of this is proof-of-concept that will result in recommendations about what IMLS DCC might have to do to move towards Linked Open Data in the future. There are probably more efficient ways to program these tasks, but for the moment I’m using some simple tools that work for me.

In my previous post, I shared an example collection-level record set as RDF. I’ve gone back and simplified this transformation to leave out the representations of institutions and projects. Turns out the URIs that are present will resolve to a vCard RDF representation. e.g. http://imlsdcc.grainger.uiuc.edu/Registry/Institution/?1316 wil return some XML. Maybe not the best representation, but we can work on that as a separate problem. This has the benefit of making the CLD instances simpler. I made a small change that will still associate a project with a funding agency (to demonstrate the contributions they’ve made).

Using the SIMILE Gadget tool, I’ve also extracted unique terms & frequency counts from the CLD records(1). These terms/frequencies are then imported into Google Refine and reconciled against appropriate LOD data:

Using Freebase has been pretty painless.  When a column of terms is reconciled,  Refine stores the Freebase ID.   To get the Freebase URI,  simply create a “New Column Based on This One” using the following GREL

“http://rdf.freebase.com/ns/m/”+cell.recon.match.id

Using this Freebase URL I can replace the literal statement

<dcterms:spatial>Illinois (state) </dcterms:spatial>

into and linked data statement:

<dcterms:spatial rdf:resource=”http://rdf.freebase.com/ns/m/03v0t” rdfs:label=”Illinois (state)” />

Reconciling against id.loc.gov has been more difficult. From my literal values I can create a query string (sometimes) fetch the correct set of triples for a term. This works for most of our terms, though a few uncontrolled terms have been contributed by participants that don’t match. e.g. http://api.talis.com/stores/lcsh-info/items?query=preflabel:photographs&max=1

It is a little sensitive to plural/singular terms. For example the difference between “scrapbooks” and “scrapbook.” Most terms are plural, but there seems to be some distinction I don’t understand between Painting and Paintings.

In Refine I can pull back the RDF for these terms, but am still working how how I might extract the canonical concept URI for each term. This looks like it will require parsing the RDF to get the right URI out of it. If anyone has a good cookbook for reconciling terms with id.loc.gov URIs, I’d love to see it. (something using the Refine ReconciliationServiceAPI would be swell).

I may give our subject headings a twirl, but I may need some subject cataloger help there.  The published LC authorities include headings like “Cemeteries – Recording” but not localized forms “Cemeteries – Recording — Illinois”  Since these are all strings in a dc:subject, some way of parsing the subdivisions is needed.

Update: After posting this, I started playing with my subject headings and found that the LCSH triples were loaded into Freebase in May. (http://www.freebase.com/view/topic/en/loc_subject_headings_full_load). The Refine reconciliation service will pick them by creating a “namespaced” reconciliation service. (point it at the Library of Congress Namespace). Now, let’s get those names & other vocabularies loaded!

(1) I’ve tried to replicate this in Google Refine, but on my computer it seems to choke on the complex XML record structure. It’s quite happy with large tabular representations though.

07/4/11

Piloting collection-level LOD for IMLS DCC

On Metadata
Since the #lodlam conference, I haven’t had much chance to play around with my shipyard LOD — the dissertation calls. Plus I’m spending about half my time this summer as part of the team working on the CLIR/DLF/IMLS DCC Beta Sprint for the Digital Public Library of America (DPLA).

What follows is a bit of skunkworkery that I’m doing for self-edification & also to help suggest ways we can make IMLS DCC data more LOD friendly.**  Currently people can browse the site at http://imlsdcc.grainger.illinois.edu/history or as XML via OAI-PMH for collection-level and item-level metadata.   As part of the Collection/Item Metadata Working Group (CIMR),  I helped build an RDF testbed that was oriented towards our research problems.

Using some of the stylesheets developed for CIMR, I’ve generated LOD representations for the currently available collection-level records.  When the rubber hits the road like this, there are lots of design choices you can make – in terms of encodings,  which vocabularies to use, etc., etc.  Here is a sample set of records and the XSLT used to generate them from the OAI-PMH, imlsdcc listRecords format.  Some questions:

  • this looks rather complicated.  Maybe that’s OK, as it seems to represent much of the information currently shared publicly by the project. I’d welcome any suggestions for simplifications or better approaches to representing this as LOD.
  • are there best practices for representing organizations as organizations?  FOAF/vCard seem very oriented towards people (who have associations with an organization).  I also picked up the Organization Ontology from Describing Libraries, Their Collections and Services in RDF.
  • Many of the URIs here are just made up for demonstration purposes.
  • There are lots of organizations we have minimal information for. It would be nice to reconcile our URIs with other published URIs for these institutions. What would be the most authoritative source for that LOD?
  • Many organizations aren’t publishing their own “authorized” graphs for themselves.  Is this something a project like IMLS DCC should consider?   I added a stub description of IMLS DCC to this file to demonstrate the relationships between the project and the aggregated collections.
  • Right now this RDF mostly contains the strings found in the original XML.  I would like to reconcile controlled terms where possible to existing LOD vocabularies (like id.loc.gov,  language terms, formats, etc.).  I think that would make this data more “linked.”
  • In theory the XSLT above should still work with the SIMILE OAI-PMH RDFizer

Thanks if you have a chance to take a look and offer comments on this.  And do let me know if you’d like to see more of this kind of data!

** Disclaimer: this is some work I’m doing on the side, on my own. Neither the rdf nor the XSLT should be considered an “official” release by the project. Any mistakes here are mine.

05/17/11

On the ways (Part II)

"Galen L. Stone" Interior view, ribbing of tug under construction.  Delaware Public Archives

"Galen L. Stone" Interior view, ribbing of tug under construction. Delaware Public Archives

Tonight I decided to go back to my large table of >1,000 ships and continue doing some clean-up.   However, instead of trying to edit the values I had I gave the Google Refine/Freebase reconciliation service a try.  Boy-howdy I really should have taken @jonvoss’s advice and done this sooner.   I pretty quickly whipped through my Ship Type column and matched to the /boats/ship/ship_type vocabulary in Freebase.   Like any classification task, I think some of the subtleties of my data get lost. For the moment I think that’s OK, but if you care about the difference between a sidewheel paddle steamer and a sternwheel paddle steamer they’ve been lumped together under the same class.

The reconciliation tool made quick work of matching to the companies that make up the majority of the shipyards in my database, which are mostly the late 19th and 20th century yards.   There’s not much of a record about individual vessels from the earlier yards.   In the few cases where there wasn’t a match, I asked Freebase to create a new topic.  (I’ll go back later and see how this populates Freebase itself.)   I also did this for the Owners column, which was able to match a smaller number of organizations and people.  I don’t know whether I’m being a little cavalier about making new topics on Freebase,  but this seems like the easiest thing to do. (I should probably be keeping better notes about what new things I’m creating – it would nice to get some sort of report/e-mail with all those things listed). The latter part has been slow going due to a bug in Refine that takes you back to the first row after reconciling a row that may be deep in your data. It helps to select the (none) facet that removes rows for which judgements have been assigned and use additional facets to narrow things down.  While I’ve cut this list of owners down significantly I’m still looking at a long-tail of about 400 unmatched entities. (many are individuals who’s first names are abbreviated – with a little googling I can find many of them and expand the names).

A resource that is proving useful to double-check my work is Shipbuilding History that includes lists of vessels from the Wilmington yards. Tim seems to have collected some information I’m missing, so I’m thinking about the best way to reconcile his information with mine.  There are other lists of vessels that are currently not linked data, but are large tables on the web.  Perhaps a screenscraper might make quick work of turning those into linked data graphs that can be merged with my graphs.

But I think I’ve hit my limit for tonight. (I’ve been grading all day, so more than 14 hrs of staring at a screen is probably enough – time to hit my bunk).

On the Ways (Part I)

05/11/11

Birdseye of Wilmington, DE

Just a fun aside,  from the BigMaps blog.  Bummer I don’t see a way to embed the zoomable version here.    Pointing out to these kinds of things from my RDF is one of the longer term goals of my exploration.  Take a ride along the waterfront to see the various ships under construction. (wonder if I can infer from my data which ones those might be?)