Saturday, February 6, 2010

How to: get data from SPARQL endpoints and save data in a spreadsheet

An updated version of this blog post is now available as a chapter in the TopBraid Application Development Quickstart Guide (pdf).

This entry covers two useful SPARQLMotion techniques: how to retrieve and use data from SPARQL endpoints, and how to save data so that you can open it with a spreadsheet program such as Excel or OpenOffice Calc. While this may look like a lot of material to cover, both techniques are so simple that this is the shortest How to blog entry yet.

Retrieving data from a SPARQL endpoint

A SPARQL endpoint is a server that accepts SPARQL queries and returns the result using the SPARQL Protocol for RDF, a W3C standard XML-based format that is much simpler than RDF/XML. This endpoint can be part of your intranet, providing access for others there to data behind your firewall, but more and more SPARQL endpoints are available on the public internet to provide you with data to use in your applications, and excitement about the new possibilities provided by this data is a big driver of the Linked Data movement. (In a future entry, we'll see how you can create your own SPARQL endpoint with TopBraid Live.)

The ability to retrieve SPARQL endpoint data from within a SPARQLMotion script is not a feature specific to SPARQLMotion, but actually an extension to the Jena framework that will hopefully become part of SPARQL 1.1: the SERVICE keyword. This lets you name a data source right in your query, and the source can be a remote SPARQL endpoint.

To try it out, enter the following query into TopBraid Composer's SPARQL view and execute it. Instead of pulling data from the currently open data graph, it gets it from the SPARQL endpoint specified by the URI http://dbpedia.org/sparql, which is the endpoint for the huge DBpedia collection of triples harvested from the structured infobox data in Wikipedia pages:


SELECT ?subsidiaryName ?description
WHERE {
{
SERVICE <http://dbpedia.org/sparql> {
<http://dbpedia.org/resource/IBM> <http://dbpedia.org/ontology/Company/subsidiary> ?ibmSub .
?ibmSub <http://dbpedia.org/property/abstract> ?description .
?ibmSub rdfs:label ?subsidiaryName .
FILTER (lang(?description) = "en")
FILTER (lang(?subsidiaryName) = "en")
}
}
}

This query retrieves names and descriptions of IBM subsidiaries from DBpedia and only passes along the English language ("en") ones.

You can use this technique anywhere that you can use a SPARQL SELECT query in your TopBraid development—including in the sml:selectQuery property of a SPARQLMotion Create Spreadsheet module, as we'll see.

Saving data in a spreadsheet

If you search the TopBraid Composer online help for the word "spreadsheet," you'll see that TopBraid offers a variety of ways to read and write spreadsheet data. The simplest way to create a spreadsheet from a SPARQLMotion script is the Create Spreadsheet module. After you enter a SELECT query in its sml:selectQuery property, it sends a tab-delimited spreadsheet to the next SPARQLMotion module with a column for each variable in the select query. (Any spreadsheet program can read a tab-delimited file. OpenOffice Calc doesn't offer "tab-delimited file" as a choice of file format to open, but if you tell it to open a "Text CSV" file and point it at a tab-delimited file, it will find that tab characters are the delimiters and import it properly.) For example, with the SELECT query shown above, this module will create a two-column spreadsheet with subsidiaryName values in the first column and description values in the second.

Let's create a SPARQLMotion script that does this. Create a new SPARQLMotion file as described at How to: create and run a SPARQLMotion script, calling the file endpointtsv. Next, select Create SPARQLMotion script from the Scripts menu, and when it asks you to select an initial module type, select sml:CreateSpreadsheet from the sml:TextProcessing modules choices under sml:ProcessingModules. Name the module GetIBMSubs and click OK.

When the icon for your new module appears on the SPARQLMotion workspace, double-click it to configure it and set sml:selectQuery to the query shown above. The sml:CreateSpreadsheet module creates a spreadsheet in memory, not on disk, to provide you with the flexibility of doing other things with its output before you send it to a specific output destination such as a disk file, so set the sm:outputVariable property to the name spreadsheetData so that subsequent modules can reference it. You're finished configuring this module, so click the Close button.

Our SPARQLMotion script will only have one subsequent module: the one that saves the spreadsheet to a disk file. Drag an Export to text file module from the Export to local section of the palette onto the workspace and name it SaveSpreadsheet. Double-click its icon to configure it and set the following properties:

  • Set sml:targetFilePath to ibmsubs.txt.

  • The sml:text property is where you assign the text to store in the file being created; set it to {?spreadsheetData} so that it uses the value of the variable being set in the GetIBMSubs module.

  • Set sml:replace to true so that running the script more than once replaces the existing disk file instead of appending the data on as new lines. (Of course, setting sml:replace to false can be very handy in other scripts that you write.)

Click the Close button, and then click on Add connection on the palette. Connect your two icons, and you're finished with your simple little script:

Save it, select the SaveSpreadsheet icon, and click the debug icon to run your script. You should see an ibmsubs.txt file appear in the same directory as the script itself, because you didn't specify a path name when you configured the sml:targetFilePath property. (You're welcome to include a pathname with the file in that property.)

Open up this new file in a spreadsheet program, and then you can use any of the program's formatting features that you like before saving the spreadsheet using the program's native format. Here are the first few rows of the spreadsheet in Microsoft Excel:

If you combine these two techniques with others covered in this How-to series, you could have all of this triggered by a web service call, or the script could combine the data retrieved from a SPARQL endpoint with other data from other sources before saving the spreadsheet, or it might output the data in another of the destinations available in the "Export to" sections of the SPARQLMotion workspace palette. Take another look at those palette choices, or better yet, review the help page for them; I'm sure you'll get some great ideas.

0 comments:

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.