EPrints importers made publicly available

We are really please to say that RJ Broker now has importers available for both EPrints 3.2 and 3.3

The EPrints 3.2 code is essentially unchanged, however it is now publicly available in GitHub (https://github.com/edina/RJ_Broker_Importer_3.2) and has been proved to work in a number of installations.

The EPrints 3.3 plugin is also available on GitHub (https://github.com/edina/RJ_Broker_epm) however it is also available as a direct plugin from the central EPrints Bazaar (http://bazaar.eprints.org/332/). This code has been tested in EPrints 3.3.5 and 3.3.12 (and I believe the patch required to correct the returned atom:id will be applied to EPrints 3.3.13)

The two code-bases are almost identical, and we would be delighted to receive feedback/improvements on either.

Setting up redundancy in a live service

At EDINA, we strive for resilience in the services we run. We may not be competing with the massive international server-farms that the likes of Google, Facebook, Amazon, eBay, etc run… however we try to have systems that cope when things become unavailable due to network outages, power-cuts, or even hardware failures: the services continue and the user may not even notice!

For read-only services, such as ORI, this is relatively easy to do: the datasets are apparently static (any updates are done in the background, without any interaction from the user) so two completely separate services can be run in two completely separate data centres, and the systems can switch between them without anyone noticing – however Repository Junction Broker presented a whole different challenge.

The basic premise is that there are two instances of the service running – one at data centre A, the other at data centre B – and the user should be able to use either, and not spot any difference.

This first part is easy: the user connects to the public name for the service. The public name is answered by something called a load-balancer which directs the user to whichever service it perceives as being least loaded.

In a read-only system, we can stop here: we can have two entirely disconnected systems, and let them respond as they need to…. and use basic admin commands to update the services as they need.

For RJ Broker, this is not the end of the story: RJB has to take in data from the user, and store that information in a database. Again, a relatively simple solution is available: run the database through the load-balancer, so both services are talking to the same database, and use background admin tools to copy database A to database B in the other data-centre.

rjb-loadbalancer Again, RJB needs more: as well as storing information in the database, the service also writes files do disk. Here at EDINA, we make use of Storage Area Network (SAN) systems – which means we mount a chunk of disk space into the file-system, across the network.

The initial solution was to get both services to mount disk space from SAN A (at data centre A) into /home/user/san_A and then get the application to store the files into this branch of the file-system…. meaning the files are instantly visible to both service.

This, however, is still has the “single point of failure” which is SAN A. What is needed is to replicate the data in SAN A in SAN B, and make SAN B easily available to both installations

The first part of this is simple enough: mount disk space from SAN B (in data centre B) into /home/user/san_B. We then use an intermediate symbolic link (/home/user/filestore) to point to /home/user/san_A and get the application to store data in /home/user/filestore/... Now, if there is an outage at Data Centre A, we simply need to swap the symbolic link  /home/user/filestore to /home/user/san_B and the application is none-the wiser.

The only thing that needs to happen is the magic to ensure that all the data written to SAN A is duplicated in SAN B (and database A is duplicated into database B)

Embargoes in real metadata, take 2

Following on from the earlier discussion, we have ruled out the first option (where we add an attribute to METS):

<div ID="sword-mets-div-2" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-581-0"/>
    </div>

as the METS Schema doesn’t allow additional attributes to be added (and the investigation into writing validating schema with additional attributes was fun in its own right) – so this leaves us with the XMLdata within the amdSec solution.

To recap, the amdSec will read something like:

<amdSec ID="sword-mets-adm-1" LABEL="administrative" TYPE="LOGICAL">
  <rightsMD ID="sword-mets-amdRights-1">
    <mdWrap MDTYPE="OTHER" OTHERMDTYPE="RJ-BROKER">
      <xmlData>
        <epdcx:descriptionSet xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"
                              xsi:schemaLocation="http://purl.org/eprint/epdcx/2006-11-16/
                                                  http://purl.org/eprint/epdcx/xsd/2006-11-16/epdcx.xsd ">
          <epdcx:description epdcx:resourceId="sword-mets-div-3" 
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
          <epdcx:description epdcx:resourceId="sword-mets-div-2"
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
        </epdcx:descriptionSet>
      </xmlData>
    </mdWrap>
  </rightsMD>
</amdSec>

One of the questions I have been asked a few times is “why don’t you put the actual file URL with the embargo date”, and I refer you to the explanation in the original article:

  • A document may actually be composed of multiple files (consider a web page – the .html file is the primary file, however there are additional image files, stylesheet files, and possibly various other files that combine to present the whole document)

In other words, whilst 99% of cases will be a single file for a single document, it’s not always that simple and I don’t believe the metadata should lead you into a false understanding of what is, so things don’t break when it goes wrong,

SWORD 1.3 v’s SWORD 2

What’s the difference, and how do they compare?

In summary

SWORD 1.3 is a one-off package deposit system: the record is wrapped up in some agreed format, and dropped into the repository. SWORD 1.3 uses an HTTP header to define what that package format is, and the Individual Repositories use that header to determine how to unpack the  record. Every deposit is a new record.

SWORD 2.0 is a CRUD-based (Create, Read, Update, Delete) system, where the emphasis is on being able to manage existing records, as well creating new records. SWORD 2 uses the URL to identify the record being manipulated, and the mime-type of the object being presented to know what to do with it.

In detail (EPrints-specific)

This is, per force, EPrints specific as I am an EPrints user, with no experience coding in DSpace/Fedora/etc.

SWORD 1.3

With a SWORD 1.3 system, one defines a mapping between the X-Packaging header URI and the importer package to handle it:

  $c->{sword}->{supported_packages}->{"http://opendepot.org/broker/1.0"} =
  {
    name => "Open Access Repository Junction Broker",
    plugin => "Sword::Import::Broker_OARJ",
    qvalue => "0.6"
  };

The importer routine then has some internal logic to ensure it only tries to manage records of the right type (XML files, Word Documents, Spreadsheets, Zip files, etc).

In the case of compressed files, it is customary to also indicate the routine to un-compress the file. For example, the same Importer could manage .zip, .tar, and .tgz files – which are all variations on a compressed collection of files – which has the following collection of mime-types:

application/x-gtar
application/x-tar
application/x-gtar-compressed
application/zip

Therefore our importer would have code like this:

   our %SUPPORTED_MIME_TYPES = ( "application/zip"    => 1, "application/tar"               => 1,
                                 "application/x-gtar" => 1, "application/x-gtar-compressed" => 1,);

   our %UNPACK_MIME_TYPES = ( "application/zip"               => "Sword::Unpack::MyNewZip",
                              "application/tar"               => "Sword::Unpack::MyNewTar",
                              "application/x-gtar"            => "Sword::Unpack::MyNewTar",
                              "application/x-gtar-compressed" => "Sword::Unpack::MyNewTar");

So, a basic SWORD 1.3 deposit is a simple POST request to a defined URL, with a set of headers to manage the deposit, and the record as the body of the request:

  curl -x POST \
       -i \
       -u username:password \
       --data-binary "@myFile.zip" \
       -H 'X-Packaging: http://opendepot.org/broker/1.0' \
       -H 'Content-Type: application/zip'  \

http://my.repo.url/sword-path/collection

This will deposit the binary file myFile.zip into the collection point in the repository, using the importer identified by the Package http://opendepot.org/broker/1.0.

SWORD 2.0

This is much vaguer, as I’ve not really got a good working example of a SWORD 2 sequence available (the Broker doesn’t do CRUD).

With SWORD 2, the idea is to be able to update existing records, piecemeal:

  • Create a blank record
  • Add some basic metadata (title, authors, etc)
  • Add the rough-draft file
  • Add the post-review article
  • Delete the rough-draft file
  • Add the abstract
  • Add the publication metadata (journal, issue, pages, etc)

With SWORD 2, what routines are used to process the request is based on the mime-type given in the headers.

Within each importer, there is a new function:

sub new {
 my ( $class, %params ) = @_;
 my $self = $class->SUPER::new(%params);
 $self->{name} = "Import RJBroker SWORD (2.0) deposits";

 $self->{visible}   = "all";
 $self->{advertise} = 1;
 $self->{produce}   = [qw( list/eprint dataobj/eprint )];
 $self->{accept}    = [qw( application/vnd.broker.xml )];
 $self->{actions}   = [qw( unpack )];
 return $self;

So, to create a new record, one posts a file with no record id:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \

http://my.repo.url/id/content

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use it to create a new record. The server response will include URLs for updating the record.

To add a file to a known record:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyOtherFile.pdf" \

http://my.repo.url/id/eprints/123

This will use the default application/octet-stream importer, and add the file MyOtherFile.pdf to the record with the id 123.

To add more metadata:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \

http://my.repo.url/id/eprints/123

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use that code to add the metadata to the record with the id 123.

Note: there is a difference between PUT and POST:

  • POST adds contents to any existing data. Where a field already exists, the action is determined by the importer
  • PUT deletes the data and adds the new information – it replaces the whole record.

Summary

  • SWORD 1.3 uses the X-Package header to determine which importer routine to use, and the importer uses the mime-type to confirm suitability
  • SWORD 2 uses the mime-type to determine which importer routine to use.
  • The URLs for making deposits are different.

Embargoes in real metadata

One very important function the RJ Broker needs to support is that of embargoes: Publishers and other data suppliers are going to be much more willing to be involved in a dissemination programme if they believe their property is not being abused. To be blunt about it: most journals make money from selling copies – if they’re giving articles away for free, who would buy them… ?

So – the RJ Broker needs to ensure that embargo periods are defined, and clearly defined in the data that’s passed on….. and that’s why recipients of the data from the RJ Broker need to sign an agreement: to assert they will actually honour any embargo periods for records they receive.

We know from previous conversations that one cannot embargo metadata, so embargo only applies to the binary objects attached to the metadata record.

The first question is “Is there a blanket embargo for all files, or can different files have different embargoes?”, and the second question is “Is there a difference between ‘document’ and ‘file’?”

Actually, thinking about it, a blanket embargo can be mimic’d by having the same embargo for all files, however variable embargoes cannot be (sensibly) implemented using a single field. Thinking about the difference between “files” and “Documents” comes from the EPrints platform: they have this concept that a document may actually be composed of multiple files (consider a web page: the .html file is the primary file, however there are additional image files, stylesheet files, and other files that combine to present the whole document).

The third question is how to encode this embargo information.

The Broker has already defined it’s basic metadata file as being a METs file, with the record metadata encoded in epdcx (see SWAP and epcdx).

Looking round the net, there are several formal structures for defining administrative metadata; archival metadata; preservation metadata; etc…. but none seemed to actually define a nice, simple, embargo date.

In the end, I have loaded up two options, and we’ll investigate which one makes more sense as things get used.

The easier one, but the one that breaks the METS standard is to add an attribute to each Structure element in the METS file:

<structMap ID="sword-mets-struct-1" LABEL="structure" TYPE="LOGICAL">
  <div ID="sword-mets-div-1" DMDID="sword-mets-dmd-eprint-191" TYPE="SWORD Object">
    <div ID="sword-mets-div-2" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-581-0"/>
    </div>
    <div ID="sword-mets-div-3" oarj_embargo="2013-05-29">
      <fptr FILEID="eprint-191-document-582-0"/>
    </div>
  </div>
</structMap>

This has the beauty that the embargo metadata is directly linked to the document it belongs to, and that information is immediately available to any import routines.

The second is to write a more convoluted, but formally correct, structure within the METS Administrative section:

<amdSec ID="sword-mets-adm-1" LABEL="administrative" TYPE="LOGICAL">
  <rightsMD ID="sword-mets-amdRights-1">
    <mdWrap MDTYPE="OTHER" OTHERMDTYPE="RJ-BROKER">
      <xmlData>
        <epdcx:descriptionSet xmlns:epdcx="http://purl.org/eprint/epdcx/2006-11-16/"
                              xsi:schemaLocation="http://purl.org/eprint/epdcx/2006-11-16/
                                                  http://purl.org/eprint/epdcx/xsd/2006-11-16/epdcx.xsd ">
          <epdcx:description epdcx:resourceId="sword-mets-div-3" 
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
          <epdcx:description epdcx:resourceId="sword-mets-div-2"
                             epdcx:resourceURI="http://devel.edina.ac.uk:1203/191/">
            <epdcx:statement epdcx:propertyURI="http://purl.org/dc/terms/available"
                             epdcx:valueRef="http://purl.org/eprint/accessRights/ClosedAccess">
              <epdcx:valueString epdcx:sesURI="http://purl.org/dc/terms/W3CDTF">2013-05-29</epdcx:valueString>
            </epdcx:statement>
          </epdcx:description>
        </epdcx:descriptionSet>
      </xmlData>
    </mdWrap>
  </rightsMD>
</amdSec>

As you can see, this follows the rules for the epcdx structure. This was a deliberate choice as it is already used for the primary metadata, so the importers will already have routines for following the structure.

What will be interesting is which is more usable when it comes to writing importers.

“Will Triplestore replace Relational Databases?”

It is not possible to give a definitive answer, however it is important to look at this technology which has been causing a stir in the informatics field.

Basically a triplestore is a purpose-built database for the storage and retrieval of triples (Jack Rusher, Semantic Web Advanced Development for Europe). Rather than highlighting the main features of a triplestore (by way of making a comparison to a traditional relational database), we will give a brief overview of the how and why of choosing, installing, and maintaining a triplestore, giving a practical example not only of the installation phase but also of the graphical interface customization and some security policies that should be considered for the SPARQL endpoint server.

Choosing the application

The first task, obviously, is choosing the triplestore application.

Following Richard Wallis’ advice (see reference) 4Store was considered a good tool, useful for our needs. There are a number of reasons why we like this application: firstly, as Richard says, it is a open source project and it “comes from an environment where it was the base platform for a successful commercial business, so it should workâ€�. In addition, as their website suggest, 4store’s main strengths are its performance, scalability and stability.

It may not provide many features beyond RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist. We did investigate other products (such as Virtuoso) but none were as simple and efficient as 4Store.

Hardware platform

At EDINA, we tend to manage our services in hosted systems (similar to the concept that many web-hosting companies use.)

After considering the application framework for a SPARQL service, and various options of how triplestores could be used within EDINA, we decided to create an independent host for the 4Store application. This would allow us to both keep the application independent of other services and allow us to evaluate the performance of the triplestore.

It was configured with the following features:

  • 64bit CPU
  • 4 GB of RAM (with the possibility to increase this amount if needed)
  • Linux (64-bit RedHat Enterprise Level 6)

Application installation

At EDINA, we try to install services as a local (non-root) user. This allows us the option of multiple independent services within one host, and reduces the chance that an “exploit� (or cyber break-in) gains significant access to the hosts operating system.

Although some libraries were installed at system level, almost everything was installed at user-level. Overall, the 4Store installation was quick and easy to configure: installing the software as normal user required the installation paths to be specified (ie --prefix=~~~), however there were no significant challenges. We did fall foul when installing the raptor and rasqal libraries (the first provides a set of parsers that generate triples, while the second is useful to handle query language syntaxes [such as SPARQL]): there are fundamental differences between v1 & v2 – and we required v2.

Configuration and load data

Once finished the installation, 4Store is ready to store data.

  1. The first operation consists to set up a dataset, which (for this example) we will call “ori�:
    $ 4s-backend-setup ori
  2. then we start the database with the following command:
    $ 4s-backend ori
  3. Next we need to load the triples from the file data set. This step could take a while, depending on the system and on the amount of data.
    $ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file.ttl

    • This command line includes some options useful for the storing process. 4store’s import command produces no output by default unless something goes wrong (ref: 4Store website).
    • Since we would like more verbose feedback on how the import is progressing, or just some reassurance that it’s doing something at all, we add the option “-vâ€�.
    • The option “-mâ€� (or “–modelâ€�) defines the the model URI: It is, effectively, a namespace. Every time we import data, 4Store defines a model useful to map a SPARQL graph for the imported data. By default the name of your imported model (SPARQL graph) in 4store will be a file (URI derived from the import filename) but we can use the –model flag to override this with any other URI.

The model definition is important when looking to do data-replacement:
$ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file2.ttl

By specifying the same model, we replace all the data present in the “ori� dataset with the information contained in the new file.

Having imported the data, the server is ready to be queried, however this is only via local (command-line) clients. To be useful, we need to start an HTTP server to access the data store.

4Store includes a simple SPARQL HTTP protocol server, which can answer SPARQL queries using the standard SPARQL HTTP query protocol among other features.
$ 4s-httpd -p 1234 ori
will start a server on port 1234 which responds to queries for the 4Store database named “ori�.

If you have several 4store databases they will need to run on separate ports.

HTTP Interface for the SPARQL server

Once the server is running we can see an overview page from a web browser at http://my.host.com:1234/status/
There is a simple HTML query interface at http://my.host.com:1234/test/and a machine2machine (m2m) SPARQL endpoint at http://my.host.com:1234/sparql/

These last two links provide the means to execute some queries to retrieve information present in the database. Note that neither interface allow INSERT, DELETE, or LOAD functions, or any other function which could modify the data into the database.

You can send SPARQL 1.1 Update requests to http://my.host.com:1234/update/, and you can add or remove RDF graphs using a RESTful interface at the URI http://my.host.com:1234/data/

GUI Interface

The default test interface, whilst usable, is not particularly welcoming.

We wanted to provide a more powerful, and visually more appealing, tool that could allow to query the SPARQL endpoint without affect the 4Store performance. From the Developer Days promoted by JISC and Dev8D, we were aware of Dave Challis’ SPARQLfront project. SPARQLfront is a PHP and javascript based frontend to RDF SPARQL endpoints. It uses a modified version of ARC2 (a nice PHP RDF library) for most of the functionality, Skeleton to provide the basic HTML and stylesheets, and CodeMirror to provide syntax highlighting for SPARQL.

We installed, configured and customized SPARQLfront under the default system Apache2/PHP server: http://my.host.com/endpoint/

The main features of this tool are the highlighted syntax that help during the query composition and the chance to select different output formats.

Security

EDINA is very conscious of security issues relating to services (as noted, we run services as non-root where possible).

It is desirable to block access to the anything that allows modification to the 4Store dataset, however the 4Store server itself provides no Access Control Lists – therefore is unable to block update and data connections directly. The simplest way to block these connections is to use a reverse-proxy in the Apache2 server to pass on calls we do want to allow connect to the 4Store server, and completely block direct access to the 4Store server using a firewall.

Thus, we added proxy-pass configuration to the apache server:

 #--Reverse Proxy--
 ProxyPass /sparql http://localhost:1234/sparql
 ProxyPassReverse /sparql http://localhost:1234/sparql
 ProxyPass /status http://localhost:1234/status
 ProxyPassReverse /status http://localhost:1234/status
 ProxyPass /test http://localhost:1234/test
 ProxyPassReverse /test http://localhost:1234/test

 

Note: do NOT enable proxyRequests or proxyVia as this opens your server as a “forwarding� proxy to all-and-sundry using your host to access anything on the web! (see the Apache documentation)

We then use the Firewall to ensure that all ports we are not using are blocked, whilst allowing localhost to connect to port 1234:

 # allow localhost to do everything
 iptables -A INPUT -i lo -j ACCEPT
 iptables -A OUTPUT -o lo -j ACCEPT
 # allow connections to port 80 (http) & port 22 (ssh)
 iptables -A INPUT -p tcp -m multiport -d 12.23.34.45 --destination-ports 22,80 -m state --state NEW -j ACCEPT

 

Your existing firewall may already have a more complex configuration, or may be configured using a GUI. Whichever way, you need to allow localhost to connect to port 1234, and everything else to be blocked from most things (especially 1234!)


Cesare

Catalyst is awesome!

As part of the [potential?] move into UK RepositoryNet+ I am moving the discovery side of OA-RJ into a whole new framework…. and using Catalyst.

It is AWESOME!

It’s faster by fairly serious amounts compared to straight CGI scripts, and the whole way of interacting with the database is just so clean it’s brilliant. I’m suffering a bit from the “old dog, new tricks” syndrome…. but man – it’s amazing!