New blog for the Jisc Publications Router

The latest phase of the projects documented in this blog has moved to a new blog.

we-have-moved-icon

Our new blog will be used to outline the developments and benefits of the The Jisc Publications Router service. It begins with an introductory post that includes links to the service page and information on interacting with the Router.

The Publications Router is a free to use standalone middleware tool that automates the delivery of research publications from data suppliers to institutional repositories. The Router extracts authors’ affiliations from the metadata provided to determine appropriate target repositories before transferring publications to repositories registered with the service. The Router offers a solution to the duplication of effort recording a single research output presents in the increasingly collaborative world of research publications. It is intended to minimise effort on behalf of potential depositors while maximising the distribution and exposure of research outputs.

The Router has its origins in the Open Access Repository Junction project. A brief recap of the various stages of evolution can be found in a post on the history of the project.

If you wish to find out more about the service the Router offers please see the about page.

History of the Router – it started on the back of an envelope

Envelope concept image

The Jisc Publications Router has its origins in the preceding Open Access Repository Junction (OA-RJ) project which itself continued on from the work carried out on the Depot.

The Depot bridged a gap for researchers before a specific local institutional repository was available to them. It aimed to make more content available in repositories and to make it easier for researchers to have research results exposed to a wider readership under open access. The Depot is still available and providing researchers with a repository at http://opendepot.org/

One of the objectives of the Depot was to devise an unmediated reception and referral service called the Repository Junction. The Junction collected information in order to redirect users to existing institutional repository services near them. Institutional affiliation of potential depositors was deduced through an IP lookup and external directories were queried to find an appropriate location for deposit. This facilitated the redirection of a user to the most appropriate repository. If none of the suggested repositories were suitable for the researcher they could still deposit in the Depot.

OA-RJ started as an investigation to improve the simplistic approach of the Repository Junction and provide a service within the Jisc information environment. After consultation with other technologists in the Repository community it because clear that there were two workflows that should be addressed. Firstly that the deposit object could be data-mined for additional information on the author affiliation and, secondly, that the object could be, itself, deposited into repositories. This second workflow could solve the many-to-many problem of research publications with multiple authors from multiple institutions who require their publications be deposited in multiple locations. The aim was to minimise effort on behalf of potential depositors while maximising the distribution and exposure of research outputs.

The foundation for OA-RJ can be seen in the ‘back of an envelope’ diagram (above) born from a meeting between Theo Andrew, Jim Downing, Richard Jones, Ben O’Steen and Ian Stuart. With smoother edges the above diagram looks like this:tidy concept image

OA-RJ then split into discovery and delivery providing services for each. The Repository Junction would discover repository targets while a standalone broker would enable content providers to make deposits with multiple recipients. OA-RJ became two distinct projects as part of the UK Repository Net+ (RepNet) infrastructure project; Organisation and Repository Identification (ORI) handling the discovery while the Repository Junction Broker (RJB) dealt with delivery. ORI is now an Edina micro service providing APIs to access authoritative data on organisations and repositories. The latest phase of RJB is the Jisc Publications Router.

The Router is a service based on the RJB application. The Publications Router aims to deliver open access content in a format that can be understood by institutional repositories. Having evolved from the projects outlined above the Router automates the delivery of research publications from multiple suppliers (publishers, subject repositories) to multiple institutional repositories. The Router parses the metadata to determine the appropriate target repositories based on the authors responsible for the output and transfers the publication to the institutional repositories registered with the service. It is intended to minimise efforts on behalf of potential depositors in order to maximise the distribution and exposure of research outputs.

The envelop sketch is now a fully realised service.

You can view blog posts from the previous incarnations of the Router at the following URL but we will highlight some of these older posts in the future: https://oarepojunction.wordpress.com/

If you have any queries about the Publications Router please contact the Edina Helpdesk or email Edina@ed.ac.uk.

Setting up redundancy in a live service

At EDINA, we strive for resilience in the services we run. We may not be competing with the massive international server-farms that the likes of Google, Facebook, Amazon, eBay, etc run… however we try to have systems that cope when things become unavailable due to network outages, power-cuts, or even hardware failures: the services continue and the user may not even notice!

For read-only services, such as ORI, this is relatively easy to do: the datasets are apparently static (any updates are done in the background, without any interaction from the user) so two completely separate services can be run in two completely separate data centres, and the systems can switch between them without anyone noticing – however Repository Junction Broker presented a whole different challenge.

The basic premise is that there are two instances of the service running – one at data centre A, the other at data centre B – and the user should be able to use either, and not spot any difference.

This first part is easy: the user connects to the public name for the service. The public name is answered by something called a load-balancer which directs the user to whichever service it perceives as being least loaded.

In a read-only system, we can stop here: we can have two entirely disconnected systems, and let them respond as they need to…. and use basic admin commands to update the services as they need.

For RJ Broker, this is not the end of the story: RJB has to take in data from the user, and store that information in a database. Again, a relatively simple solution is available: run the database through the load-balancer, so both services are talking to the same database, and use background admin tools to copy database A to database B in the other data-centre.

rjb-loadbalancer Again, RJB needs more: as well as storing information in the database, the service also writes files do disk. Here at EDINA, we make use of Storage Area Network (SAN) systems – which means we mount a chunk of disk space into the file-system, across the network.

The initial solution was to get both services to mount disk space from SAN A (at data centre A) into /home/user/san_A and then get the application to store the files into this branch of the file-system…. meaning the files are instantly visible to both service.

This, however, is still has the “single point of failure” which is SAN A. What is needed is to replicate the data in SAN A in SAN B, and make SAN B easily available to both installations

The first part of this is simple enough: mount disk space from SAN B (in data centre B) into /home/user/san_B. We then use an intermediate symbolic link (/home/user/filestore) to point to /home/user/san_A and get the application to store data in /home/user/filestore/... Now, if there is an outage at Data Centre A, we simply need to swap the symbolic link  /home/user/filestore to /home/user/san_B and the application is none-the wiser.

The only thing that needs to happen is the magic to ensure that all the data written to SAN A is duplicated in SAN B (and database A is duplicated into database B)

RSP Webinar on RJ Broker: Automating Delivery of Research Output to Repositories

On Wednesday 29th May 2013, Muriel Mewissen presented a Webinar on the Repository Junction Broker (RJ Broker) for the Repository Support Project (RSP).

This presentation discusses:

  • the need for a broker to automate the delivery of research output to Institutional Repositories
  • the development of the middleware tool for RJ Broker
  • the data deposit trials involving a publisher (Nature Publishing Group) and a subject repository (Europe PubMed Central) which have recently taken place
  • what the future holds for the RJ Broker.

A recording of the webinar and the presentation slides are available on the RSP website.

“Will Triplestore replace Relational Databases?”

It is not possible to give a definitive answer, however it is important to look at this technology which has been causing a stir in the informatics field.

Basically a triplestore is a purpose-built database for the storage and retrieval of triples (Jack Rusher, Semantic Web Advanced Development for Europe). Rather than highlighting the main features of a triplestore (by way of making a comparison to a traditional relational database), we will give a brief overview of the how and why of choosing, installing, and maintaining a triplestore, giving a practical example not only of the installation phase but also of the graphical interface customization and some security policies that should be considered for the SPARQL endpoint server.

Choosing the application

The first task, obviously, is choosing the triplestore application.

Following Richard Wallis’ advice (see reference) 4Store was considered a good tool, useful for our needs. There are a number of reasons why we like this application: firstly, as Richard says, it is a open source project and it “comes from an environment where it was the base platform for a successful commercial business, so it should workâ€�. In addition, as their website suggest, 4store’s main strengths are its performance, scalability and stability.

It may not provide many features beyond RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist. We did investigate other products (such as Virtuoso) but none were as simple and efficient as 4Store.

Hardware platform

At EDINA, we tend to manage our services in hosted systems (similar to the concept that many web-hosting companies use.)

After considering the application framework for a SPARQL service, and various options of how triplestores could be used within EDINA, we decided to create an independent host for the 4Store application. This would allow us to both keep the application independent of other services and allow us to evaluate the performance of the triplestore.

It was configured with the following features:

  • 64bit CPU
  • 4 GB of RAM (with the possibility to increase this amount if needed)
  • Linux (64-bit RedHat Enterprise Level 6)

Application installation

At EDINA, we try to install services as a local (non-root) user. This allows us the option of multiple independent services within one host, and reduces the chance that an “exploit� (or cyber break-in) gains significant access to the hosts operating system.

Although some libraries were installed at system level, almost everything was installed at user-level. Overall, the 4Store installation was quick and easy to configure: installing the software as normal user required the installation paths to be specified (ie --prefix=~~~), however there were no significant challenges. We did fall foul when installing the raptor and rasqal libraries (the first provides a set of parsers that generate triples, while the second is useful to handle query language syntaxes [such as SPARQL]): there are fundamental differences between v1 & v2 – and we required v2.

Configuration and load data

Once finished the installation, 4Store is ready to store data.

  1. The first operation consists to set up a dataset, which (for this example) we will call “ori�:
    $ 4s-backend-setup ori
  2. then we start the database with the following command:
    $ 4s-backend ori
  3. Next we need to load the triples from the file data set. This step could take a while, depending on the system and on the amount of data.
    $ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file.ttl

    • This command line includes some options useful for the storing process. 4store’s import command produces no output by default unless something goes wrong (ref: 4Store website).
    • Since we would like more verbose feedback on how the import is progressing, or just some reassurance that it’s doing something at all, we add the option “-vâ€�.
    • The option “-mâ€� (or “–modelâ€�) defines the the model URI: It is, effectively, a namespace. Every time we import data, 4Store defines a model useful to map a SPARQL graph for the imported data. By default the name of your imported model (SPARQL graph) in 4store will be a file (URI derived from the import filename) but we can use the –model flag to override this with any other URI.

The model definition is important when looking to do data-replacement:
$ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file2.ttl

By specifying the same model, we replace all the data present in the “ori� dataset with the information contained in the new file.

Having imported the data, the server is ready to be queried, however this is only via local (command-line) clients. To be useful, we need to start an HTTP server to access the data store.

4Store includes a simple SPARQL HTTP protocol server, which can answer SPARQL queries using the standard SPARQL HTTP query protocol among other features.
$ 4s-httpd -p 1234 ori
will start a server on port 1234 which responds to queries for the 4Store database named “ori�.

If you have several 4store databases they will need to run on separate ports.

HTTP Interface for the SPARQL server

Once the server is running we can see an overview page from a web browser at http://my.host.com:1234/status/
There is a simple HTML query interface at http://my.host.com:1234/test/and a machine2machine (m2m) SPARQL endpoint at http://my.host.com:1234/sparql/

These last two links provide the means to execute some queries to retrieve information present in the database. Note that neither interface allow INSERT, DELETE, or LOAD functions, or any other function which could modify the data into the database.

You can send SPARQL 1.1 Update requests to http://my.host.com:1234/update/, and you can add or remove RDF graphs using a RESTful interface at the URI http://my.host.com:1234/data/

GUI Interface

The default test interface, whilst usable, is not particularly welcoming.

We wanted to provide a more powerful, and visually more appealing, tool that could allow to query the SPARQL endpoint without affect the 4Store performance. From the Developer Days promoted by JISC and Dev8D, we were aware of Dave Challis’ SPARQLfront project. SPARQLfront is a PHP and javascript based frontend to RDF SPARQL endpoints. It uses a modified version of ARC2 (a nice PHP RDF library) for most of the functionality, Skeleton to provide the basic HTML and stylesheets, and CodeMirror to provide syntax highlighting for SPARQL.

We installed, configured and customized SPARQLfront under the default system Apache2/PHP server: http://my.host.com/endpoint/

The main features of this tool are the highlighted syntax that help during the query composition and the chance to select different output formats.

Security

EDINA is very conscious of security issues relating to services (as noted, we run services as non-root where possible).

It is desirable to block access to the anything that allows modification to the 4Store dataset, however the 4Store server itself provides no Access Control Lists – therefore is unable to block update and data connections directly. The simplest way to block these connections is to use a reverse-proxy in the Apache2 server to pass on calls we do want to allow connect to the 4Store server, and completely block direct access to the 4Store server using a firewall.

Thus, we added proxy-pass configuration to the apache server:

 #--Reverse Proxy--
 ProxyPass /sparql http://localhost:1234/sparql
 ProxyPassReverse /sparql http://localhost:1234/sparql
 ProxyPass /status http://localhost:1234/status
 ProxyPassReverse /status http://localhost:1234/status
 ProxyPass /test http://localhost:1234/test
 ProxyPassReverse /test http://localhost:1234/test

 

Note: do NOT enable proxyRequests or proxyVia as this opens your server as a “forwarding� proxy to all-and-sundry using your host to access anything on the web! (see the Apache documentation)

We then use the Firewall to ensure that all ports we are not using are blocked, whilst allowing localhost to connect to port 1234:

 # allow localhost to do everything
 iptables -A INPUT -i lo -j ACCEPT
 iptables -A OUTPUT -o lo -j ACCEPT
 # allow connections to port 80 (http) & port 22 (ssh)
 iptables -A INPUT -p tcp -m multiport -d 12.23.34.45 --destination-ports 22,80 -m state --state NEW -j ACCEPT

 

Your existing firewall may already have a more complex configuration, or may be configured using a GUI. Whichever way, you need to allow localhost to connect to port 1234, and everything else to be blocked from most things (especially 1234!)


Cesare

OA-RJ is dead. Long live ORI and RJ Broker.

JISC is funding EDINA to participate in the UK Repository Net+  infrastructure project.  UK Repository Net+, or RepNet for short,  is a socio-technical infrastructure supporting deposit, curation and  exposure of Open Access research literature.  EDINA’s contribution to RepNet is based on the Open Access Repository Junction (OA-RJ) project outcomes. Two independent tools will be developed as two separate projects from the ‘discovery’ and ‘delivery’ functionality of the OA-RJ.

The Organisation and Repositories Identification (ORI) project will design a standalone middleware tool for identifying academic organisations and their associated repositories. It tool will harvest data from several authoritative sources  in order to provide information on over 23,000 organisations and 3,000 repositories worldwide.  APIs will be provided to query the content of the ORI tool.

The Repository Junction Broker (RJ Broker) project will deliver a standalone middleware tool for handling the deposit of research articles to multiple repositories. This will offer an effective solution to authors and publishers wishing to deposit open access publications in relevant subject and institutional repositories.  The RJ Broker will parse the metadata of an article to determine the appropriate target repositories and transfer the publication to the registered repositories. It is intended to minimise efforts on behalf of potential depositors, and thereby maximise distribution and exposure of research outputs.

The RJ Broker project has been funded for a year as part of the wave 1 development of  UK Repository Net+ while the ORI project was awarded 6 months funding.

Catalyst is awesome!

As part of the [potential?] move into UK RepositoryNet+ I am moving the discovery side of OA-RJ into a whole new framework…. and using Catalyst.

It is AWESOME!

It’s faster by fairly serious amounts compared to straight CGI scripts, and the whole way of interacting with the database is just so clean it’s brilliant. I’m suffering a bit from the “old dog, new tricks” syndrome…. but man – it’s amazing!

Validating turtle

When I create the linked-data copy of the ORI (ORI – Organisation and Repository Identification: what was called Repository Junction) dataset, I create a turtle file, adn then use any23.org to convert it to RDF – woot: valid turtle & valid RDF.

At dev8D [Feb 2012] I tried to throw it through a summary parser (which uses a Jena-based parser), and it all fell apart.

It turns out Jena is a really strict parser – so now I’m reviewing the whole build process to ensure I have a properly valid turtle file.