Setting up redundancy in a live service

At EDINA, we strive for resilience in the services we run. We may not be competing with the massive international server-farms that the likes of Google, Facebook, Amazon, eBay, etc run… however we try to have systems that cope when things become unavailable due to network outages, power-cuts, or even hardware failures: the services continue and the user may not even notice!

For read-only services, such as ORI, this is relatively easy to do: the datasets are apparently static (any updates are done in the background, without any interaction from the user) so two completely separate services can be run in two completely separate data centres, and the systems can switch between them without anyone noticing – however Repository Junction Broker presented a whole different challenge.

The basic premise is that there are two instances of the service running – one at data centre A, the other at data centre B – and the user should be able to use either, and not spot any difference.

This first part is easy: the user connects to the public name for the service. The public name is answered by something called a load-balancer which directs the user to whichever service it perceives as being least loaded.

In a read-only system, we can stop here: we can have two entirely disconnected systems, and let them respond as they need to…. and use basic admin commands to update the services as they need.

For RJ Broker, this is not the end of the story: RJB has to take in data from the user, and store that information in a database. Again, a relatively simple solution is available: run the database through the load-balancer, so both services are talking to the same database, and use background admin tools to copy database A to database B in the other data-centre.

rjb-loadbalancer Again, RJB needs more: as well as storing information in the database, the service also writes files do disk. Here at EDINA, we make use of Storage Area Network (SAN) systems – which means we mount a chunk of disk space into the file-system, across the network.

The initial solution was to get both services to mount disk space from SAN A (at data centre A) into /home/user/san_A and then get the application to store the files into this branch of the file-system…. meaning the files are instantly visible to both service.

This, however, is still has the “single point of failure” which is SAN A. What is needed is to replicate the data in SAN A in SAN B, and make SAN B easily available to both installations

The first part of this is simple enough: mount disk space from SAN B (in data centre B) into /home/user/san_B. We then use an intermediate symbolic link (/home/user/filestore) to point to /home/user/san_A and get the application to store data in /home/user/filestore/... Now, if there is an outage at Data Centre A, we simply need to swap the symbolic link  /home/user/filestore to /home/user/san_B and the application is none-the wiser.

The only thing that needs to happen is the magic to ensure that all the data written to SAN A is duplicated in SAN B (and database A is duplicated into database B)

SWORD 1.3 v’s SWORD 2

What’s the difference, and how do they compare?

In summary

SWORD 1.3 is a one-off package deposit system: the record is wrapped up in some agreed format, and dropped into the repository. SWORD 1.3 uses an HTTP header to define what that package format is, and the Individual Repositories use that header to determine how to unpack the  record. Every deposit is a new record.

SWORD 2.0 is a CRUD-based (Create, Read, Update, Delete) system, where the emphasis is on being able to manage existing records, as well creating new records. SWORD 2 uses the URL to identify the record being manipulated, and the mime-type of the object being presented to know what to do with it.

In detail (EPrints-specific)

This is, per force, EPrints specific as I am an EPrints user, with no experience coding in DSpace/Fedora/etc.

SWORD 1.3

With a SWORD 1.3 system, one defines a mapping between the X-Packaging header URI and the importer package to handle it:

  $c->{sword}->{supported_packages}->{"http://opendepot.org/broker/1.0"} =
  {
    name => "Open Access Repository Junction Broker",
    plugin => "Sword::Import::Broker_OARJ",
    qvalue => "0.6"
  };

The importer routine then has some internal logic to ensure it only tries to manage records of the right type (XML files, Word Documents, Spreadsheets, Zip files, etc).

In the case of compressed files, it is customary to also indicate the routine to un-compress the file. For example, the same Importer could manage .zip, .tar, and .tgz files – which are all variations on a compressed collection of files – which has the following collection of mime-types:

application/x-gtar
application/x-tar
application/x-gtar-compressed
application/zip

Therefore our importer would have code like this:

   our %SUPPORTED_MIME_TYPES = ( "application/zip"    => 1, "application/tar"               => 1,
                                 "application/x-gtar" => 1, "application/x-gtar-compressed" => 1,);

   our %UNPACK_MIME_TYPES = ( "application/zip"               => "Sword::Unpack::MyNewZip",
                              "application/tar"               => "Sword::Unpack::MyNewTar",
                              "application/x-gtar"            => "Sword::Unpack::MyNewTar",
                              "application/x-gtar-compressed" => "Sword::Unpack::MyNewTar");

So, a basic SWORD 1.3 deposit is a simple POST request to a defined URL, with a set of headers to manage the deposit, and the record as the body of the request:

  curl -x POST \
       -i \
       -u username:password \
       --data-binary "@myFile.zip" \
       -H 'X-Packaging: http://opendepot.org/broker/1.0' \
       -H 'Content-Type: application/zip'  \

http://my.repo.url/sword-path/collection

This will deposit the binary file myFile.zip into the collection point in the repository, using the importer identified by the Package http://opendepot.org/broker/1.0.

SWORD 2.0

This is much vaguer, as I’ve not really got a good working example of a SWORD 2 sequence available (the Broker doesn’t do CRUD).

With SWORD 2, the idea is to be able to update existing records, piecemeal:

  • Create a blank record
  • Add some basic metadata (title, authors, etc)
  • Add the rough-draft file
  • Add the post-review article
  • Delete the rough-draft file
  • Add the abstract
  • Add the publication metadata (journal, issue, pages, etc)

With SWORD 2, what routines are used to process the request is based on the mime-type given in the headers.

Within each importer, there is a new function:

sub new {
 my ( $class, %params ) = @_;
 my $self = $class->SUPER::new(%params);
 $self->{name} = "Import RJBroker SWORD (2.0) deposits";

 $self->{visible}   = "all";
 $self->{advertise} = 1;
 $self->{produce}   = [qw( list/eprint dataobj/eprint )];
 $self->{accept}    = [qw( application/vnd.broker.xml )];
 $self->{actions}   = [qw( unpack )];
 return $self;

So, to create a new record, one posts a file with no record id:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \

http://my.repo.url/id/content

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use it to create a new record. The server response will include URLs for updating the record.

To add a file to a known record:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyOtherFile.pdf" \

http://my.repo.url/id/eprints/123

This will use the default application/octet-stream importer, and add the file MyOtherFile.pdf to the record with the id 123.

To add more metadata:

  curl -x POST -i \
       -u username:password \
      --data-binary "@MyData.xml" \
       -H 'Content-Type: application/vnd.broker.xml' \

http://my.repo.url/id/eprints/123

This will find the importer that claims to understand ‘application/vnd.broker.xml’, and use that code to add the metadata to the record with the id 123.

Note: there is a difference between PUT and POST:

  • POST adds contents to any existing data. Where a field already exists, the action is determined by the importer
  • PUT deletes the data and adds the new information – it replaces the whole record.

Summary

  • SWORD 1.3 uses the X-Package header to determine which importer routine to use, and the importer uses the mime-type to confirm suitability
  • SWORD 2 uses the mime-type to determine which importer routine to use.
  • The URLs for making deposits are different.

“Will Triplestore replace Relational Databases?”

It is not possible to give a definitive answer, however it is important to look at this technology which has been causing a stir in the informatics field.

Basically a triplestore is a purpose-built database for the storage and retrieval of triples (Jack Rusher, Semantic Web Advanced Development for Europe). Rather than highlighting the main features of a triplestore (by way of making a comparison to a traditional relational database), we will give a brief overview of the how and why of choosing, installing, and maintaining a triplestore, giving a practical example not only of the installation phase but also of the graphical interface customization and some security policies that should be considered for the SPARQL endpoint server.

Choosing the application

The first task, obviously, is choosing the triplestore application.

Following Richard Wallis’ advice (see reference) 4Store was considered a good tool, useful for our needs. There are a number of reasons why we like this application: firstly, as Richard says, it is a open source project and it “comes from an environment where it was the base platform for a successful commercial business, so it should workâ€�. In addition, as their website suggest, 4store’s main strengths are its performance, scalability and stability.

It may not provide many features beyond RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist. We did investigate other products (such as Virtuoso) but none were as simple and efficient as 4Store.

Hardware platform

At EDINA, we tend to manage our services in hosted systems (similar to the concept that many web-hosting companies use.)

After considering the application framework for a SPARQL service, and various options of how triplestores could be used within EDINA, we decided to create an independent host for the 4Store application. This would allow us to both keep the application independent of other services and allow us to evaluate the performance of the triplestore.

It was configured with the following features:

  • 64bit CPU
  • 4 GB of RAM (with the possibility to increase this amount if needed)
  • Linux (64-bit RedHat Enterprise Level 6)

Application installation

At EDINA, we try to install services as a local (non-root) user. This allows us the option of multiple independent services within one host, and reduces the chance that an “exploit� (or cyber break-in) gains significant access to the hosts operating system.

Although some libraries were installed at system level, almost everything was installed at user-level. Overall, the 4Store installation was quick and easy to configure: installing the software as normal user required the installation paths to be specified (ie --prefix=~~~), however there were no significant challenges. We did fall foul when installing the raptor and rasqal libraries (the first provides a set of parsers that generate triples, while the second is useful to handle query language syntaxes [such as SPARQL]): there are fundamental differences between v1 & v2 – and we required v2.

Configuration and load data

Once finished the installation, 4Store is ready to store data.

  1. The first operation consists to set up a dataset, which (for this example) we will call “ori�:
    $ 4s-backend-setup ori
  2. then we start the database with the following command:
    $ 4s-backend ori
  3. Next we need to load the triples from the file data set. This step could take a while, depending on the system and on the amount of data.
    $ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file.ttl

    • This command line includes some options useful for the storing process. 4store’s import command produces no output by default unless something goes wrong (ref: 4Store website).
    • Since we would like more verbose feedback on how the import is progressing, or just some reassurance that it’s doing something at all, we add the option “-vâ€�.
    • The option “-mâ€� (or “–modelâ€�) defines the the model URI: It is, effectively, a namespace. Every time we import data, 4Store defines a model useful to map a SPARQL graph for the imported data. By default the name of your imported model (SPARQL graph) in 4store will be a file (URI derived from the import filename) but we can use the –model flag to override this with any other URI.

The model definition is important when looking to do data-replacement:
$ 4s-import -v ori -m http://localhost:8001/data/ori /path/to/datafile/file2.ttl

By specifying the same model, we replace all the data present in the “ori� dataset with the information contained in the new file.

Having imported the data, the server is ready to be queried, however this is only via local (command-line) clients. To be useful, we need to start an HTTP server to access the data store.

4Store includes a simple SPARQL HTTP protocol server, which can answer SPARQL queries using the standard SPARQL HTTP query protocol among other features.
$ 4s-httpd -p 1234 ori
will start a server on port 1234 which responds to queries for the 4Store database named “ori�.

If you have several 4store databases they will need to run on separate ports.

HTTP Interface for the SPARQL server

Once the server is running we can see an overview page from a web browser at http://my.host.com:1234/status/
There is a simple HTML query interface at http://my.host.com:1234/test/and a machine2machine (m2m) SPARQL endpoint at http://my.host.com:1234/sparql/

These last two links provide the means to execute some queries to retrieve information present in the database. Note that neither interface allow INSERT, DELETE, or LOAD functions, or any other function which could modify the data into the database.

You can send SPARQL 1.1 Update requests to http://my.host.com:1234/update/, and you can add or remove RDF graphs using a RESTful interface at the URI http://my.host.com:1234/data/

GUI Interface

The default test interface, whilst usable, is not particularly welcoming.

We wanted to provide a more powerful, and visually more appealing, tool that could allow to query the SPARQL endpoint without affect the 4Store performance. From the Developer Days promoted by JISC and Dev8D, we were aware of Dave Challis’ SPARQLfront project. SPARQLfront is a PHP and javascript based frontend to RDF SPARQL endpoints. It uses a modified version of ARC2 (a nice PHP RDF library) for most of the functionality, Skeleton to provide the basic HTML and stylesheets, and CodeMirror to provide syntax highlighting for SPARQL.

We installed, configured and customized SPARQLfront under the default system Apache2/PHP server: http://my.host.com/endpoint/

The main features of this tool are the highlighted syntax that help during the query composition and the chance to select different output formats.

Security

EDINA is very conscious of security issues relating to services (as noted, we run services as non-root where possible).

It is desirable to block access to the anything that allows modification to the 4Store dataset, however the 4Store server itself provides no Access Control Lists – therefore is unable to block update and data connections directly. The simplest way to block these connections is to use a reverse-proxy in the Apache2 server to pass on calls we do want to allow connect to the 4Store server, and completely block direct access to the 4Store server using a firewall.

Thus, we added proxy-pass configuration to the apache server:

 #--Reverse Proxy--
 ProxyPass /sparql http://localhost:1234/sparql
 ProxyPassReverse /sparql http://localhost:1234/sparql
 ProxyPass /status http://localhost:1234/status
 ProxyPassReverse /status http://localhost:1234/status
 ProxyPass /test http://localhost:1234/test
 ProxyPassReverse /test http://localhost:1234/test

 

Note: do NOT enable proxyRequests or proxyVia as this opens your server as a “forwarding� proxy to all-and-sundry using your host to access anything on the web! (see the Apache documentation)

We then use the Firewall to ensure that all ports we are not using are blocked, whilst allowing localhost to connect to port 1234:

 # allow localhost to do everything
 iptables -A INPUT -i lo -j ACCEPT
 iptables -A OUTPUT -o lo -j ACCEPT
 # allow connections to port 80 (http) & port 22 (ssh)
 iptables -A INPUT -p tcp -m multiport -d 12.23.34.45 --destination-ports 22,80 -m state --state NEW -j ACCEPT

 

Your existing firewall may already have a more complex configuration, or may be configured using a GUI. Whichever way, you need to allow localhost to connect to port 1234, and everything else to be blocked from most things (especially 1234!)


Cesare

Trying to get my head around the options….

OK, so here’s the problem:

An organisation can have multiple names, and it can have multiple URLs…. and sometimes one can identify a straight one-to-one relationship between the two.

For example: Riga Technical University is the english name for Rīgas Tehniskā Universitāte. Being clever, I have identified http://www.rtu.lv as the home page (in its native Latvian) and http://www.rtu.lv/en (as the english-language version). I can even associate the URLs as appropriate: the english name links to the english-language pages, and the Latvian name links to the Latvian-language pages.

Life is a tad more complex in other places. For example “Đại học Quốc gia Hồ Chí Minh” can be called either “National University of Ho Chi Minh” or “Ho Chi Minh City Vietnam National University” in english….  yet I have only one URL: http://www.vnuhcm.edu.vn

Contrary-wise: EDINA has just one name, but two URLs (http://edina.ac.uk and http://www.edina.ac.uk )

There are, naturally, some unknown number of instances where the name and the URL have not been linked – where the harvesting code was unable to make a “sensible” correlation.

The problem is working out how to model this sometimes-present relationship of many-to-many – in code, in data-returns, and on the screen.