Validating turtle

When I create the linked-data copy of the ORI (ORI – Organisation and Repository Identification: what was called Repository Junction) dataset, I create a turtle file, adn then use any23.org to convert it to RDF – woot: valid turtle & valid RDF.

At dev8D [Feb 2012] I tried to throw it through a summary parser (which uses a Jena-based parser), and it all fell apart.

It turns out Jena is a really strict parser – so now I’m reviewing the whole build process to ensure I have a properly valid turtle file.

Development APIs : The main api

This is the final, and main, api for the system (cf the AJAXie get_xxx and the list functions.) Currently at http://devel.edina.ac.uk:1201/cgi/api5 this is the API that the front-page of OpenDepot.org uses, and is expected to be the primary contact point for other developers.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. All return the data as a nested object, with three top-level elements:
     {
       'message' => {}
       'status'  => 'ok',
       'to'      => 'http://.....'
     }

    status is “ok” or “fail”, to is the url that made the query, and message contains the actual data being returned.

The call

The “locus” for the api search can be defined in a number of ways:

  1. You can specify an IP number to base the search on (ip=129.215)
    • If a full quad is not given, then the full range based on what is given is assumed (so 129.215 means 129.215.0.0 to 129.215.255.255)
    • If a range is defined (ie 129.214-129.217) then the upper and lower bands are set accordingly (ie 129.214.0.0-129.217.255.255)
  2. You can specify a geographic location to base the search on (geo=55.95,-3)
    • The accuracy for the search depends on the numbers given: the range is always +/- 1 either side of the last decimal place given (so a bounding box of 55.94,-2.9 to 55.96,-3.1)
  3. You can specifically define an organisation ID to fix your search on (org=2736)

You can specify multiple locus points, however how they interact needs to be made clear:

  • Every locus definition within the same typeis cumulative: if you specify two IP ranges, then anything in either range is listed
    • This can lead to lots and lots of results
  • Every locus definition that combines different typesresults in an intersection of the results (all the results on a specified network range that are also within a specified geographic location)
    • This can lead to Zero results

In addition to defining the locus for the search, the repositories returned can be tuned to return only those of a certain type, and/or only those that accept particular types of deposits.

  • type is the parameter that defines the type of repository (Institutional, Data, etc), and its the code number you need (see the appropriate list/type call for the known list of types
  • content is the parameter that defines the type of content the repository accepts (pre-prints, data, learning objects, etc), and its the code number you need  (see the appropriate list/content call for the known list of content-types

The return

The data object returned is a set of net objects (indexed by net_id), within which is a list of org objects associated with that network. Within each org object is a list of repo objects. All objects conform to the specification here.

 {
   'message' => {net} => 'i38647' => { 'dec_lower' => '152.78.0.0',
                                       'dec_upper' => '152.78.255.255',
                                       'orgs' => [ { 'org_name' => 'AgentLink.org',
                                                     'org_url' => 'http://www.agentlink.org',
                                                     'repos' => [ { 'repo_name' => 'xxxxxxx',
                                                                    'org_url'   => 'yyyyyy',
                                                                    ................
                                                                   },
                                                                   {
                                                                    .................
                                                                   } ]
                                                     },
                                                     {
                                                 } ],
                                       ...........
                                     }
                      => 'i39677' => {
                                       .............
                                     }
 }

The data is not sorted before being returned.

Development APIs : The get functions

This second suite of functions (cf the main api and the list functions) was initially created as part of a set of “data sanity checking” web pages, however it became apparent that their usefullness lived beyond my own needs, so they have been brought in-line with the other functions, and made generic.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘prototype’ (the only place this format is available), ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. For prototype returns, the data is formatted as an xhtml unordered list (as per the scriptalicious/prototype requirements), with the for attribute set to match EPrints field names.
  3. For all other returns, the data is a list of data records (as per the desciption here)

The queries

Currently at http://devel.edina.ac.uk:1201/cgi/get_xxx5, this is a suite of three APIs that are there to support AJAX calls.

The basic premis is that the term to be looked up is passed in a parameter “q”, and all the records that have that term somewhere in the data are returned.

Additional parameters can be used to tune the query:

  • format will define the format being returned
  • field will specify which field to query on (see the individual functions for more details on this)

The three queries are:

  • get_orgs5
  • get_nets5
  • get_repos5

get_orgs5

This query will search either the name or the url to return a list of organisaions that match. By default, the name field is searched.

get_nets5

This query will search either the name or an ip number to return a list of networks that match. By default, the name field is searched, however if the script spots an IP number, it will automatically switch to an ip search.

get_repos5

This query will search either by name or url to return a list of networks that match. By default, the name field is searched.

Demos of the new APIs

To help ensure usability of the new APIs, I’ve created some example clients

Most of the calls are done using JQuery, which has a mechanism for doing CrossDomain calls – and all the scripts support that functionality if you want it.

Development APIs : The list functions

OK, so there’s some interesting data to get – but how do you get it?

There are three general APIs, or 10… depending on how you count them.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. All return the data as a nested object, with three top-level elements:
 {
   'message' => {}
   'status'  => 'ok',
   'to'      => 'http://.....'
 }

status is “ok” or “fail”, to is the url that made the query, and message contains the actual data being returned…. which is dependant on the query!

The queries

Lets start with the suite that list things  (cf the AJAXie get_xxx functions and the main api)… currently at http://devel.edina.ac.uk:1201/cgi/list5/xxx, this is a suite of six APIs that pull out a list things:

  • type
  • content
  • country
  • lang
  • org
  • net

type

This lists the type (or classification) of repository.

'message' => {
               'type' => [
                           {
                             'code' => 1,
                             'text' => 'Subject (Research Cross-Institutional)'
                           },
                           {
                             'code' => 2,
                             'text' => 'Other'
                           },
                           ......
                         ]
                      },
code text
1 Undetermined – Repositories whose type has not yet been assessed
2 Institutional (Institutional or departmental repositories)
3 Disciplinary (Cross-institutional subject repositories)
4 Aggregating (Archives aggregating data from several subsidiary repositories)
5 Governmental (Repositories for governmental data)
6 Subject (Research Cross-Institutional)
7 Journal (e-Journal/Publication)
8 Thesis
9 Database (Database/A&I Index)
10 Learning (Learning and Teaching Objects)
11 Other
12 Demonstration

When a repository type is needed by /api, it is the code number you need.

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

content

This lists the type of content that repositories accept

  <message>
    <content>
      <code>1</code>
      <text>Research papers (pre- and postprints)</text>
    </content>
    <content>
      <code>2</code>
      <text>Research papers (preprints only)</text>
    </content>
    .....
  </message>
code text
1 Research papers (pre- and postprints)
2 Research papers (preprints only)
3 Research papers (postprints only)
4 Bibliographic references
5 Conference and workshop papers
6 Theses and dissertations
7 Unpublished reports and working papers
8 Books & chapters and sections
9 Datasets
10 Learning Objects
11 Multimedia and audio-visual materials
12 Software
13 Patents
14 Other special item types

When a content type is to be defined in /api, it is the code number you need.

Adding the parameter full=1 will cause the query to return all the repositories that accept the content-type listed under a repos element. Note that repositories are usually accept multiple content-types, so will appear under multiple entries.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

 lang

This lists all the languages the dataset knows about (in essence, the ISO 639 codes).

(We are limited to ISO 639-2 as ISO639-3 & later are not Open Access lists and there is a clause which states “the product, system, or device does not provide a means to redistribute the code set.”)

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/lang",
  "status" : "ok",
  "message" : {
    "lang" : [
      {
        "text" : "Abkhazian",
        "iso3_b" : "abk",
        "code" : "ab"
      },
      {
        "text" : "Achinese",
        "iso3_b" : "ace"
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to return all the repositories that assert they use that language in their interface, listed in a repos element. Many non-english interfaces are multi-lingual, and those repositories will appear in multiple lists.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

country

This lists all the counties the dataset knows about (in essence, the ISO 3166-1 codes).

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/country",
  "status" : "ok",
  "message" : {
    "country" : [
      {
        "text" : "Andora",
        "code" : "ad"
      },
      {
        "text" : "United Arab Emirates",
        "code" : "ae"
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to include all the repositories, under a repos element, that are listed [in OpenDOAR] as from of that country. OpenDOAR does not have a concept of multiple countries for a repository.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

org

This lists all the organisations in the dataset. This script will take over 15 minutes to complete… there is a LOT of data to return!

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/org",
  "status" : "ok",
  "message" : {
    "org" : {
      "1" : {
        <as per org listing>
      },
      "4": { 
        <as per org listing>
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Running the query with the full flag can take twenty minutes!

The repos sub-elements are, in this situation, listed as described in this post .

net

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

Trying to get my head around the options….

OK, so here’s the problem:

An organisation can have multiple names, and it can have multiple URLs…. and sometimes one can identify a straight one-to-one relationship between the two.

For example: Riga Technical University is the english name for Rīgas Tehniskā Universitāte. Being clever, I have identified http://www.rtu.lv as the home page (in its native Latvian) and http://www.rtu.lv/en (as the english-language version). I can even associate the URLs as appropriate: the english name links to the english-language pages, and the Latvian name links to the Latvian-language pages.

Life is a tad more complex in other places. For example “Đại học Quốc gia Hồ Chí Minh” can be called either “National University of Ho Chi Minh” or “Ho Chi Minh City Vietnam National University” in english….  yet I have only one URL: http://www.vnuhcm.edu.vn

Contrary-wise: EDINA has just one name, but two URLs (http://edina.ac.uk and http://www.edina.ac.uk )

There are, naturally, some unknown number of instances where the name and the URL have not been linked – where the harvesting code was unable to make a “sensible” correlation.

The problem is working out how to model this sometimes-present relationship of many-to-many – in code, in data-returns, and on the screen.