Development APIs : The main api

This is the final, and main, api for the system (cf the AJAXie get_xxx and the list functions.) Currently at http://devel.edina.ac.uk:1201/cgi/api5 this is the API that the front-page of OpenDepot.org uses, and is expected to be the primary contact point for other developers.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. All return the data as a nested object, with three top-level elements:
     {
       'message' => {}
       'status'  => 'ok',
       'to'      => 'http://.....'
     }

    status is “ok” or “fail”, to is the url that made the query, and message contains the actual data being returned.

The call

The “locus” for the api search can be defined in a number of ways:

  1. You can specify an IP number to base the search on (ip=129.215)
    • If a full quad is not given, then the full range based on what is given is assumed (so 129.215 means 129.215.0.0 to 129.215.255.255)
    • If a range is defined (ie 129.214-129.217) then the upper and lower bands are set accordingly (ie 129.214.0.0-129.217.255.255)
  2. You can specify a geographic location to base the search on (geo=55.95,-3)
    • The accuracy for the search depends on the numbers given: the range is always +/- 1 either side of the last decimal place given (so a bounding box of 55.94,-2.9 to 55.96,-3.1)
  3. You can specifically define an organisation ID to fix your search on (org=2736)

You can specify multiple locus points, however how they interact needs to be made clear:

  • Every locus definition within the same typeis cumulative: if you specify two IP ranges, then anything in either range is listed
    • This can lead to lots and lots of results
  • Every locus definition that combines different typesresults in an intersection of the results (all the results on a specified network range that are also within a specified geographic location)
    • This can lead to Zero results

In addition to defining the locus for the search, the repositories returned can be tuned to return only those of a certain type, and/or only those that accept particular types of deposits.

  • type is the parameter that defines the type of repository (Institutional, Data, etc), and its the code number you need (see the appropriate list/type call for the known list of types
  • content is the parameter that defines the type of content the repository accepts (pre-prints, data, learning objects, etc), and its the code number you need  (see the appropriate list/content call for the known list of content-types

The return

The data object returned is a set of net objects (indexed by net_id), within which is a list of org objects associated with that network. Within each org object is a list of repo objects. All objects conform to the specification here.

 {
   'message' => {net} => 'i38647' => { 'dec_lower' => '152.78.0.0',
                                       'dec_upper' => '152.78.255.255',
                                       'orgs' => [ { 'org_name' => 'AgentLink.org',
                                                     'org_url' => 'http://www.agentlink.org',
                                                     'repos' => [ { 'repo_name' => 'xxxxxxx',
                                                                    'org_url'   => 'yyyyyy',
                                                                    ................
                                                                   },
                                                                   {
                                                                    .................
                                                                   } ]
                                                     },
                                                     {
                                                 } ],
                                       ...........
                                     }
                      => 'i39677' => {
                                       .............
                                     }
 }

The data is not sorted before being returned.

Development APIs : The get functions

This second suite of functions (cf the main api and the list functions) was initially created as part of a set of “data sanity checking” web pages, however it became apparent that their usefullness lived beyond my own needs, so they have been brought in-line with the other functions, and made generic.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘prototype’ (the only place this format is available), ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. For prototype returns, the data is formatted as an xhtml unordered list (as per the scriptalicious/prototype requirements), with the for attribute set to match EPrints field names.
  3. For all other returns, the data is a list of data records (as per the desciption here)

The queries

Currently at http://devel.edina.ac.uk:1201/cgi/get_xxx5, this is a suite of three APIs that are there to support AJAX calls.

The basic premis is that the term to be looked up is passed in a parameter “q”, and all the records that have that term somewhere in the data are returned.

Additional parameters can be used to tune the query:

  • format will define the format being returned
  • field will specify which field to query on (see the individual functions for more details on this)

The three queries are:

  • get_orgs5
  • get_nets5
  • get_repos5

get_orgs5

This query will search either the name or the url to return a list of organisaions that match. By default, the name field is searched.

get_nets5

This query will search either the name or an ip number to return a list of networks that match. By default, the name field is searched, however if the script spots an IP number, it will automatically switch to an ip search.

get_repos5

This query will search either by name or url to return a list of networks that match. By default, the name field is searched.

Demos of the new APIs

To help ensure usability of the new APIs, I’ve created some example clients

Most of the calls are done using JQuery, which has a mechanism for doing CrossDomain calls – and all the scripts support that functionality if you want it.

Development APIs : The list functions

OK, so there’s some interesting data to get – but how do you get it?

There are three general APIs, or 10… depending on how you count them.

Data returns

All APIs return data in the same ways:

  1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
    • If there’s a callback parameter, and the format is json, then a crossDomain package is returned… very useful!
  2. All return the data as a nested object, with three top-level elements:
 {
   'message' => {}
   'status'  => 'ok',
   'to'      => 'http://.....'
 }

status is “ok” or “fail”, to is the url that made the query, and message contains the actual data being returned…. which is dependant on the query!

The queries

Lets start with the suite that list things  (cf the AJAXie get_xxx functions and the main api)… currently at http://devel.edina.ac.uk:1201/cgi/list5/xxx, this is a suite of six APIs that pull out a list things:

  • type
  • content
  • country
  • lang
  • org
  • net

type

This lists the type (or classification) of repository.

'message' => {
               'type' => [
                           {
                             'code' => 1,
                             'text' => 'Subject (Research Cross-Institutional)'
                           },
                           {
                             'code' => 2,
                             'text' => 'Other'
                           },
                           ......
                         ]
                      },
code text
1 Undetermined – Repositories whose type has not yet been assessed
2 Institutional (Institutional or departmental repositories)
3 Disciplinary (Cross-institutional subject repositories)
4 Aggregating (Archives aggregating data from several subsidiary repositories)
5 Governmental (Repositories for governmental data)
6 Subject (Research Cross-Institutional)
7 Journal (e-Journal/Publication)
8 Thesis
9 Database (Database/A&I Index)
10 Learning (Learning and Teaching Objects)
11 Other
12 Demonstration

When a repository type is needed by /api, it is the code number you need.

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

content

This lists the type of content that repositories accept

  <message>
    <content>
      <code>1</code>
      <text>Research papers (pre- and postprints)</text>
    </content>
    <content>
      <code>2</code>
      <text>Research papers (preprints only)</text>
    </content>
    .....
  </message>
code text
1 Research papers (pre- and postprints)
2 Research papers (preprints only)
3 Research papers (postprints only)
4 Bibliographic references
5 Conference and workshop papers
6 Theses and dissertations
7 Unpublished reports and working papers
8 Books & chapters and sections
9 Datasets
10 Learning Objects
11 Multimedia and audio-visual materials
12 Software
13 Patents
14 Other special item types

When a content type is to be defined in /api, it is the code number you need.

Adding the parameter full=1 will cause the query to return all the repositories that accept the content-type listed under a repos element. Note that repositories are usually accept multiple content-types, so will appear under multiple entries.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

 lang

This lists all the languages the dataset knows about (in essence, the ISO 639 codes).

(We are limited to ISO 639-2 as ISO639-3 & later are not Open Access lists and there is a clause which states “the product, system, or device does not provide a means to redistribute the code set.”)

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/lang",
  "status" : "ok",
  "message" : {
    "lang" : [
      {
        "text" : "Abkhazian",
        "iso3_b" : "abk",
        "code" : "ab"
      },
      {
        "text" : "Achinese",
        "iso3_b" : "ace"
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to return all the repositories that assert they use that language in their interface, listed in a repos element. Many non-english interfaces are multi-lingual, and those repositories will appear in multiple lists.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

country

This lists all the counties the dataset knows about (in essence, the ISO 3166-1 codes).

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/country",
  "status" : "ok",
  "message" : {
    "country" : [
      {
        "text" : "Andora",
        "code" : "ad"
      },
      {
        "text" : "United Arab Emirates",
        "code" : "ae"
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to include all the repositories, under a repos element, that are listed [in OpenDOAR] as from of that country. OpenDOAR does not have a concept of multiple countries for a repository.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

org

This lists all the organisations in the dataset. This script will take over 15 minutes to complete… there is a LOT of data to return!

{
  "to" : "http://devel.edina.ac.uk:1201/cgi/list5/org",
  "status" : "ok",
  "message" : {
    "org" : {
      "1" : {
        <as per org listing>
      },
      "4": { 
        <as per org listing>
      },
    ]
  }
}

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Running the query with the full flag can take twenty minutes!

The repos sub-elements are, in this situation, listed as described in this post .

net

Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

The repos sub-elements are indexed by repo_id. There is also a count element which will tell you how many repositories are in the set.

Introducing the new APIs

Its been a long time coming (OK, I’ve been distracted by other things too), however the new APIs using a new dataset, are nearly ready.

The new calls return far more data, and in a consistent way!

The new dataset is a better merging of OpenDOAR and ROAR (and it updates from those “Authoritative” sources on weekly), and adds in records from the UK Access Management Federation (harvesting daily) and the webometrics list of 12,000 universities (http://www.webometrics.info/ – harvested on an ad-hoc basis)

The OA Organisation Identification Service (as we are now starting to call it) is now predominantly a list of [academic] organisations, with details of networks and repositories associated with them…. it is no longer a list of repositories and their organisations (as ROAR & OpenDOAR are)

How big is it?

How does 14,000 Organisations, 2,700 repositories, and 6,700 networks grab you? There are 17,00 URLs and 33,000 names for these objects…. its big! …. and growing bigger all the time!

If you can find me more good sources of Repositories or Academic Organisations, I’ll see about including them too!

What data is returned?


When you get data on an organisation, you get:

org_id The ID for the org (can be used in other API calls)
lat The Latitude held for the organisation
long The Longitude held for the organisation
identites A list of names (and URLs) for the organisation (see below for details)
Data is also pulled in from the identities data:
… the following are taken from the first identity record:

org_name
org_npri
org_acronym
org_npref
org_iri

…. and these are taken from the first matching (else non-matching) URL for the first identity:

org_url
org_upri
org_checked_good
org_date_checked

When you get data on a repository, you get:

repo_id The ID for the repository (can be used in other API calls)
lat The Latitude held for the repository
long The Longitude held for the repositiry
postaddress The address the repository is located at
countrycode The country the repository is in
oaibaseurl The URL for OAI harvesting
softwarename What software it uses (EPrints, DSpace, flubber, etc)
softwareversion What version of the software
description The main description for the repository
comment A list of additional comments for the repositories
types A list of repository types the repository is (institutional, data, etc)
content A list of content types the repository accepts (Pre-prints, data, etc)
external_ids A list of external ids [OpenDOAR_123, etc]
language A list of languages used in the repository interface
sword A list of servicedocument locations for the repository
identites A list of names (and URLs) for the organisation (see below for details)
… the following are taken from the first identity record:

repo_name
repo_npri
repo_acronym
repo_npref
repo_iri
…. and these are taken from the first matching (else non-matching) URL for the first identity:

repo_url
repo_upri
repo_checked_good
repo_date_checked

When you get data on a network, you get:

net_id The ID for the network (can be used in other API calls)
inetnum The IP range for the network (123.234.0.0-123.234.63.255)
dec_lower The first IP number of the range (123.234.0.0, from above)
dec_upper The last IP number of the range (123.234.63.255, from above)
identites A list of name(s) for the network (see below for details) – there are no URLS, obviously
… the following are taken from the first identity record:

net_name
net_npri
net_acronym
net_npref
net_iri

identities

Each entry in the array is a name for the object, with whichever name is defined as “Primary” at the start of the list.

Each identity object contains the following keys (if they exist in the database):

name The name of the object (‘Poppleton Univeristy’, ‘Plink-Plonk Repository’, etc)
acronym Any acronym the object may be known as (‘PU’, ‘PPR’, etc)
npref A true/false flag that indicates which is the preferred term.
(Absent means true, not false…. or “There is no statement that the name is not the preferred term” )
pri A true/false flag that indicates if the name is marked as Primary.Again, this flag in not always defined, as there may be only one option, or there may be know definite name that is the primary name.
iri The Open Linked-Data iri to get the linked-data record
nid The database ID for the name
urls A sub-element containing URL data for the object, as associated with the particular name.

urls

In the database, there is an association between names and URLs. This is to enable objects to have multi-lingual names, and appropriate urls for each language (eg: Ukranian, Russian, and English)

The urls element contains two keys: “matching” and “non-matching”, both of which are lists on url objects:

'urls' => {
              'matching' => [
                                {....},
                                {....}
                               ],
          'non_matching' => [
                                {....},
                                {....}
                               ]
             }

If a URL is flagged as Primary, it is placed at the front of the appropriate list

Within each url object, the following data is returned:

url The actual URL
pri Whether the URL is marked as a primnary one
live A true/false flag to indicate if the URL returns [a non-error] web page
date The date that the URL was last checked.
Note that no history is kept of the alive/not-alive checking.
Hosts that are alive are re-checked weekly, hosts that are not flagged as alive are checked on a daily basis
uid The database ID for the URL

Comprehensive enough? want more? speak to me….

Trying to get my head around the options….

OK, so here’s the problem:

An organisation can have multiple names, and it can have multiple URLs…. and sometimes one can identify a straight one-to-one relationship between the two.

For example: Riga Technical University is the english name for Rīgas Tehniskā Universitāte. Being clever, I have identified http://www.rtu.lv as the home page (in its native Latvian) and http://www.rtu.lv/en (as the english-language version). I can even associate the URLs as appropriate: the english name links to the english-language pages, and the Latvian name links to the Latvian-language pages.

Life is a tad more complex in other places. For example “Đại học Quốc gia Hồ Chí Minh” can be called either “National University of Ho Chi Minh” or “Ho Chi Minh City Vietnam National University” in english….  yet I have only one URL: http://www.vnuhcm.edu.vn

Contrary-wise: EDINA has just one name, but two URLs (http://edina.ac.uk and http://www.edina.ac.uk )

There are, naturally, some unknown number of instances where the name and the URL have not been linked – where the harvesting code was unable to make a “sensible” correlation.

The problem is working out how to model this sometimes-present relationship of many-to-many – in code, in data-returns, and on the screen.