The De-dup Challenge

Save for quite specific deposit processes such as the DepositMO Project or BioMed Central automatic article deposit, so far there were no general-purpose tools for automated ‘random’ content delivery into IRs. Subsequently the urge to identify publications in order to avoid the ingest of duplicated publications was not felt in Repositoryland. The Repository Junction Broker (RJB) will eventually change this by delivering content from various sources (such as publishers or subject repositories) into IRs. And de-dup at entry level will quickly become an issue.

The RJB as a tool for automated SWORD-mediated item ingest into IRs has only undergone preliminary testing so far with internal test DSpace, EPrints and Fedora repositories at EDINA. However, plans for testing actual content transfers to IRs at partner institutions are already being rolled out – with the de-dup issue rapidly becoming a critical challenge. There are basically two ways of dealing with it: the first one is having the RJB leave the content at the IR’s front door for the IR manager to check whether the item was already filed into the IR (this is default procedure for RJB). The other one would be to use some kind of article identifier to detect potential duplicates. But this is a hard one, of course: most content sources tend to use their own internal identifiers and although there are wide-scope identifiers such as the Scopus# or the WoK#, these are not used widely enough to be any real help. At best you can get PubMed IDs (which is not bad if only IR managers usually collected these as part of the relevant metadata, regrettably not the case for most) or DOIs. But even then we’re only talking partial IR content coverage.

In the same way as ORCID is a potential solution for author ID purposes we do probably need a similar progress in terms of document identification, de-dup strategies and metadata enhancement at IRs.

EDINA Blogs

A Blogs.edina.ac.uk weblog

The De-dup Challenge