The repository is watching: automated harvesting from replicated filesystems

One of the final things I’m looking at on this jiscPUB project is a demonstration of a new class of tool for managing academic projects not just documents. For a while we were calling this idea the Desktop Repository, the idea being that there would be repository services watching your entire hard disk and exposing all the content in a local website with repository and content management services that’s possibly a very useful class of application for some academics, but in this project we are looking at a slightly different slant on that idea.

The core use case I’m illustrating here is thesis writing, but the same workflow would be useful across a lot of academic projects, including all the things we’re focussing on in the jiscPUB project academic users managing their portfolio of work, project reporting and courseware management. This tool is about a lot more than just ebook publishing, but I will look at that aspect of it, of course.

In this post I will show some screenshots of The Fascinator repository in action, talk about how you can get involved in trying it out, and finish with some technical notes about installation and setup. I was responsible for leading the team that built this software at the University of Southern Queensland. Development is now being done at the University of Central Queensland and the Queensland Cyber Infrastructure Foundation where Duncan Dickinson and Greg Pendlebury continue work on the ReDBox research data repository which is based on the same platform.

I know Theo Andrew at Edinburgh is keen to get some people trying this. So this blog post will serve to introduce it and give his team some ideas we’ll follow up on their experiences if there are useful findings.

Managing a thesis

The short version of how this thesis story might work is:

  • The university supplies the candidate with a dropbox-like shared file system they can use from pretty much any device to access their stuff. But there’s a twist there is a web-based repository watching the shared folder and exposing everything there to the web.

  • The university helpfully adds into the share a thesis template that’s ready to go, complete with all the cover page stuff, margins all set, automated tables of contents for sections and tables and figures and the right styles and trains the candidate in the basics of word processing.

  • The candidate works away on their project, keeping all their data, presentations, notes and so on in the Dropbox and filling out the thesis template as they go.

  • The supervisor can drop in on the work in progress and leave comments via an annotation system.

  • At any time, the candidate can grab a group, which we call a package of things to publish to a blog or deposit to a repository at the click of a button. This includes not just documents, but data files (the ones that are small enough to keep in a replicated file system), images, presentations etc.

  • The final examination process could be handled using the same infrastructure and the university could make its own packages of all the examiners reports etc for deposit into a closed repository.

The result is web-based, web-native scholarship where everything is available in HTML, not just PDF or application file formats and there are easy ways to route content to other repositories or publish it in various ways.

Where might ebook dissemination fit into this?

Well, pretty much anywhere in the above that someone wants to either take a digital object ‘on the road’ or deposit it in a repository of some kind as a bounded digital thing.

Demonstration

I have put a copy of Joss Winn’s MA thesis into the system to show how it works. It is available in the live system (note that this might change if people play around with it). I took an old OpenOffice .sxw file Joss sent me and changed the styles a little bit to use the ICE conventions, I’m writing up a much more detailed post about templates in general, so stay tuned for a discussion of the pros and cons of various options for choosing style names and conventions and whether or not to manage the document as a single file or multiple chapters.

graphics2Illustration 1: The author puts their stuff in the local file system, in this case replicated by Dropbox.

graphics7Illustration 2: A web-view of Joss Winn’s thesis.

The interface provides a range of actions.

graphics9Illustration 3: You can do things with content in The Fascinator including blogging and export to zip or (experimental) EPUB

The EPUB export was put together as a demonstration for the Beyond The PDF effort by Ron Ward. A the moment it only works on packages, not individual documents, and it is using some internal Python code to stitch together documents, rather than calling out to Calibre as I did in earlier work on this project. The advantage of doing it this way is that you don’t have Calibre adding extra stuff and reprocessing documents to add CSS but the disadvantage is that a lot of what Calibre does is useful, for example working around known bugs in reader software, but it does tend to change formatting on you, not always in useful ways.

I put the EPUB into the dropbox so it is available in the demo site (you need to expand the Attachments box to get the download that’s not great usability I know). Or you can go to the package and export it yourself. Log in first, using admin as a username and a the same for a password.

graphics8Illustration 4: Joss Winn’s thesis exported as EPUB.

I looked a different way of creating an EPUB book from the same thesis a while ago which will be available for a while here at the Calibre server I set up.

One of the features of this software is that more than one person can look at the web site and there are extensive opportunities for collaboration.

graphics5Illustration 5: Colleagues and supervisors can leave comments via inline annotation (including annotating pictures and videos)

graphics6Illustration 6: Annotations are threaded discussions

graphics3Illustration 7: Images and videos can be annotated too. At USQ we developed a Javascript toolkit called Anotar for this, the idea being you could add annotation services to any web site quickly and easily.

This thesis package only contains documents, but one of the strengths of The Fascinator platform is that it can aggregate all kinds of data, including images, spreadsheets, presentation and can be extended to deal with any kind of data file via plugins. I have added another package, modestly calling itself the research object of the future, using some files supplied by Phil Bourne for the Beyond the PDF group The Fascinator makes web views of all the content and can package it all as a zip file or an EPUB.

graphics10Illustration 8: A spreadsheet rendered into HTML and published into an EPUB file (demo quality only)

This includes turning PowerPoint into a flat web page.

graphics11Illustration 9: A presentation exported to EPUB along with data and all the other parts of a research object

Installation notes

Installing The Fascinator  (I did it on Amazon’s EC2 cloud on Ubuntu 10.04.1 LTS) is straightforward. These are my notes not intended to be a detailed how-to, but possibly enough for experienced programmers/sysadmins to work it out.

  • Check it out.

    sudo svn co https://the-fascinator.googlecode.com/svn/the-fascinator/trunk /opt/fascinator
  • Install Sun’s Java

    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:sun-java-community-team/sun-java6
    sudo apt-get update
    sudo apt-get install sun-java6-jdk

    http://stackoverflow.com/questions/3747789/how-to-install-the-sun-java-jdk-on-ubuntu-10-10-maverick-meerkat/3997220#3997220

  • Install Maven 2.

    sudo apt-get install maven2
  • Install ICE or point your config at an ICE service. I have one running for the jiscPUB project you can point to this by changing the ~/.fascinator/system-config.json file.

  • Install Dropbox or your file replication service of choice a little bit of work on a headless server but there are instruction linked from the Dropbox.com site.

  • Make some configuration changes, see below.

  • To run ICE and The Fascinator on their default ports on the same machine add this stuff to /etc/apache2/apache.conf (I think the proxy modules I’m using here is non-standard).

    LoadModule  proxy_module /usr/lib/apache2/modules/mod_proxy.so
    LoadModule  proxy_http_module /usr/lib/apache2/modules/mod_proxy_http.so
    ProxyRequests Off
    <Proxy *>
    Order deny,allow
    Allow from all
    </Proxy>
    ProxyPass        /api/ http://localhost:8000/api/
    ProxyPassReverse /api/  http://localhost:8000/api/
    ProxyPass       /portal/ http://localhost:9997/portal/
    ProxyPassReverse /portal/ http://localhost:9997/portal/
  • Run it.

    cd /opt/fascinator
    ./tf.sh restart

Configuration follows:

  • To set up the harvester, add this to the empty jobs list in ~/.fascinator/system-config.json

"jobs" : [
                   {
                       "name": "dropbox-public",
                       "type": "harvest",
                       "configFile":
"${fascinator.home}/harvest/local-files.json",
                       "timing": "0/30 * * * * ?"
                   } 

And change /harvest/local-files.json to point at the Dropbox directory

"harvester": {
        "type": "file-system",
        "file-system": {
            "targets": [
                {
                    "baseDir": "${user.home}/Dropbox/",
                    "facetDir": "${user.home}/Dropbox/",
                    "ignoreFilter": ".svn|.ice|.*|~*|Thumbs.db|.DS_Store",
                    "recursive": true,
                    "force": false,
                    "link": true
                }
            ],
            "caching": "basic",
            "cacheId": "default"
        }

To add the EPUB support and the red branding, unzip the skin files in this zip file into the portal/default/ directory: http://ec2-50-19-86-198.compute-1.amazonaws.com/portal/default/download/551148ce6d80bfc0c9c36914f9df4f91/jiscpub.zip

unzip -d /opt/fascinator/portal/src/main/config/portal/default/ jispub.zip

Copyright Peter Sefton, 2011-07-12. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Making EPUB from WordPress (and other) web collections

Background

As part of Workpackage 3 I have been looking at WordPress as a way of creating scholarly monographs. This post carries on from the last couple, but it’s not really about EPUB or about WordPress, it’s about interoperability and how tools might work together in a Scholarly HTML mode so that people can package and repackage their resources much more reliably and flexibly than they can now.

While exploring WordPress I had a look at the JISC funded KnowledgeBlog project. The team there has released a plugin for WordPress to show a table of contents made up of all the posts in a particular category. It seemed that with a bit of enhancement this could be a useful component of a production workflow for book-like project, particularly for project reports and theses (where they are being written online in content management systems maybe not so common now, but likely to become more common) and for course materials.

Recently I looked at Anthologize, a WordPress-based way of creating ebooks from HTML resources sourced from around the web (I noted a number of limitations which I am sure will be dealt with sooner or later). Anthologize is using a design pattern that I have seen a couple of times with EPUB, converting the multiple parts of a project to an XML format that already has some tools for rendering and using those tools to generate outputs like PDF or EPUB. Asciidoc does this using the DocBook tool-chain and Anthologize uses TEI tools. I will write more on this design pattern and its implications soon. There is another obvious approach; to leave things in HTML and build books from that, for example using Calibre which already has ways to build ebooks from HTML sources. This is an approach which could be added to Anthologize very easily, to complement the TEI approach.

So, I have put together a workflow using Calibre to build EPUBs straight from a blog.

Why would you want to do this? Two main reasons. Firstly, to read a report, thesis or course, or an entire blog on a mobile device. Secondly, to be able to deposit a snapshot of same into a repository.

In this post I will talk about some academic works:

The key to this effort is the KnowledgeBlog table of contents plugin ktoc, with some enhancements I have added to make it easier to harvest web content into a book.

The results are available on a Calibre server I’m running in the Amazon cloud just for the duration of this project. (The server is really intended for local use, the way I am running it behind an Apache reverse proxy it doesn’t seem very happy you may have to refresh a couple of times until it comes good). This is rough. It is certainly not production quality.

graphics1

These books are created using calibre ‘recipes’: available here. You run them like this:

ebook-convert thesis-demo.recipe .epub --test

If you are just trying this out, to be kind to site owners --test will cause it to only fetch a couple of articles per feed.

I added them to the calibre server like this:

calibredb add --library-path=./books thesis-demo.epub

The projects page at my site has two TOCs for two different projects.

[ktoc cat="jiscPUB" title="Digital Monograph Technical Landscape study #jiscPUB" show_authors="false" orderby="date" toc_author="Peter Sefton"]

[ktoc cat="ScholarlyHTML" title="Scholarly HTML posts" orderby="date" show_authors="false" toc_author="Peter Sefton" ]

I the title is used to create sections in the book, in both cases the post are displayed in date-order and I am not showing the name of the author on the page because that’s not needed when it is all me.

The resulting book has a nested table of contents, seen here in Adobe Digital Editions.

graphics2Illustration 1: A book built from a WordPress page with two table of contents blocks generated from WordPress categories.

Read on for more detail about the process of developing these things and some comments about the problems I encountered working with multiple conflicting WordPress plugins, etc.

The Scholarly HTML way to EPUB

The first thing I tried in this exploration was writing a recipe to make an EPUB book from a Knowledge Blog, for the Ontogenesis project. It is a kind of encyclopaedia of ontology development maintained in a WordPress site with multiple contributors. It worked well, for a demonstration, and did not take long to develop. The Ontogenesis recipe is available here and the resulting book is available on the Calibre server.

But there was a problem.

The second blog I wanted to try it on was my own, so I installed ktoc changed the URL in the recipe and ran it. Nothing. The problem is that Ontogenesis and my blog use different WordPress themes so the structure is different. Recipes have stuff like this in them to locate the parts of a page, such as <p class='details_small'>:

remove_tags_before = dict(name='p', attrs={'class':'details_small'})

remove_tags_after = dict(name='div', attrs={'class':'post_content'})

That’s for Ontogenesis, different rules are needed for other sites. You also need code to find the table of contents amongst all the links on a WordPress page, and deal with pages that might have two or more ktoc-generated tables for different sections of a journal, or parts of a project report.

Anyway, I wrote a different recipe for my site, but as I was doing so I was thinking about how to make this easier. What if:

  • The ktoc plugin output a little more information in its list of posts that made it easy to find no matter what WordPress theme was being used.

  • The actual post part of each page (ie not the navigation, or ads) identified itself as such.

  • The same technique could be extended to other websites in general.

There is already a standard way to do the most important part of this, listing a set of resources that make up an aggregated resource; the Object Reuse and Exchange specification, embedded in HTML using RDFa. ORE in RDFa. Simple.

Well no, it’s not, unfortunately. ORE is complicated and has some very important but hard to grasp abstractions such the difference between an Aggregation, and a Resource Map. An Aggregation is a collection of resources which has a URI, while a resource map describes the relationship between the aggregation and the resources it aggregates. These things are supposed to have different URIs. Now, for a simple task like making a table of contents of WordPress posts machine-readable so you can throw together a book, these abstractions are not really helpful to developers or consumers. But what if there were a simple recipe/microformat what we call a convention in Scholarly HTML to follow, which was ORE compliant and that was also simple to implement at both the server and client end?

What I have been doing over the last couple of days, as I continue this EPUB exploration is try to use the ORE spec in a way that will be easy implement, say in the Digress.it TOC page, or in Anthologize, while still being ORE compliant. That discussion is ongoing, and will take place in the Google groups for Scholarly HTML and ORE. It is worth pursuing because if we can get it sorted out then with a few very simple additions to the HTML they spit out, any web system can get EPUB export quickly and cheaply by adhering to a narrowly defined profile of ORE subject to the donor service being able to supply reasonable quality HTML. More sophisticated tools that do understand RDFa and ORE will be able to process arbitrary pages that use the Scholarly HTML convention, but developers can choose the simpler convention over a full implementation for some tasks.

The details may change, as I seek advice from experts, but basically, there are two parts to this.

Firstly there’s adding ORE semantics to the ktoc (or any) table of contents. It used to be a plain-old unordered list, with list items in it:

<p><strong>Articles</strong></p>
<ul>
<li><a href="http://ontogenesis.knowledgeblog.org/49">Automatic
maintenance of multiple inheritance ontologies</a> by Mikel Egana
Aranguren</li>
<li><a href="http://ontogenesis.knowledgeblog.org/257">Characterising
Representation</a> by Sean Bechhofer and Robert Stevens</li>
<li><a href="http://ontogenesis.knowledgeblog.org/1001">Closing Down
the Open World: Covering Axioms and Closure Axioms</a> by Robert
Stevens</li>
</ul>

The list items now explicitly say what is being aggregated. The plain old <li> becomes:

<li  rel="http://www.openarchives.org/ore/terms/aggregates"
resource="http://ontogenesis.knowledgeblog.org/49">

(The fact that this is an <li> does not matter, it could be any element.)

And there is a separate URI for the Aggregation and resource map courtesy of different IDs. And the resource map says that it describes the Aggregation as per the ORE spec.

<div id=”AggregationScholarlyHTML">

<div rel="http://www.openarchives.org/ore/terms/describes" resource="#AggregationScholarlyHTML" id="ResourceMapScholarlyHTML" about="#ResourceMapScholarlyHTML">

It is verbose, but nobody will have to type this stuff. What I have tried to do here (and it is a work in progress) is to simplify an existing standard which could be applied in any number of ways and boil it down to a simple convention that’s easy to implement but that still honours the more complicated specifications in the background. (Experts this will realise that I have used an RDFa 1.1 approach here, meaning that current RDFa processors will not understand, this is so that we don’t have to deal with namespaces and CURIES which complicate processing for non-native tools.)

Secondly the plugin wraps a <div> element around the content for every post to label it as being scholarly HTML, this is a way of saying that this part of the whole page is the content that makes up the article, thesis chapter or similar. Without a marker like this finding the content is a real challenge where pages are loaded up with all sorts of navigation, decoration and advertisements, it is different on just about every site, and it can change at the whim of the blog owner if they change themes.

<div rel="http://scholarly-html.org/schtml">

Why not define an even simpler format?

It would be possible to come up with a simple microformat that had nice human readable class attributes or something to mark the parts of a TOC page. I didn’t do that because then people will rightly point out that ORE exists and we would end up with a convention that covered a subset of the existing spec, making it harder for tool makers to cover both and less likely that services will interoperate.

So why not just use general ORE and RDFa?

There are several reasons:

  • Tool support is extremely limited for client and server side processing of full RDFa, for example in supporting the way namespaces are handled in RDFa using CURIES. (Sam Adams has pointed out that it would be a lot easier to debug my code if I did use CURIES and RDFa 1.0 so I followed his advice, did some search and replacing and checked that the work I am doing here is indeed ORE compliant).

  • The ORE spec is suited only for experienced developers with a lot of patience for complexities like the difference between an aggregation and a resource map.

  • RDFa needs to apply to a whole page, with the correct document type and that’s not always possible to do when we’re dealing with systems like WordPress. The convention approach means you can at least produce something that can become proper RDFa if put into the right context.

Why not use RSS/Atom feeds?

Another way to approach this would be to use a feed, in RSS or Atom format. WordPress has good support for feeds there’s one for just about everything. So you can look at all the posts on my website:

http://ptsefton.com/category/uncategorized/feed/atom

or use Tony Hirst’s approach to fetch a singe post from the jiscPUB blog

http://jiscpub.blogs.edina.ac.uk/2011/05/23/a-view-from-academia-on-digital-humanities/feed/?withoutcomments=1

The nice thing about this single post technique is that it gives you just the content in a content element so there is no screen scraping involved. The problem is that the site has to be set up to provide full HTML versions of all posts in its feeds or you only get a summary. There’s a problem with using feeds on categories too, I believe, in that there is an upper limit to how many posts a WordPress site will serve. The site admin can change that to a larger number but then that will affect subscribers to the general purpose feeds as well. They probably don’t want to see three hundred posts in Google Reader when they sign up to a new blog.

Given that Atom (the best standardised and most modern feed format) is one of the official serialisation formats for ORE it is probably worth revisiting this question later if someone, such as JISC, decides to invest more in this kind of web-to-ebook-compiling application.

What next?

There are some obvious things that could be done to further this work:

  • Set up a more complete and robust book server which builds and rebuilds books from particular sites and distributes them in some way, using Open Publication Distribution System (OPDS) or something like this thing that sends stuff to your Kindle.

  • Write a ‘recipe factory’. With a little more work the ScholarlyHTML recipe can be got to the point where the only required variable is a single page URL everything else can be harvested from the page or over-ridden by the recipe.

  • Combining the above to make a WordPress plugin that can create EPUBs from collections of in-built content (tricky because of the present calibre dependency but it could be re-coded in PHP).

  • Add the same ScholarlyHTML convention for ORE to other web systems such as the Digress.it plugin and Anthologize. Anthologize is appealing because it allows you to order resources in ‘projects’ and nest them into ‘parts’ rather than being based on simple queries but at the moment it does not actually have a way to publish a project directly to the web.

  • Explore the same technique in the next phase of WorkPackage 3 when I return to looking at word processing tools and examine how cloud replication services like DropBox might help people to manage book-like projects that consist of multiple parts.

Postscript: Lessons and things that need fixing or investiging

I encountered some issues. Some of these are mentioned above but I wanted to list them here as fodder for potential new projects.

  • As with Anthologize, if you use the WordPress RSS importer to bring-in content it does not change the links between posts so they point to the new location. Likewise with importing a WordPress export file.

  • The RSS importer applied to the thesis created hundreds of blank categories.

  • I tried to add my ktoc plugin to a Digress.it site, but ran into problems. It uses PHP’s simplexml parser which chokes on what I am convinced is perfectly valid XML in unpredictable ways. And the default Digress.it configuration expects posts to be formatted in a particular way as a list of top-level paragraphs, rather than with nested divs. I will follow this up with the developers.

  • Calibre does a pretty good job of taking HTML and making it into EPUBs but it does have its issues. I will work through these on the relevant forums as time permits.

    • There are some encoding problems with the table of contents in some places. Might be an issue with my coding in the recipes.

    • Unlike other Calibre workflows, such as creating books from raw HTML, ebook-convert adds navigation to each HTML page in the book created by a recipe. This navigation is redundant in an EPUB, but apparently it would require a source code change to get rid of it.

    • It does something complicated to give each book its style information. There are some odd presentation glitches in the samples as a result of Calibre’s algorithms. This requires more investigation.

    • It doesn’t find local links between parts of a book (ie links from one post to another which occur a lot in my work and in Tony’s course), but I have coded around that in the Scholarly HTML recipes.

It will be up to Theo Andrew, the project manager if any of these next steps or issues get any attention during the rest of this project.

Copyright Peter Sefton, 2011-05-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics3

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

How to add EPUB support to EPrints

In a previous post here on the jiscPUB project I said it would be good for the EPrints repository software to support EPUB uploads.

I’d love to do something with a repository – I’m thinking that it would be great to deposit theses in EPUB format – and the repository could provided a web-based reader, along the lines of IbisReader, which Liza Daly and company created. I’m looking at you, Eprints! Eprints already almost supports this, if you upload a zip file it will stash all the parts for you in a single record. All we would need would be something like this little reader my colleagues at USQ made. It would just be a matter of transforming the EPUB TOC into JSON, and loading the JavaScript into an Eprints page.

I Called Les Carr’s attention to the post and he responded:

lescarr @ptsefton just tell us what to do and we’ll do it.

OK. Here goes with my specification for how EPrints could add at least basic support for EPUB.

Putting EPUB into EPrints as-is

To explore this, I ran the EPrints live CD (livecd_v3.1-x.iso) under VirtualBox on Windows 7 – this worked well when I gave it a decent amount of memory – it didn’t manage to boot in several hours at 256Mb. (Note that no repositories were harmed in the making of this post – I did not change the Eprints code at all.)

The EPUB format is a zipfile containing some XHTML payload documents, a manifest, and a table of contents. On one level EPRINTS already supports this, in that there is support for uploading ZIP files. I tested this using Danny Kingsley’s thesis (as received, with no massaging or adding metadata apart from tweaking the title in Word) converted to EPUB via the ICE service I have been working on.

The procedure:

  1. Generated an EPUB using ICE.
  2. Changed the file extension to .zip.
  3. Uploaded it into EPrints.

The result is an EPrints item with many parts. If you click on any of the HTML files that make up the thesis then they work as web pages – ie the table of contents (if you can find it amongst the many files) links to the other pages. But there is no navigation to tie it all together you have to keep hitting back – each HTML page from the EPUB is a stand alone fragment.

 


Illustration 1: The management interface in EPrints showing all the parts of an EPUB file which has been uploaded and saved as a series of parts in a single record.

 

At this point I went off on a side trip, and wrote this little tool – to add an HTML view to an EPUB file.

Putting enhanced EPUB into Eprints

Now, lets try that again with the version where I added an HTML index page to the EPUB using the new demo tool, epub2html. I uploaded the file, clicked around semi-randomly until I figured out how to see all the files listed from the zip, and selected index.html as the ‘main’ file. From memory I thought the repository would do that for me but it didn’t. Anyway, I ended up with this:

 


Illustration 2: The details screen that users see – clicking on the description takes you to the HTML page I picked as the main file.

 

 


Illustration 3: A rudimentary ebook reader using an inline frame.

If I click on the link starting with Other, there we have it – more-or-less working navigation within the limits of this demo-quality software. All I had to do was change the extension from .epub to .zip and select the entry page, and I had a working, navigable document.

The initial version of epub2html used the unsupported epubjs as a web based reader-application – but Liza Daly suggested I use the more up to date Monocle.js library instead. I tried that but I’m afraid the amount of setup required is too much for the moment so what you see here is an HTML page with an inline frame for the content.

What does the repository need to do?

So what does the EPrints team need to do to support EPUB a bit better?

  • Add EPUB to the list of recognised files.
  • Upon recognising an EPUB…
    • Use a service like epub2html that can generate an HTML view of the EPUB. I wrote mine in Python, Eprints is written in Perl but I’m sure that can be sorted out via a re-write or a web service or something*.
    • Allow the user to download the whole EPUB, or choose to use an online viewer. Could be static HTML, frames (not nice), or some kind of JavaScript based viewer.
    • Embed some kind of viewer in the EPrints page itself, or at least provide a back-link in the document viewer to the EPrints page.

Does that make sense, Les?

Copyright Peter Sefton, 2011-04-15. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.


 

* Maybe there’s a Python interpreter written in Perl?

 

Introducing Epub2Html – adding a plain HTML view to an EPUB

Background

EPUB ebook files are useful if you have an application to read them, but not everyone does. We have been discussing this in the Scholarly HTML movement; to some of us EPUB looks like a good general purpose packaging format for scholarship. Not just for HTML (if you can make it XTHML, that is) but potentially for other stuff that makes up a research object, such as data files or provenance information. One of the big problems, though is that the format is still not that widely known; what is a researcher to do when they are given file ending in .epub? That question remains unresolved at the moment, but in this post I will talk about one small step to making EPUB potentially more useful in the general academic community.

This week, I was looking at the potential for EPUB support in repositories, which I will cover in my next post. An EPUB is full of HTML, but it’s not something that is necessarily straightforward to display on the web. jiscPUB colleague Liza Daly’s company has a thing called IbisReader that serves EPUB over the web and worked on BookWorm, parts of which are also available as open source.

What I wanted was a bit different – I wanted to be able to add something equivalent to a README file to an EPUB that let people read the content and web site or repository managers would be able to do something with it. So, I wrote a small tool intended as demonstrator only which:

  • Generates a plain HTML table of contents.
  • Adds an index.html page to the root of an EPUB (this is legit, it gets added to the manifest as well, but not the TOC) with a simple frame-based navigation system so if you can open the EPUB zip, you can browse it.
  • Bundles in a lightweight JavaScript viewer. Initially I tried the Paquete system from USQ, but it turned out to have a few more issues than I had hoped. For this first release I have used a bit of Liza’s code from a couple of years ago, epubjs with couple of modifications. Status? Works for me.

Demo

So here’s what it looks like in real life, warts and all.

I used the test file I was working on earlier in the week with embedded metadata.

graphics1Illustration 1: Test epub from Edinburgh thesis template, with added metadata in Adobe Digital Editions

I ran the new code:

python epub2html.py Edinburgh-ThesisSingleSided-plus-inline-metadata.epub

Which made a new file. (It does make epubckeck complain, but that’s mostly to do with HTML attributes it doesn’t like, not EPUB structural problems).

Edinburgh-ThesisSingleSided-plus-inline-metadata-html.epub

Now, if I unzip it there is an index.html, and some JavaScript from epubjs. In Firefox that looks like this.

 

graphics2Illustration 2: HTML view of the EPUB being served from the file system, using epubjs for navigation

But, if the JavaScript is not working, then you can still see the content courtesy of the less than ideal inline frame:

graphics3Illustration 3: Fall-back to plain HTML with no JavaScript, the index.html file has an inline frame for the EPUB content. Not elegant, but lets the content be seen.

Trying it out / the future

If you want to try this out, or help out you can get the tool from Google code.

svn co https://integrated-content-environment.googlecode.com/svn/branches/temp-2011/epub2html

There are lots of things to do, like add command line options for output files, extracting the EPUB+HTML for immediate use (after safety checking it), choosing whether to bundle the JavaScript in the EPUB or linking to it via the web. Does anyone want this? Let us know.

One of the things I like about Paquete is that it generates # URLS for the different pages you view, making bookmarking chapters possible like this: http://demo.adfi.usq.edu.au/paquete/demo/#configuration.htm. I will explore whether this can be added to epubjs or whether it is worth pressing on with Paquete, which does have some more options like navigation buttons and a tree-widget for the table of contents.

Like I said, I did this as part of the notes I was putting together for how repositories might support EPUB, and maybe, finally, start serving real web content rather than exclusively PDF, more on that soon.

This approach might also help us add previews to web services so people can see their content in ereader-mode, something I know David Flanders the JISC manager on this project is keen on.

And finally something like this approach might be part of a tool-chain that could help people break up long documents into parts, packaged in EPUB and upload them to services like http://digress.it which want things broken up into parts.

Copyright Peter Sefton, 2011-04-14. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.