Template design issues for WordProcessors and (possible future) EPUB export

This document is a collection of notes on how to design word processing templates for creating EPUBs – particularly theses. It’s probably not very interesting as a general read. The intended audience is support and technical staff who are working with theses, and preparing for ebook creation projects. It may be of use to projects following from jiscPUB, particularly in the area of thesis management submission and deposit where theses are required to be published in HTML and/or EPUB. These notes are incomplete – this is not a full “word processing for thesesâ€� book and it does not provide an answer of how to actually create EPUB theses of high quality from word processing documents, although I produced some promising demonstrations of the potential during my work on this project.

To make theses produced with a word processor available as EPUB it is a given that the word processor, or some other application which can read word processing documents needs to be able to produce good quality HTML. Given HTML EPUB can be created even if the word processing package or content management system being used is not capable of exporting EPUB natively. As Liza Daly notes in the final report for this project that’s difficult to achieve from arbitrary word processing documents, which is why it is useful to design a template, documentation and training that helps users to choose features in their word processors, such as using defined styles rather than direct formatting.

I was involved in a word-processor based web publishing project at the University of Southern Queensland from 2004 to 2010. The project, the Integrated Content Environment produced some templates, toolbars for creating documents, and HTML conversion code all released under an open source license. I refer to that project a lot here, as it dealt with many of the relevant issues in setting up word processors for academic use, including fairly comprehensive documentation about how to do things the right way. There is a fork of the project on Google Code which I added to during the jiscPUB project.

This document takes a general look at template design and provides some specific examples and advice for two applications, Microsoft Word and OpenOffice.org Writer (including the new LibreOffice fork and the other derivatives); many of the issues are the same for other word processing packages but this project didn’t have the resources to explore all the available options such as Apple’s Pages and Google docs. Another option is the open source LyX word processor (or document processor as the creators call it) which, with some training and template development may suit some candidates. But note that it would need to be run in a very well-supported environment.

In this document I refer to the thesis template provided by the university of Edinburgh and use it as the basis for some examples.

Templates vs “document prototype factories�

Templates are a starting point for creating different genres of document. When properly installed, they allow users to choose something like File / New / From template... and to pick the kind of document they want; a thesis chapter, a paper, report or blog post. But they suffer from several usability and maintenance issues in today’s computing environment.

  • If you click to open a template, it spawns a new document. In my experience users tend to save these new documents wherever they work normally – maybe in a shared drive, often on the desktop, and leave the template where they downloaded it. So the most likely place a template will end up living is in the Downloads or Desktop folder – where it is not subject to version control or management.
  • In OpenOffice.org the template system is arcane and difficult to navigate – it is possible to import a template via the user interface but it is complicated.
  • My advice here is not to attempt to distribute templates unless it is possible to do so via something like a standard institutional desktop, but to make blank prototype documents available for download from a content management system or a shared directory, and to put in place managed processes, automated if possible for creating the prototype documents; creating something along the lines of a ‘document prototype factory’.

If you decide to maintain a family of document templates:

  • Try to share as much as possible between document prototypes/template, including style names, and if possible the same fonts and margins to reduce maintenance overhead.
  • Maintain the core styles and common elements from the templates in one place – a ‘master’ template.
    • When making changes make them in the master template and then import the changes into the other templates/prototypes.
    • Consider automating production of sets of styles using macros or by producing the raw XML for .docx or .odt files. The ICE system, for example, contains macros that create a complete set of styles on demand using default settings. This means (if the macros work and I’m not sure that they do 100%) that a new template can be created by setting margins, and the font and spacing for a couple of base elements, and having the machine generate all the rest.

Granularity

One of the fundamental choices to make in designing templates for long documents like theses is whether to manage the document in one long file or to break it up into multiple chapters.

Historically, it was important to work on compound documents for performance reasons. These days, performance is probably not a major problem, with most computers having plenty of RAM, but there are still reasons why compound documents make sense, for example where a resource is to be assembled out of a range of source documents, or other objects. It makes particular sense in collaborative environments, where multiple parties are working on a project and editing different chapters. Theses are not usually meant to be collaborative (although that might be changing) but in the absence of collaboration infrastructure which can manage comments from a supervisor, sending off chapter one to a supervisor to add comments while the candidate works on chapter two allows for simpler management than mailing off the whole thesis for comment, and then having to integrate the two versions.

The major problem with the compound approach is when it comes time to join the thesis into a single final product for printing.

Microsoft Word has long had a reputation for poor performance in managing master documents. I have not checked this in detail in the latest version but I would urge any template project to check its performance carefully before relying on it. The simple approach of copy-pasting multiple things into one once the thesis is finished, as recommended in the help text in the Edinburgh thesis template is possibly the most reliable but it can be time-consuming and small differences in formatting that have crept in to the various chapter documents can cause problems.

The ICE project used compound documents because its focus was course documents which were authored by multiple parties, but our initial experiments with OpenOffice.org master documents assembled by computer program were not a success, so we settled on an approach which automated copying and pasting things together, according to a table-of-contents-like manifest, to produce a final compound file, avoiding all sorts of complexities to do with differences between page layout and changes to styles, which can occur by accident.

With the rapid rise of ebook readers and a shift away from paper-based publishing, we should, in the academy be considering that thesis submission is a web-based process, possibly with EPUB as a container format, with thesis projects taking a few years to complete, the time for projects that consider how thesis authoring and submission should work is now.

How to set up a master document

In this section I have some sketchy instructions for setting up compound theses via master documents, note that these instructions are a starting point only.

In Writer you can turn a long document into a master document with multiple parts – I put examples of these in the demonstration system for the jiscPUB project.

  • First, use styles for your headings.
  • Work out which heading style is being used for Chapter headings. In the Edinburgh template it’s Heading 1 – in a typical ICE thesis it would be h1n or Title Chapter. I have used h1 in the examples.

If you are using Writer:

  • From the File menu choose Send, then Create Master Document.
  • In the Template dropdown choose the style that’s used for chapter headings and provide a file name. graphics4
  • Click Save.
  • The application will create a series of files, one for each block that starts with the chapter style. (Or at least it should – there seem to be bugs in LibreOffice 3.3.3 and the splitting feature didn’t work for me). The resulting master document will contain all the front matter text with the chapters included. I recommend moving this to a sub document too:
    • Select all the front matter and Cut it.
    • In the navigator in the Master document, right-click on the first chapter (eg my-thesis1.odt) and choose Insert New Document.
    • Paste the front-matter into the new document.
    • Save the new document as my-thesis0.odt (for example).
  • To make the chapters usable as stand-alone documents:
    • Open each document.
    • In Tools / Outline Numbering, set the Start at number for the first outline level to the chapter number. graphics8

To perform the same trick in Microsoft Word you have to split the document manually.

  • Copy and paste all the chapters and the front-matter into a series of files.
  • Create a new, blank document with the correct margins and styles.
  • Switch to outline view via the status bar, bottom right of the document window. graphics5
  • In the Outlining tab, click Show document (this makes more options appear). graphics6
  • For each chapter, click Insert, then pick the chapter and click Open.
  • To make sure that each chapter has the correct number and title when you are managing in individually:
    • For each chapter, open the file itself.
    • Copy the text from the chapter heading and enter it as the document title in the File tab under Info, Properties, Title. graphics7
    • In the chapter heading, right click and choose Numbering, then Set Numbering Value... graphics10
    • Choose the chapter number in Set value to: and click OK.

In either Word or Writer you now have a series of stand alone documents that you can edit one at a time, and a master document that contains the whole book.

Converting the files to HTML

Converting the thesis to HTML can now be done either chapter by chapter, for example as a series of posts or pages in WordPress, or by via the master document, with the usual caveats that word processors tend to make poor HTML. One drawback of the approach I have outlined here is that each of the sub documents uses the Heading 1 style as its title so when converted to HTML as a stand alone document has a slightly odd structure. Dealing with this kind of document structure is something for a (forthcoming) wish-list of features for a good quality HTML converter – it should be able to normalise headings in the documents it outputs, and ‘do the right thing’ with each document delineated by article tags, containing sections. HTML5 has specific rules about document outlines which allow for re-combining content from multiple fragments.

Styles

Styles are one of the key innovations that make word processing useful for technical and academic content. A style is a named bundle of formatting attributes that can be attached to a paragraphs, span of text inside a paragraph and to lists and table structures.

The most basic use of styles (and the only area where there is anything like a cross-application standard approach) is using heading styles to structure a document. Most word processors use Heading 1, Heading 2, and so on attached to paragraphs as the standard way to create a document outline. That is where the quasi-standardisation ends, though, there are no widely used standards for the other things we need in academic documents.

A thesis template should have, at a minimum styles for:

  • Headings
  • Metadata
  • Block-quotes
  • Examples
  • Pre-formatted code.

For example the ICE system specifies a set of styles which has been used for several years for producing academic documents at the University of Southern Queensland. The styles are summarised here. In a table, updated from one which originally appeared in an article at xml.com. These style names were chosen to be mostly very short, so they would be easy to see in the interface in both Word and OpenOffice.org, particularly in Word’s view that shows style names on the left.

 

Family Type Style names
1 2 3 4 5
Paragraph (p) p-centre, p-right, p-indent* p
Heading (h) h1 h2 h3 h4 h5
Heading (h) Numbered (n) h1n h2n h3n h4n h5n
List item (li) Numbered (n) li1n li2n li3n li4n li5n
List item (li) Bullet (b) li1b li2b li3b li4b li5b
List item (li) Uppercase Alpha (A) li1A li2A li3A li4A li5A
List item (li) Lowercase Alpha (a) li1a li2a li3a li4a li5a
List item (li) Lowercase Roman (i) li1i li2i li3i li4i li5i
List item (li) Lowercase Roman (I) li1I li2I li3I li4I li5I
List item (li) Continuing paragraph (p) li1p li2p li3p li4p li5p
Blockquote (bq) bq1 bq2 bq3 bq4 bq5
Definition List Term (dt) dt1 dt2 dt3 dt4 dt5
Definition List Description (dd) dd1 dd2 dd3 dd4 dd5

It was not intended that users have to type these or even select them from a drop-down, rather they would use a add-in interfaces which aided them. The first generation is described in the XML.com article I wrote. This used a hierarchical menu system (also keyboard navigable) which was the same in n Word and OpenOffice.

graphics2

 

The second generation of this interface is the ICE toolbar, which uses a set of buttons very like those in most modern editing application, but which tries to “Do the right thing� and apply styles, documented at the ICE site.

graphics9

Saving as HTML

Using styles does not do a lot to help the quality of HTML exported from our two word processors out of the box but many third-party applications for creating HTML do try to use styles, for example ICE, or the commercial HTML Transit.

(I gather that ICE is no longer actively maintained by USQ – I’m using it here as an example of the kinds of interfaces that make it easier for users to apply styles than the defaults that come with their word processing packages – it is open source, so organisations who wanted to develop templates like the ones used in ICE could adopt part or all of it).

Heading numbering and document outlines

One of the key benefits of using heading styles is that they allow for automatic tables of contents, and to use an outline view of a document.

One issue that needs to be dealt with is document numbering. It is possible to attach numbering to styles so that the headings in a document are numbered. The simplest case is to map style names to numbers – but there are use-cases where documents have both numbered and non-numbered parts – and special cases such as appendices which might be sections at the same level as, say, chapters but have different numbers.

ICE (barely) manages to deal with this complexity by using a compound document approach with each chapter or appendix stored in a separate file. The ICE system was designed to be aggressively interoperable between the OpenOffice family and Word, which imposed a major limitation – OOo Writer can only tie ONE style to each numbering level in the document outline – with the added complication that recent versions do support ‘outline level’ as a paragraph attribute, although this is not tied to the Outline numbering feature as far as I can tell.

Lists

List structures are one of the most difficult things to deal with in word processing, for template design, HTML export and for basic usability.

  • Word’s lists have historically been very unstable. There are multiple ways to make lists in Word, including direct formatting, named lists outlines, anonymous list outlines and list styles, many of which are almost impossible to access in the new Ribbon interface that Word moved to in version 2007, there have also been many changes to the way Word handles lists and list styles over the years making this a very complicated topic.
  • Writer’s list support is close to unusable. The Open Document format which is the native file format has lists as a first-class object and has provision for a document to contain hierarchical list structures like those in HTML. The problem is that in a paragraph-based editing environment it is almost impossible for an author to understand the hierarchical structure of their lists – there are only very small cues in the interface to show you what level of list a particular item is on, for example, and the process of adding an extra paragraph into a list, without a bullet is bizarrely complicated – it is not a matter of applying formatting or styling, but a structural manipulation which is at odds with the way word processors typically work.

Interoperability is a problem – when transferring documents with or without styles between Word and Writer, lists often break, numbering is destroyed, and indenting changes. Even when using styles, when Writer saves to the .doc format, instead of creating word styles for lists corresponding to its internal ones it creates new ones. So, the result even of saving a Writer document and reloading it back in to Writer breaks documents.

Against this background I think it is worth describing the ICE approach to interoperability here as an illustration of the sort of thinking that is needed in a heterogeneous application environment.

In ICE there is a standard set of list style names which is implemented differently in Word and OpenOffice. Both share a set of paragraph styles with the same name, li1b for a first-level bullet list, li2n for a second level numbered list item and so on.

In Word
Each paragraph style is tied to a named list outline (not a list style), so the list styles li1b, li2b et al are attached to a single outline called lib. While Word has these named outlines they are difficult to access reliably – there is no way to pick one off a list, they only appear in galleries and if the one you want is not showing you cannot access it. In ICE use of these lists is entirely by macros which can repair them when they break. (And they do).
In Writer
There is a corresponding List style for each paragraph style, and when a user uses the ICE toolbar or menus to apply a paragraph style, a macro applies the relevant list style at the same time Writer has long had an option to tie a paragraph style to a list style, but it doesn’t work reliably.

In both cases when things go wrong there is a macro that cycles through every paragraph in the document and re-applies each style, including making sure that is a paragraph is in li1b style it is attached to the correct list. In Word, there is a macro to reset its list formatting, rebuilding each named list outline, as Word has a tendency to do what can only be described as ‘go crazy’ and have all the lists in a document change formatting (I have not checked up on this in the latest version, but I have no reason to think that this has been fixed).

Saving as HTML

Saving lists as HTML is one of the worst performing areas for word processors. Their algorithms typically do a very poor job. Word 2010 still saves list items as paragraphs with formatting rather than as list structures, and the OpenOffice.org family produce non-standard, often flat-out wrong structures. The ICE approach of a full consistent set of styles means that ICE can create properly structured output, including correctly nesting block quotes and non-numbered paragraphs inside complex list structures. It does this by using the level numbers in ICE styles to work out what should be nested inside what.

In a potential new service for converting word processing content to HTML this could be extended to deal not only with a standard set of style names, but to infer structure in other situations as well , indenting being one of the major cues (that seems obvious, but the current algorithms in word processors and in browser based editors manage to get it wrong – they produce odd structures that are almost certainly not what any author was trying to mean).

Metadata

I looked at metadata in a blog post.

Embedding images

By default in Word and OpenOffice, if you paste in or create an image or other inline object such as a chart or drawing it ‘floats’ relative to the content. The idea is that objects can be placed on a page. For web and ebook publishing this is not useful and it leads to lots of frustrations. Unless very fine grained support for image placement is required for print publication it is usually best to anchor images as characters rather than as floating objects.

  • Anchor images and objects as characters.In Writer:
    • Right click on an embedded object and choose Anchor, As Character.

    In Word:

    Right click on the object and choose Wrap Text, Inline with Text

  • Use the in-built vector drawing packages for diagramming, but:
    • Don’t draw on the document as though it were paper, insert objects that contain drawings.
      • In Writer: From the Insert menu choose, Object, OLE Object, (Name of application) Drawing
      • In Word: From the Insert Tab choose Shapes, New Drawing Canvas.
  • Use the inbuilt Maths editors in either platform.

0.1 Maths

Maths support on the web has been a problem, but things are slowly improving. The ideal is to use MathML which is part of HTML. Current practice on the web often involves the use of LaTeX as a source for mathematics which is then rendered into HTML via other tools. There are commercial plugins for both Word and Writer that can deal with LaTeX markup.

Word 2007 and 2010 and Writer call export MathML and save MathML inside their file formats, although this does not happen when you save as HTML, so it should be possible to automate production of high quality output to HTML given the resources. As far as I know, nobody has done this yet.

For casual use of maths, using the approach I describe below of generating images using the Word processor’s inbuilt Save as HTML, which creates images of the maths is probably adequate but is far from ideal where mathematics is a key part of the content.

 

Converting to images to HTML

One of the areas where many HTML conversion projects fall down is images. Because office suites have tight integration with drawing and presentation applications, and inbuilt maths rendering etc it is often very difficulty for external code to render anything but a plain-text document or with images that already in web formats such as JPEG or PNG as HTML from a word processor file The ICE application uses OpenOffice to render inline objects from both Word and Writer documents, and in parallel created HTML from the XML source files.

 

Object1Illustration 1: Diagram showing how an HTML converter can use the word processor to create web ready images, while still creating HTML from the XML inside its native document format (.docx or .odt)

In a previous project I worked on with some members of the ICE team we simply used the HTML output from Word 2000 and massaged it to be much better quality HTML, discarding the formatting that Word outputs and using the style names (which are output as classes) to generate the HTML.

This is an important area because the various integration features which allow authors to embed charts and vector graphics and so on are one of the main reasons to keep using word processors. If candidates are working on theses that are exclusively text, then using a tool-chain such as asciiidoc with a wiki text format or Pandoc may be worth considering.

Reference managers

There is no space here for a full evaluation of reference managers such as EndNote, Zotero or Mendeley, all of which have integration with word processors, for now candidates must be assisted in choosing an appropriate tool for their discipline and institution. Regarding the future, JISC is investing in this area with support for the Open Bibliography project – one important dimension of this will be working out how cost and effort across the entire sector can be reduced by simplifying and rationalising the process of citing works. If we have a large scale open bibliography available, then referencing in many disciplines could be as simple as linking to a URI for a resource in that shared bibliography – with all the details of presenting citations and reference lists handled automatically.

Tables of contents, figures etc

Both Word and Writer have extensive automation features for tables of contents and tables of figures etc as demonstrated in the Edinburgh template. It is important to set up examples and encourage candidates to use them. A template should have the required tables of contents for headings, figures etc already in place with examples and instructions on how to insert figures etc so they are numbered. Most of these should probably be discarded in exported HTML and EPUB versions and appropriate native HTML versions prepared automatically by software.

Summary

Any new template design process needs to consider all of he above (and more) in multiple cycles, until a stable set of design constraints emerges.

  • Interoperability requirements. The range of packages you want users to be able to work with imposes constraints on which features can be used. Current trends such as tablet computing, and the rise of vertical plaftorms such as Apple’s iOSX devices need to be given consideration.(On the ICE project, several years ago we decided to support OpenOffice.org Writer and Microsoft Word to ensure cross-platform coverage across Windows, Mac and Linux – today’s environment is very different, but during the ICE project our each-way bet paid off when Microsoft dropped support for Visual Basic scripting in the Mac version of office – we were able to keep coverage for the style toolbar on that platform by offering Writer to Mac users.)
  • Whether to support single-file theses or multi file theses or both. Multiple files will increase the need to provide support, and possibly require the use of external tools, but for modern research theses, the ability to aggregate different things such as data files together is attractive.
  • A set of styles and/or other guidelines.
  • How to make HTML and EPUB versions of the content. If an application can produce HTML then that can be converted to EPUB automatically. It is producing HTML of sufficient quality that is a problem.

The ICE system I have continually referenced throughout this document was one fully worked example of all of the above considerations. It was not designed for theses, although it was tested on them and found to be adequate. But ICE is several years old, so re-doing this process now would produce a different design. Some of the key insights from the ICE design process include:

  • Templates need to be immediately useful to their users. That is, people have to be able to see the point of what they are being asked to do/fill-in. For theses this is simpler than for some other types of document, the institution can say to a candidate: “Use this!â€� or, “Your thesis must meet the formatting criteria we specify, here is a template that helpsâ€�.
  • Following from the above point, rapid feedback is required – if the final deliverable is expected to be an ebook, amongst other formats, make sure there is a system in place to show the candidate and their supervisor to
  • The document authoring system needs to be integrated into institutional processes, so making the authoring system part of the supervisor/candidate conversation, and automating submission will be important.

While this document has looked at some design issues for templates, it does not provide a solution to the question it is trying to answer; how to set up an environment for creating EPUB theses from word processing source files. I will produce one final blog post for this project outlining some potential solutions to some of the issues raised in this document as a guide to where JISC might or might not like to invest in future work.

Copyright Peter Sefton, 2011-07-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

 

Posted in Uncategorized

The repository is watching: automated harvesting from replicated filesystems

One of the final things I’m looking at on this jiscPUB project is a demonstration of a new class of tool for managing academic projects not just documents. For a while we were calling this idea the Desktop Repository, the idea being that there would be repository services watching your entire hard disk and exposing all the content in a local website with repository and content management services that’s possibly a very useful class of application for some academics, but in this project we are looking at a slightly different slant on that idea.

The core use case I’m illustrating here is thesis writing, but the same workflow would be useful across a lot of academic projects, including all the things we’re focussing on in the jiscPUB project academic users managing their portfolio of work, project reporting and courseware management. This tool is about a lot more than just ebook publishing, but I will look at that aspect of it, of course.

In this post I will show some screenshots of The Fascinator repository in action, talk about how you can get involved in trying it out, and finish with some technical notes about installation and setup. I was responsible for leading the team that built this software at the University of Southern Queensland. Development is now being done at the University of Central Queensland and the Queensland Cyber Infrastructure Foundation where Duncan Dickinson and Greg Pendlebury continue work on the ReDBox research data repository which is based on the same platform.

I know Theo Andrew at Edinburgh is keen to get some people trying this. So this blog post will serve to introduce it and give his team some ideas we’ll follow up on their experiences if there are useful findings.

Managing a thesis

The short version of how this thesis story might work is:

  • The university supplies the candidate with a dropbox-like shared file system they can use from pretty much any device to access their stuff. But there’s a twist there is a web-based repository watching the shared folder and exposing everything there to the web.

  • The university helpfully adds into the share a thesis template that’s ready to go, complete with all the cover page stuff, margins all set, automated tables of contents for sections and tables and figures and the right styles and trains the candidate in the basics of word processing.

  • The candidate works away on their project, keeping all their data, presentations, notes and so on in the Dropbox and filling out the thesis template as they go.

  • The supervisor can drop in on the work in progress and leave comments via an annotation system.

  • At any time, the candidate can grab a group, which we call a package of things to publish to a blog or deposit to a repository at the click of a button. This includes not just documents, but data files (the ones that are small enough to keep in a replicated file system), images, presentations etc.

  • The final examination process could be handled using the same infrastructure and the university could make its own packages of all the examiners reports etc for deposit into a closed repository.

The result is web-based, web-native scholarship where everything is available in HTML, not just PDF or application file formats and there are easy ways to route content to other repositories or publish it in various ways.

Where might ebook dissemination fit into this?

Well, pretty much anywhere in the above that someone wants to either take a digital object ‘on the road’ or deposit it in a repository of some kind as a bounded digital thing.

Demonstration

I have put a copy of Joss Winn’s MA thesis into the system to show how it works. It is available in the live system (note that this might change if people play around with it). I took an old OpenOffice .sxw file Joss sent me and changed the styles a little bit to use the ICE conventions, I’m writing up a much more detailed post about templates in general, so stay tuned for a discussion of the pros and cons of various options for choosing style names and conventions and whether or not to manage the document as a single file or multiple chapters.

graphics2Illustration 1: The author puts their stuff in the local file system, in this case replicated by Dropbox.

graphics7Illustration 2: A web-view of Joss Winn’s thesis.

The interface provides a range of actions.

graphics9Illustration 3: You can do things with content in The Fascinator including blogging and export to zip or (experimental) EPUB

The EPUB export was put together as a demonstration for the Beyond The PDF effort by Ron Ward. A the moment it only works on packages, not individual documents, and it is using some internal Python code to stitch together documents, rather than calling out to Calibre as I did in earlier work on this project. The advantage of doing it this way is that you don’t have Calibre adding extra stuff and reprocessing documents to add CSS but the disadvantage is that a lot of what Calibre does is useful, for example working around known bugs in reader software, but it does tend to change formatting on you, not always in useful ways.

I put the EPUB into the dropbox so it is available in the demo site (you need to expand the Attachments box to get the download that’s not great usability I know). Or you can go to the package and export it yourself. Log in first, using admin as a username and a the same for a password.

graphics8Illustration 4: Joss Winn’s thesis exported as EPUB.

I looked a different way of creating an EPUB book from the same thesis a while ago which will be available for a while here at the Calibre server I set up.

One of the features of this software is that more than one person can look at the web site and there are extensive opportunities for collaboration.

graphics5Illustration 5: Colleagues and supervisors can leave comments via inline annotation (including annotating pictures and videos)

graphics6Illustration 6: Annotations are threaded discussions

graphics3Illustration 7: Images and videos can be annotated too. At USQ we developed a Javascript toolkit called Anotar for this, the idea being you could add annotation services to any web site quickly and easily.

This thesis package only contains documents, but one of the strengths of The Fascinator platform is that it can aggregate all kinds of data, including images, spreadsheets, presentation and can be extended to deal with any kind of data file via plugins. I have added another package, modestly calling itself the research object of the future, using some files supplied by Phil Bourne for the Beyond the PDF group The Fascinator makes web views of all the content and can package it all as a zip file or an EPUB.

graphics10Illustration 8: A spreadsheet rendered into HTML and published into an EPUB file (demo quality only)

This includes turning PowerPoint into a flat web page.

graphics11Illustration 9: A presentation exported to EPUB along with data and all the other parts of a research object

Installation notes

Installing The Fascinator  (I did it on Amazon’s EC2 cloud on Ubuntu 10.04.1 LTS) is straightforward. These are my notes not intended to be a detailed how-to, but possibly enough for experienced programmers/sysadmins to work it out.

  • Check it out.

    sudo svn co https://the-fascinator.googlecode.com/svn/the-fascinator/trunk /opt/fascinator
  • Install Sun’s Java

    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:sun-java-community-team/sun-java6
    sudo apt-get update
    sudo apt-get install sun-java6-jdk

    http://stackoverflow.com/questions/3747789/how-to-install-the-sun-java-jdk-on-ubuntu-10-10-maverick-meerkat/3997220#3997220

  • Install Maven 2.

    sudo apt-get install maven2
  • Install ICE or point your config at an ICE service. I have one running for the jiscPUB project you can point to this by changing the ~/.fascinator/system-config.json file.

  • Install Dropbox or your file replication service of choice a little bit of work on a headless server but there are instruction linked from the Dropbox.com site.

  • Make some configuration changes, see below.

  • To run ICE and The Fascinator on their default ports on the same machine add this stuff to /etc/apache2/apache.conf (I think the proxy modules I’m using here is non-standard).

    LoadModule  proxy_module /usr/lib/apache2/modules/mod_proxy.so
    LoadModule  proxy_http_module /usr/lib/apache2/modules/mod_proxy_http.so
    ProxyRequests Off
    <Proxy *>
    Order deny,allow
    Allow from all
    </Proxy>
    ProxyPass        /api/ http://localhost:8000/api/
    ProxyPassReverse /api/  http://localhost:8000/api/
    ProxyPass       /portal/ http://localhost:9997/portal/
    ProxyPassReverse /portal/ http://localhost:9997/portal/
  • Run it.

    cd /opt/fascinator
    ./tf.sh restart

Configuration follows:

  • To set up the harvester, add this to the empty jobs list in ~/.fascinator/system-config.json

"jobs" : [
                   {
                       "name": "dropbox-public",
                       "type": "harvest",
                       "configFile":
"${fascinator.home}/harvest/local-files.json",
                       "timing": "0/30 * * * * ?"
                   } 

And change /harvest/local-files.json to point at the Dropbox directory

"harvester": {
        "type": "file-system",
        "file-system": {
            "targets": [
                {
                    "baseDir": "${user.home}/Dropbox/",
                    "facetDir": "${user.home}/Dropbox/",
                    "ignoreFilter": ".svn|.ice|.*|~*|Thumbs.db|.DS_Store",
                    "recursive": true,
                    "force": false,
                    "link": true
                }
            ],
            "caching": "basic",
            "cacheId": "default"
        }

To add the EPUB support and the red branding, unzip the skin files in this zip file into the portal/default/ directory: http://ec2-50-19-86-198.compute-1.amazonaws.com/portal/default/download/551148ce6d80bfc0c9c36914f9df4f91/jiscpub.zip

unzip -d /opt/fascinator/portal/src/main/config/portal/default/ jispub.zip

Copyright Peter Sefton, 2011-07-12. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Making EPUB from WordPress (and other) web collections

Background

As part of Workpackage 3 I have been looking at WordPress as a way of creating scholarly monographs. This post carries on from the last couple, but it’s not really about EPUB or about WordPress, it’s about interoperability and how tools might work together in a Scholarly HTML mode so that people can package and repackage their resources much more reliably and flexibly than they can now.

While exploring WordPress I had a look at the JISC funded KnowledgeBlog project. The team there has released a plugin for WordPress to show a table of contents made up of all the posts in a particular category. It seemed that with a bit of enhancement this could be a useful component of a production workflow for book-like project, particularly for project reports and theses (where they are being written online in content management systems maybe not so common now, but likely to become more common) and for course materials.

Recently I looked at Anthologize, a WordPress-based way of creating ebooks from HTML resources sourced from around the web (I noted a number of limitations which I am sure will be dealt with sooner or later). Anthologize is using a design pattern that I have seen a couple of times with EPUB, converting the multiple parts of a project to an XML format that already has some tools for rendering and using those tools to generate outputs like PDF or EPUB. Asciidoc does this using the DocBook tool-chain and Anthologize uses TEI tools. I will write more on this design pattern and its implications soon. There is another obvious approach; to leave things in HTML and build books from that, for example using Calibre which already has ways to build ebooks from HTML sources. This is an approach which could be added to Anthologize very easily, to complement the TEI approach.

So, I have put together a workflow using Calibre to build EPUBs straight from a blog.

Why would you want to do this? Two main reasons. Firstly, to read a report, thesis or course, or an entire blog on a mobile device. Secondly, to be able to deposit a snapshot of same into a repository.

In this post I will talk about some academic works:

The key to this effort is the KnowledgeBlog table of contents plugin ktoc, with some enhancements I have added to make it easier to harvest web content into a book.

The results are available on a Calibre server I’m running in the Amazon cloud just for the duration of this project. (The server is really intended for local use, the way I am running it behind an Apache reverse proxy it doesn’t seem very happy you may have to refresh a couple of times until it comes good). This is rough. It is certainly not production quality.

graphics1

These books are created using calibre ‘recipes’: available here. You run them like this:

ebook-convert thesis-demo.recipe .epub --test

If you are just trying this out, to be kind to site owners --test will cause it to only fetch a couple of articles per feed.

I added them to the calibre server like this:

calibredb add --library-path=./books thesis-demo.epub

The projects page at my site has two TOCs for two different projects.

[ktoc cat="jiscPUB" title="Digital Monograph Technical Landscape study #jiscPUB" show_authors="false" orderby="date" toc_author="Peter Sefton"]

[ktoc cat="ScholarlyHTML" title="Scholarly HTML posts" orderby="date" show_authors="false" toc_author="Peter Sefton" ]

I the title is used to create sections in the book, in both cases the post are displayed in date-order and I am not showing the name of the author on the page because that’s not needed when it is all me.

The resulting book has a nested table of contents, seen here in Adobe Digital Editions.

graphics2Illustration 1: A book built from a WordPress page with two table of contents blocks generated from WordPress categories.

Read on for more detail about the process of developing these things and some comments about the problems I encountered working with multiple conflicting WordPress plugins, etc.

The Scholarly HTML way to EPUB

The first thing I tried in this exploration was writing a recipe to make an EPUB book from a Knowledge Blog, for the Ontogenesis project. It is a kind of encyclopaedia of ontology development maintained in a WordPress site with multiple contributors. It worked well, for a demonstration, and did not take long to develop. The Ontogenesis recipe is available here and the resulting book is available on the Calibre server.

But there was a problem.

The second blog I wanted to try it on was my own, so I installed ktoc changed the URL in the recipe and ran it. Nothing. The problem is that Ontogenesis and my blog use different WordPress themes so the structure is different. Recipes have stuff like this in them to locate the parts of a page, such as <p class='details_small'>:

remove_tags_before = dict(name='p', attrs={'class':'details_small'})

remove_tags_after = dict(name='div', attrs={'class':'post_content'})

That’s for Ontogenesis, different rules are needed for other sites. You also need code to find the table of contents amongst all the links on a WordPress page, and deal with pages that might have two or more ktoc-generated tables for different sections of a journal, or parts of a project report.

Anyway, I wrote a different recipe for my site, but as I was doing so I was thinking about how to make this easier. What if:

  • The ktoc plugin output a little more information in its list of posts that made it easy to find no matter what WordPress theme was being used.

  • The actual post part of each page (ie not the navigation, or ads) identified itself as such.

  • The same technique could be extended to other websites in general.

There is already a standard way to do the most important part of this, listing a set of resources that make up an aggregated resource; the Object Reuse and Exchange specification, embedded in HTML using RDFa. ORE in RDFa. Simple.

Well no, it’s not, unfortunately. ORE is complicated and has some very important but hard to grasp abstractions such the difference between an Aggregation, and a Resource Map. An Aggregation is a collection of resources which has a URI, while a resource map describes the relationship between the aggregation and the resources it aggregates. These things are supposed to have different URIs. Now, for a simple task like making a table of contents of WordPress posts machine-readable so you can throw together a book, these abstractions are not really helpful to developers or consumers. But what if there were a simple recipe/microformat what we call a convention in Scholarly HTML to follow, which was ORE compliant and that was also simple to implement at both the server and client end?

What I have been doing over the last couple of days, as I continue this EPUB exploration is try to use the ORE spec in a way that will be easy implement, say in the Digress.it TOC page, or in Anthologize, while still being ORE compliant. That discussion is ongoing, and will take place in the Google groups for Scholarly HTML and ORE. It is worth pursuing because if we can get it sorted out then with a few very simple additions to the HTML they spit out, any web system can get EPUB export quickly and cheaply by adhering to a narrowly defined profile of ORE subject to the donor service being able to supply reasonable quality HTML. More sophisticated tools that do understand RDFa and ORE will be able to process arbitrary pages that use the Scholarly HTML convention, but developers can choose the simpler convention over a full implementation for some tasks.

The details may change, as I seek advice from experts, but basically, there are two parts to this.

Firstly there’s adding ORE semantics to the ktoc (or any) table of contents. It used to be a plain-old unordered list, with list items in it:

<p><strong>Articles</strong></p>
<ul>
<li><a href="http://ontogenesis.knowledgeblog.org/49">Automatic
maintenance of multiple inheritance ontologies</a> by Mikel Egana
Aranguren</li>
<li><a href="http://ontogenesis.knowledgeblog.org/257">Characterising
Representation</a> by Sean Bechhofer and Robert Stevens</li>
<li><a href="http://ontogenesis.knowledgeblog.org/1001">Closing Down
the Open World: Covering Axioms and Closure Axioms</a> by Robert
Stevens</li>
</ul>

The list items now explicitly say what is being aggregated. The plain old <li> becomes:

<li  rel="http://www.openarchives.org/ore/terms/aggregates"
resource="http://ontogenesis.knowledgeblog.org/49">

(The fact that this is an <li> does not matter, it could be any element.)

And there is a separate URI for the Aggregation and resource map courtesy of different IDs. And the resource map says that it describes the Aggregation as per the ORE spec.

<div id=”AggregationScholarlyHTML">

<div rel="http://www.openarchives.org/ore/terms/describes" resource="#AggregationScholarlyHTML" id="ResourceMapScholarlyHTML" about="#ResourceMapScholarlyHTML">

It is verbose, but nobody will have to type this stuff. What I have tried to do here (and it is a work in progress) is to simplify an existing standard which could be applied in any number of ways and boil it down to a simple convention that’s easy to implement but that still honours the more complicated specifications in the background. (Experts this will realise that I have used an RDFa 1.1 approach here, meaning that current RDFa processors will not understand, this is so that we don’t have to deal with namespaces and CURIES which complicate processing for non-native tools.)

Secondly the plugin wraps a <div> element around the content for every post to label it as being scholarly HTML, this is a way of saying that this part of the whole page is the content that makes up the article, thesis chapter or similar. Without a marker like this finding the content is a real challenge where pages are loaded up with all sorts of navigation, decoration and advertisements, it is different on just about every site, and it can change at the whim of the blog owner if they change themes.

<div rel="http://scholarly-html.org/schtml">

Why not define an even simpler format?

It would be possible to come up with a simple microformat that had nice human readable class attributes or something to mark the parts of a TOC page. I didn’t do that because then people will rightly point out that ORE exists and we would end up with a convention that covered a subset of the existing spec, making it harder for tool makers to cover both and less likely that services will interoperate.

So why not just use general ORE and RDFa?

There are several reasons:

  • Tool support is extremely limited for client and server side processing of full RDFa, for example in supporting the way namespaces are handled in RDFa using CURIES. (Sam Adams has pointed out that it would be a lot easier to debug my code if I did use CURIES and RDFa 1.0 so I followed his advice, did some search and replacing and checked that the work I am doing here is indeed ORE compliant).

  • The ORE spec is suited only for experienced developers with a lot of patience for complexities like the difference between an aggregation and a resource map.

  • RDFa needs to apply to a whole page, with the correct document type and that’s not always possible to do when we’re dealing with systems like WordPress. The convention approach means you can at least produce something that can become proper RDFa if put into the right context.

Why not use RSS/Atom feeds?

Another way to approach this would be to use a feed, in RSS or Atom format. WordPress has good support for feeds there’s one for just about everything. So you can look at all the posts on my website:

http://ptsefton.com/category/uncategorized/feed/atom

or use Tony Hirst’s approach to fetch a singe post from the jiscPUB blog

http://jiscpub.blogs.edina.ac.uk/2011/05/23/a-view-from-academia-on-digital-humanities/feed/?withoutcomments=1

The nice thing about this single post technique is that it gives you just the content in a content element so there is no screen scraping involved. The problem is that the site has to be set up to provide full HTML versions of all posts in its feeds or you only get a summary. There’s a problem with using feeds on categories too, I believe, in that there is an upper limit to how many posts a WordPress site will serve. The site admin can change that to a larger number but then that will affect subscribers to the general purpose feeds as well. They probably don’t want to see three hundred posts in Google Reader when they sign up to a new blog.

Given that Atom (the best standardised and most modern feed format) is one of the official serialisation formats for ORE it is probably worth revisiting this question later if someone, such as JISC, decides to invest more in this kind of web-to-ebook-compiling application.

What next?

There are some obvious things that could be done to further this work:

  • Set up a more complete and robust book server which builds and rebuilds books from particular sites and distributes them in some way, using Open Publication Distribution System (OPDS) or something like this thing that sends stuff to your Kindle.

  • Write a ‘recipe factory’. With a little more work the ScholarlyHTML recipe can be got to the point where the only required variable is a single page URL everything else can be harvested from the page or over-ridden by the recipe.

  • Combining the above to make a WordPress plugin that can create EPUBs from collections of in-built content (tricky because of the present calibre dependency but it could be re-coded in PHP).

  • Add the same ScholarlyHTML convention for ORE to other web systems such as the Digress.it plugin and Anthologize. Anthologize is appealing because it allows you to order resources in ‘projects’ and nest them into ‘parts’ rather than being based on simple queries but at the moment it does not actually have a way to publish a project directly to the web.

  • Explore the same technique in the next phase of WorkPackage 3 when I return to looking at word processing tools and examine how cloud replication services like DropBox might help people to manage book-like projects that consist of multiple parts.

Postscript: Lessons and things that need fixing or investiging

I encountered some issues. Some of these are mentioned above but I wanted to list them here as fodder for potential new projects.

  • As with Anthologize, if you use the WordPress RSS importer to bring-in content it does not change the links between posts so they point to the new location. Likewise with importing a WordPress export file.

  • The RSS importer applied to the thesis created hundreds of blank categories.

  • I tried to add my ktoc plugin to a Digress.it site, but ran into problems. It uses PHP’s simplexml parser which chokes on what I am convinced is perfectly valid XML in unpredictable ways. And the default Digress.it configuration expects posts to be formatted in a particular way as a list of top-level paragraphs, rather than with nested divs. I will follow this up with the developers.

  • Calibre does a pretty good job of taking HTML and making it into EPUBs but it does have its issues. I will work through these on the relevant forums as time permits.

    • There are some encoding problems with the table of contents in some places. Might be an issue with my coding in the recipes.

    • Unlike other Calibre workflows, such as creating books from raw HTML, ebook-convert adds navigation to each HTML page in the book created by a recipe. This navigation is redundant in an EPUB, but apparently it would require a source code change to get rid of it.

    • It does something complicated to give each book its style information. There are some odd presentation glitches in the samples as a result of Calibre’s algorithms. This requires more investigation.

    • It doesn’t find local links between parts of a book (ie links from one post to another which occur a lot in my work and in Tony’s course), but I have coded around that in the Scholarly HTML recipes.

It will be up to Theo Andrew, the project manager if any of these next steps or issues get any attention during the rest of this project.

Copyright Peter Sefton, 2011-05-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics3

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Anthologize: a WordPress based collection tool

In this post I’ll look at Anthologize. Anthologize lets you write or import content into a WordPress instance, organise the ‘parts’ of your ‘project’ and publish to PDF or EPUB, HTML or into TEI XML format. This is what I referred to in my last post about WordPress as an aggregation platform.

Anthologize background and use-cases

Anthologize was created in an interesting way. It is the (unfinished as yet) outcome of a one-week workshop conducted at the Centre for History and New Media – the same group that brought us Zotero and Omeka, which is one good reason to take it seriously. They produce very high quality software.

Anthologize is a project of One Week | One Tool a project of the Center for History and New Media, George Mason University. Funding provided by the National Endowment for the Humanities. © 2010, Center for History and New Media. For more information, contact infoATanthologizeDOTorg. Follow @anthologize.

Anthologize is a WordPress plugin that adds import and organisation features to WordPress. You can author posts and pages as normal, or you can import anything with an RSS/Atom feed. The imported documents don’t seem to be able to be published for others to view but you can edit them locally. This could be useful – but introduces a whole lot of management issues around provenance and version control. When you import a post from somewhere else the images stay on the other site, so you have a partial copy of the work with references back to a different site. I can see some potential problems with that if other sites go offline or change.

Let’s remind ourselves about the use-cases in workpackage 3:

The three main use cases identified in the current plan, and a fourth proposed one: [numbering added for this post]

  1. Postgrad serializing PhD (or conference paper etc) for mobile devices
  2. Retiring academic publishing their ‘best-of’ research (books)
  3. Present final report as epub
  4. Publish course materials as an eBook (Proposed extra use-case proposed by Sefton)

http://jiscpub.blogs.edina.ac.uk/2011/03/03/workpackage-3/

Many documents like (a) theses or (c) reports are likely to be written as monolithic documents in the first place, so it would be a bit strange to write, say, a report in Word, or LaTeX or asciidoc (which is how I think Liza Daly will go about writing the landscape paper for this project) , export that as a bunch of WordPress posts for dissemination, then reprocess back into an Anthologize project, and then to EPUB. There’s much more to go wrong with that, and information to be lost than going straight from the source document to EPUB. It is conceivable that this would be a good tool for thesis by publication, where the publications were available as HTML that could be fed or pasted in to WordPress.

I do see some potential with (d) courseware here – it seems to me that it might make sense to author course materials in a blog-post like way covering topics one by one. I have put some feelers out for someone who might like to test publishing course materials, without spending too much of this project’s time as this is not one of the core use cases. If anyone wants to try this or can point me to some suitable open materials somewhere with categories and feeds I can use then I will give it a go.

There is also some potential with (c), project reports, particularly if anyone takes up the JiscPress way of doing things and creates their project outputs directly in WordPress+digress.it. It would also be ideal for compiling stuff that happens on the project blog as a supporting Appendix. So, an EPUB that gathers together, say all the blog posts I have made on WorkPackage 3 or the whole of the jiscPUB blog might make sense. These could be distributed to JISC and stakeholders as EPUB documents to read on the train, or deposited in a repository.

The retiring academic (b) (or any academic really) might want to make use of Anthologize too – particularly if they’ve been publishing online. If not they could paste their works into WordPress as posts, and deal with the HTML conversion issues inherent in that, or try to post from Word to WordPress. The test project I chose was to convert the blog posts I have done for jiscPUB into an EPUB book. That’s use case (c) more or less.

 

How did the experiment go?

I have documented the basic process of creating an EPUB using Anthologize below, with lots of screenshots, but here is a summary of the outcomes.

Some things went really well.

  • Using the control panel at my web host I was able set up a new WordPress website on my domain, add the Anthologize plugin and make my first EPUB in well under an hour. (But as usual, it takes a lot longer to back-track and investigate and try different options, and read the google group to see if bugs have been reported and so on).
  • The application is easy to install and easy to use – with some issues I note below.
  • Importing a feed just works if you search to find out how to do it on a standard WordPress host (although I think there might be issues trying to get large amounts of content if the source does not include everything in the feed).
  • Creating parts and dragging in content is simple.
  • Anthologize looks good.

The good looks and simple interface are deceptive, lots of functionality I was expecting to be there just wasn’t – yet. I have been in contact with the developers and noted my biggest concerns, but here’s a list of the major issues I see with the product at this stage of its development:

  • There does not seem to be a way to publish the project (or the imported docs) directly to the web – rather than export it. Seems like an obvious win to add that. I can see that being really useful with Digress.it for one thing. The other big win there would be if the Table of Contents could have some semantics embedded in it so it could act like an ORE resource map – meaning that machines would be able to interpret the content. (I will come back to this idea soon with a demo of using Calibre to make an EPUB)
  • There are no TOC entries for the posts within a ‘part’ that is, if you pull in a lot of WordPress posts, they don’t get individual entries in the EPUB ToC.
  • Links, even internal ones, like the table of contents links on my posts all point back to the original post – this makes packaging stuff up much less useful – you’d need to be online, and you lose the context of an intra-linked resource. This is a known problem, and the developers say they are going to fix it.
  • Potentially a problem is the way Anthologize EPUB export puts all the HTML content for the whole project into one HTML file – I gather from poking around with Calibre etc that many book readers need their content chunked into multiple files.
  • There’s a wizard for exporting your EPUB, and you can enter some metadata and choose some options – all of which is immediately forgotten by the application, so if you do it again, you have to re-enter all the information.
  • Epubcheck complains about the test book I made:
    • It says the mimetype (a simple file that MUST be there in all EPUB) is wrong – looks OK to me.
    • It complains about the XHTML containing stuff from the TEI namespace and a few other things.
  • Finally, PDF export fails on my blog with a timeout error – but that’s not an issue for this investigation.

Summary

For the use case of bundling together a bunch of blog posts (or anything that has a feed) into a curated whole Anthologize is a promising application, but unless your needs are very simple it’s probably not quite ready for production use. I spent a bit of time looking at it though, as it shows great promise and comes from a good stable.

Here’s the result I got importing the first handful of posts from my work on this project.

graphics8Illustration 1: The test book in Adobe Digital Edtions – note some encoding problems bottom right and the lack of depth in the table of contents. There are several posts but no way to navigate to them. Also, clicking on those table of contents links takes you back to tbe jiscPUB blog not to the heading.

Walk through

 

graphics1Illustration 2: Anthologize uses ‘projects’. These are aggregated resources, in many cases they will be books but project seems like a nice media-neutral term.

 

graphics2Illustration 3: A new project in a fresh WordPress install – only two things can be added to it until you write or import some content.

 

 

graphics3Illustration 4: Importing the feed for workpackage 3 in the jiscPUB project. http://jiscpub.blogs.edina.ac.uk/category/workpackage-3/feed/atom/

 

 

 

 

graphics4Illustration 5: You can select which things to keep from the feed. Ordering is done later. Remember that imported documents are copies, so there is potential for confusion if you edit them in Anthologize.

 

graphics5Illustration 6: Exporting content is via a wizard, easy to use but frustrating becuase it asks some of the same questions every time you export.

 

graphics6Illustration 7: Having to retype the export information is a real problem as you can only export one format at a time. Exported material is not stored in the WordPress site, either, it is downloaded, so there is no audit trail of versions.

 

 

 

Copyright Peter Sefton, 2011-05-04. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

 

 

Posted in Uncategorized

WordPress

Introduction

So far in the jiscPUB project I have been looking at word processing applications and EPUB, as well as how repositories and other web applications might support EPUB document production. One of the tasks in workpackage 3 is to look at WordPress as an example of an online tool that’s being used quite a bit in academia for both writing and publishing.

The three main use cases identified in the current plan, and a fourth proposed one: [numbering added for this post]

  1. Postgrad serializing PhD (or conference paper etc) for mobile devices
  2. Retiring academic publishing their ‘best-of’ research (books)
  3. Present final report as epub
  4. Publish course materials as an eBook (Proposed extra use-case proposed by Sefton)

The next few posts will explore web based authoring and publishing with a focus on WordPress, and how they relate to packaging content as electronic books.

WordPress can be used in a number of different ways. For this project I am thinking of it as:

  • A publishing platform.
  • A collaboration platform.
  • A content aggregation platform.
  • An authoring environment where people might write academic content. (I put this last, because I think it’s the most controversial).

All of these overlap, and the same installation of WP might be doing all or none, as might other content management systems being used in academia.

In future posts I’m going to look at building ebooks via aggregation, using the Anthologize plugin, look at an alternative way of building EPUB books from lists of WordPress posts using Calibre, and take a look at Martin Fenner’s EPUB plugin for WordPress. In this post I will look at some of the issues around WordPress as used in a couple of projects related to this one, looking particularly at JISC-funded or JISC-friendly work. This is not a survey of how WordPress is being used in academia everywhere – there’s no time for that. Please use the comments below if I’ve missed something that’s important to this project.

At the moment, I am thinking that the most compelling match up between the use cases for this project and what is being done with WordPress are these:

  • b: Retiring academic publishing their ‘best-of’ research (books): not so much books but using a tool like Anthologize to draw together papers or other documents.
  • d: Publish course materials as an eBook (Proposed extra use-case proposed by Sefton): I see great potential for tools like Anthologize as a way of compiling reading packages from web resources and packaging them to take-away on mobile devices, likewise for conference proceedings and programs and other aggregated documents.

And possibly, where people are using JiscPress this use-case: c: Present final report as epub.

Publishing platform

A great example of using a blogging platform for scholarship is the KnowledgeBlog project:

We are investigating a new, light-weight way of publishing scientific, academic and technical knowledge on the web. Currently, Knowledge Blog is being funded by a JISC grant.

And the sites it has under its wing.

KnowledgeBlogs use the WordPress platform to publish articles and for article review and serves as a live example of a new mode of scholarship. It’s a publisher, but not as we know it.

A new entrant in the WordPress backed publishing space (and in the Authoring space) is Annotum which has not released any code, but has very lofty ambitions. I’ll come back to Annotum below.

An aggregation platform – bringing together content from elsewhere.

I’ll cover this in my next post, looking at Anthologize, which is a promising but immature tool for pulling together stuff from multiple sources and/or authoring it locally, then grouping it with a customized table of contents and publishing to a variety of media.

An authoring platform

I has to be said that WordPress as an editor gets some bad press from time to time. Phillip Lord at KnowledgeBlog advises against using it for authoring.

WordPress is not an authoring environment

http://www.knowledgeblog.org is hosted using WordPress. It’s a very good tool in many ways, but it was intended for and is most suited for use as a publishing tool; most blogs are written by single authors who wish to place their thoughts on the web either for authors or themselves to be able to read. It is not an authoring tool, however. It does not provide a particularly rich environment for editing, and particularly not for collaborative editing. Most people get tired of the wordpress authoring tool very quickly, as it’s just not suited for serious scientific authoring. Nor does it provide good facilities for collaborative editing; normally, only one person can see a draft post, so you cannot pass this around between several authors.

http://process.knowledgeblog.org/3

The KnowledgeBlog site encourages people to use their current authoring tools and treat the KnowledgeBlog WordPress platform as a publishing and review system.

Others are more positive about WordPress as an editor. Martin Fenner, for example is a tireless promoter of the practice. And the Digress.it help recommends using WordPress to create content from scratch, the opposite of the advice coming from KnowledgeBlogs:

We recommend using the WordPress editor directly for a number of reasons:

  • Multiple authors can easily collaborate on a single document;
  • A complete revision history of the document is maintained with the ability to roll-back to earlier versions;
  • This method produces a web-ready document, native to WordPress, and avoids the two-stage process of ‘re-publishing’ on your Digress.it site; and
  • You can easily embed video and other objects.

And then there’s Annotum. The site says:

Annotum will build upon the WordPress platform as a foundation, filling in the gaps by providing the following additional features:

  • Rich, web-based authoring and editing:
    • “What you see is what you getâ€� (WYSIWYG) authoring with rich toolset (equations, figures, tables, citations and references)
    • coauthoring, comments, version tracking, and revision comparisons
    • Strict conformance to a subset of the NLM  journal article publishing tag set

And a long list of other features. There is no code to show yet, though.

Collaboration platform

Others are seeing WordPress as a place for collaborative authoring and editing. Annotum promises this on a grand scale. For those who would like to get started, Martin Fenner listed some resources late last year:

The Co-Authors Plus Plugin enables multiple authors per article. Each author can be linked to an author page for displaying biographical info. WordPress could be extended to include additional info such as institution or past publications. Linking the WordPress user account to the unique author identifier ORCID, and describing the role of the author in the paper (e.g. conceived and designed the experiments or analyzed the data) would be particularly interesting. Plugins such as Edit Flow can extend the workflow by adding custom status messages (e.g. resubmission), reviewer comments, and email notifications.

http://blogs.plos.org/mfenner/2010/12/05/blogging-beyond-the-pdf/

Collaboration post publication is handled by a WordPress tool that’s been a hit in the UK, and with JISC. Digress.it is a tool for public annotation and discussion of long-form documents. The JISC incarnation is at jiscpress.org. Digress.it is related to Commentpress. (They’re different things although sometimes confused with each other at least by me. See them compared here.)

For a JiscPress example see this document, which has a number of comments.

Issues

Some issues I have observed with WordPress in the past include the problems with its authoring environment, covered above but also a number of other considerations.

There is the WordPress version of Microsoft’s “DLL hellâ€�“Plugin hellâ€� – many WordPress plugins and/or themes interact with each other in unpredictable ways. I found this out first hand, trying to show-off some work my team at USQ had done on an annotation system. It worked (with bugs) in a plain WordPress site, but failed completely in Martin Fenner’s demo site where there are many other plugins installed. I never got to the bottom of that. Plugins also go out out sync with the WordPress as it evolves, so a site with lots of plugins can be hard to maintain, this is also the case with systems like Drupal which have their own enthusiastic following.

Some of the above systems require the content management system to be used in very particular ways – for example Digress it treats each document as a new WordPress site and asks you to upload posts in a particular order so that the Table of Contents for the site looks right. There are two issues with this kind of approach. I’m not saying that people are not already aware of these issues, but noting that they are there:

  • There’s sometimes a fair bit of overhead involved in setting things up just so. Sometimes, it would make sense to automate some of the processes. Other times maybe a re-think to reduce complexity might be in order.
  • There is a risk of creating a new form of the proprietary lock-in we had up until recently (and arguably we still have) with document formats like Microsoft’s .doc. The documents we create in some of these systems may end up being unusable in other systems. If you author a long document in Digress.it and depend on a particular configuration of WP and, having posts in a certain order and so on for the document’s integrity, then it is essential to consider an exit strategy and an archiving strategy (more on that soon – an EPUB export might be just the ticket).

    There are similar issues/risks with stuff like WordPress shortcodes such as KCite from KnowledgeBlogs. It’s a great tool for authors, allowing them to cite things in a rational way:

    DOI Example – [cite source=’doi’]10.1021/jf904082b[/cite]

    PMID example – [cite source=’pubmed’]17237047[/cite]

    But it’s proprietary to a particular processing environment. If one wants to be able to re-used these documents or archive them then it is important to consider which version of the documents in WP to keep. (I’d argue that in this case best practice would be to transform the above to an RDFa representation in HTML and treat the HTML version as the version of record – more on this later in the project).

All this adds up to saying that WordPress + plugins can be fragile – the application itself needs to be updated frequently for security reasons, and so does the operating system underneath and inevitably stuff breaks. The more complex the plugin-set and the further you stray from straight WordPress the worse the risk. Even on simple sites there can be issues. For example one of the WordPress sites I use regularly currently has a bug with remote publishing via Atompub and XMLRPC. One day it was working and the next all my attempts to post from the tools I use everyday, as per the best practice advice from the KnowledgeBlog people, were minus the characters < and > in the document source, both of which are obviously essential to the web.

For those interested in learning more about WordPress for scholarship, there’s a Google Group called WordPress for Scientists that is worth joining even if you are not a scientist and a test site that Martin Fenner has set up for WordPress plugins.

Copyright Peter Sefton, 2011-05-09. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Posted in Uncategorized

How to add EPUB support to EPrints

In a previous post here on the jiscPUB project I said it would be good for the EPrints repository software to support EPUB uploads.

I’d love to do something with a repository – I’m thinking that it would be great to deposit theses in EPUB format – and the repository could provided a web-based reader, along the lines of IbisReader, which Liza Daly and company created. I’m looking at you, Eprints! Eprints already almost supports this, if you upload a zip file it will stash all the parts for you in a single record. All we would need would be something like this little reader my colleagues at USQ made. It would just be a matter of transforming the EPUB TOC into JSON, and loading the JavaScript into an Eprints page.

I Called Les Carr’s attention to the post and he responded:

lescarr @ptsefton just tell us what to do and we’ll do it.

OK. Here goes with my specification for how EPrints could add at least basic support for EPUB.

Putting EPUB into EPrints as-is

To explore this, I ran the EPrints live CD (livecd_v3.1-x.iso) under VirtualBox on Windows 7 – this worked well when I gave it a decent amount of memory – it didn’t manage to boot in several hours at 256Mb. (Note that no repositories were harmed in the making of this post – I did not change the Eprints code at all.)

The EPUB format is a zipfile containing some XHTML payload documents, a manifest, and a table of contents. On one level EPRINTS already supports this, in that there is support for uploading ZIP files. I tested this using Danny Kingsley’s thesis (as received, with no massaging or adding metadata apart from tweaking the title in Word) converted to EPUB via the ICE service I have been working on.

The procedure:

  1. Generated an EPUB using ICE.
  2. Changed the file extension to .zip.
  3. Uploaded it into EPrints.

The result is an EPrints item with many parts. If you click on any of the HTML files that make up the thesis then they work as web pages – ie the table of contents (if you can find it amongst the many files) links to the other pages. But there is no navigation to tie it all together you have to keep hitting back – each HTML page from the EPUB is a stand alone fragment.

 


Illustration 1: The management interface in EPrints showing all the parts of an EPUB file which has been uploaded and saved as a series of parts in a single record.

 

At this point I went off on a side trip, and wrote this little tool – to add an HTML view to an EPUB file.

Putting enhanced EPUB into Eprints

Now, lets try that again with the version where I added an HTML index page to the EPUB using the new demo tool, epub2html. I uploaded the file, clicked around semi-randomly until I figured out how to see all the files listed from the zip, and selected index.html as the ‘main’ file. From memory I thought the repository would do that for me but it didn’t. Anyway, I ended up with this:

 


Illustration 2: The details screen that users see – clicking on the description takes you to the HTML page I picked as the main file.

 

 


Illustration 3: A rudimentary ebook reader using an inline frame.

If I click on the link starting with Other, there we have it – more-or-less working navigation within the limits of this demo-quality software. All I had to do was change the extension from .epub to .zip and select the entry page, and I had a working, navigable document.

The initial version of epub2html used the unsupported epubjs as a web based reader-application – but Liza Daly suggested I use the more up to date Monocle.js library instead. I tried that but I’m afraid the amount of setup required is too much for the moment so what you see here is an HTML page with an inline frame for the content.

What does the repository need to do?

So what does the EPrints team need to do to support EPUB a bit better?

  • Add EPUB to the list of recognised files.
  • Upon recognising an EPUB…
    • Use a service like epub2html that can generate an HTML view of the EPUB. I wrote mine in Python, Eprints is written in Perl but I’m sure that can be sorted out via a re-write or a web service or something*.
    • Allow the user to download the whole EPUB, or choose to use an online viewer. Could be static HTML, frames (not nice), or some kind of JavaScript based viewer.
    • Embed some kind of viewer in the EPrints page itself, or at least provide a back-link in the document viewer to the EPrints page.

Does that make sense, Les?

Copyright Peter Sefton, 2011-04-15. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.


 

* Maybe there’s a Python interpreter written in Perl?

 

Introducing Epub2Html – adding a plain HTML view to an EPUB

Background

EPUB ebook files are useful if you have an application to read them, but not everyone does. We have been discussing this in the Scholarly HTML movement; to some of us EPUB looks like a good general purpose packaging format for scholarship. Not just for HTML (if you can make it XTHML, that is) but potentially for other stuff that makes up a research object, such as data files or provenance information. One of the big problems, though is that the format is still not that widely known; what is a researcher to do when they are given file ending in .epub? That question remains unresolved at the moment, but in this post I will talk about one small step to making EPUB potentially more useful in the general academic community.

This week, I was looking at the potential for EPUB support in repositories, which I will cover in my next post. An EPUB is full of HTML, but it’s not something that is necessarily straightforward to display on the web. jiscPUB colleague Liza Daly’s company has a thing called IbisReader that serves EPUB over the web and worked on BookWorm, parts of which are also available as open source.

What I wanted was a bit different – I wanted to be able to add something equivalent to a README file to an EPUB that let people read the content and web site or repository managers would be able to do something with it. So, I wrote a small tool intended as demonstrator only which:

  • Generates a plain HTML table of contents.
  • Adds an index.html page to the root of an EPUB (this is legit, it gets added to the manifest as well, but not the TOC) with a simple frame-based navigation system so if you can open the EPUB zip, you can browse it.
  • Bundles in a lightweight JavaScript viewer. Initially I tried the Paquete system from USQ, but it turned out to have a few more issues than I had hoped. For this first release I have used a bit of Liza’s code from a couple of years ago, epubjs with couple of modifications. Status? Works for me.

Demo

So here’s what it looks like in real life, warts and all.

I used the test file I was working on earlier in the week with embedded metadata.

graphics1Illustration 1: Test epub from Edinburgh thesis template, with added metadata in Adobe Digital Editions

I ran the new code:

python epub2html.py Edinburgh-ThesisSingleSided-plus-inline-metadata.epub

Which made a new file. (It does make epubckeck complain, but that’s mostly to do with HTML attributes it doesn’t like, not EPUB structural problems).

Edinburgh-ThesisSingleSided-plus-inline-metadata-html.epub

Now, if I unzip it there is an index.html, and some JavaScript from epubjs. In Firefox that looks like this.

 

graphics2Illustration 2: HTML view of the EPUB being served from the file system, using epubjs for navigation

But, if the JavaScript is not working, then you can still see the content courtesy of the less than ideal inline frame:

graphics3Illustration 3: Fall-back to plain HTML with no JavaScript, the index.html file has an inline frame for the EPUB content. Not elegant, but lets the content be seen.

Trying it out / the future

If you want to try this out, or help out you can get the tool from Google code.

svn co https://integrated-content-environment.googlecode.com/svn/branches/temp-2011/epub2html

There are lots of things to do, like add command line options for output files, extracting the EPUB+HTML for immediate use (after safety checking it), choosing whether to bundle the JavaScript in the EPUB or linking to it via the web. Does anyone want this? Let us know.

One of the things I like about Paquete is that it generates # URLS for the different pages you view, making bookmarking chapters possible like this: http://demo.adfi.usq.edu.au/paquete/demo/#configuration.htm. I will explore whether this can be added to epubjs or whether it is worth pressing on with Paquete, which does have some more options like navigation buttons and a tree-widget for the table of contents.

Like I said, I did this as part of the notes I was putting together for how repositories might support EPUB, and maybe, finally, start serving real web content rather than exclusively PDF, more on that soon.

This approach might also help us add previews to web services so people can see their content in ereader-mode, something I know David Flanders the JISC manager on this project is keen on.

And finally something like this approach might be part of a tool-chain that could help people break up long documents into parts, packaged in EPUB and upload them to services like http://digress.it which want things broken up into parts.

Copyright Peter Sefton, 2011-04-14. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.