I am again at the IIPC WAC / RESAW Conference 2017 and, for today I am
Tools for web archives analysis & record extraction (chair Nicholas Taylor)
Digging documents out of the archived web – Andrew Jackson
This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…
- staff looked in an outlook calendar for reminders
- looked for new updates since last check
- download each to local folder and open
- check catalogue to avoid re-submitting
- upload to internal submission portal
- add essential metadata
- submit for ingest
- clean up local files
- update stats sheet
- Then inget usually automated (but can require intervention)
- Updates catalogue once complete
- New catalogue records processed or enhanced as neccassary.
It was very manual, and very inefficient… So we have created a harvester:
- Setup: specify “watched targets” then…
- Harvest (harvester crawl targets as usual) –> Ingested… but also…
- Document extraction:
- spot documents in the crawl
- find landing page
- extract machine-readable metadata
- submit to W3ACT (curation tool) for review
- check document harvester for new publications
- edit essemtial metaddta
- submit to catalogue
- cataloguing records processed as neccassry
This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…
MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…
One of the intentions of the metadata extraction work was to proide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.
What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.
But now we have to resolve references… Multiple use cases for “records about this record”:
- publisher metadata
- third party data sources (e.g. Wikipedia)
- Our own annotations and catalogues
- Revisit records
We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….
And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solaar searches correctly it should be easy so will be correcting this…
We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discocerale. Need to be able to re-run automated extraction.
We want to iteractively ipmprove automated metadat extraction:
- improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
- Bring together different sources
- Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)
And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.
Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…
A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.
Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…
A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….
Q2) Geoffrey Bilder also working on this…
A2) And that’s the ideal… To improve the standards more broadly…
Q3) Are these all PDF files?
A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…
Q4) What does the user see at the end of this… Is it a PDF?
A4) This work ends up in our search service, and that metadata helps them find what they are looking for…
Q4) Do they know its from the website, or don’t they care?
A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..
Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…
Q5) You spoke yesterday about engaging with machine learning… Can you say more?
A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…
A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.
Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform
Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.
So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t acessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.
So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, SImon Fraser University – that represents about half of the archive in Canada.
We work on workflow… We run workshops… We separated the collections so that post docs can look at this
We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadan political parties and political interest group web crawls which track changes, although that may include crawler issues.
Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.
Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tightknit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.
Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.
Last year we had a solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixees, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..
Ian spoke about dericative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.
So, that goal Ian talked about: one central hub for archived data and derivatives…
Q1) Do you plan to make graphs interactive, by using Kebana rather than Gephi?
A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…
A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kabana for stuff so in due course we may bring that in…
Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…
A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..
Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…
A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…
Q3) Do you think in few years time
A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,
Q4) What are some of the organisational, admin and social challenges of building this?
A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”
A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…
A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..