Developer's Challenge Ideas

Below are the ideas submitted for the OR2013 Developer's Challenge. These are open for any team to pick up and develop. Extra consideration will be given during judging for entries that build on these ideas:

PDF/Eh?
For the most part PDF files are reasonably straightforward to preserve for the long term. But there are pitfalls. Certain characteristics can cause longevity issues. For example unusual but critical fonts, that haven't been embedded in the file in question. So how do we know if our PDFs contain preservation risky features?

PDF/Eh? is a prototype of a PDF risk identifier, that exploits the functionality of the Apache Preflight tool. Preflight can validate any PDF against the PDF/A standard and reports any areas that are not compliant, as well as a range of other possible PDF errors. PDF/Eh? simplifies the output for the ordinary user by providing a configurable XML output that focuses on just the issues that are likely to be of concern for long term preservation. More details and links to source code can be found here:
http://openplanetsfoundation.org/blogs/2013-03-15-pdf-eh-another-hackath...

This technology has the potential to root out possible preservation problems in collections of PDF files, but how should it be realised in a repository environment? Could it be incorporated into the ingest workflow to allow bad PDFs to be spotted and rejected (or replaced) before they are ingested? Should it be provided as a back end tool to allow repository managers to scan their entire PDF collection and assess it for preservation risks?

The SPRUCE Project is making available a prize of £1000 (UK pounds) for the best application of PDF/Eh? in a repository environment. The prize will be awarded to the entry that makes the most practical use of PDF/Eh? for repository managers to utilise at their own organisations.

PDF/Eh? was developed at a hackathon in Yorkshire, UK, and seems perfectly named to be taken forward at a hackfest in Canada!
Leveraging Fedora 4
The general idea is to extend and/or leverage Fedora4 to accomplish novel repository outcomes. The list of ideas includes:
1. Create a sequencer that does <x>
  1. characterization (e.g. DROID)
  2. derivative generation (e.g. thumbnail creation)
  3. normalization (e.g. PDF generation from various sources)
  4. OCR
  5. automated metadata enhancement (e.g. going out to wikipedia)
  6. metadata crosswalks
2. Use the cluster-aware low-level storage service to do some map-reduce operations on repository content
  1. text mining
  2. word-count
  3. format migration (e.g. jpeg to png)
3. fcrepo-admin hydra head adapted for fcrepo4
4. resurrecting legacy-api
5. workflow demo using workspaces
6. repository reporting statistics, audit, and visualization
7. new jax-rs resources
8. support authn w/ random provider (e.g. google, facebook)
Integrate ORCID IDs
I would love to see integration of ORCID iDs into a repository in the spirit of the RFP from ORCID: https://orcid.org/content/rfp-2013-06-orcid-id-adoption-and-integration-....

Some thoughts and questions that might suggest an angle a group could take:

- can ORCID iD's be used to perform name disambiguation or collocation within the IR?
- Can the id be clickable so that it goes to the public page?
- Can researchers add the orcids of co-authors as well?
- what role does an orcid play in the authentication process?
- can we reuse any part of orcid records to populate a repository or at least name fields?

There are probably many others!
Open Repos Conference Repository
The 2013 Conference team has been provided with the archive of previous OR presentations. Our idea is to create an initial repository in a system like Islandora and then have an "auto-sync tool" that syncs this data, and any new material, with repositories using DSpace, ePrints, Hydra and any other appropriate system. Having a "Hylandora" system that uses a single Fedora with 2 front-ends is also an option. The goal is to have a single canonical repo that receives material and which is tied to additional systems. The goal is to have an identical data repository which is simultaneously accessible in all systems and can be used to demonstrate the functionality of the core open source repository systems. Ideally an ingest in the canonical repository triggers sync on the others, including updates and deletions.
Hosting Multiple Fedora 3 Objects in Fedora 4
Fedora 4 has a hierarchal PID space and a goal of multiple tenancy, while Fedora 3 had a flat PID space. Fedora 4 also supports the Fedora 3 APIs as long as you do not use the hierarchy and put your objects with the /objects folder.

For this challenge, make it possible to host multiple Fedora 3 /objects folders within the same Fedora 4 repository. For instance, you could put a set of PIDs in a path like, /my/old/repository/objects/
and have the Fedora 3 APIs working at that /my/old/repository/objects path.

This would provide a clean way to migrate many legacy repositories into one Fedora 4 architecture. It would bring all of the storage, scale and performance gains of Fedora 4 to Fedora 3 applications, without needing duplicate Fedora 4 installations.
Meeting OSTP Directives
For those of us in the US, there has been some debate about how the OSTP directive (http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_publi...) should be implemented. One proposal is SHARE: http://www.arl.org/news/arl-news/2773-shared-access-research-ecosystem-p....

But SHARE, as many have pointed out, talks about functionality that does not exist in many repositories currently. It would be great to see a group build an add on or something that would help meet these!
Integration with ArchivesSpace
Would it be possible for Islandora to work with ArchivesSpace? So that metadata entered in one gets entered in the other, and items can get moved between them? Or you could harvest data from one to the other?
RDF and LOD
I work in FAO of the United Nations and I am the database curator of AGRIS, a completely Open Bibliographic Repository, indexing more than 4 millions records on agricultural sciences.
My team harvests, collects and process metadata using OAI-PMH and several other approaches. I basically reuse metadata from other repositories, disseminating its contents. Content providers are happy to see that their articles are more visible to the entire Internet community.
In addition to that, each piece of information is enriched and expanded by the OpenAGRIS triple store (http://agris.fao.org/openagris/), a semantic web application.

My idea and vision?
If open repositories start publishing their datasets in RDF, using the LOD paradigm, there would be more interlinking and less overload collecting and centralizing.

I would be happy to see (maybe in OR2013) how Fedora and Hydra, for example, are leveraging RDF.
Semantic Wiki or CMS for Describing Research Context
If you are building a repository for research data, then you need to be able to record a lot of contextual metadata about the data being collected. For example, you might have some way to attach data to instruments . We typically see designs with hierarchies something like Facility / Experiment / Dataset / File.

Problem is, if you design this into the application, for example via database table then that makes it much harder to adapt to a new domain or changing circumstances, where you might have more or fewer levels, or hierarchies of experiment or instrument might become important etc.

So, what I’d like to see would be a semantic wiki or CMS for describing research context with some built-in concepts such as “Institute”, “Instrument”, “Experiment”, “Study”, “Clinical Trial” (but extensible) which could be used by researchers, data librarians and repository managers to describe research context as a series of pages or nodes, and thus create a series of URIs to which data in any repository anywhere can point: the research data repository could then concentrate on managing the data, and link the units of data (files, sets, databases, collections) to the context via RDF assertions such as ‘ generatedBy ’. Describing new data sets would involve look-up and auto-completes to the research-context-semantic-wiki – a really interesting user interface challenge.

It would be great to see someone demonstrate this architecture, building on a wiki or CMS framework such as Drupal or maybe one of the NoSQL databases, or maybe as a Fedora 4 app, showing how describing research context in a flexible way can be de-coupled from one or more data-repositories. In fact the same principle would apply to lots of repository metadata – instead of configuring input forms with things like institutional hierarchies, why not set up semantic web sites that document research infrastructure and processes and link the forms to them?

Signup your team today!

Developer's Challenge Ideas

Platinum Sponsors: