As part of the work we're doing on the Preserving Virtual Worlds
project (game preservation), I'm having to collect representation
information which documents the formats used for storing certain game
materials. For instance, games like "Mindwheel", written for the
Apple II, tend to get handed around as disk images produced using a
utility like Disk Copy. When I'm lucky, a company like Apple will
actually make the format specs for their own disk image structures
available, but they tend to make them available as a set of several
hundred highly interlinked web pages. I need to be able to say 'this
file here is in a format documented by that spec over there' when that
spec over there is several hundred web resources. OAI-ORE nicely
provides a way to make those several hundred web pages a single,
addressable object, but I'm currently having to generate OAI-ORE for
those hundreds of pages by hand. Very annoying.
Has anyone modified one of the web harvesting tools so that instead of
actually grabbing web resources it generates an OAI-ORE description of
a site? Or do I have a little hacking project to undertake?
I have some bits of code that could be tweaked to do this (uses wget and
creates an Atom doc), if an Atom representation of the ReM is useful.
Further tweaking could create RDF ReM. What's you're prefered language
between python, perl & php? I'd be happy to share.
On Fri, Jul 10, 2009 at 11:23 AM, Jerome <jmcdo...@uiuc.edu> wrote:
> Howdy,
> As part of the work we're doing on the Preserving Virtual Worlds
> project (game preservation), I'm having to collect representation
> information which documents the formats used for storing certain game
> materials. For instance, games like "Mindwheel", written for the
> Apple II, tend to get handed around as disk images produced using a
> utility like Disk Copy. When I'm lucky, a company like Apple will
> actually make the format specs for their own disk image structures
> available, but they tend to make them available as a set of several
> hundred highly interlinked web pages. I need to be able to say 'this
> file here is in a format documented by that spec over there' when that
> spec over there is several hundred web resources. OAI-ORE nicely
> provides a way to make those several hundred web pages a single,
> addressable object, but I'm currently having to generate OAI-ORE for
> those hundreds of pages by hand. Very annoying.
> Has anyone modified one of the web harvesting tools so that instead of
> actually grabbing web resources it generates an OAI-ORE description of
> a site? Or do I have a little hacking project to undertake?
The hardest part is knowing when to stop adding resources into the aggregation. There's a few obvious ways, none of which are great:
1. Limit by URL template(s): Crawl until the URL no longer matches a template given at run time. However this might be tricky to construct, especially to include css, images and so on, while not including pages you don't want. 2. Limit by sitemap: Requires a sitemap, obviously. 3. Limit by crawl depth from initial page: Very arbitrary.
Also happy to share/help, especially in Python+Foresite :)
Peter: Have you considered writing a RAP based (or other?) PHP library for ORE?
On Fri, Jul 10, 2009 at 5:31 PM, Peter Keane <pke...@mail.utexas.edu> wrote: > Jerome-
> I have some bits of code that could be tweaked to do this (uses wget and > creates an Atom doc), if an Atom representation of the ReM is useful. > Further tweaking could create RDF ReM. What's you're prefered language > between python, perl & php? I'd be happy to share.
> --peter > On Fri, Jul 10, 2009 at 11:23 AM, Jerome <jmcdo...@uiuc.edu> wrote:
>> Has anyone modified one of the web harvesting tools so that instead of >> actually grabbing web resources it generates an OAI-ORE description of >> a site? Or do I have a little hacking project to undertake?
It's relatively simple -- it just takes a start page and some
restriction regular expressions for what to include, but should be
modifiable for other strategies.
If you run it as is, it should generate a description of the ORE 1.0
spec.
> I have some bits of code that could be tweaked to do this (uses wget and
> creates an Atom doc), if an Atom representation of the ReM is useful.
> Further tweaking could create RDF ReM. What's you're prefered language
> between python, perl & php? I'd be happy to share.
> --peter
> On Fri, Jul 10, 2009 at 11:23 AM, Jerome <jmcdo...@uiuc.edu> wrote:
> > Howdy,
> > As part of the work we're doing on the Preserving Virtual Worlds
> > project (game preservation), I'm having to collect representation
> > information which documents the formats used for storing certain game
> > materials. For instance, games like "Mindwheel", written for the
> > Apple II, tend to get handed around as disk images produced using a
> > utility like Disk Copy. When I'm lucky, a company like Apple will
> > actually make the format specs for their own disk image structures
> > available, but they tend to make them available as a set of several
> > hundred highly interlinked web pages. I need to be able to say 'this
> > file here is in a format documented by that spec over there' when that
> > spec over there is several hundred web resources. OAI-ORE nicely
> > provides a way to make those several hundred web pages a single,
> > addressable object, but I'm currently having to generate OAI-ORE for
> > those hundreds of pages by hand. Very annoying.
> > Has anyone modified one of the web harvesting tools so that instead of
> > actually grabbing web resources it generates an OAI-ORE description of
> > a site? Or do I have a little hacking project to undertake?
Yeah, we have that problem (and similar ones) littering the
landscape. I think I could probably spend 5 years just researching
the user interface issues around making it simple/easy for users to do
highly selective web harvesting. For now, limiting by URL template is
probably going to be good enough for me.
On Jul 10, 1:51 pm, Robert Sanderson <azarot...@gmail.com> wrote:
> The hardest part is knowing when to stop adding resources into the
> aggregation.
> There's a few obvious ways, none of which are great:
> 1. Limit by URL template(s): Crawl until the URL no longer matches a
> template given at run time. However this might be tricky to construct,
> especially to include css, images and so on, while not including pages you
> don't want.
> 2. Limit by sitemap: Requires a sitemap, obviously.
> 3. Limit by crawl depth from initial page: Very arbitrary.
> Also happy to share/help, especially in Python+Foresite :)
> Peter:
> Have you considered writing a RAP based (or other?) PHP library for ORE?
> Rob
> On Fri, Jul 10, 2009 at 5:31 PM, Peter Keane <pke...@mail.utexas.edu> wrote:
> > Jerome-
> > I have some bits of code that could be tweaked to do this (uses wget and
> > creates an Atom doc), if an Atom representation of the ReM is useful.
> > Further tweaking could create RDF ReM. What's you're prefered language
> > between python, perl & php? I'd be happy to share.
> > --peter
> > On Fri, Jul 10, 2009 at 11:23 AM, Jerome <jmcdo...@uiuc.edu> wrote:
> >> Has anyone modified one of the web harvesting tools so that instead of
> >> actually grabbing web resources it generates an OAI-ORE description of
> >> a site? Or do I have a little hacking project to undertake?
> It's relatively simple -- it just takes a start page and some
> restriction regular expressions for what to include, but should be
> modifiable for other strategies.
> If you run it as is, it should generate a description of the ORE 1.0
> spec.