Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Web Harvesting to OAI-ORE
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jerome  
View profile  
 More options Jul 11, 2:23 am
From: Jerome <jmcdo...@uiuc.edu>
Date: Fri, 10 Jul 2009 09:23:59 -0700 (PDT)
Local: Sat, Jul 11 2009 2:23 am
Subject: Web Harvesting to OAI-ORE
Howdy,

As part of the work we're doing on the Preserving Virtual Worlds
project (game preservation), I'm having to collect representation
information which documents the formats used for storing certain game
materials.  For instance, games like "Mindwheel", written for the
Apple II, tend to get handed around as disk images produced using a
utility like Disk Copy.  When I'm lucky, a company like Apple will
actually make the format specs for their own disk image structures
available, but they tend to make them available as a set of several
hundred highly interlinked web pages.  I need to be able to say 'this
file here is in a format documented by that spec over there' when that
spec over there is several hundred web resources.  OAI-ORE nicely
provides a way to make those several hundred web pages a single,
addressable object, but I'm currently having to generate OAI-ORE for
those hundreds of pages by hand.  Very annoying.

Has anyone modified one of the web harvesting tools so that instead of
actually grabbing web resources it generates an OAI-ORE description of
a site?  Or do I have a little hacking project to undertake?


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Keane  
View profile  
 More options Jul 11, 2:31 am
From: Peter Keane <pke...@mail.utexas.edu>
Date: Fri, 10 Jul 2009 11:31:09 -0500
Local: Sat, Jul 11 2009 2:31 am
Subject: Re: Web Harvesting to OAI-ORE

Jerome-

I have some bits of code that could be tweaked to do this (uses wget and
creates an Atom doc), if an Atom representation of the ReM is useful.
Further tweaking could create RDF ReM.  What's you're prefered language
between python, perl & php?  I'd be happy to share.

--peter


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Sanderson  
View profile  
 More options Jul 11, 4:51 am
From: Robert Sanderson <azarot...@gmail.com>
Date: Fri, 10 Jul 2009 19:51:16 +0100
Local: Sat, Jul 11 2009 4:51 am
Subject: Re: Web Harvesting to OAI-ORE

Hi Jerry,

The hardest part is knowing when to stop adding resources into the
aggregation.
There's a few obvious ways, none of which are great:

1. Limit by URL template(s):  Crawl until the URL no longer matches a
template given at run time. However this might be tricky to construct,
especially to include css, images and so on, while not including pages you
don't want.
2. Limit by sitemap:  Requires a sitemap, obviously.
3. Limit by crawl depth from initial page:  Very arbitrary.

Also happy to share/help, especially in Python+Foresite :)

Peter:
Have you considered writing a RAP based (or other?) PHP library for ORE?

Rob


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rob Sanderson  
View profile  
 More options Jul 12, 3:49 am
From: Rob Sanderson <azarot...@gmail.com>
Date: Sat, 11 Jul 2009 10:49:21 -0700 (PDT)
Local: Sun, Jul 12 2009 3:49 am
Subject: Re: Web Harvesting to OAI-ORE

Here's an initial attempt at this idea:

http://code.google.com/p/foresite-toolkit/source/browse/foresite-pyth...

It's relatively simple -- it just takes a start page and some
restriction regular expressions for what to include, but should be
modifiable for other strategies.

If you run it as is, it should generate a description of the ORE 1.0
spec.

Hope that helps :)
-- Rob


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jerome  
View profile  
 More options Jul 17, 5:34 am
From: Jerome <jmcdo...@uiuc.edu>
Date: Thu, 16 Jul 2009 12:34:49 -0700 (PDT)
Local: Fri, Jul 17 2009 5:34 am
Subject: Re: Web Harvesting to OAI-ORE
Hi Peter,

If you could share your code, that would be fantastic.  Of those
languages, I'll take python, thanks!

On Jul 10, 11:31 am, Peter Keane <pke...@mail.utexas.edu> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jerome  
View profile  
 More options Jul 17, 5:36 am
From: Jerome <jmcdo...@uiuc.edu>
Date: Thu, 16 Jul 2009 12:36:32 -0700 (PDT)
Local: Fri, Jul 17 2009 5:36 am
Subject: Re: Web Harvesting to OAI-ORE
Yeah, we have that problem (and similar ones) littering the
landscape.  I think I could probably spend 5 years just researching
the user interface issues around making it simple/easy for users to do
highly selective web harvesting.  For now, limiting by URL template is
probably going to be good enough for me.

On Jul 10, 1:51 pm, Robert Sanderson <azarot...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jerome  
View profile  
 More options Jul 17, 5:36 am
From: Jerome <jmcdo...@uiuc.edu>
Date: Thu, 16 Jul 2009 12:36:48 -0700 (PDT)
Local: Fri, Jul 17 2009 5:36 am
Subject: Re: Web Harvesting to OAI-ORE
Thanks!

On Jul 11, 12:49 pm, Rob Sanderson <azarot...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google