I have something of a problem i was hoping the gods of the web that
reside here could help me with. I'm making my sitemap for google, and
my site is rather large (2 million+ pages) and when running my python
script it starts off without a hitch. Works beautifully, that is
untill it hits sitemap54.xml.gz... then without fail it crashes. Below
is the message I get. (I cut the file path down to save space as your
don't need to see the huge file path it goes through.)
---
Writing Sitemap file "(file path)/sitemap53.xml.gz" with 50000 URLs
Sorting and normalizing collected URLs.
Writing Sitemap file "(file path)/sitemap54.xml.gz" with 50000 URLs
Traceback (most recent call last):
File "sitemap_gen.py", line 2208, in ?
sitemap.Generate()
File "sitemap_gen.py", line 1780, in Generate
input.ProduceURLs(self.ConsumeURL)
File "sitemap_gen.py", line 979, in ProduceURLs
os.path.walk(self._path, PerDirectory, None)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 290, in walk
func(arg, top, names)
File "sitemap_gen.py", line 974, in PerDirectory
PerFile(dirpath, name)
File "sitemap_gen.py", line 959, in PerFile
consumer(url, False)
File "sitemap_gen.py", line 1841, in ConsumeURL
self._urls[hash] = 1
MemoryError
---
Anyone have any incite or work arounds to how i can free up the
apparent memory that is gummed up by this process? Any help is
greatfully appreciated!
Can you set the verbose attribute of
the <site> node in config.xml to 3
(highest level of diagnostic data for
when you run the sitemap generator),
to if you get more diagnostic messages
and
check if you get a diagnostic message
about the directory being walked at the time
before that error message in walk.
Are you using the directory nodes of
config.xml to walk your server file system?
Another thing is that 2 millions+ URLs are quite a lot,
are you sure there are no duplicate URLs,
and that you want to list all these URLs
in your sitemaps?
> I have something of a problem i was hoping the gods of the web that
> reside here could help me with. I'm making my sitemap for google, and
> my site is rather large (2 million+ pages) and when running my python
> script it starts off without a hitch. Works beautifully, that is
> untill it hits sitemap54.xml.gz... then without fail it crashes. Below
> is the message I get. (I cut the file path down to save space as your
> don't need to see the huge file path it goes through.)
> ---
> Writing Sitemap file "(file path)/sitemap53.xml.gz" with 50000 URLs
> Sorting and normalizing collected URLs.
> Writing Sitemap file "(file path)/sitemap54.xml.gz" with 50000 URLs
> Traceback (most recent call last):
> File "sitemap_gen.py", line 2208, in ?
> sitemap.Generate()
> File "sitemap_gen.py", line 1780, in Generate
> input.ProduceURLs(self.ConsumeURL)
> File "sitemap_gen.py", line 979, in ProduceURLs
> os.path.walk(self._path, PerDirectory, None)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 290, in walk
> func(arg, top, names)
> File "sitemap_gen.py", line 974, in PerDirectory
> PerFile(dirpath, name)
> File "sitemap_gen.py", line 959, in PerFile
> consumer(url, False)
> File "sitemap_gen.py", line 1841, in ConsumeURL
> self._urls[hash] = 1
> MemoryError
> ---
> Anyone have any incite or work arounds to how i can free up the
> apparent memory that is gummed up by this process? Any help is
> greatfully appreciated!
Man I was very confident in that working. I Changed the verbose
attribute of the site nod in config.xml to 3, and it does say
something about the directory being walked at the very beginning of
the process, but then around sitemap 54 I received this message:
---
URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ air-check-valve.shtml] lastmod=[2008-01-17T16:25:38Z] changefreq=[]
priority=[]
URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ air-conditioning-accumulator.shtml] lastmod=[2008-01-17T16:25:38Z]
changefreq=[] priority=[]
Traceback (most recent call last):
File "sitemap_gen.py", line 2206, in ?
sitemap.Generate()
File "sitemap_gen.py", line 1778, in Generate
input.ProduceURLs(self.ConsumeURL)
File "sitemap_gen.py", line 979, in ProduceURLs
os.path.walk(self._path, PerDirectory, None)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 298, in walk
walk(name, func, arg)
File "/usr/lib/python2.4/posixpath.py", line 290, in walk
func(arg, top, names)
File "sitemap_gen.py", line 974, in PerDirectory
PerFile(dirpath, name)
File "sitemap_gen.py", line 959, in PerFile
consumer(url, False)
File "sitemap_gen.py", line 1839, in ConsumeURL
self._urls[hash] = 1
MemoryError
---
I am using the directory node this is what i have:
(My Site Path) Is my long path that i actually have typed in but I'm
so paranoid i took it out just in case haha. Anyways, I do believe my
Python is up to date as my web hosting company takes care of that
server side software.
And yes those pages are all different one for each type of part we
carry, which is a lot, so they are not the same page over and over, if
that's what you mean.
> Can you set the verbose attribute of
> the <site> node in config.xml to 3
> (highest level of diagnostic data for
> when you run the sitemap generator),
> to if you get more diagnostic messages
> and
> check if you get a diagnostic message
> about the directory being walked at the time
> before that error message in walk.
> Are you using the directory nodes of
> config.xml to walk your server file system?
> Another thing is that 2 millions+ URLs are quite a lot,
> are you sure there are no duplicate URLs,
> and that you want to list all these URLs
> in your sitemaps?
> Cristina.
> On Jul 2, 4:15 pm, BadXAsh wrote:
> > Hello all,
> > I have something of a problem i was hoping the gods of the web that
> > reside here could help me with. I'm making my sitemap for google, and
> > my site is rather large (2 million+ pages) and when running my python
> > script it starts off without a hitch. Works beautifully, that is
> > untill it hits sitemap54.xml.gz... then without fail it crashes. Below
> > is the message I get. (I cut the file path down to save space as your
> > don't need to see the huge file path it goes through.)
> > ---
> > Writing Sitemap file "(file path)/sitemap53.xml.gz" with 50000 URLs
> > Sorting and normalizing collected URLs.
> > Writing Sitemap file "(file path)/sitemap54.xml.gz" with 50000 URLs
> > Traceback (most recent call last):
> > File "sitemap_gen.py", line 2208, in ?
> > sitemap.Generate()
> > File "sitemap_gen.py", line 1780, in Generate
> > input.ProduceURLs(self.ConsumeURL)
> > File "sitemap_gen.py", line 979, in ProduceURLs
> > os.path.walk(self._path, PerDirectory, None)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> >File "/usr/lib/python2.4/posixpath.py", line 290, in walk
> > func(arg, top, names)
> > File "sitemap_gen.py", line 974, in PerDirectory
> > PerFile(dirpath, name)
> > File "sitemap_gen.py", line 959, in PerFile
> > consumer(url, False)
> > File "sitemap_gen.py", line 1841, in ConsumeURL
> > self._urls[hash] = 1
> > MemoryError
> > ---
> > Anyone have any incite or work arounds to how i can free up the
> > apparent memory that is gummed up by this process? Any help is
> > greatfully appreciated!
Can you run the sitemap generator more than once
for different config.xml files with different
settings for the <directory> node,
just to break the sitemaps for different sub-folders,
to check if indeed the problem is memory leak
because of the large number of URLs,
and not some problem because of file system walking.
For example first time run the sitemap generator
for the directory where you got the error
to check that this directory can be walked OK
change default_file to index.shtml
if the default home page is index.shtml
After that run the sitemap generator for other
non-overlapping directories,
you can use if you want the <sitemap>
nodes as well to aggregate sitemaps
(you can use <sitemap> nodes in version 1.4,
I am not sure if you can use them in version 1.5)
It is not great, just to check that the
problems are indeed because of memory leaks
caused by the large number of URLs.
> Man I was very confident in that working. I Changed the verbose
> attribute of the site nod in config.xml to 3, and it does say
> something about the directory being walked at the very beginning of
> the process, but then around sitemap 54 I received this message:
> ---
> URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ > air-check-valve.shtml] lastmod=[2008-01-17T16:25:38Z] changefreq=[]
> priority=[]
> URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ > air-conditioning-accumulator.shtml] lastmod=[2008-01-17T16:25:38Z]
> changefreq=[] priority=[]
> Traceback (most recent call last):
> File "sitemap_gen.py", line 2206, in ?
> sitemap.Generate()
> File "sitemap_gen.py", line 1778, in Generate
> input.ProduceURLs(self.ConsumeURL)
> File "sitemap_gen.py", line 979, in ProduceURLs
> os.path.walk(self._path, PerDirectory, None)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> walk(name, func, arg)
> File "/usr/lib/python2.4/posixpath.py", line 290, in walk
> func(arg, top, names)
> File "sitemap_gen.py", line 974, in PerDirectory
> PerFile(dirpath, name)
> File "sitemap_gen.py", line 959, in PerFile
> consumer(url, False)
> File "sitemap_gen.py", line 1839, in ConsumeURL
> self._urls[hash] = 1
> MemoryError
> ---
> I am using the directory node this is what i have:
> (My Site Path) Is my long path that i actually have typed in but I'm
> so paranoid i took it out just in case haha. Anyways, I do believe my
> Python is up to date as my web hosting company takes care of that
> server side software.
> And yes those pages are all different one for each type of part we
> carry, which is a lot, so they are not the same page over and over, if
> that's what you mean.
Well i ran just the '85 dodge aries folder and it ran perfectly,
finished in the blink of an eye. I also tried to run just the /parts/
directory, which is the single largest directory on my site. A page
for each and every part for every vehicle from 1965 thru 2007, so as
you can imagine it's rather large. Just walking along that directory
Sitemapping it alone failed in MemoryError as well. Though it went
past the '85 dodge aries it crashed out in the '86 year, Something
similar though is that it crashed when it reached mid page 55 of the
Sitemaps.
So I'm wondering can i set up a filter to break down the /parts/
directory and Map it in sections, say like in 10 to 20 years
increments, i.e. /parts/1965 - 1985 I'm not to clear on the FILTERS
rules, it seems like i can't really specify directories i want
filtered. And alternitively, can i leave out the /parts/ directory
once i have that mapped so i can map the rest of the site and leave
out that directory?
Or should i just abandon all hope?? hehe Thank you for your help so
far Cristina!
> Can you run the sitemap generator more than once
> for different config.xml files with different
> settings for the <directory> node,
> just to break the sitemaps for different sub-folders,
> to check if indeed the problem is memory leak
> because of the large number of URLs,
> and not some problem because of file system walking.
> For example first time run the sitemap generator
> for the directory where you got the error
> to check that this directory can be walked OK
> change default_file to index.shtml
> if the default home page is index.shtml
> After that run the sitemap generator for other
> non-overlapping directories,
> you can use if you want the <sitemap>
> nodes as well to aggregate sitemaps
> (you can use <sitemap> nodes in version 1.4,
> I am not sure if you can use them in version 1.5)
> It is not great, just to check that the
> problems are indeed because of memory leaks
> caused by the large number of URLs.
> Cristina.
> On Jul 2, 9:39 pm, BadXAsh wrote:
> > Man I was very confident in that working. I Changed the verbose
> > attribute of the site nod in config.xml to 3, and it does say
> > something about the directory being walked at the very beginning of
> > the process, but then around sitemap 54 I received this message:
> > ---
> > URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ > > air-check-valve.shtml] lastmod=[2008-01-17T16:25:38Z] changefreq=[]
> > priority=[]
> > URL: loc=[http://www.diyautoparts.com/search/parts/1985/dodge/aries/ > > air-conditioning-accumulator.shtml] lastmod=[2008-01-17T16:25:38Z]
> > changefreq=[] priority=[]
> > Traceback (most recent call last):
> > File "sitemap_gen.py", line 2206, in ?
> > sitemap.Generate()
> > File "sitemap_gen.py", line 1778, in Generate
> > input.ProduceURLs(self.ConsumeURL)
> > File "sitemap_gen.py", line 979, in ProduceURLs
> > os.path.walk(self._path, PerDirectory, None)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 298, in walk
> > walk(name, func, arg)
> > File "/usr/lib/python2.4/posixpath.py", line 290, in walk
> > func(arg, top, names)
> > File "sitemap_gen.py", line 974, in PerDirectory
> > PerFile(dirpath, name)
> > File "sitemap_gen.py", line 959, in PerFile
> > consumer(url, False)
> > File "sitemap_gen.py", line 1839, in ConsumeURL
> > self._urls[hash] = 1
> > MemoryError
> > ---
> > I am using the directory node this is what i have:
> > (My Site Path) Is my long path that i actually have typed in but I'm
> > so paranoid i took it out just in case haha. Anyways, I do believe my
> > Python is up to date as my web hosting company takes care of that
> > server side software.
> > And yes those pages are all different one for each type of part we
> > carry, which is a lot, so they are not the same page over and over, if
> > that's what you mean.
You can filter out directories from the sitemap
with the filter nodes.
The filter nodes apply to URLs, so to filter out
the /parts/ directory you can add to your
config.xml file something like
You can see from the comments in the example_config.xml file that
the filters are applied in the order they appear in the config.xml
file,
and a pass filter shortcuts any other later filter that match.
> Well i ran just the '85 dodge aries folder and it ran perfectly,
> finished in the blink of an eye. I also tried to run just the /parts/
> directory, which is the single largest directory on my site. A page
> for each and every part for every vehicle from 1965 thru 2007, so as
> you can imagine it's rather large. Just walking along that directory
> Sitemapping it alone failed in MemoryError as well. Though it went
> past the '85 dodge aries it crashed out in the '86 year, Something
> similar though is that it crashed when it reached mid page 55 of the
> Sitemaps.
> So I'm wondering can i set up a filter to break down the /parts/
> directory and Map it in sections, say like in 10 to 20 years
> increments, i.e. /parts/1965 - 1985 I'm not to clear on the FILTERS
> rules, it seems like i can't really specify directories i want
> filtered. And alternitively, can i leave out the /parts/ directory
> once i have that mapped so i can map the rest of the site and leave
> out that directory?
> Or should i just abandon all hope?? hehe Thank you for your help so
> far Cristina!
> On Jul 2, 6:31 pm, cristina wrote:
> > Can you run the sitemap generator more than once
> > for different config.xml files with different
> > settings for the <directory> node,
> > just to break the sitemaps for different sub-folders,
> > to check if indeed the problem is memory leak
> > because of the large number of URLs,
> > and not some problem because of file system walking.
> > For example first time run the sitemap generator
> > for the directory where you got the error
> > to check that this directory can be walked OK
> > change default_file to index.shtml
> > if the default home page is index.shtml
> > After that run the sitemap generator for other
> > non-overlapping directories,
> > you can use if you want the <sitemap>
> > nodes as well to aggregate sitemaps
> > (you can use <sitemap> nodes in version 1.4,
> > I am not sure if you can use them in version 1.5)
> > It is not great, just to check that the
> > problems are indeed because of memory leaks
> > caused by the large number of URLs.