[CoLoCo] Wikipedia project

Neal McBurnett neal at bcn.boulder.co.us
Fri Feb 11 12:30:26 UTC 2011


Folks do this sort of analysis all the time, and I doubt that they
each download the whole site these days.  It is pretty big.
And they may already have analysis data relevent to your interests.

I don't know the details, but I suspect this sort of analysis is best
done by getting in touch with folks at an existing copy like toolserver:

 http://toolserver.org/

One policy page is here which talks about academic research:
 https://wiki.toolserver.org/view/Account_approval_policy/en

Cheers,

Neal McBurnett                 http://neal.mcburnett.org/

On Thu, Feb 10, 2011 at 10:12:32PM -0700, Jason B. Hill wrote:
> The size of the current text-only, non-revision based English
> Wikipedia dump is 6.07 GB (~27 GB uncompressed). It is available at
> the following address:
> 
> http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
> 
> If you need revisions (all revisions/all pages), then you're looking
> at 5 TB of text.
> 
> 
> If you're going to plow through that much data, it's possible that
> building your own 2/4/6-core machine with enough storage would be
> cheaper than using Amazon's cloud service. I don't think 27 GB of text
> is an unreasonable amount of data to sift for a commodity machine if
> done correctly.
> 
> Jason
> 
> 
> 
> 
> On Thu, Feb 10, 2011 at 9:31 PM, Neal McBurnett <neal at bcn.boulder.co.us> wrote:
> > Wikipedia is a web application and database hosted on a bunch of
> > servers,, not a static site.  What do you really want, and what will
> > you do with it?
> >
> > Neal McBurnett                 http://neal.mcburnett.org/
> >
> > On Thu, Feb 10, 2011 at 09:14:56PM -0700, mkass at numericalgeo.com wrote:
> >> Just be careful of bandwidth if you host it on a private server.  I know
> >> Comcast only gives you 250 GB, which might throw a monkey wrench into your plan
> >> if you're scripting a look at every single page.  Unless of course you ran it
> >> remotely on that server...
> >>
> >> Might be obvious but I thought I'd point that out...
> >>
> >>     -------- Original Message --------
> >>     Subject: [CoLoCo] Wikipedia project
> >>     From: Jim Hutchinson <jim at ubuntu-rocks.org>
> >>     Date: Thu, February 10, 2011 8:39 pm
> >>     To: coloco-list <ubuntu-us-co at lists.ubuntu.com>
> >>
> >>     Greetings CoLoCo,
> >>
> >>     I posted a few days ago about needing a programmer for a side job and
> >>     someone did contact me so hopefully that piece will work out. The other
> >>     piece is storage. I will need to download a copy of Wikipedia - probably
> >>     the entire English Wikipedia - and store the data somewhere for a few
> >>     months. I don't have all the answers yet, but looks like it could be in the
> >>     terabyte range. I don't have access to anything that could host this unless
> >>     I go and buy a few big drives. I'm wondering if anyone has access to a
> >>     server with space that could host a static copy of Wikipedia for a while.
> >>     It would probably only be a few months. Obviously, I'd like to find a way
> >>     to get by on the cheap. I am aware of the Amazon cloud option and depending
> >>     on how much that would cost it's an option. However, I thought I'd see if
> >>     anyone was sitting on a lot of drive space not doing much at the moment. It
> >>     would need a certain level of security to ensure things weren't changed
> >>     while the data collection was happening. After that, if people wanted to
> >>     explore that would be fine.
> >>
> >>     Thanks.
> >>
> >>     --
> >>     Jim (Ubuntu geek extraordinaire)
> >>     ----
> >>     Please avoid sending me Word or PowerPoint attachments.
> >>     See http://www.gnu.org/philosophy/no-word-attachments.html
> >>     ---------------------------------------------------------------------------
> >>     --
> >>     Ubuntu-us-co mailing list
> >>     Ubuntu-us-co at lists.ubuntu.com
> >>     Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/
> >>     listinfo/ubuntu-us-co
> >>
> >
> >> --
> >> Ubuntu-us-co mailing list
> >> Ubuntu-us-co at lists.ubuntu.com
> >> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co
> >
> >
> > --
> > Ubuntu-us-co mailing list
> > Ubuntu-us-co at lists.ubuntu.com
> > Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co
> >
> 
> -- 
> Ubuntu-us-co mailing list
> Ubuntu-us-co at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co



More information about the Ubuntu-us-co mailing list