[CoLoCo] Wikipedia project
Neal McBurnett
neal at bcn.boulder.co.us
Fri Feb 11 12:30:26 UTC 2011
Folks do this sort of analysis all the time, and I doubt that they
each download the whole site these days. It is pretty big.
And they may already have analysis data relevent to your interests.
I don't know the details, but I suspect this sort of analysis is best
done by getting in touch with folks at an existing copy like toolserver:
http://toolserver.org/
One policy page is here which talks about academic research:
https://wiki.toolserver.org/view/Account_approval_policy/en
Cheers,
Neal McBurnett http://neal.mcburnett.org/
On Thu, Feb 10, 2011 at 10:12:32PM -0700, Jason B. Hill wrote:
> The size of the current text-only, non-revision based English
> Wikipedia dump is 6.07 GB (~27 GB uncompressed). It is available at
> the following address:
>
> http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>
> If you need revisions (all revisions/all pages), then you're looking
> at 5 TB of text.
>
>
> If you're going to plow through that much data, it's possible that
> building your own 2/4/6-core machine with enough storage would be
> cheaper than using Amazon's cloud service. I don't think 27 GB of text
> is an unreasonable amount of data to sift for a commodity machine if
> done correctly.
>
> Jason
>
>
>
>
> On Thu, Feb 10, 2011 at 9:31 PM, Neal McBurnett <neal at bcn.boulder.co.us> wrote:
> > Wikipedia is a web application and database hosted on a bunch of
> > servers,, not a static site. What do you really want, and what will
> > you do with it?
> >
> > Neal McBurnett http://neal.mcburnett.org/
> >
> > On Thu, Feb 10, 2011 at 09:14:56PM -0700, mkass at numericalgeo.com wrote:
> >> Just be careful of bandwidth if you host it on a private server. I know
> >> Comcast only gives you 250 GB, which might throw a monkey wrench into your plan
> >> if you're scripting a look at every single page. Unless of course you ran it
> >> remotely on that server...
> >>
> >> Might be obvious but I thought I'd point that out...
> >>
> >> -------- Original Message --------
> >> Subject: [CoLoCo] Wikipedia project
> >> From: Jim Hutchinson <jim at ubuntu-rocks.org>
> >> Date: Thu, February 10, 2011 8:39 pm
> >> To: coloco-list <ubuntu-us-co at lists.ubuntu.com>
> >>
> >> Greetings CoLoCo,
> >>
> >> I posted a few days ago about needing a programmer for a side job and
> >> someone did contact me so hopefully that piece will work out. The other
> >> piece is storage. I will need to download a copy of Wikipedia - probably
> >> the entire English Wikipedia - and store the data somewhere for a few
> >> months. I don't have all the answers yet, but looks like it could be in the
> >> terabyte range. I don't have access to anything that could host this unless
> >> I go and buy a few big drives. I'm wondering if anyone has access to a
> >> server with space that could host a static copy of Wikipedia for a while.
> >> It would probably only be a few months. Obviously, I'd like to find a way
> >> to get by on the cheap. I am aware of the Amazon cloud option and depending
> >> on how much that would cost it's an option. However, I thought I'd see if
> >> anyone was sitting on a lot of drive space not doing much at the moment. It
> >> would need a certain level of security to ensure things weren't changed
> >> while the data collection was happening. After that, if people wanted to
> >> explore that would be fine.
> >>
> >> Thanks.
> >>
> >> --
> >> Jim (Ubuntu geek extraordinaire)
> >> ----
> >> Please avoid sending me Word or PowerPoint attachments.
> >> See http://www.gnu.org/philosophy/no-word-attachments.html
> >> ---------------------------------------------------------------------------
> >> --
> >> Ubuntu-us-co mailing list
> >> Ubuntu-us-co at lists.ubuntu.com
> >> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/
> >> listinfo/ubuntu-us-co
> >>
> >
> >> --
> >> Ubuntu-us-co mailing list
> >> Ubuntu-us-co at lists.ubuntu.com
> >> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co
> >
> >
> > --
> > Ubuntu-us-co mailing list
> > Ubuntu-us-co at lists.ubuntu.com
> > Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co
> >
>
> --
> Ubuntu-us-co mailing list
> Ubuntu-us-co at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-us-co
More information about the Ubuntu-us-co
mailing list