[ec2-beta] data corruption

Mark Shuttleworth mark at ubuntu.com
Tue Apr 14 20:54:44 BST 2009

Ben Hendrickson wrote:
> I use around 16 large instances for data processing tasks.  When using
> the beta AMIs, I had around a dozen instances of data corruption.
> Previously when using the alestic.com images, and now that I've
> switched back to them, I haven't seen any corruption.  Is there any
> known issue this could be related to?
> I use machines by raiding together the disks via RAID 0, and then
> installing ReiserFS on top of that.  The list of commands I use to do
> this is at the bottom of this email.  The workload of the machines
> changes somewhat, but generally it maxes out both of the cores, use
> around 15MB/s of disk throughput (split even read/writing), and has
> the disks around 60% full.  Our data is always compressed on disk
> (LZO), and we have checksums every 64KB of uncompressed data.  What I
> would see is that at a seemingly random point in a file the checksum
> wouldn't match, although the checksums for the rest of the file both
> before and after this point would be fine.  I didn't notice anything
> unusual in the system logs.
Thanks for the detailed information, that may help to narrow the search
for the problem substantially.

Is ReiserFS integral to the solution, or a personal preference? It
jumped out at me as an area of risk, as it's not a filesystem we're
particularly focused on. Ext3, and the newer ext4 and ultimately btrfs
would be the "stable, next, future" default filesystems we'd recommend
unless there was a specific technical reason to do otherwise. If Reiser
isn't integral I'd be interested in your results with ext3, both
performance and stability wise.

There are different kernels, as I understand it, between the Alestic
images. Chuck and Eric will be able to say in detail but AIUI the beta
AMI's use newer kernels, which bring some benefits but are also quite
possibly the source of new gotchas. You may have triggered one of those.

