Tracker in Edgy?

John Richard Moser nigelenki at comcast.net
Fri Jun 30 20:17:03 BST 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Jamie McCracken wrote:
> John Richard Moser wrote:
> 
>> Want a bet?  I've seen JFS kill all of /usr; I've seen XF86Config get
>> portions of Xorg.0.log in it on ReiserFS (how did THAT happen?!); and
>> I've seen XFS and EXT3 straight truncate files that were in the middle
>> of being resized when the power drops.
>>
>> My worry is the database may be growing and suddenly get truncated to
>> half its size or just thrown out by the file system.  Even journaled
>> file systems will sometimes not know quite what to do; but they'll know
>> what needs fixing right off.  They don't magically complete anything you
>> do; they just clean up the file system by repairing meta-data, which may
>> involve truncating files, freeing blocks, and unlinking.
>>
>> This is an inherent hazard of having files that change size.  As long as
>> you're making changes in-place, the file system is probably not going to
>> have meta-data (besides mtime and atime, which are journaled perfectly)
>> being rewritten; however, when a file is being resized, you can 1) wind
>> out truncating it; 2) wind out deleting/unlinking it; or 3) wind out
>> filling it with zeroes (if the FS doesn't know the original size or
>> allocated blocks, it should fill it with zeroes to avoid leaking
>> previously deleted, possibly privileged data).
> 
> You can tell Ext3 and ReiserFS to journal data as well as metadata and
> in those cases it should protect you from this. (use data=journal in
> /etc/fstab).
> 

Let's take a guess at how this works:


Without journalizing data:

 - Journal file resize
 - Journal new blocks allocated to the file
 - Set those blocks in use
 - Allocate those blocks to the file
 - Clear the journal entry for new blocks allocated
 - Change the file's size
 - Zero the blocks
 - Clear the journal entry for file resizing
 - Write the data

With journalizing data:

 - Journal file resize
 - Journal new blocks allocated to the file
 - Set those blocks in use
 - Allocate those blocks to the file
 - Clear the journal entry for new blocks allocated
 - Change the file's size
 - Zero the blocks
 - Clear the journal entry for file resizing
 - Journal the data
 - Write the data

Remember, the file is resized and written to in different swings; there
are different POSIX interfaces for each of these, unless you are doing
linear writing.  The changed allocated blocks have to be zeroed before
the resize is finished.

Now some file systems will defer this for a later point in time, placing
it on writeback (reiserfs).  Crash during a resize, and one of two
things happens:  1) the file is NULLed out, or at least the tail end is;
2) (reiserfs) you get access to old, deleted data (now your database has
junk in it).

Data journaling does not magically solve anything, it just slows things
down.  As for meta-data journaling, different implementations have
different problems; some methods can truncate if you're shrinking a
file, some methods can zero-fill a file for whatever reason, and some
will just remove the file.

> Of course if there are bugs in the FS implementation then problems can
> occur and I suspect that might be why some of the newer ones (like JFS)
> have problems too.

JFS is in general trash.

> 
>>
>>> also mysql has an excellent reputation with regards to the integrity of
>>> its databases.
>>>
>>
>> I've heard exactly the contrary (MySQL can be easily shredded by a
>> single power drop, and its recovery routines don't work most of the
>> time); but I haven't seen a problem myself.  Like I said, I'm more
>> worried about the underlying file system interacting with the database.
> 
> On dodgy filesystems thats true but try a jounalised data and metadata
> FS and it should be extremely robust.
> 

define 'dodgy'

It seems to me it's much more efficient to buffer writes and then write
them out in a straight line in journal and then on disk, instead of in
the chronological order they come in.  It's also a lot faster and more
efficient to combine writes and adjust meta-data changes to reflect
batch writing, instead of writing data 5 times to the same spot on disk.

Pretty much, a write back buffer that gets flushed every 10 seconds and
has the same area rewritten 3 times will flush once for all 3 writes.
This can be DB_JOURNAL DB_DATA DB_JOURNAL, with the area in the
DB_JOURNAL being the same area in both writes.  Now the transaction
never hit disk; just writeback memory.

To work around this, the database backends like to force flushes of the
file.  The POSIX interface for this is only robust enough to flush the
file from beginning to end, so the journal has to be at the beginning
and be flushable from beginning to end, and has to assume the changes
may or may not be written to the db yet.  This is why sending a single
join transaction of 50000 operations is faster than sending about 50
individual ones.


Remember the database does about what the file system does in terms of
integrity management and data organization.  You're dealing with a piece
of data that's pretending to be meta-data; and it has an inherent
coarse-grained integrity check around it when something goes wrong.
It's nice, but it's not magic.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond

    We will enslave their women, eat their children and rape their
    cattle!
                  -- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRKV4rQs1xW0HCTEFAQLkbQ/8D6dRuCYFmQlPeG23Q5xWphDAprml0MTg
3Y2I5cxC3+ZanjuLbrC8fscNtOZV47iHqj4BoY2+PGlE/72096Y4r+B3mCYGIIJI
nL7wLAaixmLcSZMLFGBe3txwoES5AjSqWzHS41a2m5fN2ObT/BfTvcTDpRwGrs7t
6ZOr/XinHXda6uu5e9EUgacAYbOyvBonkZ/PndaQnGKOKKVyxt1xL33ozmYEUSbj
fSSGblanAEa///qbCdjONZ+q9IBaLJmi2eAYzy3ZGWDD+KFMQp/LbrdqXrHbvtSv
zIkSweL0yFnrs8sfHpd3y2WXcF/ziUeFRFzfkxpMRvuQG7h22iC0udTzRiqmxpAv
F53B2PbIKeUf79BaU1Xq/Gnv+7kPs4GtoiLysFKTyZZqkDDWieJBMhJumDo5Lrq4
b8FTvovPb0jPJJQXOgA/blI7DFRZL+2bQVyeHkpyJvgqpYfuTH3aH5qP5igKKvgn
0xDGsuOZ9YaYg6w977tfJO8nW8rWVnEb8qmtO8DKmG3ga/szQuU9ugSMmfrbtR/z
dPr1SMWpbS+xNF1+zRAlKuRQ8Wh8Lqcy7y1OCh0duuJSaMRskF+MOOkmSoA72AbT
rgOL+F9bO/mvtHH7lp3vkSvntAZHACwrxhyKSp/KL5O0S6AquI/6+JMQYBdah3A7
SHU5VILCMqY=
=PJca
-----END PGP SIGNATURE-----



More information about the ubuntu-devel mailing list