Slower performance with ext4

Wed Nov 4 09:52:36 UTC 2009

Chan Chung Hang Christopher wrote:
>
>
> ROTFL. Nice OPTIMIZATION? One is possibly doing almost DOUBLE the 
> writes. It is really only an optimization if you are using ext3 
> data=journal for a mail queue and the journal is on a uber fast nvram 
> card (memory speed versus disk speed) because most mails should not 
> queue and if you have a nice big nvram card to act as a buffer and speed 
> up response to fsync calls for other cases. Hence why most people use 
> raid cards with nice big bbu caches nowadays. /me jumps up and down on a 
> bunch of 3ware 75xx/85xx cards.
>
>   
>>     
>
> Not so fast pal. data=writeback issues a flush for data...and nothing 
> else (goto flush ... out) and data=ordered issues a call that syncs the 
> inode only. The only part where data buffers are synced is 
> data=writeback (just like what others have explained about 
> data=writeback) and there is no data buffer related call for data = 
> ordered. Just an inode sync.
>
> However, I do have my doubts about the journal being used when 
> data=ordered/writeback. I have not spent a lot of time but I cannot find 
> where the inode sync call puts anything in the journal...the call is 
> generic and not specific to ext3 too. It appears things have changed 
> since barriers were introduced.
>   
>>     

Actually I think we have both misunderstood this point - because the 
code we are looking at is not the whole story. How it works is that an 
application calls fsync() , which will then call sys_fsync(), which will 
(amongst other things) call:

- generic_block_fdatasync() to sync the *data* blocks
- ext3_sync_file() to sort out the metadata and journal stuff*/
/*
Note the comments in the links you posted actually mention this. We have 
been looking at the latter code only in isolation. I think this article:

http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/ssd’s-journaling-and-noatimerelatime

discusses the business quite well: data=journal *does* write the data 
twice! Once to the files themselves and once to the journal. However, 
under spcialized circumstances this is still faster than the other 
journal modes.

> Again, in Linux there ain't no signal to the disk write cache to flush. 
> Either you turn it off or suffer the consequences. Did you miss the 
> Notes at the end of the fsync (2) man page?
>
>   

Exactly - that is precisely the point I was making previously. Note that 
SCSI/SAS disks generally default to the write cache being *off* which 
makes 'em safer choices for serious storage. Write cache *on* means you 
are at the mercy of how good the barrier support is (not that great 
generally it seems), no matter what journal options are used.

Now I think that our differing emphasis on data vs metadata is probably 
due to you minding mail servers (lots of important metadata changes from 
mew files etc) and me minding databases (typically no important metadata 
changes - e.g innodb typically has everything in 3 files...but very 
important data changes - e.g. transaction logs).

In your use case, it makes sense to use data=journal. In mine typically 
it does not (note that a database transaction log functions like a 
journal - a serially appended file of transactions - so 
data=ordered,writeback or even xfs journaling etc is not only fine but 
optimal [1])!

regards

Mark

[1] Or even ext2 in some cases.