Slower performance with ext4

Chan Chung Hang Christopher christopher.chan at bradbury.edu.hk
Tue Nov 3 14:16:55 UTC 2009


markir at paradise.net.nz wrote:
> Quoting Chan Chung Hang Christopher <christopher.chan at bradbury.edu.hk>:
>
>
>   
>> Maybe things have changed for XFS now but for ext3, disk = journal.
>>
>> http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L71
>>
>> When data=journal, data and metadata for file are written to the journal
>>
>> and then fsync returns. End of story.
>>
>> When data=ordered, when metadata is written via sync_inode(), fsync 
>> returns and you hope nothing happens within the next half second if you
>>
>> want data consistency too.
>>
>> Hence the reason why a ext3 filesystem on software raid but mounted 
>> data=journal and with an external journal on a bbu nvram card will blow
>>
>> away other filesystems in performance and data consistency.
>>
>> Comments for your pleasure:
>>
>>  53
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L53>
>> *//*/*
>>  54
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L54> */
>> * data=writeback:/*
>>  55
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L55> */
>> * The caller's filemap_fdatawrite()/wait will sync the data./*
>>  56
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L56> */
>> * sync_inode() will sync the metadata/*
>>  57
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L57> */
>> */*
>>  58
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L58> */
>> * data=ordered:/*
>>  59
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L59> */
>> * The caller's filemap_fdatawrite() will write the data and/*
>>  60
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L60> */
>> * sync_inode() will write the inode if it is dirty. Then the caller's/*
>>  61
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L61> */
>> * filemap_fdatawait() will wait on the pages./*
>>  62
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L62> */
>> */*
>>  63
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L63> */
>> * data=journal:/*
>>  64
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L64> */
>> * filemap_fdatawrite won't do anything (the buffers are clean)./*
>>  65
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L65> */
>> * ext3_force_commit will write the file data into the journal and/*
>>  66
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L66> */
>> * will wait on that./*
>>  67
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L67> */
>> * filemap_fdatawait() will encounter a ton of newly-dirtied pages/*
>>  68
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L68> */
>> * (they were dirtied by commit). But that's OK - the blocks are/*
>>  69
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L69> */
>> * safe in-journal, which is all fsync() needs to ensure./*
>>  70
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L70> */
>> *//*
>>
>>
>>     
>
> Good idea to post the source :-).
>
> However it does not seem to actually support your statement.
>
> When fs is mounted data=journal then yes - the logic goes as you suggest.
> Clearly, as the data+metadata is in the journal, then this is all we need to
> sync (its a nice optimization).
>   

ROTFL. Nice OPTIMIZATION? One is possibly doing almost DOUBLE the 
writes. It is really only an optimization if you are using ext3 
data=journal for a mail queue and the journal is on a uber fast nvram 
card (memory speed versus disk speed) because most mails should not 
queue and if you have a nice big nvram card to act as a buffer and speed 
up response to fsync calls for other cases. Hence why most people use 
raid cards with nice big bbu caches nowadays. /me jumps up and down on a 
bunch of 3ware 75xx/85xx cards.

> In other cases (no journal, data=ordered,writeback), then  the metadata is
> synced to the journal, and the data buffers are synced to their respective
> inodes - that is what the comments appear to say as well.
>
> So it seems that disk = journal *only* if you are journalling the *data*! (not
> that staggering an observation, but as you mentioned does explain why sometimes
> data=journal performs better than the other ext3 journal options). 
>
>   

Not so fast pal. data=writeback issues a flush for data...and nothing 
else (goto flush ... out) and data=ordered issues a call that syncs the 
inode only. The only part where data buffers are synced is 
data=writeback (just like what others have explained about 
data=writeback) and there is no data buffer related call for data = 
ordered. Just an inode sync.

However, I do have my doubts about the journal being used when 
data=ordered/writeback. I have not spent a lot of time but I cannot find 
where the inode sync call puts anything in the journal...the call is 
generic and not specific to ext3 too. It appears things have changed 
since barriers were introduced.
> Also there is still the issue of does your data (or metadata) actually hit the
> disk platter (whether via the journal or the file itself), and this concerns the
> business of disk write caches and barrier support - since for journal or file
> you gotta signal the backing device to flush. If it tells fibs to you, or your
> barrier support is buggy - then you can still get data loss, no matter what fs
> options are enabled.
>
>   
Again, in Linux there ain't no signal to the disk write cache to flush. 
Either you turn it off or suffer the consequences. Did you miss the 
Notes at the end of the fsync (2) man page?




More information about the ubuntu-users mailing list