Slower performance with ext4
Chan Chung Hang Christopher
christopher.chan at bradbury.edu.hk
Tue Nov 3 14:16:55 UTC 2009
markir at paradise.net.nz wrote:
> Quoting Chan Chung Hang Christopher <christopher.chan at bradbury.edu.hk>:
>
>
>
>> Maybe things have changed for XFS now but for ext3, disk = journal.
>>
>> http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L71
>>
>> When data=journal, data and metadata for file are written to the journal
>>
>> and then fsync returns. End of story.
>>
>> When data=ordered, when metadata is written via sync_inode(), fsync
>> returns and you hope nothing happens within the next half second if you
>>
>> want data consistency too.
>>
>> Hence the reason why a ext3 filesystem on software raid but mounted
>> data=journal and with an external journal on a bbu nvram card will blow
>>
>> away other filesystems in performance and data consistency.
>>
>> Comments for your pleasure:
>>
>> 53
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L53>
>> *//*/*
>> 54
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L54> */
>> * data=writeback:/*
>> 55
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L55> */
>> * The caller's filemap_fdatawrite()/wait will sync the data./*
>> 56
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L56> */
>> * sync_inode() will sync the metadata/*
>> 57
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L57> */
>> */*
>> 58
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L58> */
>> * data=ordered:/*
>> 59
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L59> */
>> * The caller's filemap_fdatawrite() will write the data and/*
>> 60
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L60> */
>> * sync_inode() will write the inode if it is dirty. Then the caller's/*
>> 61
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L61> */
>> * filemap_fdatawait() will wait on the pages./*
>> 62
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L62> */
>> */*
>> 63
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L63> */
>> * data=journal:/*
>> 64
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L64> */
>> * filemap_fdatawrite won't do anything (the buffers are clean)./*
>> 65
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L65> */
>> * ext3_force_commit will write the file data into the journal and/*
>> 66
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L66> */
>> * will wait on that./*
>> 67
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L67> */
>> * filemap_fdatawait() will encounter a ton of newly-dirtied pages/*
>> 68
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L68> */
>> * (they were dirtied by commit). But that's OK - the blocks are/*
>> 69
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L69> */
>> * safe in-journal, which is all fsync() needs to ensure./*
>> 70
>> <http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/fsync.c#L70> */
>> *//*
>>
>>
>>
>
> Good idea to post the source :-).
>
> However it does not seem to actually support your statement.
>
> When fs is mounted data=journal then yes - the logic goes as you suggest.
> Clearly, as the data+metadata is in the journal, then this is all we need to
> sync (its a nice optimization).
>
ROTFL. Nice OPTIMIZATION? One is possibly doing almost DOUBLE the
writes. It is really only an optimization if you are using ext3
data=journal for a mail queue and the journal is on a uber fast nvram
card (memory speed versus disk speed) because most mails should not
queue and if you have a nice big nvram card to act as a buffer and speed
up response to fsync calls for other cases. Hence why most people use
raid cards with nice big bbu caches nowadays. /me jumps up and down on a
bunch of 3ware 75xx/85xx cards.
> In other cases (no journal, data=ordered,writeback), then the metadata is
> synced to the journal, and the data buffers are synced to their respective
> inodes - that is what the comments appear to say as well.
>
> So it seems that disk = journal *only* if you are journalling the *data*! (not
> that staggering an observation, but as you mentioned does explain why sometimes
> data=journal performs better than the other ext3 journal options).
>
>
Not so fast pal. data=writeback issues a flush for data...and nothing
else (goto flush ... out) and data=ordered issues a call that syncs the
inode only. The only part where data buffers are synced is
data=writeback (just like what others have explained about
data=writeback) and there is no data buffer related call for data =
ordered. Just an inode sync.
However, I do have my doubts about the journal being used when
data=ordered/writeback. I have not spent a lot of time but I cannot find
where the inode sync call puts anything in the journal...the call is
generic and not specific to ext3 too. It appears things have changed
since barriers were introduced.
> Also there is still the issue of does your data (or metadata) actually hit the
> disk platter (whether via the journal or the file itself), and this concerns the
> business of disk write caches and barrier support - since for journal or file
> you gotta signal the backing device to flush. If it tells fibs to you, or your
> barrier support is buggy - then you can still get data loss, no matter what fs
> options are enabled.
>
>
Again, in Linux there ain't no signal to the disk write cache to flush.
Either you turn it off or suffer the consequences. Did you miss the
Notes at the end of the fsync (2) man page?
More information about the ubuntu-users
mailing list