logrotate configuration seems wrong

Tue Sep 16 14:28:16 UTC 2014

John,

Thank you taking the time to run the logging through these tests.

I found that there is also the $SystemLogRateLimitInterval setting. Setting
that to 0 will disable the default rate limiting in rsyslog.
    http://www.rsyslog.com/changing-the-settings/

I would be curious to see what the performance and logging looks like with
the rate limiting completely disabled.

Should I submit the changes at a PR? Since even though they aren't perfect,
it does resolve a real bug (size mismatch).

Not sure why the second file is hanging around, I wasn't able to replicate
that locally and I wasn't able to find anything that suggested there was a
minimum age. There is a maxage, which ensures old files get deleted, but
not the inverse. I will continue to research optimizations and fixes for
our rsyslog configuration.

Wayne

On Tue, Sep 16, 2014 at 4:13 AM, John Meinel <john at arbash-meinel.com> wrote:

> Oh, and after all this has been running for a while, I did see:
> -rw------- 1 syslog adm    178M Sep 16 08:11 all-machines.log
> -rw------- 1 syslog adm    513M Sep 16 07:53 all-machines.log.1
> -rw------- 1 syslog adm    513M Sep 16 07:26 all-machines.log.2
>
> So it is successfully rotating the log file, but it seems to be keeping
> too many of them?
>
> Maybe it is keeping more because they are happening on the same day and
> there is some sort of "minimum age before deleting" rule?
>
> John
> =:->
>
> On Tue, Sep 16, 2014 at 12:11 PM, John Meinel <john at arbash-meinel.com>
> wrote:
>
>> So poking around and filing
>> https://bugs.launchpad.net/juju-core/+bug/1369909
>> I found lines about
>> Sep 16 07:48:14 ip-10-39-60-247 rsyslogd-2079: message repeated 498
>> times: [too many tcp sessions - dropping incoming request [try
>> http://www.rsyslog.com/e/2079 ]]
>>
>> That is 500 times rsyslog had to deny someone. (So probably 500 Rsyslog
>> workers that got restarted, triggering 500*5 lines of machine-0.log
>> information.)
>>
>> By putting this line in our /etc/rsyslog.d/25-juju.conf
>>
>> $InputTCPMaxSessions 20000 # Maximum TCP sessions (default 200)
>>
>> In the "ModLoad imtcp" section (before the InputTCPServerRun) and then
>> restarting (reloading is not enough), it seemed to make the system a lot
>> happier.
>>
>> It isn't 100% happy, as still the #1 thing in machine-0.log is
>> information about WatchForRsyslogChanges being stopped. But it is at a
>> lower rate.
>>
>> Again filtering for a single connection it seems that the watcher is
>> being stopped every 2 minutes, and then restarted 3s later.
>>
>> Given that this is are logging infrastructure, failures like this mean
>> that it is going to be hard to debug, since we can't aggregate logs to the
>> central point that say "I could not connect to the logging infrastructure".
>>
>> John
>> =:->
>>
>>
>>
>> On Tue, Sep 16, 2014 at 11:39 AM, John Meinel <john at arbash-meinel.com>
>> wrote:
>>
>>> I tested your branch in the same configuration. First I had to change
>>> "logging-config: <root>=DEBUG" again. I thought that if you did "juju
>>> bootstrap --debug" that it always left DEBUG level on.
>>>
>>> Anyway, something doesn't quite seem right, as I saw:
>>> -rw------- 1 syslog adm    500M Sep 16 07:25 all-machines.log
>>> ...
>>> -rw------- 1 syslog syslog 300M Sep 16 07:21
>>> machine-0-2014-09-16T07-21-02.193.log
>>> -rw------- 1 syslog syslog 300M Sep 16 07:25
>>> machine-0-2014-09-16T07-25-14.840.log
>>> -rw------- 1 syslog syslog  31M Sep 16 07:25 machine-0.log
>>>
>>>
>>> Which means that machine-0.log has gotten more than 600MB of data for
>>> sure (I saw it rotate at least once above this). But there is only a single
>>> all-machines.log of only 500MB.
>>>
>>> As I sat there watching, I did manage to see:
>>> -rw------- 1 syslog adm    7.0M Sep 16 07:26 all-machines.log
>>> -rw------- 1 syslog adm    513M Sep 16 07:26 all-machines.log.1
>>> ...
>>>
>>> Which looks correct. But it would seem that something (load?) was
>>> causing it to not get all of the machine-0 messages.
>>> Maybe it is getting rate limited?
>>> I can watch the machine-0.log file fill at about 1MB per second. And
>>> then after about 5s I see all-machines.log only go up by a couple of MB.
>>> (machine-0 log seemed to go up by about 12MB, and then all-machines.log
>>> only went up by 3MB.)
>>>
>>> Now, I'm also seeing a rediculous amount of logging, and lots of lines
>>> like:
>>> http://paste.ubuntu.com/8356019/
>>>
>>> Which seems to indicate that every 3 seconds the Rsyslog Watcher is
>>> dying and causing everything to restart and reconnect and ask for the new
>>> credentials.
>>>
>>> I'm also a bit surprised that we're telling them to connect to a DNS
>>> name, and not an IP address, but the resetting is the important bit. I
>>> filed https://bugs.launchpad.net/juju-core/+bug/1369900 to track that
>>> bug.
>>>
>>> Anyone care to try to figure out why I'm seeing 3-4x the amount of data
>>> in machine-0.log than seems to be ending up in all-machines.log ?
>>>
>>> John
>>> =:->
>>>
>>> On Mon, Sep 15, 2014 at 7:13 PM, Wayne Witzel <
>>> wayne.witzel at canonical.com> wrote:
>>>
>>>> Here is a branch that addresses the concerns and bug. I've tested it
>>>> locally and with digitalocean. I would love for it to be tested under the
>>>> scaling scenario where you first encountered the issues.
>>>>
>>>>     https://github.com/wwitzel3/juju/tree/ww3-rsyslogd-logrotate
>>>>
>>>> I've changed the size in logrotate to an arbitrarily small size, since
>>>> the rotation is driven by rsyslog. I've also updated the "style" of the
>>>> rotation to create a new log file and use postrotate to HUP rsyslog.
>>>>
>>>> Also I removed compression. I don't think we need it given that we are
>>>> only keeping one rotation around and it is size restricted. This will just
>>>> make it easier to get at information across both of the log files.
>>>>
>>>> Wayne
>>>>
>>>> On Mon, Sep 15, 2014 at 10:30 AM, Nate Finch <nate.finch at canonical.com>
>>>> wrote:
>>>>
>>>>> tl;dr:   rsyslog sees 512,000,000 bytes and tells logrotate "time to
>>>>> rotate!"  and logrotate sees less than 512MB and say "nah, not big enough"
>>>>> and rsyslog never writes the logs anymore because the file is too big.
>>>>>
>>>>> On Mon, Sep 15, 2014 at 10:07 AM, Wayne Witzel <
>>>>> wayne.witzel at canonical.com> wrote:
>>>>>
>>>>>> When this was being implemented and during the review process the
>>>>>> actual size of the rotated files got adjusted. I tested very thoroughly the
>>>>>> actual rotation script and also tested the triggering of the rotation
>>>>>> script. Enough that I was happy landing the changes.
>>>>>>
>>>>>> When the sizes were adjusted the same amount of diligence was not
>>>>>> applied in testing. This is what introduced the bug.
>>>>>>
>>>>>> The size mismatch is problem. rsyslogd will stop logging to the file
>>>>>> and attempted to execute the script in the outchannel every time it gets a
>>>>>> new log message, which runs logrotate, which does nothing because the file
>>>>>> isn't big enough.
>>>>>>
>>>>>> Here is the rsyslog docs for implementation of sized based rotation.
>>>>>>     http://www.rsyslog.com/doc/log_rotation_fix_size.html
>>>>>>
>>>>>> I first attempted use the mv command, but with our rsyslogd
>>>>>> configuration, when I performed a mv, rsyslogd would stop logging to the
>>>>>> file until I actually did a reload on the process. Also this didn't easily
>>>>>> support things like compression or adding more archives if we decided to
>>>>>> keep more than one rotation around. So the choice to use logrotate is was
>>>>>> made.
>>>>>>
>>>>>> As for the actual act of reloading rsyslog. You can use copytruncate
>>>>>> to avoid it all together or you can use postrotate to reload rsyslogd. In
>>>>>> both cases there is a small window of possible data loss. Though with
>>>>>> copyandtruncate it is more likley to happen when the system is under load.
>>>>>> So changing to a postrotate that reloads rsyslogd is probably a good idea.
>>>>>>
>>>>>> The rotate setting is just the number of files to keep around. I
>>>>>> asked some people after the size update if 1 rotation, a total of 1GB was
>>>>>> enough to keep around. Most people thought it was fine, since the original
>>>>>> purpose of the sized based long rotation was to address a ticket that was
>>>>>> complaining about the juju log folder taking up 3GB+ of space.
>>>>>>
>>>>>> I will get a ticket created to address the size mismatch issue and
>>>>>> switch the copyandtruncate approach to a postrotate of rsyslogd.
>>>>>>
>>>>>> Wayne
>>>>>>
>>>>>> On Mon, Sep 15, 2014 at 5:30 AM, John Meinel <john at arbash-meinel.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So we are using rsyslog.conf to have it figure out when rotation
>>>>>>> needs to be done with:
>>>>>>> # Maximum size for the log on this outchannel is 512MB
>>>>>>> # The command to execute when an outchannel as reached its size
>>>>>>> limit cannot accept any arguments
>>>>>>> # that is why we have created the helper script for executing
>>>>>>> logrotate.
>>>>>>> $outchannel
>>>>>>> logRotation,{{logDir}}/all-machines.log,512000000,{{logrotateHelperPath}}
>>>>>>>
>>>>>>> I would think that would not require SIGHUP along with having it run
>>>>>>> the script at our request.
>>>>>>>
>>>>>>> John
>>>>>>> =:->
>>>>>>>
>>>>>>> On Mon, Sep 15, 2014 at 12:13 PM, Stuart Bishop <
>>>>>>> stuart.bishop at canonical.com> wrote:
>>>>>>>
>>>>>>>> On 15 September 2014 12:38, John Meinel <john at arbash-meinel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> > 7) "copytruncate" seems the wrong setting for interactive with
>>>>>>>> rsyslog. I
>>>>>>>> > believe rsyslog is already aware that the file needs to be
>>>>>>>> rotated, and thus
>>>>>>>>
>>>>>>>> It is only aware if you sent it a HUP signal.
>>>>>>>>
>>>>>>>> > it shouldn't be trying to write to the same file handle (and thus
>>>>>>>> we don't
>>>>>>>> > need to truncate in place). I'm not 100% sure on the interactions
>>>>>>>> here, but
>>>>>>>> > "copytruncate" seems to have an inherent likelyhood of dropping
>>>>>>>> data (while
>>>>>>>> > you are copying, if any data gets written then you'll miss those
>>>>>>>> last few
>>>>>>>> > bytes when you go to truncate, right?)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Stuart Bishop <stuart.bishop at canonical.com>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Juju-dev mailing list
>>>>>>> Juju-dev at lists.ubuntu.com
>>>>>>> Modify settings or unsubscribe at:
>>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Wayne Witzel III
>>>>>> wayne.witzel at canonical.com
>>>>>>
>>>>>> --
>>>>>> Juju-dev mailing list
>>>>>> Juju-dev at lists.ubuntu.com
>>>>>> Modify settings or unsubscribe at:
>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Wayne Witzel III
>>>> wayne.witzel at canonical.com
>>>>
>>>
>>>
>>
>

-- 
Wayne Witzel III
wayne.witzel at canonical.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140916/8636cf7f/attachment-0001.html>