logrotate configuration seems wrong

Wed Sep 17 05:58:26 UTC 2014

Even with the rate limiting on, I'm seeing 100% of 1 CPU dedicated to
rsyslogd, I'm *guessing* that is TLS connection overhead, but I really
don't have any insight there. I can say that I'm still seeing some agents
get "could not connect" or "connection reset by peer" errors.

I certainly think we should land your changes so far, as it does fix a real
bug, and we can always continue to iterate on fixing other things.

John
=:->

On Tue, Sep 16, 2014 at 6:28 PM, Wayne Witzel <wayne.witzel at canonical.com>
wrote:

> John,
>
> Thank you taking the time to run the logging through these tests.
>
> I found that there is also the $SystemLogRateLimitInterval setting.
> Setting that to 0 will disable the default rate limiting in rsyslog.
>     http://www.rsyslog.com/changing-the-settings/
>
> I would be curious to see what the performance and logging looks like with
> the rate limiting completely disabled.
>
> Should I submit the changes at a PR? Since even though they aren't
> perfect, it does resolve a real bug (size mismatch).
>
> Not sure why the second file is hanging around, I wasn't able to replicate
> that locally and I wasn't able to find anything that suggested there was a
> minimum age. There is a maxage, which ensures old files get deleted, but
> not the inverse. I will continue to research optimizations and fixes for
> our rsyslog configuration.
>
> Wayne
>
>
> On Tue, Sep 16, 2014 at 4:13 AM, John Meinel <john at arbash-meinel.com>
> wrote:
>
>> Oh, and after all this has been running for a while, I did see:
>> -rw------- 1 syslog adm    178M Sep 16 08:11 all-machines.log
>> -rw------- 1 syslog adm    513M Sep 16 07:53 all-machines.log.1
>> -rw------- 1 syslog adm    513M Sep 16 07:26 all-machines.log.2
>>
>> So it is successfully rotating the log file, but it seems to be keeping
>> too many of them?
>>
>> Maybe it is keeping more because they are happening on the same day and
>> there is some sort of "minimum age before deleting" rule?
>>
>> John
>> =:->
>>
>> On Tue, Sep 16, 2014 at 12:11 PM, John Meinel <john at arbash-meinel.com>
>> wrote:
>>
>>> So poking around and filing
>>> https://bugs.launchpad.net/juju-core/+bug/1369909
>>> I found lines about
>>> Sep 16 07:48:14 ip-10-39-60-247 rsyslogd-2079: message repeated 498
>>> times: [too many tcp sessions - dropping incoming request [try
>>> http://www.rsyslog.com/e/2079 ]]
>>>
>>> That is 500 times rsyslog had to deny someone. (So probably 500 Rsyslog
>>> workers that got restarted, triggering 500*5 lines of machine-0.log
>>> information.)
>>>
>>> By putting this line in our /etc/rsyslog.d/25-juju.conf
>>>
>>> $InputTCPMaxSessions 20000 # Maximum TCP sessions (default 200)
>>>
>>> In the "ModLoad imtcp" section (before the InputTCPServerRun) and then
>>> restarting (reloading is not enough), it seemed to make the system a lot
>>> happier.
>>>
>>> It isn't 100% happy, as still the #1 thing in machine-0.log is
>>> information about WatchForRsyslogChanges being stopped. But it is at a
>>> lower rate.
>>>
>>> Again filtering for a single connection it seems that the watcher is
>>> being stopped every 2 minutes, and then restarted 3s later.
>>>
>>> Given that this is are logging infrastructure, failures like this mean
>>> that it is going to be hard to debug, since we can't aggregate logs to the
>>> central point that say "I could not connect to the logging infrastructure".
>>>
>>> John
>>> =:->
>>>
>>>
>>>
>>> On Tue, Sep 16, 2014 at 11:39 AM, John Meinel <john at arbash-meinel.com>
>>> wrote:
>>>
>>>> I tested your branch in the same configuration. First I had to change
>>>> "logging-config: <root>=DEBUG" again. I thought that if you did "juju
>>>> bootstrap --debug" that it always left DEBUG level on.
>>>>
>>>> Anyway, something doesn't quite seem right, as I saw:
>>>> -rw------- 1 syslog adm    500M Sep 16 07:25 all-machines.log
>>>> ...
>>>> -rw------- 1 syslog syslog 300M Sep 16 07:21
>>>> machine-0-2014-09-16T07-21-02.193.log
>>>> -rw------- 1 syslog syslog 300M Sep 16 07:25
>>>> machine-0-2014-09-16T07-25-14.840.log
>>>> -rw------- 1 syslog syslog  31M Sep 16 07:25 machine-0.log
>>>>
>>>>
>>>> Which means that machine-0.log has gotten more than 600MB of data for
>>>> sure (I saw it rotate at least once above this). But there is only a single
>>>> all-machines.log of only 500MB.
>>>>
>>>> As I sat there watching, I did manage to see:
>>>> -rw------- 1 syslog adm    7.0M Sep 16 07:26 all-machines.log
>>>> -rw------- 1 syslog adm    513M Sep 16 07:26 all-machines.log.1
>>>> ...
>>>>
>>>> Which looks correct. But it would seem that something (load?) was
>>>> causing it to not get all of the machine-0 messages.
>>>> Maybe it is getting rate limited?
>>>> I can watch the machine-0.log file fill at about 1MB per second. And
>>>> then after about 5s I see all-machines.log only go up by a couple of MB.
>>>> (machine-0 log seemed to go up by about 12MB, and then all-machines.log
>>>> only went up by 3MB.)
>>>>
>>>> Now, I'm also seeing a rediculous amount of logging, and lots of lines
>>>> like:
>>>> http://paste.ubuntu.com/8356019/
>>>>
>>>> Which seems to indicate that every 3 seconds the Rsyslog Watcher is
>>>> dying and causing everything to restart and reconnect and ask for the new
>>>> credentials.
>>>>
>>>> I'm also a bit surprised that we're telling them to connect to a DNS
>>>> name, and not an IP address, but the resetting is the important bit. I
>>>> filed https://bugs.launchpad.net/juju-core/+bug/1369900 to track that
>>>> bug.
>>>>
>>>> Anyone care to try to figure out why I'm seeing 3-4x the amount of data
>>>> in machine-0.log than seems to be ending up in all-machines.log ?
>>>>
>>>> John
>>>> =:->
>>>>
>>>> On Mon, Sep 15, 2014 at 7:13 PM, Wayne Witzel <
>>>> wayne.witzel at canonical.com> wrote:
>>>>
>>>>> Here is a branch that addresses the concerns and bug. I've tested it
>>>>> locally and with digitalocean. I would love for it to be tested under the
>>>>> scaling scenario where you first encountered the issues.
>>>>>
>>>>>     https://github.com/wwitzel3/juju/tree/ww3-rsyslogd-logrotate
>>>>>
>>>>> I've changed the size in logrotate to an arbitrarily small size, since
>>>>> the rotation is driven by rsyslog. I've also updated the "style" of the
>>>>> rotation to create a new log file and use postrotate to HUP rsyslog.
>>>>>
>>>>> Also I removed compression. I don't think we need it given that we are
>>>>> only keeping one rotation around and it is size restricted. This will just
>>>>> make it easier to get at information across both of the log files.
>>>>>
>>>>> Wayne
>>>>>
>>>>> On Mon, Sep 15, 2014 at 10:30 AM, Nate Finch <nate.finch at canonical.com
>>>>> > wrote:
>>>>>
>>>>>> tl;dr:   rsyslog sees 512,000,000 bytes and tells logrotate "time to
>>>>>> rotate!"  and logrotate sees less than 512MB and say "nah, not big enough"
>>>>>> and rsyslog never writes the logs anymore because the file is too big.
>>>>>>
>>>>>> On Mon, Sep 15, 2014 at 10:07 AM, Wayne Witzel <
>>>>>> wayne.witzel at canonical.com> wrote:
>>>>>>
>>>>>>> When this was being implemented and during the review process the
>>>>>>> actual size of the rotated files got adjusted. I tested very thoroughly the
>>>>>>> actual rotation script and also tested the triggering of the rotation
>>>>>>> script. Enough that I was happy landing the changes.
>>>>>>>
>>>>>>> When the sizes were adjusted the same amount of diligence was not
>>>>>>> applied in testing. This is what introduced the bug.
>>>>>>>
>>>>>>> The size mismatch is problem. rsyslogd will stop logging to the file
>>>>>>> and attempted to execute the script in the outchannel every time it gets a
>>>>>>> new log message, which runs logrotate, which does nothing because the file
>>>>>>> isn't big enough.
>>>>>>>
>>>>>>> Here is the rsyslog docs for implementation of sized based rotation.
>>>>>>>     http://www.rsyslog.com/doc/log_rotation_fix_size.html
>>>>>>>
>>>>>>> I first attempted use the mv command, but with our rsyslogd
>>>>>>> configuration, when I performed a mv, rsyslogd would stop logging to the
>>>>>>> file until I actually did a reload on the process. Also this didn't easily
>>>>>>> support things like compression or adding more archives if we decided to
>>>>>>> keep more than one rotation around. So the choice to use logrotate is was
>>>>>>> made.
>>>>>>>
>>>>>>> As for the actual act of reloading rsyslog. You can use copytruncate
>>>>>>> to avoid it all together or you can use postrotate to reload rsyslogd. In
>>>>>>> both cases there is a small window of possible data loss. Though with
>>>>>>> copyandtruncate it is more likley to happen when the system is under load.
>>>>>>> So changing to a postrotate that reloads rsyslogd is probably a good idea.
>>>>>>>
>>>>>>> The rotate setting is just the number of files to keep around. I
>>>>>>> asked some people after the size update if 1 rotation, a total of 1GB was
>>>>>>> enough to keep around. Most people thought it was fine, since the original
>>>>>>> purpose of the sized based long rotation was to address a ticket that was
>>>>>>> complaining about the juju log folder taking up 3GB+ of space.
>>>>>>>
>>>>>>> I will get a ticket created to address the size mismatch issue and
>>>>>>> switch the copyandtruncate approach to a postrotate of rsyslogd.
>>>>>>>
>>>>>>> Wayne
>>>>>>>
>>>>>>> On Mon, Sep 15, 2014 at 5:30 AM, John Meinel <john at arbash-meinel.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> So we are using rsyslog.conf to have it figure out when rotation
>>>>>>>> needs to be done with:
>>>>>>>> # Maximum size for the log on this outchannel is 512MB
>>>>>>>> # The command to execute when an outchannel as reached its size
>>>>>>>> limit cannot accept any arguments
>>>>>>>> # that is why we have created the helper script for executing
>>>>>>>> logrotate.
>>>>>>>> $outchannel
>>>>>>>> logRotation,{{logDir}}/all-machines.log,512000000,{{logrotateHelperPath}}
>>>>>>>>
>>>>>>>> I would think that would not require SIGHUP along with having it
>>>>>>>> run the script at our request.
>>>>>>>>
>>>>>>>> John
>>>>>>>> =:->
>>>>>>>>
>>>>>>>> On Mon, Sep 15, 2014 at 12:13 PM, Stuart Bishop <
>>>>>>>> stuart.bishop at canonical.com> wrote:
>>>>>>>>
>>>>>>>>> On 15 September 2014 12:38, John Meinel <john at arbash-meinel.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> > 7) "copytruncate" seems the wrong setting for interactive with
>>>>>>>>> rsyslog. I
>>>>>>>>> > believe rsyslog is already aware that the file needs to be
>>>>>>>>> rotated, and thus
>>>>>>>>>
>>>>>>>>> It is only aware if you sent it a HUP signal.
>>>>>>>>>
>>>>>>>>> > it shouldn't be trying to write to the same file handle (and
>>>>>>>>> thus we don't
>>>>>>>>> > need to truncate in place). I'm not 100% sure on the
>>>>>>>>> interactions here, but
>>>>>>>>> > "copytruncate" seems to have an inherent likelyhood of dropping
>>>>>>>>> data (while
>>>>>>>>> > you are copying, if any data gets written then you'll miss those
>>>>>>>>> last few
>>>>>>>>> > bytes when you go to truncate, right?)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Stuart Bishop <stuart.bishop at canonical.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Juju-dev mailing list
>>>>>>>> Juju-dev at lists.ubuntu.com
>>>>>>>> Modify settings or unsubscribe at:
>>>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Wayne Witzel III
>>>>>>> wayne.witzel at canonical.com
>>>>>>>
>>>>>>> --
>>>>>>> Juju-dev mailing list
>>>>>>> Juju-dev at lists.ubuntu.com
>>>>>>> Modify settings or unsubscribe at:
>>>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Wayne Witzel III
>>>>> wayne.witzel at canonical.com
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Wayne Witzel III
> wayne.witzel at canonical.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140917/231f8355/attachment-0001.html>