logrotate configuration seems wrong
John Meinel
john at arbash-meinel.com
Tue Sep 16 08:11:46 UTC 2014
So poking around and filing
https://bugs.launchpad.net/juju-core/+bug/1369909
I found lines about
Sep 16 07:48:14 ip-10-39-60-247 rsyslogd-2079: message repeated 498 times:
[too many tcp sessions - dropping incoming request [try
http://www.rsyslog.com/e/2079 ]]
That is 500 times rsyslog had to deny someone. (So probably 500 Rsyslog
workers that got restarted, triggering 500*5 lines of machine-0.log
information.)
By putting this line in our /etc/rsyslog.d/25-juju.conf
$InputTCPMaxSessions 20000 # Maximum TCP sessions (default 200)
In the "ModLoad imtcp" section (before the InputTCPServerRun) and then
restarting (reloading is not enough), it seemed to make the system a lot
happier.
It isn't 100% happy, as still the #1 thing in machine-0.log is information
about WatchForRsyslogChanges being stopped. But it is at a lower rate.
Again filtering for a single connection it seems that the watcher is being
stopped every 2 minutes, and then restarted 3s later.
Given that this is are logging infrastructure, failures like this mean that
it is going to be hard to debug, since we can't aggregate logs to the
central point that say "I could not connect to the logging infrastructure".
John
=:->
On Tue, Sep 16, 2014 at 11:39 AM, John Meinel <john at arbash-meinel.com>
wrote:
> I tested your branch in the same configuration. First I had to change
> "logging-config: <root>=DEBUG" again. I thought that if you did "juju
> bootstrap --debug" that it always left DEBUG level on.
>
> Anyway, something doesn't quite seem right, as I saw:
> -rw------- 1 syslog adm 500M Sep 16 07:25 all-machines.log
> ...
> -rw------- 1 syslog syslog 300M Sep 16 07:21
> machine-0-2014-09-16T07-21-02.193.log
> -rw------- 1 syslog syslog 300M Sep 16 07:25
> machine-0-2014-09-16T07-25-14.840.log
> -rw------- 1 syslog syslog 31M Sep 16 07:25 machine-0.log
>
>
> Which means that machine-0.log has gotten more than 600MB of data for sure
> (I saw it rotate at least once above this). But there is only a single
> all-machines.log of only 500MB.
>
> As I sat there watching, I did manage to see:
> -rw------- 1 syslog adm 7.0M Sep 16 07:26 all-machines.log
> -rw------- 1 syslog adm 513M Sep 16 07:26 all-machines.log.1
> ...
>
> Which looks correct. But it would seem that something (load?) was causing
> it to not get all of the machine-0 messages.
> Maybe it is getting rate limited?
> I can watch the machine-0.log file fill at about 1MB per second. And then
> after about 5s I see all-machines.log only go up by a couple of MB.
> (machine-0 log seemed to go up by about 12MB, and then all-machines.log
> only went up by 3MB.)
>
> Now, I'm also seeing a rediculous amount of logging, and lots of lines
> like:
> http://paste.ubuntu.com/8356019/
>
> Which seems to indicate that every 3 seconds the Rsyslog Watcher is dying
> and causing everything to restart and reconnect and ask for the new
> credentials.
>
> I'm also a bit surprised that we're telling them to connect to a DNS name,
> and not an IP address, but the resetting is the important bit. I filed
> https://bugs.launchpad.net/juju-core/+bug/1369900 to track that bug.
>
> Anyone care to try to figure out why I'm seeing 3-4x the amount of data in
> machine-0.log than seems to be ending up in all-machines.log ?
>
> John
> =:->
>
> On Mon, Sep 15, 2014 at 7:13 PM, Wayne Witzel <wayne.witzel at canonical.com>
> wrote:
>
>> Here is a branch that addresses the concerns and bug. I've tested it
>> locally and with digitalocean. I would love for it to be tested under the
>> scaling scenario where you first encountered the issues.
>>
>> https://github.com/wwitzel3/juju/tree/ww3-rsyslogd-logrotate
>>
>> I've changed the size in logrotate to an arbitrarily small size, since
>> the rotation is driven by rsyslog. I've also updated the "style" of the
>> rotation to create a new log file and use postrotate to HUP rsyslog.
>>
>> Also I removed compression. I don't think we need it given that we are
>> only keeping one rotation around and it is size restricted. This will just
>> make it easier to get at information across both of the log files.
>>
>> Wayne
>>
>> On Mon, Sep 15, 2014 at 10:30 AM, Nate Finch <nate.finch at canonical.com>
>> wrote:
>>
>>> tl;dr: rsyslog sees 512,000,000 bytes and tells logrotate "time to
>>> rotate!" and logrotate sees less than 512MB and say "nah, not big enough"
>>> and rsyslog never writes the logs anymore because the file is too big.
>>>
>>> On Mon, Sep 15, 2014 at 10:07 AM, Wayne Witzel <
>>> wayne.witzel at canonical.com> wrote:
>>>
>>>> When this was being implemented and during the review process the
>>>> actual size of the rotated files got adjusted. I tested very thoroughly the
>>>> actual rotation script and also tested the triggering of the rotation
>>>> script. Enough that I was happy landing the changes.
>>>>
>>>> When the sizes were adjusted the same amount of diligence was not
>>>> applied in testing. This is what introduced the bug.
>>>>
>>>> The size mismatch is problem. rsyslogd will stop logging to the file
>>>> and attempted to execute the script in the outchannel every time it gets a
>>>> new log message, which runs logrotate, which does nothing because the file
>>>> isn't big enough.
>>>>
>>>> Here is the rsyslog docs for implementation of sized based rotation.
>>>> http://www.rsyslog.com/doc/log_rotation_fix_size.html
>>>>
>>>> I first attempted use the mv command, but with our rsyslogd
>>>> configuration, when I performed a mv, rsyslogd would stop logging to the
>>>> file until I actually did a reload on the process. Also this didn't easily
>>>> support things like compression or adding more archives if we decided to
>>>> keep more than one rotation around. So the choice to use logrotate is was
>>>> made.
>>>>
>>>> As for the actual act of reloading rsyslog. You can use copytruncate to
>>>> avoid it all together or you can use postrotate to reload rsyslogd. In both
>>>> cases there is a small window of possible data loss. Though with
>>>> copyandtruncate it is more likley to happen when the system is under load.
>>>> So changing to a postrotate that reloads rsyslogd is probably a good idea.
>>>>
>>>> The rotate setting is just the number of files to keep around. I asked
>>>> some people after the size update if 1 rotation, a total of 1GB was enough
>>>> to keep around. Most people thought it was fine, since the original purpose
>>>> of the sized based long rotation was to address a ticket that was
>>>> complaining about the juju log folder taking up 3GB+ of space.
>>>>
>>>> I will get a ticket created to address the size mismatch issue and
>>>> switch the copyandtruncate approach to a postrotate of rsyslogd.
>>>>
>>>> Wayne
>>>>
>>>> On Mon, Sep 15, 2014 at 5:30 AM, John Meinel <john at arbash-meinel.com>
>>>> wrote:
>>>>
>>>>> So we are using rsyslog.conf to have it figure out when rotation needs
>>>>> to be done with:
>>>>> # Maximum size for the log on this outchannel is 512MB
>>>>> # The command to execute when an outchannel as reached its size limit
>>>>> cannot accept any arguments
>>>>> # that is why we have created the helper script for executing
>>>>> logrotate.
>>>>> $outchannel
>>>>> logRotation,{{logDir}}/all-machines.log,512000000,{{logrotateHelperPath}}
>>>>>
>>>>> I would think that would not require SIGHUP along with having it run
>>>>> the script at our request.
>>>>>
>>>>> John
>>>>> =:->
>>>>>
>>>>> On Mon, Sep 15, 2014 at 12:13 PM, Stuart Bishop <
>>>>> stuart.bishop at canonical.com> wrote:
>>>>>
>>>>>> On 15 September 2014 12:38, John Meinel <john at arbash-meinel.com>
>>>>>> wrote:
>>>>>>
>>>>>> > 7) "copytruncate" seems the wrong setting for interactive with
>>>>>> rsyslog. I
>>>>>> > believe rsyslog is already aware that the file needs to be rotated,
>>>>>> and thus
>>>>>>
>>>>>> It is only aware if you sent it a HUP signal.
>>>>>>
>>>>>> > it shouldn't be trying to write to the same file handle (and thus
>>>>>> we don't
>>>>>> > need to truncate in place). I'm not 100% sure on the interactions
>>>>>> here, but
>>>>>> > "copytruncate" seems to have an inherent likelyhood of dropping
>>>>>> data (while
>>>>>> > you are copying, if any data gets written then you'll miss those
>>>>>> last few
>>>>>> > bytes when you go to truncate, right?)
>>>>>>
>>>>>> --
>>>>>> Stuart Bishop <stuart.bishop at canonical.com>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Juju-dev mailing list
>>>>> Juju-dev at lists.ubuntu.com
>>>>> Modify settings or unsubscribe at:
>>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Wayne Witzel III
>>>> wayne.witzel at canonical.com
>>>>
>>>> --
>>>> Juju-dev mailing list
>>>> Juju-dev at lists.ubuntu.com
>>>> Modify settings or unsubscribe at:
>>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>>
>>>>
>>>
>>
>>
>> --
>> Wayne Witzel III
>> wayne.witzel at canonical.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140916/5391b02e/attachment-0001.html>
More information about the Juju-dev
mailing list