out of space on /root

Mon Mar 6 19:30:41 UTC 2017

Paul Smith schreef op 06-03-2017 19:42:
> On Mon, 2017-03-06 at 18:35 +0100, Xen wrote:
>> >> cat "file" | split -b 500M
>> > 
>> > can be more efficiently written:
>> > 
>> >   split -b 500M < "file"
>> 
>> I disagree. The second expression is harder to write mentally and
>> more prone to error.
> 
> Those are subjective assertions, and people who who have a lot more
> experience than either of us have disagreed with you for over 20 years.
> I have no interest in arguing about it in general.

Well then don't argue it. And let everyone say their own thing.

I am being attacked not you. One cannot even mention a solution without 
being talked over by other people because ones solution was not perfect 
enough.

Might I also add that in those 20 years the shell has not become a more 
user friendly environment in any way, so by that measure they have 
failed.

>> There is barely a functional difference apart from the extra cat
>> process, but this doesn't really take any resources. It's either Bash
>> doing it, or Cat, there is no other difference.
> 
> That's not correct.  In the "with cat" method you have this:
> 
>    1. shell creates a pipe
>    2. shell forks a new process A
>    3. shell forks a new process B
>    4. In process A, duplicate the write side of the pipe to stdout
>    5. In process A, exec /usr/bin/cat "file"
>    6. In process B, duplicate the read side of the pipe to stdin
>    7. In process B, exec split -b 500M
>    8. The "cat" process reads from the file, writes to the pipe
>    9. The "split" process reads from the pipe, writes to files.
>   10. Wait for A and B to finish
> 
> In this method the entire file is read into memory by one process, then
> copied back into the kernel via the pipe, then that data is read by the
> second process (split) and worked on.  Essentially you have the
> overhead of an entire extra read and write of the file.
> 
> In the "without cat" method you have this:
> 
>    1. Shell opens "file"
>    2. shell forks a new process A
>    3. In process A, duplicate the "file" descriptor to stdin
>    4. In process A, exec split -b 500M
>    5. The "split" process reads the "file" (stdin), writes to files.
>    6. Wait for A to finish

So you're saying split reads directly from disk. Well didn't know that, 
thank you.

Regardless, that does of course agree with the idiom of cat | filter. 
There is an extra step.

> Even better would be:
> 
>   split -b 500M "file"
> 
> with no redirection at all.

Yes yes, of course, that was the whole point of what I said.

> Here the output filenames would be based on "file" which is nice; some
> programs can behave better if they have an actual file they can stat(2)
> to find the size, etc.
> 
> The advantage to "read from the file" methods rather than the first
> "read from pipe" is that a pipe is uni-directional (you can't go
> backward), and it contains a maximum of 4K (on Linux) bytes at a time.

Yes, that's a downside. However I thought that was bigger these times? 
Something like 64k? Regardless that is not a deficiency of the idiom, 
but of the implementation.

Sometimes pipes are incredibly slow. But that was mostly when I tried to 
do it on MS Windows using GnuWin32 tools. I tried on Linux to use the 
"buffer" command but it doesn't make a lot of difference.

I ran some tests on a rather slow device (a NAS) and I guess I was wrong 
about the insignificance. The "superior method" (in terms of 
performance) finished roughly twice as fast (or more) than the cat 
version, feeding a binary file of some 1.8G through "wc -l".

For some reason I just don't expect CPU speeds to be a bottleneck in 
these operations, but on this NAS the harddisk is actually faster than 
the CPU in that sense.

> 
> A program like "tail", if it works on a file, can go to the end of the
> file and then back up from there and not have to read the beginning.  A
> program like "split" can read blocks much larger than 4k at a time and
> gain efficiency.  There could even be special kernel support for bulk
> file IO that can be taken advantage of, which clearly can't be used
> with pipes.

Well I just think that is a deficiency of the pipe (at least the 4k 
thing), not the method.

If you make it impossible to travel from your town to the next in a 
direct line, but instead have to travel half the world because of 
regulations, that doesn't suddenly mean travelling half the world is now 
a "superior" measure. I remains, in that case, in that sense, a sad 
situation.

I know buffer sizes can have enormous impact. Compare a dd with bs=512 
vs a dd with bs=4M, the difference is huge.

> Please don't state your personal subjective opinions as if they were
> facts.

Then don't do it yourself.

> It's true that in many cases the processing difference is not
> significant, because the amount of data involved is small.  But, I
> didn't comment on the GENERAL case, I commented on THIS case.  In this
> case, where we're dealing with such enormous files, there is absolutely
> no question that the non-cat version is far superior and the cat
> version should be avoided.

The non-cat version should also be avoided in my view, and you should 
use the direct file method instead. So it was irrelevant here, and as 
such, not to the point.

> My view on using cat/pipes vs. simple redirection in general disagrees
> with yours, but that's an opinion (held by many, but an opinion
> nonetheless).

Sure I don't mind that. But my experience is that whenever one even 
mentions the alternative view, one is chastised for it.

So, I would never go out of my way to berate people for using the "< 
file" syntax, however people generally do go out of their way to berate 
the "cat | process" syntax.

So, the situation is not equal.

See, the point is that people ARE arguing the general case.

But since you have provided many data on the benefit of more direct 
methods, let's also introduce some analogies here.

Idiomatically, the cat method is met by at least:

zcat
ccat
bzcat

and others. In using anything else, such as gzip or gunzip one must 
remember exactly how to ensure that the output is going to stdout, which 
is often very hard, or hard to remember. So, from a programming 
perspective, there is no simple equivalent for:

grep "pattern" < file.gz

You can't do that thing. However what you can do is:

zcat file.gz | grep "pattern".

Idiomatically this barely changes from:

cat file | grep "pattern" so we have more consistency here, which is 
great from a programmer's perspective, everything becomes simpler.

In general high level languages are always less performant than low 
level languages, and this is a perfect analogy. The "cat" method is less 
performant, but easier to use.

(Compared to the stdin redirect method, not to the direct file method).

So I don't know who those "experts" are that have been arguing this 
style for 20 years, but they don't know much about programmers. Or they 
only program in C themselves, which is not exactly the most user 
friendly language out there. So by that standard we should not expect 
user friendly shell code either.

?