gnome "storage" project, smart windows

Tue Dec 13 11:08:32 UTC 2005

> "Searching" is the problem. It's a stupid idea for a machine which you
> control. I cannot tell the folks at wired where to put their pages,
but
> I most certainly can do this on my own machine.
>
>but you never screw up? or get confused about which project you put
>stuff in?

All the time. I have a few well named folders like "music" and "dvds"
and "tv" (all in /hollywood, of course) but I also have thousands of
image and music files that don't really fit anywhere in particular. I
also have about a dozen desktop backups and every one of these has
within it many duplicate files but also many unsorted files that were
in some "download" queue or tempdir at the time I backed them up.
Storing and sorting these will be the next step of the acid test.

The point I was making is that I should not have to concern myself with
where the file goes - so long as I describe it relatively adequately or
am able to describe it in a "search" why should I have to worry about
where it goes on my drive? Having to mess with sorting stuff into
"folders" and constantly worry ablut what's updated, where this goes
now that I've added ten thousand whatevers and the old method doesn't
work anymore... and then still ahve to rely on a "search engine" to
interact with that stuff is just nuts - it's twice the work.

>hmm, well that sounds very interesting and I'd lvoe to try it out.
>how about e.g. full text earching of structured documents like
>open-document files? I like that feature in beaggle...

Of course, that's a fundamental need. You could not locate a doc by
description if you were not cataloging this stuff. 

But having to screw with mono on a system when the basis of a
completely adequate system is part of the existing operating system
seems to me a great waste of resources on many levels. Rather than
create a user-centric "search engine" with primitve security and flaky
behavior, why not instead just build on what's already there and
stable? No, slocate doesn't index stuff by content - but the greater
point is this: even slocate works backwards.

If I save a file I cannot do so by magic. I cannot wish the file to
exist on the hard drive, nor can I retrieve it without the aid of the
operating system. Every time a file is stored on the disc it goes
through linux - nothing goes in or out without linux knowing about it.

So why does linux then have to go back and "search" for all this stuff?
Why isn't linux instead catalogin each file into a *quickly searchable*
database every time it  stores that file? And why do I have to know
where that file goes? 

The system has been built as it is and changing this from the ground up
is impractical. But adapting the system to perform this maintenance is
not at all impractical. You can even do this with *system* files like
those in the /etc and /var directories. Because linux has a perfectly
usable system that allows symbolic linking of resources, "active"
system files can be stored in a structure that does not fit the
/usr/var/etc paths but is still acessible in this manner. 

So, for example, when I edit my /etc/fstab file why does it get
overwritten? Why doesn't linux just remember the old one and swap in
the new one? It knows I have edited the file and changed its contents -
why does this then have to be retroactively indexed?

I cannot rewrite the kernel - my talents simply are not up to the task.
But there are ways to model this behavior and that's what I'm working
with. A system like beagle can work just fine, but beagle's main
weakness is that it has terribel security and it seeks to overcome this
weakness by being built inside some "sandbox." It's an illusion of
safety that really just adds so much complexity it becomes brittle.

As an example, here's the routine that hashes and stores the file and
then catalogs basic information about the file. This is a very simple
example that doesn't incorporate worry about magic numbers and plugin
miners - it just takes a given bit of metadata, a file (or collection
of files) and stores them away. It's nothing but a bash script - well
developed, robust technology in a completely unoptimized, simple,
readable and maintainable script. 

if [[ -f "$_file" ]] && [[ -e "$_file" ]];then
_hash=`md5sum "$_file"`;
_hash="${_hash:0:32}"
_FIL=${_file##*/};
fldr="${storagepath}${_hash:0:2}/${_hash:2:2}/${_hash:4:28}";
# echo "Storing $_file at hash $fldr" >> ~/wtf.log

if [[ -e "$fldr" ]];then
echo "$_file $fldr">>${storagepath}dupes.list;
#   echo -e "folder exists\n" >> ~/wtf.log
cmd="insert into _alias values('${_hash}','${_FIL}')";
sqlite ${storagepath}meta.db "$cmd";  

else mkdir "$fldr";
cp "$_file" "$fldr";
_SIZ=`stat -c "%s" "$_file"`;
_DATE=`stat -c "%Y" "$_file"`;
cmd="insert into _files
values('${_hash}','${_FIL}','${_FIL##*.}','${_SIZ}','${_DATE}',1,'archive','${_DATE}')";
sqlite ${storagepath}meta.db "$cmd";
#   echo -e "files info stored: $cmd \n" >> ~/wtf.log
cmd="insert into _meta
values('${_hash},','<keywords>${keywords}</keywords><originalpath>${_file}</originalpath>')";
sqlite ${storagepath}meta.db "$cmd";  
#   echo -e "meta info stored: $cmd \n" >> ~/wtf.log
#   rm -f "$_file";
chmod 444 "${fldr}/${_FIL}"
fi;
fi;

In a structure of about 50,000 files the metadata folder is, at
present, less than 30MB. This can be backed up separately and could
also be exchanged with others. Since info about the files is stored
along with their unique hash (and of course md5 can be replaced with
any other) the system can quickly and easily decide if a given file is
present, for example in file sharing applications - just look up the
hash and then see if the files located there match. Because all this
isn't "built into the filesystem" it can be used on any filesystem and
can be maintained with existing, mature and human-friendly tools.

Simple example: when downloading files from usenet I no longer even
look at a dialog box; the system itself monitors my incoming usenet
folder and, when it sees a new file appear there it locates the cached
text, copies the fields I have specified (posted by, group, date,
subject line, x-ref number), then hashes and stores the file and its
metadata. Putting together a playlist involves describing the music or
files I want without having to be concerned about the location of the
data. 

On the "hunter-gatherer" backend, if a "new and improved" version of a
file is posted the system can instantly tell simply by comparing the
posted filesizes to the files it already has. Because the metadata is
more comprehensive than just filenames (as in slocate) it can be
smarter about telling the difference between, say, Al Cooper and Alice
Cooper. But because it never -replaces- a file but only adds more,
mistakes are easily corrected. This would allow building "smart agents"
that can pool the array of resources available to a desktop machine (web
search, p2p, torrent, usenet, irc etc) in "tivo like" fashion. The more
data it collects the more it knows about your tastes and the better it
is able to find other relevant data for the owner. And because it uses
existing security models this can all be built to whatever level of
paranoia the eu happens to feel prudent.

-- 
poptones