[storm] why weakrefdict for cache?
Michael Bayer
mike_mp at zzzcomputing.com
Mon Sep 3 17:40:07 BST 2007
On Sep 3, 2007, at 9:01 AM, Gustavo Niemeyer wrote:
>
> Right. In Storm this won't be an issue. When objects get dirty
> they are
> added to a dictionary which will strong-reference them so that they
> are
> kept in memory at least up to the next flush/rollback. They
> continue to
> be in the weakrefd cache even then, and only leave that one when
> they die.
I'll tell you why we currently dont have a strong-referencing "dirty"
list...its because our session detects "dirty" changes at flush
time. While most "dirty" objects are detected in the session using a
regular "dirty" flag that was set when an attribute changed (this
part could be replaced with a strong-referencing list instead), there
are some which are detected by comparing the values of their
attributes to that which was loaded from the database. this
approach was copied from that of Hibernate and supports "mutable"
attribute types, such as a mapped attribute that points to another
object which is pickled into a binary database column. If someone
changes an attribute on the non-mapped, "pickled" object, that change
needs to be detected as well, and the only way to do that is to
compare to what was loaded. We only do the comparison operation on
datatypes that are known to be "mutable". so even if we do
reinstate the strong dirty list and the weakrefed identity map, that
case would still remain as a caveat.
>
>> While we might someday reinstate a strongly-referenced "dirty"
>> collection, the basic idea of a strongly referenced identity map is
>> generally not a problem for our users; the use case where someone is
>> looping through many objects and throwing away as they iterate is
>> pretty rare and those folks either expunge the objects explicitly or
>> use the "weak referencing" option on their session.
>
> I see.. but then do they have to make sure by themselves that the
> object doesn't die before it gets flushed?
well i think the "weak referencing" option is probably not widely
used, people just know to expunge/clear objects from the session
which they dont need. we went with Hibernate's example in this area
as "not that big a deal".
>
> Did you consider the creation of a more flexible caching system, and
> if so, can you tell me why you gave up? (maybe there's something
> we can learn from that)
We never "gave up", as far as "caching" we've never "begun" that. I
dont really consider the Session's identity map to be much of a
"cache"; while we do use it as a cache in cases where we need to
locate an object by primary key (such as lazy-loading a many-to-one
attribute), i would consider a "more flexible" cache to be a second
level cache which is a distinct plugin to the whole system, which is
configurable with things like cache size, expiration time, expire
event handlers, and maybe even having some form of query caching.
When you really do "caching", people need fine grained control over
the lifespan of objects, which is something I know from all the
caching work we did with Myghty and now Pylons. So we dont try to
turn the Session into the full "caching" solution, its "cache" is
primarily there to maintain identity uniqueness (and we say as much
in our docs). and someday, we might tackle a real second level
solution that integrates nicely. Currently, people who need this
tend to roll their own, or move the caching into a coarser-grained
area (which often is the better place for it), such as page caching
or "sub-template" caching which is something Mako/Pylons supports.
>
> I'm actually a bit surprised that people don't seem to bother with
> the strong references for the duration of the transaction.
> In Landscape, for instance, we have web pages which show up thousands
> of objects, and there isn't a good reason to keep the object in
> memory after it has been displayed.
Our ORM's system of loading objects for a particular query still
needs to store the full results of that query in a single in-memory
collection; since we support queries which add left outer joins of
additional objects to be loaded as part of a collection, we cant just
load a row, create an instance for it, then throw it away; the next
row might also represent the same instance which needs to be
"uniqued" against the total result set (i.e., we have a mini
"identity map" that takes place for a single ORM query). Not only
that, but the eager loading of collections also means the same
object, in rare circumstatnces, can be represented at different
levels in the same result; object A might reference B, and also might
reference C which *also* references B. While this is another area
where I've proposed that we could add options to not maintain a local
"uniqued" set of instances for a query which doesn't need it and just
allow "streaming" of ORM'ed objects, it hasn't been needed and i
think folks who display thousands of rows tend to just use non-ORM
result sets, which of course dont have any of these requirements.
Though as it turns out, DBAPIs like psycopg2 already buffer all the
rows of a result set by default so theres a lot more "load it all
into memory" going on than people might think anyway.
More commonly, people who are representing thousands of objects will
only be displaying a subset of those on a single page, and only need
to load a range of objects, and our "eager loading" does support the
usage of LIMIT and OFFSET in such a way that you limit the "primary"
entities but still get the full list of "collection entities"
associated with them. This is another area where we've looked at
Hibernate, seen that theres no problem with their "non-streamed"
approach, so for now its "good enough", with the door open to improve
upon it if needed.
.
More information about the storm
mailing list