[storm] why weakrefdict for cache?

Mon Sep 3 17:40:07 BST 2007

On Sep 3, 2007, at 9:01 AM, Gustavo Niemeyer wrote:

>
> Right.  In Storm this won't be an issue.  When objects get dirty  
> they are
> added to a dictionary which will strong-reference them so that they  
> are
> kept in memory at least up to the next flush/rollback.  They  
> continue to
> be in the weakrefd cache even then, and only leave that one when  
> they die.

I'll tell you why we currently dont have a strong-referencing "dirty"  
list...its because our session detects "dirty" changes at flush  
time.  While most "dirty" objects are detected in the session using a  
regular "dirty" flag that was set when an attribute changed (this  
part could be replaced with a strong-referencing list instead), there  
are some which are detected by comparing the values of their  
attributes to that which was loaded from the database.   this  
approach was copied from that of Hibernate and supports "mutable"  
attribute types, such as a mapped attribute that points to another  
object which is pickled into a binary database column.  If someone  
changes an attribute on the non-mapped, "pickled" object, that change  
needs to be detected as well, and the only way to do that is to  
compare to what was loaded.  We only do the comparison operation on  
datatypes that are known to be "mutable".   so even if we do  
reinstate the strong dirty list and the weakrefed identity map, that  
case would still remain as a caveat.

>
>> While we might someday reinstate a strongly-referenced "dirty"
>> collection, the basic idea of a strongly referenced identity map is
>> generally not a problem for our users; the use case where someone is
>> looping through many objects and throwing away as they iterate is
>> pretty rare and those folks either expunge the objects explicitly or
>> use the "weak referencing" option on their session.
>
> I see.. but then do they have to make sure by themselves that the
> object doesn't die before it gets flushed?

well i think the "weak referencing" option is probably not widely  
used, people just know to expunge/clear objects from the session  
which they dont need.  we went with Hibernate's example in this area  
as "not that big a deal".

>
> Did you consider the creation of a more flexible caching system, and
> if so, can you tell me why you gave up?  (maybe there's something
> we can learn from that)

We never "gave up", as far as "caching" we've never "begun" that.  I  
dont really consider the Session's identity map to be much of a  
"cache"; while we do use it as a cache in cases where we need to  
locate an object by primary key (such as lazy-loading a many-to-one  
attribute), i would consider a "more flexible" cache to be a second  
level cache which is a distinct plugin to the whole system, which is  
configurable with things like cache size, expiration time, expire  
event handlers, and maybe even having some form of query caching.    
When you really do "caching", people need fine grained control over  
the lifespan of objects, which is something I know from all the  
caching work we did with Myghty and now Pylons.  So we dont try to  
turn the Session into the full "caching" solution, its "cache" is  
primarily there to maintain identity uniqueness (and we say as much  
in our docs).  and someday, we might tackle a real second level  
solution that integrates nicely.   Currently, people who need this  
tend to roll their own, or move the caching into a coarser-grained  
area (which often is the better place for it), such as page caching  
or "sub-template" caching which is something Mako/Pylons supports.

>
> I'm actually a bit surprised that people don't seem to bother with
> the strong references for the duration of the transaction.
> In Landscape, for instance, we have web pages which show up thousands
> of objects, and there isn't a good reason to keep the object in
> memory after it has been displayed.

Our ORM's system of loading objects for a particular query still  
needs to store the full results of that query in a single in-memory  
collection; since we support queries which add left outer joins of  
additional objects to be loaded as part of a collection, we cant just  
load a row, create an instance for it, then throw it away; the next  
row might also represent the same instance which needs to be  
"uniqued" against the total result set (i.e., we have a mini  
"identity map" that takes place for a single ORM query).  Not only  
that, but the eager loading of collections also means the same  
object, in rare circumstatnces, can be represented at different  
levels in the same result; object A might reference B, and also might  
reference C which *also* references B.    While this is another area  
where I've proposed that we could add options to not maintain a local  
"uniqued" set of instances for a query which doesn't need it and just  
allow "streaming" of ORM'ed objects,  it hasn't been needed and i  
think folks who display thousands of rows tend to just use non-ORM  
result sets, which of course dont have any of these requirements.    
Though as it turns out, DBAPIs like psycopg2 already buffer all the  
rows of a result set by default so theres a lot more "load it all  
into memory" going on than people might think anyway.

More commonly, people who are representing thousands of objects will  
only be displaying a subset of those on a single page, and only need  
to load a range of objects, and our "eager loading" does support the  
usage of LIMIT and OFFSET in such a way that you limit the "primary"  
entities but still get the full list of "collection entities"  
associated with them.  This is another area where we've looked at  
Hibernate, seen that theres no problem with their "non-streamed"  
approach, so for now its "good enough", with the door open to improve  
upon it if needed.

.