[storm] Pickling Storm objects?

Fri Aug 28 07:07:32 BST 2009

On Thu, Aug 27, 2009 at 7:48 PM, Stuart
Bishop<stuart.bishop at canonical.com> wrote:
> On Thu, Aug 27, 2009 at 4:56 PM, Gustavo Niemeyer<gustavo at niemeyer.net> wrote:
>> Hey Stuart,
>>
>>> Has anyone looked into making Storm objects pickleable? I want to
>>> stuff expensive query results into memcached.
>>>
>>> I'm using ZStorm so can just use the name to refer to the Store. I can
>>> put together a MaterializedResultSet class supporting a lot of the API
>>> from a materialized list of Storm objects. I think getting the Storm
>>> objects themselves pickled is going to be the tricky bit.
>>
>> Pickling itself shouldn't be hard.  How do you envison an unpickled
>> object should behave?
>
> Its nice to know you don't forsee major roadblocks. I think the major
> difficulty is becoming familiar enough with the Storm internals - I've
> never dealt with ObjectInfo and friends before.
>
> I'd like it to behave like the original object as much as possible.
> The goal is drop in replacement of code like:
>
>   results = store.find(... complex and costly conditions ...)
>
> with something like:
>
>   results = cached(store, max_age,  ... complex and costly conditions ...)
>
> So unpickled objects can be updated and code able to traverse from the
> unpickled objects to objects loaded from the Store. For the Storm
> objects, I expect they would be indistinguishable from one loaded from
> the Store (assemble object, swap in the Store, inject into the cache).
>
> I don't think I need the result set to support operations like union,
> find,  or aggregates beyond count() so the result set can just be a
> list with a count() method.

Out of interest, what processing are you trying to avoid in
particular?  Is it the cost of computing the query, or the cost of
retrieving the data from the database, or both?

If it is primarily the cost of the query, then you could recreate a
similar result set from the primary keys of your previously created
one:

  result2 = store.find(Foo, Foo.id.is_in(result1.values(Foo.id)))

Or for multi-table results:

  result2 = store.find((Foo, Bar), In((Foo.id, Bar.id),
result1.values(Foo.id, Bar.id)))

In this sort of case, you could use memcached to record which rows
were in the result set while having up to date field values from the
database (something that might be important if you are doing any write
operations).

James.