[storm] Cache for nonexistent result

Sat Jan 17 16:05:22 UTC 2015

Hi, @Free.

I just looking a way to reduce number of DB queries. Of course, there can
be better solutions. My patch was created from the principle of least
interference in the internal API, and simplicity of maintenance and
upgrading.

> And you can see the result of this complexity in the increased complexity
of the APIs you propose (for example the "exists" parameter), which at a
sudden become more sophisticated and hence difficult to undestand.

Thanks you. I'm agree with you, and I've killed exists param, and extended
invalidate method just now:

    def invalidate(self, obj=None):
        if type(obj) is tuple:
            self.nonexistent_cache.remove(obj)
        else:
            StoreOrig.invalidate(self, obj)
            if obj is None:
                del self.nonexistent_cache[:]

So, now I can invalidate even nonexistent objects. And API is not modified.

> (because the minimum 100% "storm-safe" isolation level would become
serializable, when it's now repeatable read)

I'm not sure in this. It's not a phantom read in its purest form.

"A phantom read occurs when, in the course of a transaction, two identical
queries are executed, and the collection of rows returned by the second
query is different from the first."

My patch does not affect the collections, only "get" certain row. So, safe
isolation level of my patch is also repeatable read.

> Let's not add more built-in complexity, instead I suggest that you
implement this additional caching mechanism in you application (and I'd
personally create a separate API built on top of Store, instead of
subclassing Store).

-- In my case overhead was fully absorbed by a decrease in number of
DB-queries. It's a real problem for non-auto-incremental or composite
primary key, especially for models of social profiles. In any case, thank
you.

2015-01-17 12:00 GMT+02:00 Free Ekanayaka <free at 64studio.com>:

> On Sat, Jan 17, 2015 at 12:12 AM, Ivan Zakrevskyi <
> ivan.zakrevskyi at rebelmouse.com> wrote:
>
> [...]
>
>> On the other hand, suppose that an object exists, and you have already
>> got this object in current transaction. After it, suppose, object was
>> changed in DB by concurrent thread. But these changes will not affect
>> your object. I think in this case it does not matter what type of
>> object, None or Model instance. Since the object has been read, it can
>> not be changed even if it has been modified by a parallel process.
>>
>
> Yes, for objects already in the cache, that's a tradeoff of the existing
> cache mechanism, and it's why Store.invalidate exist. This was considered
> by the original Storm developers an acceptable tradeoff, which requires
> people a bit more care but does some immediate performance benefit.
>
> You're proposing to extend the behavior, but this will inevitably makes
> reasoning about code more difficult and require even more care (because the
> minimum 100% "storm-safe" isolation level would become serializable, when
> it's now repeatable read), and at that point my feeling is that the
> tradeoff stops to be worth.
>
> Caches ARE hard:
>
> http://martinfowler.com/bliki/TwoHardThings.html
>
> because they are subtle. And you can see the result of this complexity in
> the increased complexity of the APIs you propose (for example the "exists"
> parameter), which at a sudden become more sophisticated and hence difficult
> to undestand.
>
> One of the design goals of Storm is to be simple, and I agree with that
> goal since the very idea of an ORM is probably questionable: every
> abstraction layer has a cost, especially in the case of ORMs where the
> abstraction layer can't be mapped cleanly to the underlying model, due to
> the object-relation impedance mismatch problem.
>
> Let's not add more built-in complexity, instead I suggest that you
> implement this additional caching mechanism in you application (and I'd
> personally create a separate API built on top of Store, instead of
> subclassing Store).
>
> Cheers,
>
> Free
>
> My patch does not affect store.find(), and, hence, selection. I'm not
>> sure, that phantom reads is possible here, except that
>> store.get_multi(). There is rather a "Non-repeatable reads", than
>> "Phantom reads". Because it can hide changes of certain row (with
>> specified primary key), but not of selection.
>>
>> So, for "Repeatable Read" and "Serializable" my patch is safe (only
>> need add reset of store.nonexistent_cache on commit).
>>
>> For "Read Committed" and "Read Uncommitted" my patch is not safe,
>> because this levels should not to have "Non-repeatable reads". But for
>> existent object storm also can not provide "repeatable reads". So,
>> it's not mater, will "Non-repeatable" be applied reads for existent
>> object or for nonexistent object.
>>
>> Of course, my patch is temporary solution. There is can be more
>> elegant solutions on library level. But it really reduce many DB
>> queries for nonexistent primary keys.
>>
>>
>>
>> 2015-01-16 23:20 GMT+02:00 Free Ekanayaka <free at 64studio.com>:
>> >
>> > See:
>> >
>> > http://en.wikipedia.org/wiki/Isolation_%28database_systems%29
>> >
>> > for reference.
>> >
>> > On Fri, Jan 16, 2015 at 10:19 PM, Free Ekanayaka <free at 64studio.com>
>> wrote:
>> >>
>> >> Hi Ivan,
>> >>
>> >> it feels what you suggest would work safely on for transactions set
>> the serializable isolation level, not repeteable reads down to read
>> uncommitted (since phantom reads could occur there, and the non-existing
>> cache would hide new results).
>> >>
>> >> Cheers
>> >>
>> >> On Fri, Jan 16, 2015 at 5:55 PM, Ivan Zakrevskyi <
>> ivan.zakrevskyi at rebelmouse.com> wrote:
>> >>>
>> >>> Hi, all. Thanks for answer. I'll try to explain.
>> >>>
>> >>> Try to get existent object.
>> >>>
>> >>> In [2]: store.get(StTwitterProfile, (1,3))
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM twitterprofile WHERE
>> twitterprofile.context_id = %s AND twitterprofile.user_id = %s LIMIT 1;
>> args=(1, 3)'
>> >>> Out[2]: <users.orm.TwitterProfile at 0x7f1e93b6d450>
>> >>>
>> >>> In [3]: store.get(StTwitterProfile, (1,3))
>> >>> Out[3]: <users.orm.TwitterProfile at 0x7f1e93b6d450>
>> >>>
>> >>> In [4]: store.get(StTwitterProfile, (1,3))
>> >>> Out[4]: <users.orm.TwitterProfile at 0x7f1e93b6d450>
>> >>>
>> >>> You can see, that storm made only one query.
>> >>>
>> >>> Ok, now try get nonexistent twitter profile for given context:
>> >>>
>> >>> In [5]: store.get(StTwitterProfile, (10,3))
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM twitterprofile WHERE
>> twitterprofile.context_id = %s AND twitterprofile.user_id = %s LIMIT 1;
>> args=(1, 10)'
>> >>>
>> >>> In [6]: store.get(StTwitterProfile, (10,3))
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM twitterprofile WHERE
>> twitterprofile.context_id = %s AND twitterprofile.user_id = %s LIMIT 1;
>> args=(1, 10)'
>> >>>
>> >>> In [7]: store.get(StTwitterProfile, (10,3))
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM twitterprofile WHERE
>> twitterprofile.context_id = %s AND twitterprofile.user_id = %s LIMIT 1;
>> args=(1, 10)'
>> >>>
>> >>> Storm sends a query to the database each time.
>> >>>
>> >>> For example, we have a some util:
>> >>>
>> >>> def myutil(user_id, *args, **kwargs):
>> >>>     context_id =
>> get_context_from_mongodb_redis_memcache_environment_etc(user_id, *args,
>> **kwargs)
>> >>>     twitter_profile = store.get(TwitterProfile, (context_id, user_id))
>> >>>     return twitter_profile.some_attr
>> >>>
>> >>> In this case, Storm will send a query to the database each time.
>> >>>
>> >>> The similar situation for non-existent relation.
>> >>>
>> >>> In [20]: u = store.get(StUser, 10)
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM user WHERE user.id = %s LIMIT 1;
>> args=(10,)'
>> >>>
>> >>>
>> >>> In [22]: u.profile
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM userprofile WHERE userprofile.user_id = %s
>> LIMIT 1; args=(10,)'
>> >>>
>> >>> In [23]: u.profile
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM userprofile WHERE userprofile.user_id = %s
>> LIMIT 1; args=(10,)'
>> >>>
>> >>> In [24]: u.profile
>> >>> base.py:50 =>
>> >>> u'(0.001) SELECT ... FROM userprofile WHERE userprofile.user_id = %s
>> LIMIT 1; args=(10,)'
>> >>>
>> >>> I've created a temporary patch, to reduce number of DB queries (see
>> bellow). But I am sure that a solution can be more elegant (on library
>> level).
>> >>>
>> >>>
>> >>> class NonexistentCache(list):
>> >>>
>> >>>     _size = 1000
>> >>>
>> >>>     def add(self, val):
>> >>>         if val in self:
>> >>>             self.remove(val)
>> >>>         self.insert(0, val)
>> >>>         if len(self) > self._size:
>> >>>             self.pop()
>> >>>
>> >>>
>> >>> class Store(StoreOrig):
>> >>>
>> >>>     def __init__(self, database, cache=None):
>> >>>         StoreOrig.__init__(self, database, cache)
>> >>>         self.nonexistent_cache = NonexistentCache()
>> >>>
>> >>>     def get(self, cls, key, exists=False):
>> >>>         """Get object of type cls with the given primary key from the
>> database.
>> >>>
>> >>>         This method is patched to cache nonexistent values to reduce
>> number of DB-queries.
>> >>>         If the object is alive the database won't be touched.
>> >>>
>> >>>         @param cls: Class of the object to be retrieved.
>> >>>         @param key: Primary key of object. May be a tuple for
>> composed keys.
>> >>>
>> >>>         @return: The object found with the given primary key, or None
>> >>>             if no object is found.
>> >>>         """
>> >>>
>> >>>         if self._implicit_flush_block_count == 0:
>> >>>             self.flush()
>> >>>
>> >>>         if type(key) != tuple:
>> >>>             key = (key,)
>> >>>
>> >>>         cls_info = get_cls_info(cls)
>> >>>
>> >>>         assert len(key) == len(cls_info.primary_key)
>> >>>
>> >>>         primary_vars = []
>> >>>         for column, variable in zip(cls_info.primary_key, key):
>> >>>             if not isinstance(variable, Variable):
>> >>>                 variable = column.variable_factory(value=variable)
>> >>>             primary_vars.append(variable)
>> >>>
>> >>>         primary_values = tuple(var.get(to_db=True) for var in
>> primary_vars)
>> >>>
>> >>>         # Patched
>> >>>         alive_key = (cls_info.cls, primary_values)
>> >>>         obj_info = self._alive.get(alive_key)
>> >>>         if obj_info is not None and not obj_info.get("invalidated"):
>> >>>             return self._get_object(obj_info)
>> >>>
>> >>>         if obj_info is None and not exists and alive_key in
>> self.nonexistent_cache:
>> >>>             return None
>> >>>         # End of patch
>> >>>
>> >>>         where = compare_columns(cls_info.primary_key, primary_vars)
>> >>>
>> >>>         select = Select(cls_info.columns, where,
>> >>>                         default_tables=cls_info.table, limit=1)
>> >>>
>> >>>         result = self._connection.execute(select)
>> >>>         values = result.get_one()
>> >>>         if values is None:
>> >>>             # Patched
>> >>>             self.nonexistent_cache.add(alive_key)
>> >>>             # End of patch
>> >>>             return None
>> >>>         return self._load_object(cls_info, result, values)
>> >>>
>> >>>     def get_multi(self, cls, keys, exists=False):
>> >>>         """Get objects of type cls with the given primary key from
>> the database.
>> >>>
>> >>>         If the object is alive the database won't be touched.
>> >>>
>> >>>         @param cls: Class of the object to be retrieved.
>> >>>         @param key: Collection of primary key of objects (that may be
>> a tuple for composed keys).
>> >>>
>> >>>         @return: The object found with the given primary key, or None
>> >>>             if no object is found.
>> >>>         """
>> >>>         result = {}
>> >>>         missing = {}
>> >>>         if self._implicit_flush_block_count == 0:
>> >>>             self.flush()
>> >>>
>> >>>         for key in keys:
>> >>>             key_orig = key
>> >>>             if type(key) != tuple:
>> >>>                 key = (key,)
>> >>>
>> >>>             cls_info = get_cls_info(cls)
>> >>>
>> >>>             assert len(key) == len(cls_info.primary_key)
>> >>>
>> >>>             primary_vars = []
>> >>>             for column, variable in zip(cls_info.primary_key, key):
>> >>>                 if not isinstance(variable, Variable):
>> >>>                     variable = column.variable_factory(value=variable)
>> >>>                 primary_vars.append(variable)
>> >>>
>> >>>             primary_values = tuple(var.get(to_db=True) for var in
>> primary_vars)
>> >>>
>> >>>             alive_key = (cls_info.cls, primary_values)
>> >>>             obj_info = self._alive.get(alive_key)
>> >>>             if obj_info is not None and not
>> obj_info.get("invalidated"):
>> >>>                 result[key_orig] = self._get_object(obj_info)
>> >>>                 continue
>> >>>
>> >>>             if obj_info is None and not exists and alive_key in
>> self.nonexistent_cache:
>> >>>                 result[key_orig] = None
>> >>>                 continue
>> >>>
>> >>>             missing[primary_values] = key_orig
>> >>>
>> >>>         if not missing:
>> >>>             return result
>> >>>
>> >>>         wheres = []
>> >>>         for i, column in enumerate(cls_info.primary_key):
>> >>>             wheres.append(In(cls_info.primary_key[i], tuple(v[i] for
>> v in missing)))
>> >>>         where = And(*wheres) if len(wheres) > 1 else wheres[0]
>> >>>
>> >>>         for obj in self.find(cls, where):
>> >>>             key_orig = missing.pop(tuple(var.get(to_db=True) for var
>> in get_obj_info(obj).get("primary_vars")))
>> >>>             result[key_orig] = obj
>> >>>
>> >>>         for primary_values, key_orig in missing.items():
>> >>>             self.nonexistent_cache.add((cls, primary_values))
>> >>>             result[key_orig] = None
>> >>>
>> >>>         return result
>> >>>
>> >>>     def reset(self):
>> >>>         StoreOrig.reset(self)
>> >>>         del self.nonexistent_cache[:]
>> >>>
>> >>>
>> >>>
>> >>> 2015-01-16 9:03 GMT+02:00 Free Ekanayaka <free at 64studio.com>:
>> >>>>
>> >>>> Hi Ivan
>> >>>>
>> >>>> On Thu, Jan 15, 2015 at 10:23 PM, Ivan Zakrevskyi <
>> ivan.zakrevskyi at rebelmouse.com> wrote:
>> >>>>>
>> >>>>> Hi all.
>> >>>>>
>> >>>>> Storm has excellent caching behavior, but stores in Store._alive
>> only existent objects. If object does not exists for some key, storm makes
>> DB-query again and again.
>> >>>>>
>> >>>>> Are you planning add caching for keys of nonexistent objects to
>> prevent DB-query?
>> >>>>
>> >>>>
>> >>>> If an object doesn't exist in the cache it meas that either it was
>> not yet loaded at all,  or it was loaded but it's now mark as "invalidated"
>> (for example the transaction in which it was loaded fresh has terminated).
>> >>>>
>> >>>> So I'm note sure what you mean in you question, but I don't think
>> anything more that could be cached (in terms of key->object values).
>> >>>>
>> >>>> Cheers
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> storm mailing list
>> >>> storm at lists.canonical.com
>> >>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/storm
>> >>>
>> >>
>> >
>>
>> --
>> storm mailing list
>> storm at lists.canonical.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/storm
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/storm/attachments/20150117/9ff6232b/attachment-0001.html>