bug search and source package names
Robert Collins
robertc at robertcollins.net
Wed Feb 2 19:13:03 UTC 2011
On Thu, Feb 3, 2011 at 6:38 AM, Clint Byrum <clint at ubuntu.com> wrote:
> On Wed, 2011-02-02 at 19:09 +1300, Robert Collins wrote:
>> Hi, as you may have noticed, bug search in Launchpad is not as fast as
>> you might wish :). One of the contributing factors to the search
>> performance is that we do a substring match for package names.
>>
>
> Could you cheat a little and add some tokens to the text that is
> searched?
>
> I presume right now the bug search uses postgres's text searching to
> scan the descriptions of the bugs. So something like
>
> WHERE searchable_description @@ 'ibche' OR package_name = '%ibche%'
Have a look in the bug I linked; we use two fti columns with gist
indices + an ILIKE clause on a text field.
> So the question is, could you add this on save:
>
> UPDATE foo SET searchable_description = concat (description,' ___packagename:',package_name)
>
> Then you can build it as an | query in the text search?
>
> WHERE searchable_description @@ to_tsquery('ibche | ___packagename:ibche')
>
> Not sure how the ranking would be affected.
So we'd need all permutations of substrings - ugh; a better (and
easier to reason about) approach is to use trigrams; this is one of
the medium term solutions we're considering. One can exclude hits from
ranking during indexing. trigrams are at least one month, maybe two
away, and will require a moderate amount of engineering time.
>> We're looking at schema changes and additional short-term solutions
>> (long term we're moving to a dedicated search engine such as Lucene),
>> but I'm wondering - how important is this substring matching?
>
> I think its an intuitive, and powerful feature. typing mysql, and
> matching on php5-mysql is actually really, really important.
Ok, so one +1 vote for 'slow is better in this case' :).
>> What do I mean by this? If you type 'ibche' into the bug search for
>> Ubuntu it will time out. But if it didn't timeout it would find all
>> bugs on libchewing (because ibche is a substring on libchewing).
>>
>> If Ubuntu as a whole is open to this being removed temporarily(*) then
>> we can drop some representative queries down from 7 seconds to 380ms
>> with relatively little effort. We may be able to achieve this sort of
>> result with more significant effort - but if its not actually a
>> valuable feature, its much more efficient to disable it for a while.
>>
>
> If nothing else, allowing the user to turn it off in advanced search to
> make things faster might be helpful.
>
> Another thing might be to simply disable it for words less than 5 chars.
> I'm assuming that pgsql's string match uses the shortcut of looking for
> the last letter at strlen() chars, then jumping nchars, looking for it
> again, and so on. Plus you definitely don't care if lib matches.
it does a table scan. All 400000 rows of Ubuntu bugs. The number of
words in play has almost no impact on performance; and currently
searching for multiple words /totally disables/ the package name
matching (because 'foo bar' is not a substring of any package name).
Another cheap thing we could do is tokenise the packagename
(php5-mysql -> php5, mysql) and shove them in the full text index.
That would meet your use case of 'searching with mysql brings back
php5-mysql bugs'. [we already index the source package name as a
whole, the tsearch2 vectorisation code may already make this work - so
it may be 'here is one I prepared earlier'].
@James I've run that experiment, with an appropriate index we can make
*prefix* searching as fast as dropping substring searches.
-Rob
More information about the ubuntu-devel
mailing list