Moving udd away from sqlite

Mon Jun 18 08:45:26 UTC 2012

>>>>> James Westby <james.westby at canonical.com> writes:

    > On Fri, 15 Jun 2012 12:34:12 +0200, Vincent Ladeuil <vila+udd at canonical.com> wrote:
    >> > It's not magic. It's moving from a database that's not designed for
    >> > concurrent use to one that is designed for concurrent use.
    >> 
    >> Despite not being designed for concurrent use, it *is* used this
    >> way and lock contentions have been encountered leading me to
    >> believe that the actual *design* needs to be fixed. The fact that
    >> changing the db is triggering more contentions is a symptom of a
    >> deeper issue.

    > Changing the db access layer is triggering that, we were still
    > running on the same (non-multi-user db). I agree that the design
    > needs to be fixed, and that's exactly what we're taking about,
    > fixing it by moving a db that is designed for multi-user use.

It looks like your understanding of the issue is better than mine here,
would you mind sharing that knowledge in an automated test (with the
added benefit that we won't regress in this area) ?

Just this week-end we had an add-import-jobs failure:

Traceback (most recent call last):
  File "/srv/package-import.canonical.com/new/scripts/bin/add-import-jobs", line 5, in <module>
    sys.exit(main())
  File "/srv/package-import.canonical.com/new/scripts/udd/scripts/add_import_jobs.py", line 17, in main
    icommon.create_import_jobs(lp, status_db)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 304, in create_import_jobs
    status_db.add_import_jobs(checked, newest_published)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 633, in add_import_jobs
    self._add_job(c, package, self.JOB_TYPE_NEW)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 615, in _add_job
    datetime.utcnow(), None, None))
sqlite3.OperationalError: database is locked

So we already know that add_import_jobs is involved in the bug (with the
current sqlite-based implementation), who is the other user in this case
and how can this be reproduced ?

    >> Well, when the correctness and safety is demonstrated, the context (and
    >> hence my own answer) will probably be different but until then I just
    >> can't say.
    >> 
    >> > And I'm very reluctant to fork without an actual plan for merging
    >> > back: how to know when it's safe & how to actually achieve it.
    >> 
    >> And I have no idea (nor time right now) to debug the fallouts of such a
    >> change that the actual package importer doesn't need. Hence my tendency
    >> to consider that demonstrating the validity of this change should be
    >> achieved first.

    > But you just said above that you *do* think it needs to be fixed?

Yes, it needs to be fixed. It doesn't cause blocking issues currently
which is why https://bugs.launchpad.net/udd/+bug/724893 hasn't been
fixed yet.

    > How can we demonstrate the validity of the change?

With an automated test (or several ;) ! What else ? ;)

This would make it more comfortable to run into production to catch the
other *fallouts* being confident the known issue *is* fixed.

I.e. reproducing the add_import_jobs failure in a test that will fail
with sqlite and succeed with your changes will demonstrate we've
captured (and fixed) at least one lock contention.

    > We can only demonstrate that it doesn't break production by
    > running the changes in production.

If the test suite cannot be trusted to catch most of the issues that
happen in production, the test suite should be fixed.

You're not implying that testing in production being needed, the test
suite is useless right ?

    > What would satisfy you that it was unlikely to break production?

Bugs that happen in production have a far higher cost than test
failures, that's where automated tests get most of their value.

But see below.

    >> Would there be a script to migrate from sqlite to PG ?

    > Yes.

Cool.

>From that, can we imagine a test that will import a few packages and
compare the corresponding dbs ?

    >> Can the package importer be re-started with empty dbs and catch up (how
    >> long will it take ? Days ? Weeks ?). Can this trigger bugs because the
    >> importer don't remember what it pushed to lp ?

    > It can be restarted with the dbs from whenever the transition starts and
    > it will catch up in roughly the between starting the transition and
    > rolling back. There may be a few bugs due to replaying things, but we do
    > it all the time (e.g. removing revids and re-importing when someone does
    > push --overwrite)

As in requeue --full ? requeue --zap-revids ? None of them is used on a
daily basis but my limited experience there never triggered issues
either.

    >> Or do you expect us to see another peek like
    >> http://webnumbr.com/ubuntu-package-import-failures.from%282012-01-24%29 ?

    > Hopefully not.

    >> Yes, that's why we're not in a position to safely accept such a change !
    >> 
    >> And all the time spent on integrating these changes is not spent on
    >> allowing them to be accepted in good conditions.

    > Sorry, I don't understand this, could you explain?

Fire-fighting in production is wasted time if it's caused by a bug that
could have been caught before deployment. Improving the test suite is
the best answer to that: the time is spent on making deployment more
reliable so no fire-fighting (ideally) or less fire-fighting is needed.

    >> > but it demonstrated the locking problem and is how he came up
    >> > with those options.
    >> 
    >> Then the test improvements are certainly valuable to backport to lp:udd
    >> or is there nothing to reuse from the EC2 experiment ?

    > It wasn't a set of unit tests, it was a set of tests of a live system,
    > so nothing to backport.

It's hard for me to judge without being able to redo these tests nor
understand how they demonstrate how the issue is fixed :-/

I don't mind having higher level tests to start with (writing unit tests
requires a precise understanding of the bug at the lowest level), what I
care about is that we add tests as we better understand how the importer
works and how it fails. I.e. we reduce the space were *we don't know*
whether the importer works or fails.

Since https://bugs.launchpad.net/udd/+bug/724893 has been filed, we
haven't understood *when* the importer fails but we know it can fail and
it failed more often recently (and a lot during your first attempt if I
remember correctly). If you have a fix for that, great ! Show us how
it's fixed with a test.

Once we have this test, if we get another lock contention, it will mean
one of two things:

- your test was wrong or incomplete (good, a single place to fix),

- we encountered a *new* bug (good too, we'll add another test).

And the more tests we add the easier it becomes to add a new one.

And more importantly in this precise case, whatever knowledge you
acquired investigating the issue is captured into these tests and
whoever have to fire-fight any related issue will be able to acquire
this knowledge and refine it.

But if nobody ever start to write the first test we'll keep
fire-fighting in the dark.

    >> > We have had a lot of experience recently working with Canonical
    >> > IS to get new servers and new staging servers deployed. If you
    >> > want a staging server, we'd be happy to help you and would
    >> > gladly advocate for you in their priority queue.
    >> 
    >> Great to hear :) I should come back soon on this topic, just a
    >> quick question though: are the new servers running lucid or
    >> precise ?

    > New servers are being installed with precise. If an existing
    > server is used it may be lucid, but there is a concerted effort to
    > upgrade all machines to precise currently underway.

Great, that makes my life easier in reducing the gap between the test
environment used by developers and packagers on one side and the
production server on the other side.

            Vincent