New overview page for the Bazaar importer

Mon Feb 1 23:28:57 GMT 2010

On Thu, 14 Jan 2010 16:58:29 -0600, John Arbash Meinel <john at arbash-meinel.com> wrote:
> So ignoring the 'latest 100' failures,

That's there to see if there are new issues occuring. It was more useful
before auto-retry was added for some LP failures.

> the top 2 failures are:
> 182 packages failed with key
> AssertionError:<module>:main:find_unimported_versions
>
> Which when poking at one looks like:
>   AssertionError: abiword 1.0.2+cvs.2002.06.05-1woody2 debian woody is
> marked but not imported
>
> Poking around I see debian woody, debian sid, ubuntu warty as failing on
> those. This certainly sounds like just bad configuration.

Let me explain the parts of this.

When the importer sees a version for the first time it will record a
tuple of (distro, suite, version, revid, testamant sha) for audit
purposes. It refers to this as "marking."

It also has checks to ensure that the tuple matches the data every time
it considers the package. In this case it is complaining as it appears
like the revision has been deleted from the branch, which would be bad.

When this was first implemented it wasn't transactional. This meant that
even if the import failed it was marked, so when you retried it it would
fail. That's fixed, but the old data wasn't purged entirely. This is a
case that never worked.

A little while ago I accidentally retried a bunch of packages I
shouldn't have as they weren't problems which were fixed, including
abiword. They failed in this manner, but it is masking the real error.

It needs to be checked to see if it was in the case that never worked
and if so do a full retry to re-instate the original error. I don't want
to do that wholesale without checking though.

> Number 2 is
> 89 packages failed with key
> AssertionError:<module>:main:import_package:import_package:extract
> 
> And all the ones I poked at were:
>   File
> "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py",
> line 1802, in extract
>     "Can't handle non gz tarballs yet"
> AssertionError: Can't handle non gz tarballs yet

Colin asked me to increase the priority of this today.

> The next 3 failures (34 + 34 + 32 = 100, so it should be #2?), are all
> UnicodeDecodeError.
> The first set are all 'author.decode', the next is path issues (probably
> non-ascii paths in the dataset). The next three are all
> "find_extra_authors" which is breaking down at "change.decode('UTF-8')".

For find_extra_authors we should catch the errors and skip I guess.
For paths, I'm not sure what the best solution is.
For author.decode, I'm not sure if we should skip, perhaps add a
translation from byte-string->unicode string for known problematic byte
strings?

> Next is 21 diverged branches...

I figured out what the issue is here and I am considering what the best
way to fix it is.

Basically the collision code forgets to set overwrite=True when
pushing. This has highlighted that the collision code is a little too
strict about how it calcualtes collisions though, which is what needs to
be considered.

> Next is 16 different serializations (need to be upgraded?) ...

Yes.

> And then a fairly long tail.

> All of this doesn't really look like stuff to do on the bzr side.
> Decoding the changelogs is something to investigate, supporting non .gz
> tarballs, and figuring out what to do when something wants to import a
> distro version that is "marked but not imported" (which I don't claim to
> understand).

Thanks for looking in to it, I hope my explanations are useful.

Thanks,

James