Contributors mailing list archives
contributors@odoo-community.org
Browse archives
Re: runbot is down
by
Camptocamp France SAS, Alexandre Fayolle
On 11/10/2016 10:38, Stéphane Bidoul wrote: > Hi Alexandre, > > "internal postgres corruption" looks pretty scary. Any idea on the root > cause or lesson we can learn here? Maybe the lesson is "be careful when running 1500 databases on your postgresql cluster with lots of simultaneous connections". And the second lesson is that the cleanup in the runbot is not very good/efficient/robust (I'm not sure exactly, but it seems to leave a lot of crap behind). I had to fight against systemd which was finding that PG was taking too much time to startup (replaying the WAL / rebuilding some internal data structurs was taking a bit of time), and would issue a kill -9, which was *not* a helpful way of solving things. In the end I: * manually cleaned up all the builds * manually dropped all the databases * rebooted the servers * DELETEd the runbot.builds related to the heads of the main branches so that the rebuild would work correctly by recreating a build environment from github) And since this was still failing in lots of cases, I just went through a 2h pdb session to find out a missing fix which I just applied on the runbot and seems to fix the builds on the v10 branch (this was introduced by the merge of the upstream branch of odoo-extra in our runbot)... Things seem to be getting back to normal now. -- Alexandre Fayolle Chef de Projet Tel : +33 4 58 48 20 30 Camptocamp France SAS Savoie Technolac, BP 352 73377 Le Bourget du Lac Cedex http://www.camptocamp.com