Debugging TripleO Gate Failures in Mistral

TripleO started making use of Mistral in the OpenStack Newton cycle. To help make sure everything is robust and well tested a TripleO Jenkins job now runs against all Mistral changes. Hopefully this will prove valuable and useful to both sides - it should avoid Mistral changes breaking TripleO and also provide a good, real world test of Mistral.

However, it does mean that Mistral developers are going to be exposed to TripleO logs and TripleO for the first time.

So, let's look at how to make sense of the huge log files.

Mistral Logs

Each CI run includes all the logs generated by Mistral. You may want to dive straight in there to see if there is an error. When you click on a failed job, go into the logs directory, then follow through to /var/log/mistral and you will find a individual log for the api, engine and executor.

For example: http://logs.openstack.org/67/356567/8/check/gate-tripleo-ci-centos-7-nonha-multinode/fb8b79e/logs/var/log/mistral/

Note: that link will stop working at some point, I am not sure how long the files are kept.

console.html

At the root of the logs you will see [console.html] - this is the main shell output for the CI session and the best place to start, unless you suspect something will be in the Mistral logs.

ERROR DURING PREVIOUS COMMAND

This line is output at various points if the job fails. So, search for that. Note that it should be at the start of the line, after the date and time, otherwise you are just seeing the line that could potentially output it, but didn't because bash is echoing each command it runs.

Here is an example where I found that error.

2016-09-26 10:12:11.891987 | + ./..//tripleo-ci/scripts/common_functions.sh:delorean_build_and_serve:L1:   echo ERROR DURING PREVIOUS COMMAND '^^^'
2016-09-26 10:12:11.892018 | ERROR DURING PREVIOUS COMMAND ^^^

If you couldn't find it, skip to the next section.

If you did, look for the titles, they will be something like the "Building mistral" example below. In this example, I had to go up about 30 lines in the log and I could seed it failed building Mistral. This means it failed to create the rpm for Mistral - we will come back to possible reasons for this later.

2016-09-26 10:09:49.996482 | #################
2016-09-26 10:09:49.996609 | tripleo.sh -- Building mistral
2016-09-26 10:09:49.996672 | #################

An example failure due to a client regression

I found this by searching for "ERROR DURING PREVIOUS COMMAND" on a failed job, as described above. I found the following Python traceback (I stripped out the date and time to save space).

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/instack_undercloud/undercloud.py", line 1180, in install
    _post_config(instack_env)
  File "/usr/lib/python2.7/site-packages/instack_undercloud/undercloud.py", line 1161, in _post_config
    auth_url)
TypeError: client() takes at most 1 argument (5 given)
Command 'instack-install-undercloud' returned non-zero exit status 1

This led me to the instack-undercloud project, and I found where we created the mistralclient instance. A bit of digging around in the git history then let me to this change in python-mistralclient. It refactored the client and changed the method signature in a backwards incompatible way.

Interestingly, the only reason this failed, is that the mistral patch had a Depends-On in its commit message on a mistralclient patch. This caused TripleO CI to build a new package for mistralclient at the same time and test against mistralclient master. Otherwise it would test with the latest released mistralclient.

If this hadn't happened the problem wouldn't have been noticed until a new mistralclient release was made and then TripleO picked up that new version. At this point the TripleO gate would be broken. This is a good argument for having TripleO gate against mistralclient also.

What failures at the different stages might mean

During "Building mistral"

In TripleO CI, an rpm is built for mistral with the patch under review. Typically this can fail if there is a change that impacts packaging - like a new requirement. If that is happening it needs to be added to the rpm spec. It's quite easy to do, but if there are any issues, ping me and I can help. The spec is on github here, and uses Gerrit for code reviews, so the process should be familiar.

https://github.com/rdo-packages/mistral-distgit

For example, there was a failure in patch 374813 as it added jinja2 until I added it to the rpm spec.

During "tripleo.sh -- Overcloud create started."

At this stage we have actually started to deploy OpenStack. The output can be very noisy once it gets started and Heat is doing most of the work. However, there will be various messages stating things like "Started Mistral Workflow..." - so look for those, if there are failures here, it could well be a legitimate regression. You probably won't see much information here unfortunately, but you will see the the execution ID which can help direct you to the correct part of the Mistral logs.

No Luck?

If all else fails, you can jump into #tripleo on freenode and ask for help. Feel free to try pinging me (I'm d0ugal) and if I'm around I'll help. As other different cases pop up I'll try and cover them in additional posts. Since gating Mistral with TripleO is new, I don't have many useful examples yet.


Thanks for reading. You should follow me on Twitter.

Do you have any feedback or comments? The best place for discussion is on Reddit or Hacker News. Otherwise, email me.