Monthly Archives: November 2015

Cinder Volume attachment Issue

Openstack by designed is flawed and operations are not fully transactional which may leaves  states in different components out of sync. One example is frequent volume attachment/detachment, esp when network has glitches causing MQ unstable, and strange errors will start showing. This is because multiple components are involved, cinder, nova, and compute, where many states are maintained in multiple components. There is no master. Once a message got delayed or lost, they could easily run out of sync. There is no transaction manager coordinating the operation. In general operation across multiple components require a global transaction manager to guarantee the consistency. This also can be achieved  using 2PC which can be built into the framework.

Network connection reset

Recently we are seeing many connection reset issues and did some root cause.

This mainly happened with long lasting connections, like pooled database connections and MQ connections, and also RPC reply connection.

Finally all are fixed by raising up timeout in VIP/LB  and firewall profiles.

Timeout is too short with connections that may have no activities during the period.

 

RabbitMQ rescue

RabbitMQ cluster shows partitioned badly. I was trying to rebuild.

First of all I had issue with stopping / starting.

Start/Stop hanging:

  1. killall -u rabbitmq -q
  2. backup exiting rabbitmq.config to rabbitmq.org  (/etc/rabbitmq/rabbitmq.config)
  3. remove all other cluster members from rabbitmq.config and only keep current host.
  4. rm -ef /var/lib/rabbitmq/mnesia  (this needs to verified. Pls check /etc/rabbitmq/rabbitmq-env.conf to see where RABBITMQ_MNESIA_BASE is pointing to.  Another thing needs to pay attention to the permission on this directory. The owner has to be rabbitmq, otherwise rabbitmq will fail to start because it’s running as user rabbitmq and  can not create directories and files without sufficient permission.
  5. service rabbitmq-server restart

Now server should start without issue. now let’s do further clean up.

  1. rabbitmqctl stop_app
  2. rabbitmqctl force_reset
  3. rabbitmqctl start_app
  4. rbbitmqctl stop

There should be no errors

Now restore /etc/rabbitmq/rabbitmq.config from backup rabbitmq.org and start rabbitmq again.

  1. service rabbitmq-server start

Now we should have all nodes running

Assume node001 is master and start service on node001:

  1. rabbitmqctl start_app

 

All other nodes:

  1. rabbitmqctl stop_app
  2. rabbitmqctl join_cluster rabbit@node001
  3. rabbitmqctl start_app

 

Check cluster status on all nodes:

  1. rabbitmqctl cluster_status

should have no more partition and all nodes should be running:

Cluster status of node rabbit@phx04rmqa001 …
[{nodes,[{disc,[rabbit@node001,rabbit@node002, rabbit@node003]}]},
{running_nodes,[[rabbit@node001,rabbit@node002, rabbit@node003]},
{partitions,[]}]

partitions should be empty []  .

 

don’t forget to enabled HA queue etc.

rabbitmqctl set_policy ha-all “” ‘{“ha-mode”:”all”,”ha-sync-mode”:”automatic”}’

Another option is cluster_partition_handling in configuration that changes how partition recovery works ( default is ignore):

  • pause_minority
  • {pause_if_all_down, [nodes], ignore | autoheal}
  • autoheal

combined with loadbalancer, things could behave strange. Highly recommend no VIP to front mq server.

Another commonly see issue is rabbitmq stops responding due to messages flooded particular queues with no consumer. It may completely stopped responding or delay message delivering. A workaround is to put a size on the queue to force RMQ not to exhaust memry:

rabbitmqctl set_policy POLICY_NAME  “QUEUE_NAME” ‘{“max-length”:100}’ –apply-to queues

 

 

Good luck