Category Archives: Uncategorized

Sail Internet review

As the very first few guys installed Sail wireless internet service, I got way faster upload speed around 200Mbit/s, Download is around 300Mbit/s, still a lot faster than Comcast currently I have, especially considering the cost and no cap.

It took about an hour for the guy getting antenna&pole installed on rooftop. Ethernet cable goes from existing vent into the attic,  the guy from Sail helped me wire the cable into my central panel where I have hub for Ethernet cables going to other floors and rooms.

So it’s painless to switch from Comcast cable to Sail, simply unplug cable from cable modem and plugin the new ethernet cable from the POE adapter, which supplies power to the antenna on rooftop.

Not sure if there is any management interface/UI for end user to tell wireless status like signal strength etc.

The major drawback is network delay is way higher,  greater than 20ms, comparing to comcast cable which is usually bellow 10ms. It may not be best option for online gamers, or delay sensitive application. Personally download & upload speed are my priority, so no big deal.

Overall installation experience is very smooth, highly recommend if it’s available in your location.

I will see how things will go as more people start adopting Sail wireless service. I would assume the bandwidth is shared. So far so good..

Hackintosh 8700k z370n rx580 10.13.6

hardware list

Motherboard : Gigabyte Z370N Wifi ITX

CPU:  Intel i7-8700K

CPU Fan: Le Grand Macho RT

Memory: CORSAIR – Vengeance LPX 32GB (2PK x 16GB) 2.6 GHz DDR4 (certified compatible with  z370n)

WIFI+Bluetooth board:  Broadcom BCM94352Z DW1560 6XRYC 802.11 AC 867 Mbps Bluetooth 4.0

Graphics : Sapphire NITRO+ Radeon RX 580 8GD5

PSU: CORSAIR RMx Series RM550x

Case: Fractal Design Define Nano S Black Silent Mini ITX Mini Tower

Keyboard: Apple magic keysboard & touchpad 2

Disk: XPG SX8200 PCIe NVMe Gen3x4 M.2 2280 480GB SSD (ASX8200NP-480GT-C) w/ Black XPG Heatsink

OS:  High Sierra 10.13.6

Replace wifi module on motherboard with DW1560 .

installed 10.13.6 without rx580  following guide at  https://hackintosher.com/builds/gigabyte-z370n-wifi-itx-hackintosh-guide-4k-htpc-build/  (system id 18.1 , with internal graphics )

After a successful install changed to 18.3 before shutdown

quoted:

“If using a dedicated Nvidia or AMD graphics card change ig-platform-id to 0x59120003 and use iMac 18,2/iMac 18,3”

disable internal graphics and change initial graphics to pcie-1  in BIOS

shutdown

(Do not try to boot into OS before everything competes. It will not work.)

install rx580  card

connect monitor to hdmi or DP, both working fine for me.

boot up

RX580 worked OOB, not patch applied. (make sure antenna is installed, otherwise connection will be very flaky even very close)

Before ordering parts please make sure:

  1. case is tall enough to hold the fan
  2. 8700 or 8700K both are ok,  depending on how much cheaper …
  3. no need for standalone graphics card if not playing game… Not sure how much it helps for video editing

 

 

 

 

Enable live migration using local disk with Nova

Instructions:

  • Enable key based authentication on destination HV, allowing root login from source into destination using key.
  • Modify nova configuration to pass extra arguments to libvirt.
  • restart only nova-compute
  • launching live migration from nova CLI.

Verify /etc/nova/nova.conf on main controller if no cell, or /etc/nova/nova.conf on cell controllers to make sure using uuid as VM name, otherwise nova will run into issue complaining no VM with name ‘instance-xxxx’, in case primary key changed in nova database. Please make sure instance_name_template is using uuid in default section:

Force using uuid as instance name

#/etc/nova/nova.conf

instance_name_template=%(uuid)s

# requires restarting nova-conductor on main controllers or Cell controllers, depending where change is made.

Verify both short name and fqdn of destination are resolvable, otherwise add to local hosts file /etc/hosts
in case hostname can not be properly resolved
#/etc/hosts

192.168.1.100  test1 test1.localdomain.com
Generate key pairs for user root on source HV.
generate key pair for user root
# enable key based root access on destination
# this is done on source
# generate keys
ssh-keygen -t rsa

Copy public keys to remote destination
Transfer public key to destination
# copy keys to destination
ssh-copy-id root@test1

Verify if we can ssh to destination with short name as root to make sure key based authentication working, and same time accepting and persist destination host id on source.
Verify key based authentication and accept host key on destination
ssh root@test1

Modify /etc/nova/nova.conf to enable live migration over ssh tunneling and local drive based live migration.
# /etc/nova/nova.conf
[DEFAULT]

live_migration_uri=qemu+ssh://%s/system
live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_NON_SHARED_INC
block_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_NON_SHARED_INC

Restart nova-compute

Restart nova-compute service
service nova-compute restart

Now, migration through Nova CLI should work.

Nova Cli
nova live-migration –block-migrate a0a4f38d-ee65-4126-a90f-f4dcbb12d7e6 hv2
VM state will be changed to migrating, and back to active again when process completes.
Concern:

Test migration across cell is not supported.

Havana nova-compute has bug dealing mount with nbd. This requires code change:
Code fix in Havana version of Nova-compute
nova/virt/disk/mount/nbd.py
# added self.partition = 1 at line 72

if not os.path.exists(‘/sys/block/nbd0’):
LOG.error(_(‘nbd module not loaded’))
self.error = _(‘nbd unavailable: module not loaded’)
return None
devices = self._detect_nbd_devices()
random.shuffle(devices)
device = self._find_unused(devices)
if not device:
# really want to log this info, not raise
self.error = _(‘No free nbd devices’)
return None
self.partition = 1
return os.path.join(‘/dev’, device)
….

# then restart nova-compute

quick qmailtoaster migrate to Postfix

assume you are running example.com with qmailtoaster.

  1. install latest iredmail which significantly simplify the whole process of installation and configuration.
  2. dump and import vpopmail  into new db.
  3. copy  /home/vpopmail/domains/* to /var/vmail1
  4. chown -R vmail:vmail /var/vmail1
  5. run following to move all the accounts into postfix vmail.mailbox table.
  6. done

 

INSERT INTO vmail.mailbox ( username      , password                               , name      ,language  , storagebasedirectory   , storagenode   , maildir                , quota , domain           , transport , department , rank     , employeeid , isadmin , isglobaladmin , enablesmtp , enablesmtpsecured , enablepop3 , enablepop3secured , enableimap , enableimapsecured , enabledeliver , enablelda , enablemanagesieve , enablemanagesievesecured , enablesieve , enablesievesecured , enableinternal , enabledoveadm , `enablelib-storage` , `enableindexer-worker` , enablelmtp , enabledsync , enablesogo  , lastlogindate       , lastloginipv4 , lastloginprotocol , settings                  , passwordlastchange  , created , modified, expired               , active , local_part )
SELECT                    concat(r.pw_name,'@example.com') , r.pw_passwd , r.pw_gecos, 'en_US'  , '/var/vmail'           , 'vmail1'      , substring(r.pw_dir,24) ,     0 ,'example.com' , ''        , ''         , 'normal' , ''       ,       0 ,             0 ,          1 ,                 1 ,          1 ,                 1 ,          1 ,                 1 ,             1 ,         1 ,                 1 ,                        1 ,           1 ,                  1 ,              1 ,             1 ,                 1 ,                    1 ,          1 ,           1 ,          1  , '1970-01-01 01:01:01' ,           0 , ''                , '' , '1970-01-01 01:01:01' , now()   , now()   , '9999-12-31 00:00:00' ,      1 , r.pw_name
FROM vpopmail.example_com r;

nova suspend issue

nova suspend ended up all error.

look at libvirt log seeing error :  qemuMigrationUpdateJobStatus:946 : operation failed: domain save job: unexpectedly failed

Found images were generated in the following directory

/var/lib/libvirt/qemu/save

/var is mounted on a separate partition with limited space. run out of save on /var again…

create a soft link to a different

Quickly detect TCP disconnection

Application may be blocked on reading TCP scocket I/O  when connection was dropped.  It could take up to 2 hours for application to detect this disconnection without a separate thread monitoring network heatbeat(Application level heatbeat ) and TCP native keepalive enabled and tune.

TCP keepalive is an optional feature and has to be explicitly enabled when creating a new sockets.  Tune the following using sysctl and persist the change in /etc/sysctl.conf as well.

net.ipv4.tcp_keepalive_time=10
net.ipv4.tcp_keepalive_probes=2
net.ipv4.tcp_keepalive_intvl=3

The above means heatbeats will be sent every 10s. If no response received for the corresponding heartbeat, another heartbeat will be sent out every 3 sends.  If 2 response (probes) missed, connection will be regarded as dropped and blocking I/O will be signaled.  Application thus get a chance to return from blocking I/O and receive an error or exception for re-establish a new connection.

Futher tuning can be done on tcp_user_timeout, fin-wait, fin-wait2, close-wait, for more details please read the following pages.

Patching RMQ issue

“RabbitMQ Best Practices”

TCP_USER_TIMEOUT

TCP keepalive overview

 

Cinder Volume attachment Issue

Openstack by designed is flawed and operations are not fully transactional which may leaves  states in different components out of sync. One example is frequent volume attachment/detachment, esp when network has glitches causing MQ unstable, and strange errors will start showing. This is because multiple components are involved, cinder, nova, and compute, where many states are maintained in multiple components. There is no master. Once a message got delayed or lost, they could easily run out of sync. There is no transaction manager coordinating the operation. In general operation across multiple components require a global transaction manager to guarantee the consistency. This also can be achieved  using 2PC which can be built into the framework.

Network connection reset

Recently we are seeing many connection reset issues and did some root cause.

This mainly happened with long lasting connections, like pooled database connections and MQ connections, and also RPC reply connection.

Finally all are fixed by raising up timeout in VIP/LB  and firewall profiles.

Timeout is too short with connections that may have no activities during the period.

 

RabbitMQ rescue

RabbitMQ cluster shows partitioned badly. I was trying to rebuild.

First of all I had issue with stopping / starting.

Start/Stop hanging:

  1. killall -u rabbitmq -q
  2. backup exiting rabbitmq.config to rabbitmq.org  (/etc/rabbitmq/rabbitmq.config)
  3. remove all other cluster members from rabbitmq.config and only keep current host.
  4. rm -ef /var/lib/rabbitmq/mnesia  (this needs to verified. Pls check /etc/rabbitmq/rabbitmq-env.conf to see where RABBITMQ_MNESIA_BASE is pointing to.  Another thing needs to pay attention to the permission on this directory. The owner has to be rabbitmq, otherwise rabbitmq will fail to start because it’s running as user rabbitmq and  can not create directories and files without sufficient permission.
  5. service rabbitmq-server restart

Now server should start without issue. now let’s do further clean up.

  1. rabbitmqctl stop_app
  2. rabbitmqctl force_reset
  3. rabbitmqctl start_app
  4. rbbitmqctl stop

There should be no errors

Now restore /etc/rabbitmq/rabbitmq.config from backup rabbitmq.org and start rabbitmq again.

  1. service rabbitmq-server start

Now we should have all nodes running

Assume node001 is master and start service on node001:

  1. rabbitmqctl start_app

 

All other nodes:

  1. rabbitmqctl stop_app
  2. rabbitmqctl join_cluster rabbit@node001
  3. rabbitmqctl start_app

 

Check cluster status on all nodes:

  1. rabbitmqctl cluster_status

should have no more partition and all nodes should be running:

Cluster status of node rabbit@phx04rmqa001 …
[{nodes,[{disc,[rabbit@node001,rabbit@node002, rabbit@node003]}]},
{running_nodes,[[rabbit@node001,rabbit@node002, rabbit@node003]},
{partitions,[]}]

partitions should be empty []  .

 

don’t forget to enabled HA queue etc.

rabbitmqctl set_policy ha-all “” ‘{“ha-mode”:”all”,”ha-sync-mode”:”automatic”}’

Another option is cluster_partition_handling in configuration that changes how partition recovery works ( default is ignore):

  • pause_minority
  • {pause_if_all_down, [nodes], ignore | autoheal}
  • autoheal

combined with loadbalancer, things could behave strange. Highly recommend no VIP to front mq server.

Another commonly see issue is rabbitmq stops responding due to messages flooded particular queues with no consumer. It may completely stopped responding or delay message delivering. A workaround is to put a size on the queue to force RMQ not to exhaust memry:

rabbitmqctl set_policy POLICY_NAME  “QUEUE_NAME” ‘{“max-length”:100}’ –apply-to queues

 

 

Good luck

Optimize to extreme

Application layers :

  • I/O or CPU intensive,  traffic and cpu usage pattern, time change pattern
  • service dependencies and strength of connections
  • internal optimization: efficiency of buffer utilization, reduce buffer memory copies, user/kernal spaces, async if possible.

IAAS layers

  • characteristics of devices: SSD/Spinning HD, Network Band width/QoS,
  • capacities : availability of cpu cores, memory, disk space, networks.

Smart scheduler:

  •  scheduling based on knowledge provided on both.
  • dynamically consolidating/scaling based on current load