top of page

vMotion Troubleshooting

Today we discussed and look into vMotion process and its troubleshooting. our focus would be more on vMotion troubleshooting and their causes. Also, we would discuss on other area's like how we can do live migrations under low network bandwidth as well as their few failure causes.

Let's have a look at how you can troubleshoot and prevent vMotion failure under your environment.

Common Failures causes

Network Configurations

  • vMotion network connectivity issues, ESXi hosts lost its connectivity or cannot ping or timeout after 20 seconds.

  • Misconfiguration on what networks are vMotion enabled on the ESXi hosts

Storage

​​

  • Datastore unreachable or experiencing an APD (All Path Down )

  • I/O's Timeout of 20 Seconds or more

​Overcommit Resource under Cluster that caused

  • Fail to allocate memory on target host

  • Swapping takes a long time leading to a vMotion timeout

So, before you start your vMotion process, cross check all above checkpoints like Network configuration, MTU mismatch and Vmotion network configuration. if required, check logs on below five levels

VPXD (vCenter ) ----------------------- Vpxd.log

VPXA (host) Vpxa.log Hostd (hostd.log)

VMX (VMware Logs ) VMKernal (VMkernal.log)

Identify the Operation ID (opID) from logs. That opID maps to a Migration ID. Using these two identifiers, we can follow a vMotion process looking at the corresponding log files.

Open vCenter appliance through bash shell and run below command to get operation ID in VPXD logs. Outcome of this command exposes the log entry that contains the operation ID

Now you retrieved Operation ID [Ops ID] and through this you need to identify Migration ID in the host log file which captures in hostd.log file.

​Open destination Hostd.log file and search with above opID "

grep jzlgfw8g-11824-auto-94h-h5:70003561-20 /var/log/hostd.log | grep -i migrate

2021-03-28T13:26:01.680Z warning hostd[2099261] [Originator@6876 sub=PropertyCollector opID=esxui-a0d-0216 user=root] Session 52cc37fe-a19c-73f7-fca5-81875b84e225 has issued 2 concurrent blocking calls to the property collector. This is almost certainly an error in the logic of the client application. 2021-03-28T13:26:08.048Z info hostd[2098608] [Originator@6876 sub=Vcsvc.VMotion opID=kmt1muqa-1372-auto-125-h5:70000832-e6-01-76-0219 user=vpxuser:VSPHERE.LOCAL\Administrator] PrepareSourceEx [6005829444379731029], VM = '4' 2021-03-28T13:26:08.048Z info hostd[2098608] [Originator@6876 sub=Vcsvc.VMotionSrc.6005829444379731029 opID=kmt1muqa-1372-auto-125-h5:70000832-e6-01-76-0219 user=vpxuser:VSPHERE.LOCAL\Administrator] VMotionEntry: migrateType = 1 Now that we have the Migration ID, we can use that to extract information about the vMotion process for this specific live-migration in the vmkernel log files.

2021-03-28T13:26:14.737Z cpu3:2107282)Migrate: vm 2107283: 3885: Setting VMOTION info: Source ts = 6005829444379731029, src ip = <192.168.200.63> dest ip = <192.168.200.21> Dest wid = 2101857 using SHARED swap, encrypted 2021-03-28T13:26:14.738Z cpu3:2107282)Hbr: 3561: Migration start received (worldID=2107283) (migrateType=1) (event=0) (isSource=1) (sharedConfig=1) 2021-03-28T13:26:14.738Z cpu2:2107428)MigrateNet: 1751: 6005829444379731029 S: Successfully bound connection to vmknic vmk3 - '192.168.200.63' 2021-03-28T13:26:14.739Z cpu3:2098167)MigrateNet: vm 2098167: 3263: Accepted connection from <192.168.200.21> 2021-03-28T13:26:14.739Z cpu3:2098167)MigrateNet: vm 2098167: 3351: dataSocket 0x430a32118f40 receive buffer size is 563272 2021-03-28T13:26:14.739Z cpu3:2098167)Migrate: 358: Remote machine is ESX 6.5 or newer. 2021-03-28T13:26:14.739Z cpu2:2107428)MigrateNet: 1751: 6005829444379731029 S: Successfully bound connection to vmknic vmk3 - '192.168.200.63' 2021-03-28T13:26:14.740Z cpu2:2107428)VMotionUtil: 5199: 6005829444379731029 S: Stream connection 1 added. 2021-03-28T13:26:14.740Z cpu2:2107428)MigrateNet: 1751: 6005829444379731029 S: Successfully bound connection to vmknic vmk4 - '192.168.200.164' 2021-03-28T13:26:14.750Z cpu2:2107428)VMotionUtil: 5199: 6005829444379731029 S: Stream connection 2 added. 2021-03-28T13:26:14.758Z cpu2:2107428)VMotion: 7441: 6005829444379731029 S: Detected 7Ms round-trip latency to remote host.

2021-03-28T13:26:18.251Z cpu5:2107283)VMotion: 5277: 6005829444379731029 S: Stopping pre-copy: only 9 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~205.501 MB/s, 155635100% t2d) 2021-03-28T13:26:18.323Z cpu5:2107291)VSCSI: 6602: handle 8192(vscsi0:0):Destroying Device for world 2107283 (pendCom 0) 2021-03-28T13:26:18.324Z cpu5:2107291)VSCSI: 6602: handle 8193(vscsi0:1):Destroying Device for world 2107283 (pendCom 0) 2021-03-28T13:26:18.324Z cpu5:2107291)NetPort: 1580: disabled port 0x2000009 2021-03-28T13:26:18.373Z cpu5:2107428)VMotionSend: 5095: 6005829444379731029 S: Sent all modified pages to destination (network bandwidth ~41.093 MB/s) 2021-03-28T13:26:18.473Z cpu5:2107282)Net: 3712: disconnected client from port 0x2000009 2021-03-28T13:26:18.473Z cpu5:2107282)Hbr: 3655: Migration end received (worldID=2107283) (migrateType=1) (event=1) (isSource=1) (sharedConfig=1) 2021-03-28T13:26:18.474Z cpu0:2107425)VMotionUtil: 7560: 6005829444379731029 S: Socket 0x430a3208a840 sendSocket pending: 976212/976220 snd 0 rcv

vMigration operation was successfully completed which reflecting in logs as well. log entries include a lot of detailed information that take you through the process. If something fails, these logs will be a good resource to look for a possible cause.

For more information or help, feel free to contact or email. happy to assist further.


282 views0 comments
Post: Blog2_Post
bottom of page