Yesterday I got a problem in an Active Directory environment. A DC stopped to authenticate users.
The first problem was easy to find and is typical. The time on this DC was several years behind. I know, not really common to have such a time shift, but the symptoms were clear. So this was fixed very quickly.
Event:
1 2 3 4 5 6 7 8 9 10 11 |
Log Name: System Source: Microsoft-Windows-Security-Kerberos Date: 16.12.2013 00:59:36 Event ID: 5 Task Category: None Level: Error Keywords: Classic User: N/A Computer: <DOMAIN CONTROLLER> Description: The Kerberos client received a KRB_AP_ERR_TKT_NYV error from the server <DOMAIN CONTROLLER>$. This indicates that the ticket presented to that server is not yet valid (due to a discrepancy between ticket and server time. Contact your system administrator to make sure the client and server times are synchronized, and that the time for the Key Distribution Center Service (KDC) in realm <DOMAIN> is synchronized with the KDC in the client realm. |
But I still didn’t know the root of this problem. After some brain work and research, I found out, that the ESXi host it was running on was in the past, exactly the same time shift.
That the ESXi host is behind didn’t really care, but even I unchecked to sync the time over VMtools with the guest (the DC in this case), I couldn’t understand why the guest was updated with the time.
Then I found the information that explained the whole problem. Even the flag to sync the time on a VM is unchecked, during a start of a VM, the “hardware” clock will still be set to the ESXi host time.
Now I thought the problem is solved, but today an other event popped up:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
Log Name: Directory Service Source: Microsoft-Windows-ActiveDirectory_DomainService Date: 16.12.2013 22:57:37 Event ID: 2042 Task Category: Replication Level: Error Keywords: Classic User: ANONYMOUS LOGON Computer: Description: It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source. The reason that replication is not allowed to continue is that the two DCs may contain lingering objects. Objects that have been deleted and garbage collected from an Active Directory Domain Services partition but still exist in the writable partitions of other DCs in the same domain, or read-only partitions of global catalog servers in other domains in the forest are known as "lingering objects". If the local destination DC was allowed to replicate with the source DC, these potential lingering object would be recreated in the local Active Directory Domain Services database. Time of last successful replication: 2009-01-18 22:46:36 Invocation ID of source directory server: Name of source directory server: ._msdcs. Tombstone lifetime (days): 180 The replication operation has failed. User Action: The action plan to recover from this error can be found at http://support.microsoft.com/?id=314282. If both the source and destination DCs are Windows Server 2003 DCs, then install the support tools included on the installation CD. To see which objects would be deleted without actually performing the deletion run "repadmin /removelingeringobjects /ADVISORY_MODE". The eventlogs on the source DC will enumerate all lingering objects. To remove lingering objects from a source domain controller run "repadmin /removelingeringobjects ". If either source or destination DC is a Windows 2000 Server DC, then more information on how to remove lingering objects on the source DC can be found at http://support.microsoft.com/?id=314282 or from your Microsoft support personnel. If you need Active Directory Domain Services replication to function immediately at all costs and don't have time to remove lingering objects, enable replication by setting the following registry key to a non-zero value: Registry Key: HKLM\System\CurrentControlSet\Services\NTDS\Parameters\Allow Replication With Divergent and Corrupt Partner Replication errors between DCs sharing a common partition can prevent user and compter acounts, trust relationships, their passwords, security groups, security group memberships and other Active Directory Domain Services configuration data to vary between DCs, affecting the ability to log on, find objects of interest and perform other critical operations. These inconsistencies are resolved once replication errors are resolved. DCs that fail to inbound replicate deleted objects within tombstone lifetime number of days will remain inconsistent until lingering objects are manually removed by an administrator from each local DC. Additionally, replication may continue to be blocked after this registry key is set, depending on whether lingering objects are located immediately. Alternate User Action: Force demote or reinstall the DC(s) that were disconnected. |
We can imagine what the cause was, because the date of the last successfully replication in the message speaks for it self.
The event description contains almost all information needed to fix this (see also this page).
Because I knew there weren’t any lingering objects to clean, i used the hard method with the registry key.
If not already exist, create a DWORD value “Allow Replication With Divergent and Corrupt Partner” in the following key “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters” and set it to “1”.
After some manual “Replicate Now” in “Active Directory Sites and Services” and some minutes, the replication succeed again between the DC’s.
To check the replication the following command could be used on the affected DC’s:
1 |
repadmin /showrepl |
Important: Do not forget to revert the key back to “0”.