D
d3m0n
Guest
Hi All
Over the last few weeks, we have had an issue whereby one of our servers
(it's a DC, which hosts apps and also a fileserver) exhibits the following
symptoms, seemingly at random:
- The application client on workstations will stop responding, eventually
giving an error relating to it not being able to access files on a UNC path
(\\server\appshare\system\blahblahetc) which is mapped to the apps drive, or
an error stating that another user has exclusive access to a certain file.
- The filesharing drive (which is mapped to a folder hosted on a completely
different logical partition) also stops responding and crashes explorer if
you try to access it from a workstation.
- The apps drive also exhibits the same behaviour.
- If you try to access a mapped drive via logging onto the server by RDC,
the result is the same - Explorer crashes.
- If you log into an RDC session and go through My Computer and drill down
to the actual shared folder itself, the same thing happens - Explorer crashes.
- However, if you log onto the actual console session on the server and
browse My Computer, drilling down to the folder, access seems to be ok (it
has only crashed explorer once when trying this method).
I've run counter logs on memory, hdd (physical and logical) and network
interface with no glaring results. I've enabled auditing on the server and
can't see anything in the security log to suggest that there are access
failures for any files on the server. The rest of the event logs show no
errors relating to anything like this.
I've also moved the user data onto another server, just in case this was a
factor, but the issue is still present, and the data is being hosted fine on
the other server, with no problems.
The only way to resolve the issue temporarily is to reboot the server. As
stated, the issue seems to randomly occur.
The one thing I've noticed, when going through services and restarting them
to see if any were at fault, is that the Server service crashes when trying
to stop and restart, and I have to reboot the server anyway to get this
running again. This doesn't produce anything in the event logs either.
I'm fairly certain that it is somehow related to the Server service, as
every time it occurs, the Server service will hang if you try and restart it
(Netlogon, DFS and Computer Browser all stop ok). When the server is running
normally, the Server service can be stopped and restarted with no issue. But
as I said, none of this is flagging up anything in any event logs, and
nothing untoward shows up on Perfmon. So I'm really not sure from here how I
can go about monitoring this Server service to see what it's doing prior to
crashing.
I have read stuff about it possibly being a low-level TCP/IP stack issue
with the onboard Broadcom Gigabit NIC (it's a Dell PowerEdge server btw), and
as such, have replaced with a dedicated Broadcom server NIC, which appears to
have stopped the issue occurring as frequently (had nothing for 7 days, then
1 crash, then nothing for the last 9 days until today, where it happened
twice in 2 hours), but it is obviously still there, nonetheless.
Anyone got any ideas please?
Thanks in advance
Over the last few weeks, we have had an issue whereby one of our servers
(it's a DC, which hosts apps and also a fileserver) exhibits the following
symptoms, seemingly at random:
- The application client on workstations will stop responding, eventually
giving an error relating to it not being able to access files on a UNC path
(\\server\appshare\system\blahblahetc) which is mapped to the apps drive, or
an error stating that another user has exclusive access to a certain file.
- The filesharing drive (which is mapped to a folder hosted on a completely
different logical partition) also stops responding and crashes explorer if
you try to access it from a workstation.
- The apps drive also exhibits the same behaviour.
- If you try to access a mapped drive via logging onto the server by RDC,
the result is the same - Explorer crashes.
- If you log into an RDC session and go through My Computer and drill down
to the actual shared folder itself, the same thing happens - Explorer crashes.
- However, if you log onto the actual console session on the server and
browse My Computer, drilling down to the folder, access seems to be ok (it
has only crashed explorer once when trying this method).
I've run counter logs on memory, hdd (physical and logical) and network
interface with no glaring results. I've enabled auditing on the server and
can't see anything in the security log to suggest that there are access
failures for any files on the server. The rest of the event logs show no
errors relating to anything like this.
I've also moved the user data onto another server, just in case this was a
factor, but the issue is still present, and the data is being hosted fine on
the other server, with no problems.
The only way to resolve the issue temporarily is to reboot the server. As
stated, the issue seems to randomly occur.
The one thing I've noticed, when going through services and restarting them
to see if any were at fault, is that the Server service crashes when trying
to stop and restart, and I have to reboot the server anyway to get this
running again. This doesn't produce anything in the event logs either.
I'm fairly certain that it is somehow related to the Server service, as
every time it occurs, the Server service will hang if you try and restart it
(Netlogon, DFS and Computer Browser all stop ok). When the server is running
normally, the Server service can be stopped and restarted with no issue. But
as I said, none of this is flagging up anything in any event logs, and
nothing untoward shows up on Perfmon. So I'm really not sure from here how I
can go about monitoring this Server service to see what it's doing prior to
crashing.
I have read stuff about it possibly being a low-level TCP/IP stack issue
with the onboard Broadcom Gigabit NIC (it's a Dell PowerEdge server btw), and
as such, have replaced with a dedicated Broadcom server NIC, which appears to
have stopped the issue occurring as frequently (had nothing for 7 days, then
1 crash, then nothing for the last 9 days until today, where it happened
twice in 2 hours), but it is obviously still there, nonetheless.
Anyone got any ideas please?
Thanks in advance