Jump to content
Mumble forums

Murmur Randomly Disconnecting Users


nacey
 Share

Recommended Posts

I'm running Murmur from the 1.2.3 installer on a Windows Server 2008 RC2 box. I have it set to run as a service using ANYSRV. The setup was relatively painless and overall I am very impressed with the Mumble / Murmur combination.


However, I am having some issues with the stability of connections to the server. We typically have around ~10-15 on the server at peak times. The server machine is a VPS. I'm not certain of the exact CPU spec but I'm not seeing any issues with CPU load and it has 1GB of RAM which generally sits around two thirds used. It has a 100/100 network connection and I once had everyone on the server hold down the Push-to-Talk key and the network usage barely hit 5%. The only other thing that this machine does is host a basic website and forums for our WoW guild. In short, the server should be plenty good enough for our usage.


What is happening is that randomly (and I literally cannot find any pattern to it at all), a person or persons will get disconnected from the server, with the message "The remote host closed the connection". They reconnect automatically a few seconds later. This isn't accompanied by any kind of lag of degradation of voice quality, and everyone else on the server is able to carry on using it throughout. It will typically disconnect two or three people at the same time, but sometimes it is only one and occasionally I've seen it do six or seven people. This is happening maybe once every 20mins or so, but it's not regular or predictable. It will be different people each time, it affects everyone fairly evenly, and I'm really at a loss as to what is causing it.


When I check the server logs, I see that the server is constantly giving timeout errors, approximately every fifteen seconds. Here's a five minute sample from 06:00 this morning (when I doubt anyone was even on the server):

<W>2010-12-16 06:00:14.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:00:14.937 1 => <2:(-1)> Timeout
<W>2010-12-16 06:00:30.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:00:30.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:00:45.953 1 => <1:(-1)> Timeout
<W>2010-12-16 06:00:45.953 1 => <2:(-1)> Timeout
<W>2010-12-16 06:01:01.453 1 => <1:(-1)> Timeout
<W>2010-12-16 06:01:01.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:01:16.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:01:16.937 1 => <2:(-1)> Timeout
<W>2010-12-16 06:01:32.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:01:32.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:01:47.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:01:47.953 1 => <2:(-1)> Timeout
<W>2010-12-16 06:02:03.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:02:03.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:02:18.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:02:18.953 1 => <2:(-1)> Timeout
<W>2010-12-16 06:02:34.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:02:34.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:02:49.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:02:49.937 1 => <2:(-1)> Timeout
<W>2010-12-16 06:03:05.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:03:05.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:03:20.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:03:20.968 1 => <2:(-1)> Timeout
<W>2010-12-16 06:03:36.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:03:36.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:03:51.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:03:51.953 1 => <2:(-1)> Timeout
<W>2010-12-16 06:04:07.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:04:07.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:04:22.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:04:22.937 1 => <2:(-1)> Timeout
<W>2010-12-16 06:04:38.437 1 => <1:(-1)> Timeout
<W>2010-12-16 06:04:38.453 1 => <2:(-1)> Timeout
<W>2010-12-16 06:04:53.937 1 => <1:(-1)> Timeout
<W>2010-12-16 06:04:53.937 1 => <2:(-1)> Timeout

This is constant and not affected by the number of active users or anything else. I don't know if this is connected with the disconnects or not.


When someone is disconnected I see this in the log:

<W>2010-12-15 23:34:09.140 1 => <41:tellezara(-1)> Connection closed: The remote host closed the connection [1]

 

I don't really know where to start with this. It seems like it must be a server issue because it's affecting everyone but I can't see anything wrong on the server. I've tried totally uninstalling/reinstalling Murmur but that didn't help at all. This was happening on 1.2.2 and now 1.2.3 is the same.


Any ideas?

Link to comment
Share on other sites

Thanks for the quick reply. Forgot to mention that, I had a few users (and myself) try TCP mode, it made no difference.


Can you think of a way that I can confirm whether it is a network issue? I will put in a ticket about it with my host.


What about the timeout issue? It's looks like it's happening every 15.5 seconds exactly, doesn't seem like a network issue to me. What would cause that error?

Link to comment
Share on other sites

  • Moderators

The timeout log spam is probably a bug. I still don't know how to reproduce it and never saw it on my Linux server. I'm working on a patch which could fix it but I don't want to commit it for 1.2.3 because I don't want to risk breaking other things.

Link to comment
Share on other sites

The timeout spam is not happening when people are disconnected, as you can see from the log extract above, those entries are showing up every 15.5 seconds. the disconnections are random and far less frequent, but still often enough to be annoying, every 20mins or so.


If you have a patch/build that you'd like testing then I'd be happy to help, let me know.


My host says there are not issues with the connection. I don't know what else to try, the disconnects just seem so arbitrary.

Link to comment
Share on other sites

I could really do with some ideas for how I can troubleshoot this.


Ventrilo was slow, inefficient, laggy, expensive, sends packets to arbitrary IPs in Brazil or somewhere, and generally I never want to see it again. But the problem is: it was stable. My guild are getting fed up with this and wants Vent back because they can't stand being disconnected randomly for no reason.

Link to comment
Share on other sites

No, Vent doesn't sell server licenses to individuals, you have to be a host/reseller. Well, you can get a free four-user license or something but that's useless. So our Vent was hosted completely separately.


I did come up with one idea. I'm running Mumble on the server itself and have left it connected to Murmur (i.e. localhost). So now we'll wait, and if that gets disconnected then we know it's a software issue. If it goes a couple of days with only disconnecting other people then we can pretty much say that it is a network issue.


Early indications suggest that it is a network issue. I will poke my host again. Any other ideas welcome though.


Edit: Also the timeout spam seems to be gone since I updated to the most recent RC build. So either it was a bug in the previous version or it was just some corruption on my side. Either way that's good, makes the logs a lot easier to check!

Link to comment
Share on other sites

As far as I know, this patch has not been applied yet.


There is something strange about the timeout spamming in your log. They are related to two users that have no username and are the first two to be connected to the server. I do not know were these two users come from, so lets explain the lines 1=> <1:(-1) and 1=> <2:(-1) to you. First number (1) means server number, the second number (1,2) means the uiSession, this is normally followed username which is empty here and the (-1) is the Id of the user meaning here it is an unregistered user. Maybe you now have a better idea what could have been going on.


BTW, were you able of registering to the murmur server? Someone on the IRC channel indicated that it could be a authentication problem, caused by an external program used for authentication.

Computer specs: AMD FX-8320, 8GB DDR3-SDRAM, AMD Radeon HD 7950, Asus Xonar D1, Windows 7 Ultimate 64bit/Debian Jessie AMD64.

Link to comment
Share on other sites

Well the local client was just kicked along with 4-5 other people so we can rule out any networking issues. There's definitely something going on software-side.


I can register fine with the server, most of my users are registered. I'm not using any kind of server manager, just have Murmur running and people registered themselves through Mumble.


When the mass disconnect happens, it isn't at a point where someone is connecting or anything like that. It's just totally random as far as I can tell.


The timeout spam has stopped so I assume whatever was causing that is fixed. It seemed to stop when I updated to the most recent RC build (moving from RC-1-4 to RC-1-8). The disconnections also seems less frequent, it's more like once every 4-5 hours now, so it's a lot better. Still pretty odd though, and annoying if it comes at a bad time, would like to resolve it.


But not sure what to check next :/ .


Here's the log from when the local client (ServerTest) was kicked (except I masked the IP endings):

 

<W>2010-12-20 20:42:13.787 1 => <17:Xinglon(23)> Timeout
<W>2010-12-20 20:42:13.795 1 => <18:Fyria(3)> Timeout
<W>2010-12-20 20:42:13.804 1 => <2:Tom(13)> Timeout
<W>2010-12-20 20:42:13.849 1 => <3:ServerTest(27)> Timeout
<W>2010-12-20 20:42:13.875 1 => <8:Hashmal(-1)> Timeout
<W>2010-12-20 20:42:13.890 1 => <11:Johnz(-1)> Timeout
<W>2010-12-20 20:42:13.908 1 => <12:Borat(19)> Timeout
<W>2010-12-20 20:42:13.922 1 => <15:Sharp(5)> Timeout
<W>2010-12-20 20:42:13.943 1 => <17:Xinglon(23)> Connection closed:  [-1]
<W>2010-12-20 20:42:13.982 1 => <18:Fyria(3)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.000 1 => <2:Tom(13)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.010 1 => <3:ServerTest(27)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.024 1 => <8:Hashmal(-1)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.034 1 => <11:Johnz(-1)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.061 1 => <12:Borat(19)> Connection closed:  [-1]
<W>2010-12-20 20:42:14.079 1 => <15:Sharp(5)> Connection closed:  [-1]
<W>2010-12-20 20:42:23.995 1 => <29:(-1)> New connection: 93.97.x.x:65064
<W>2010-12-20 20:42:24.023 1 => <30:(-1)> New connection: 127.0.0.1:49708
<W>2010-12-20 20:42:24.024 Connection: Failed to add flow to QOS
<W>2010-12-20 20:42:24.083 1 => <30:(-1)> Client version 1.2.3 (Win: 1.2.3-rc1-8-g979bf30)
<W>2010-12-20 20:42:24.112 1 => <30:ServerTest(27)> Authenticated
<W>2010-12-20 20:42:24.121 1 => <29:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:24.152 1 => <29:Tom(13)> Authenticated
<W>2010-12-20 20:42:24.165 1 => <31:(-1)> New connection: 90.200.x.x:53012
<W>2010-12-20 20:42:24.258 1 => <32:(-1)> New connection: 90.210.x.x:59631
<W>2010-12-20 20:42:24.316 1 => <31:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:24.340 1 => <31:Borat(19)> Authenticated
<W>2010-12-20 20:42:24.377 1 => <33:(-1)> New connection: 86.10.x.x:60605
<W>2010-12-20 20:42:24.426 1 => <32:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:24.439 1 => <32:Hashmal(-1)> Authenticated
<W>2010-12-20 20:42:24.697 1 => <33:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:24.734 1 => <33:Fyria(3)> Authenticated
<W>2010-12-20 20:42:25.012 1 => <34:(-1)> New connection: 86.10.x.x:54143
<W>2010-12-20 20:42:25.242 1 => <35:(-1)> New connection: 82.4.x.x:50423
<W>2010-12-20 20:42:25.505 1 => <35:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:25.547 1 => <35:Xinglon(23)> Authenticated
<W>2010-12-20 20:42:25.579 1 => <32:Hashmal(-1)> Moved Hashmal:32(-1) to Animal Haus[5:2]
<W>2010-12-20 20:42:25.685 1 => <34:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:25.732 1 => <34:Sharp(5)> Authenticated
<W>2010-12-20 20:42:47.847 1 => <36:(-1)> New connection: 93.92.x.x:50874
<W>2010-12-20 20:42:48.002 1 => <36:(-1)> Client version 1.2.2 (Win: 1.2.2)
<W>2010-12-20 20:42:48.013 1 => <36:Johnz(-1)> Authenticated
<W>2010-12-20 20:42:48.753 1 => <36:Johnz(-1)> Moved Johnz:36(-1) to Putin's Vodka Gazebo[4:2]
<W>2010-12-20 20:42:52.172 1 => <35:Xinglon(23)> Moved Xinglon:35(23) to Putin's Vodka Gazebo[4:2]
<W>2010-12-20 20:43:57.752 1 => <35:Xinglon(23)> Moved Xinglon:35(23) to Animal Haus[5:2]

Link to comment
Share on other sites

I think I've found the cause, but don't have a fix yet. Event Viewer shows entries like this:

 

The system time has changed to ‎2010‎-‎12‎-‎20T19:39:29.500000000Z from ‎2010‎-‎12‎-‎20T19:36:41.500000000Z.

This is happeneing every couple of hours, and the disconnects are occuring within one second of the updated time. I think what is happening is that the system clock is jumping, Murmur is thinking that it hasn't had a response from certain users (presumably those in the middle of a packet exhcnage or whatever) for 90 seconds, and hence it's kicking those users.


On face value this makes complete sense to me. Would others agree?

Link to comment
Share on other sites

You have found the problem.


When you have overclocked your computer the clock drift can be explained by that. If this is not the case you could change the interval that your computer checks his time on the internet.

Computer specs: AMD FX-8320, 8GB DDR3-SDRAM, AMD Radeon HD 7950, Asus Xonar D1, Windows 7 Ultimate 64bit/Debian Jessie AMD64.

Link to comment
Share on other sites

  • Administrators

Hahaha,

yes.

Check your system for updating time via time timeservers and disable it.

As that's quite a jump (3 mins) if its regular, you'd have to check if clock runs totally off then.

Maybe it's also some other script/app changing the system time?

Link to comment
Share on other sites

Well I'm glad to have finally nailed the cause at least! It's one of those things that is so obvious once you see it and yet it's hard to make the link in the first place becuase it's not something that you immediately suspect.


My server is a VPS, time sync issues are not uncommon in virtual environments but this seems unusually excessive (it's about 2.5% slow).


I have disabled the Windows time update thing and set up proper NTP as a temporary fix, though that may not actually help since I think it might still skip by afew seconds. I have put a ticket in with my host to get the underlying issue resolved (which presumably lies on the host machine).

Link to comment
Share on other sites

I think my host has resolved the time issue so I'm going to stick with the current build for a couple of days just to be sure that there was nothing else going on.


But your fast update skills are really appreciate, it's great to see in a project like this. And at this point I am happy that this is either resolved, or very close to it.


$10 donation coming your way, enjoy the beer :) .

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...