Wednesday, June 22, 2011

Mobicents Media Server 2.1.0.B2 under the hood

The development of Media server on top of standard Java was started 3 years ago and all that time we were trying to avoid implementation of real time scheduler and reach suitable results using Java threads model only. There was several reasons to stick to this way. In addition, for our justification we can say we don't need hard real time system and can agree on a system with small miss rate value and jitter. So it was subject for try... At the early stage when server was running not big set of mostly equivalent tasks the round robin thread pool model was working more or less but with each release server became more and more complex and gain the problem of scheduler more and more as well.

Version 2.1.0.BETA1 was a final and agonized try to make this system working. The development of next version started from the deep analysis of scheduling algorithms and now ended with full server refactoring.

Generally, media server is a system where time critical processes must be executed under timeline limitations which is usually decribed as term deadline. Task must be completed or its execution must start before the deadline. Thus we need first of all to implement task scheduler which will allow us execute time critical tasks.

The global task of real time scheduling is quite important for computer industry so it is well studied in many works. As result, different algorithms of scheduling were designed as well but unfortunately none of such algorithms is available in standard Java by default. Of cource there is an extension called RTSJ (JSR-1) but it is not available for all OSes and it is expensive. Defintely, all known scheduling algorithms are optimal if preemtive what cause high level of CPU control and thus dependency from OS capabilities. So not even any OS can be suitable for implementing real time scheduler. In this sense, implementation of crossplatform such scheduler using Java does not seems real at all. No need to try it even. Instead, let's look is it possible to adopt one of the known algorithm for the media processing task exactly and implement it on Java.

The media processing supposes loses. The encoded and then decoded signal never matches to the original signal but senses of humans are not able to recognize the difference. It means that we can make deadline softer and try to minimize the mistake. This is assumption quite extends the limits.

From the other side, the Java virtual machine actually deattaches us from the real CPUs what allowes us to consider it mostly like uniprocessor system.

These facts bring us to the idea to choose the scheduler with EDF policy as a starting point. We will assign priority to the task and put tasks into the priority queue while JVM will execute one task after another without preemption.

The EDF scheduling is feasible if

where $c_{i}$ is worst case execution time and $T_{i}$ is period of i-task. At this stage let's note that media server runs big number of tasks of same type: IO, transcoding, relay, etc. This makes the total analysis more simple and allows to find the strategy of feasible scheduling for all types of media tasks. However since we can not implement preemtion it is impossible to guarantee that such scheduling will be feasible all the time. And it is obvious that error in terms of deadline will be proportional to the following value

where c - is task execution time and N - number of independant threads (or CPUs behind jvm). It is easy to see that the error is reduced with execution time and with increasing number of processors. Or in numbers for 1 millisecond worst case execution time we will have jitter about

 Number of processors jitter,ms 2 0,5 4 0,25 8 0,125 16 0,064

As we can see 4 CPUs gives us already perfect number even without preemption and this estimation perfectly matches to our test results. Awesome! However it is not the end of the story yet. There is still underwater stone - dispersion.

Let's run same tasks a lot of times and check the execution time. At the end of test we will have set of different values grouped around some middle value. It is possible to draw the distribution of the execution time as following diagram

Where $\mu$ is probalistic mean for execution time and dispersion is $\sigma$.

The dispersion is explained by the several facts:
- jvm activity like JIT compiling
- CPU sharing
- Garbage collection
- Object resizing

Let's ignore what is a reason of dispersion and try to understad how is it dangerous for us? Since execution time is random value with some distribution then each member of the equation (1) $c_{i}/T_{i}$ will be random value as well with same distribution amd then following to Tchebyshev's Central Limit Theorem (CLT) the sum of n such varibales will tend to normal distribution with probabilistic mean n*$\mu$ and dispersion $\sigma^{2}$. So dispersion growth like square! (beeeeeeeeeeeeep).

The meaning of above theorem can be demostrated on the following example. Imagine that 10 guys shooting to own target. To be simple lets count that all have aproximately same result: 3 bullets into the midle and other distributed in round with 5" radius. Now the same guys shooting to one taget. The theorem predicts as the 30 bulltes will reach center and remaining will be distributed in the round with 25" already (aproximately). But what if target has radius 10" only? In first case 100% of bullets were in target while in second case 50% of bullets did not fit into target at all.

In case of media server the equation (1) gives the "size" of the target. Or in normal language it means that the probability to exceed the available CPU utilization and hit to not feasible scheduler caused by dispersion growth very quickly and we have to pay extreme attention to reasons why it happens and find ways how to minimize the dispersion.

As mentioned above there are several reasons of dispersion of execution time for any task. Some of them are completely out of our management capabilities and some can be and must be fixed. For others we can propose recommendations like run media server on a dedicated machine or use another way to guarantee the enough amount computitional resources.

For the current release most of the methods were refactored and tested during many hours individualy on subject of dispersion caused by Java itself and GC and for worst case execution time as well. New memory model was introduced to reduce the impact from GC, new text utility classes which alllowes now to improve protocol message handling in handred of thouthands times... It was a long time job but with positive result: Mobicents media server demonstartes strong and stable transmission statistics even compared with native implementations.

In this post I was trying to explain the "global" idea of scheduling used by Mobicents Media server, roadblocks and ways of solving problems. In the upcoming post I will subsequently show test results, stream statistic compared to well known open source products, explain memory model, etc. So stay tuned.

1. Thank you for the in-depth review of the Mobicents Media server refactoring. The dispersion prediction over time is eye-opening.

Looking forward to the follow up post.

Question related to formula (1) Sum (ci/Ti) <= 1
Maybe I misunderstand, but is it the case that ci is the worst case execution time for task i? And Ti is the time slot available for task i to execute? If so, then it is intuitive that the goal is for ci to always fit in Ti (ci <= Ti).

Also, Ti is usually the same for any i, let's say 1ms.

Now, if ci is close to Ti or even half Ti, then it is easy to see how Sum(ci/Ti) can exceed 1 even for n=2 or 3.

Can you help me understand what I am missing. Thank you.

2. Sorry, just hit to your question.... You did not miss anything. Absolutely right that if ci close to Ti, then only a few task feasible to execute with deadline constraints. Weird? Not actualy, it is just a life. The weird thing that if Ci is not deterministic and has a wide dispersion then relying on worst case execution time gives us too pessimistic estimation. Let's demostrate it on example: Ti = 1ms, and Ci is random with average value 10 microseconds and worst case execution time 1ms. How many such tasks we can schedule? 100 or 1? or may be 50?