- Reducing the amount of time required to create "heavyweight" objects (that might involve executing some time consuming tasks).
- Reducing the amount and rate of dynamic memory allocations. Thus, reducing the GC latency in future collections.
Allocation in JVMs was not always so fast -- early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice -- because it was, along with everything else in early JVMs -- and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.)A common, simple pattern for implementing an object pool is to create a single pool instance that is shared across all of the application. To achieve thread-safety, you would usually find a single, global lock around the Allocate and Free methods.
It's obvious that this type of design could introduce major concurrency bottlenecks. The more objects we'll attempt to pool, the greater the chance that we'll have threads attempting to acquire the pool's lock. And since we only maintain a single, global pool, contentions around that lock are bound to appear. Effectively ruining our application's scalability.
To demonstrate the issue, I've written a small benchmark that uses all of the available processors to allocate and free a constant number of objects (each thread gets an equal amount of objects to pool). Logically speaking, the more processors we use, the faster we should be able to finish allocating and freeing the constant number of objects. However, as the results show, we're actually experiencing a slowdown that gets worse as soon as we add more and more processors.
The results aren't surprising since they can be easily explained due to the massive amount of contentions we're experiencing around our single lock. (The time axis in the chart is expressed in milliseconds).
The first implementation of the pool that was being used in the test:
(Mind you that the code samples in this post are purely meant to demonstrate the conceptual differences between the pools).
// holds a dictionary that makes a pool-per-type corelation public class SimpleMainPool { private Dictionary<Type, ISubPool> m_main; // to make things simpler, the dictionary isn't modified // after the first initialization public SimpleMainPool(Type[] pooledTypes) { m_main = new Dictionary<Type, ISubPool>(); foreach (Type curType in pooledTypes) m_main.Add(curType, new SemiLocalPool(curType)); } public object Allocate(Type type) { ISubPool sub = m_main[type]; object pooledObj = sub.Allocate(); return pooledObj; } public void Free(object obj) { ISubPool sub = m_main[obj.GetType()]; sub.Free(obj); } } // our simple thread-safe pool class SimplePool : ISubPool { private const int PRIME = 50; private Type m_type; private Stack<object> m_sharedPool; public SimplePool(Type type) { m_sharedPool = new Stack<object>(PRIME); m_type = type; for (int i = 0; i < PRIME; i++) { object sharedObj = Activator.CreateInstance(m_type); m_sharedPool.Push(sharedObj); } } public object Allocate() { lock (m_sharedPool) { if (m_sharedPool.Count == 0) { for (int i = 0; i < PRIME; i++) { object newAlloc = Activator.CreateInstance(m_type); m_sharedPool.Push(newAlloc); } } object fromLocal = m_sharedPool.Pop(); return fromLocal; } } public void Free(object obj) { lock (m_sharedPool) { m_sharedPool.Push(obj); } } } interface ISubPool { object Allocate(); void Free(object obj); }As in all things related to concurrency, if you don't have locality, then you've got sharing, and once you have sharing, you will probably end up with contentions that are bound to harm your application's performance, wasting valuable CPU cycles.
So if we'd like to improve our scalability, then our goal is clear: reducing the amount of shared data. For example, if pools wouldn't be shared across different threads, then we wouldn't had to worry about synchronizing them and we could avoid the involved contentions altogether. A simple way to achieve this, is to use the TLS to allocate an independent pool for every thread. This way, on the one hand we'll achieve perfect scalability due to the avoidance of state sharing, but on the other hand, this kind of implementation could lead to an excessive memory usage. For instance, if a single instance of our pool (including all of its pre-allocated objects) weights about 10Mb, then on a machine with 16 processors, we could find ourselves dedicating no less then 160Mb in favor of our thread-local pools, even though its not likely that every single thread needs to use all the types of objects that we're allocated in its local pool.
For example, if we're parallelizing some algorithm using 3 threads, where thread 1 needs to use objects of type A and thread 2 needs to use objects of type B and thread 3 needs to use objects of type C, then it makes no sense that every one of those threads will hold a pool that will contain objects of all three types.
A possible solution for this problem is to use a pool hierarchy, where every time a thread attempts to create an object, it will direct itself to its "closest" pool instance. If that pool doesn't contain available instances of the requested object, then it will continue to navigate up the hierarchy until it reaches a pool that holds available instances of the object. Once the thread finishes using the object, it will return it to a pool that is located "closer" to that thread, this way we are able to maintain a level of locality between a thread and its used objects.
Instead of getting confused with unclear and too complex hierarchies, I'll demonstrate the concept using a flat hierarchy that offers a single "global" pool that is shared across all threads, and another local pool for every thread.
Basically, the idea is that the only place synchronization is involved is in the shared pool. So in the optimal scenario, each local pool will eventually hold only the amount of objects required to keep the thread from accessing the shared pool.
Every time a thread needs to create an object, it will first check its local pool. Since this pool only serves the requesting thread, we won't have to deal with synchronization here. Only in case where we've ran out of objects, we'll move on to the shared pool and transfer N more instances of the requested object to the local pool. It could be wise to transfer more objects than the thread initially asked for in order to avoid future accesses to the shared pool. Also, in order to cap the amount of memory we'd like to dedicate for each thread, we could decide that each local pool can hold a maximum of X objects. Once we've exceeded that number, every time a thread will want to free an object, it will return it to the shared pool instead of its local pool (of course, this could cause some contentions, depending on the implementation detail [e.g. the pool may buffer object returns]. But its entirely up to the developer to perform this kind of fine-tuning [memory usage vs. scalability]).
To demonstrate to concept, I've came up with this simplistic pool implementation:
class SemiLocalPool : ISubPool
{
private const int SHARED_PRIME = 50;
private const int LOCAL_PRIME = 20;
private const int LOCAL_MAX = 1000;
[ThreadStatic]
private static Stack<object> t_localPool;
private Type m_type;
private Stack<object> m_sharedPool;
public SemiLocalPool(Type type)
{
m_sharedPool = new Stack<object>(SHARED_PRIME);
m_type = type;
for (int i = 0; i < SHARED_PRIME; i++)
{
object sharedObj = Activator.CreateInstance(m_type);
m_sharedPool.Push(sharedObj);
}
}
public static void Init()
{
t_localPool = new Stack<object>(LOCAL_PRIME);
}
public object Allocate()
{
// first, try to allocate from the local pool
if (t_localPool.Count > 0)
{
object localObj = t_localPool.Pop();
return localObj;
}
int allocated = 0;
lock (m_sharedPool)
{
// pass objects from shared to local pool
for (; m_sharedPool.Count > 0 && allocated < LOCAL_PRIME - 1; allocated++)
{
object sharedObj = m_sharedPool.Pop();
t_localPool.Push(sharedObj);
}
// prime share pool
if (m_sharedPool.Count == 0)
{
for (int i = 0; i < SHARED_PRIME; i++)
{
// bad practice: holding the lock while executing external code
object sharedObj = Activator.CreateInstance(m_type);
m_sharedPool.Push(sharedObj);
}
}
}
// if the shared pool didn't contain enough elements, prime the remaining items
for (; allocated < LOCAL_PRIME - 1; allocated++)
{
object newAlloc = Activator.CreateInstance(m_type);
t_localPool.Push(newAlloc);
}
object fromLocal = Activator.CreateInstance(m_type);
return fromLocal;
}
public void Free(object obj)
{
// first return to local pool
if (t_localPool.Count < LOCAL_MAX)
{
t_localPool.Push(obj);
return;
}
// only after reaching LOCAL_MAX push back to the shared pool
lock (m_sharedPool)
{
m_sharedPool.Push(obj);
}
}
}
The scalability difference between the two implementations is closely related to the thread's pool usage pattern and to the values given to LOCAL_MAX, LOCAL_PRIME etc. If we reach a situation where there's always enough objects in the local pool, then we'll should enjoy perfect scalability.For the purpose of the demonstration, here are the results of the previous benchmark, now using the new pool implementation (beside exceeding the predefined values at the beginning of the benchmark, the benchmark's behavior exhibits optimal usage pattern [accessing only the local pool after a while]).
One problematic characteristic of this type of design is its reliance on thread affinity. While in some scenarios it could actually benefit us, in others it could make the Semi-Local Pool irrelevant.
If every thread in our application is affinitized to certain section of the code (that allocates a constant set of objects), then using this design could be optimal since we dedicate each local pool to a managed thread. We actually assume that the thread will always attempt to allocate objects from a specific, constant set of objects.
However, if the threads doesn't comply with this assumption, then its only a matter of time until each local pool will hold the entire set of pooled objects in the applications (which will of course lead to high memory usage).
In order to improve our way of handling with such scenarios, we could decide to add a kind of additional hierarchy level, that will separate the shared pools according to different sections in the code. Meaning, threads that are currently executing code from a network module for example will access Pool X, while threads that are currently executing some algorithm will access Pool Y. This way we could achieve object locality not by relaying on thread affinity, but on "category affinity" (each section of the code uses a certain set of objects, relevant to it). When a thread will want to allocate an object, it will tell the pool which area in the code its currently executing, so it would receive the appropriate "category pool. It's likely that this pool already contains the same type of objects that will be requested by the current thread since they we're already allocated by other threads that previously executed the same code section.
And some code to illustrate the concept:
public class CategorizedMainPool
{
private Dictionary<string, SimpleMainPool> m_main;
public CategorizedMainPool(Tuple<string, Type[]>[] pooledCategories)
{
m_main = new Dictionary<string, SimpleMainPool>();
foreach (Tuple<string, Type[]> curCategory in pooledCategories)
{
SimpleMainPool curSub = new SimpleMainPool(curCategory.Item2);
m_main.Add(curCategory.Item1, curSub);
}
}
public object Allocate(string category, Type type)
{
SimpleMainPool sub = m_main[category];
object pooledObj = sub.Allocate(type);
return pooledObj;
}
public void Free(string category, object obj)
{
SimpleMainPool sub = m_main[catagory];
sub.Free(obj);
}
}
I'm wondering if you've compared your semi-local pool implementation to a simple object pool algorithm using ConcurrentBag? From MSDN's documentation it looks like ConcurrentBag is extremely similar, using threadstatic containers and stealing from another thread's container if its own one is empty.
ReplyDelete