tcmalloc vs. jemalloc

Q:

I have an application which allocates lots of memory and I am considering using a better memory allocation mechanism than malloc.

My main options are: jemalloc and tcmalloc. Is there any benefits in using any of them over the other?

There is a good comparison between some mechanisms (including the author’s proprietary mechanism — lockless) in http://locklessinc.com/benchmarks.shtml and it mentions some pros and cons of each of them.

Given that both of the mechanisms are active and constantly improving. Does anyone have any insight or experience about the relative performance of these two?

A:
If I remember correctly, the main difference was with multi-threaded projects.

Both libraries try to de-contention memory acquire by having threads pick the memory from different caches, but they have different strategies:

jemalloc (used by Facebook) maintains a cache per thread
tcmalloc (from Google) maintains a pool of caches, and threads develop a “natural” affinity for a cache, but may change
This led, once again if I remember correctly, to an important difference in term of thread management.

jemalloc is faster if threads are static, for example using pools
tcmalloc is faster when threads are created/destructed
There is also the problem that since jemalloc spin new caches to accommodate new thread ids, having a sudden spike of threads will leave you with (mostly) empty caches in the subsequent calm phase.

As a result, I would recommend tcmalloc in the general case, and reserve jemalloc for very specific usages (low variation on the number of threads during the lifetime of the application).

A:

I have recently considered tcmalloc for a project at work. This is what I observed:

Greatly improved performance for heavy usage of malloc in a multithreaded setting. I used it with a tool at work and the performance improved almost twofold. The reason is that in this tool there were a few threads performing allocations of small objects in a critical loop. Using glibc, the performance suffers because of, I think, lock contentions between malloc/free calls in different threads.

Unfortunately, tcmalloc increases the memory footprint. The tool I mentioned above would consume two or three times more memory (as measured by the maximum resident set size). The increased footprint is a no go for us since we are actually looking for ways to reduce memory footprint.

In the end I have decided not to use tcmalloc and instead optimize the application code directly: this means removing the allocations from the inner loops to avoid the malloc/free lock contentions. (For the curious, using a form of compression rather than using memory pools.)

The lesson for you would be that you should carefully measure your application with typical workloads. If you can afford the additional memory usage, tcmalloc could be great for you. If not, tcmalloc is still useful to see what you would gain by avoiding the frequent calls to memory allocation across threads.

Leave a Reply

Your email address will not be published. Required fields are marked *