The segment tree implementation in JavaScript presented in my last post is already able to process large number of intervals. A practical use of this algorithm is for example to analyze data of gene sequences. I wanted to find out if it is possible to push the performance for this task one step further.
There is still enough room for improvement in the JavaScript version like for example changing to typed arrays or using non-recursive implementations for building and querying the segment tree. But I saw the possibility to parallelize the problem of building and traversing the tree and distribute the workload on multiple cores. For this, and also to minimize memory usage another language is required.
First I thought about using the new concurrency features of C++11 but having enjoyed programming in JavaScript for a while now I was looking for a programming paradigm that is closer to JS.
Soon I was excited about trying out Go for this task as it seemed to be a promising candidate:
There is still enough room for improvement in the JavaScript version like for example changing to typed arrays or using non-recursive implementations for building and querying the segment tree. But I saw the possibility to parallelize the problem of building and traversing the tree and distribute the workload on multiple cores. For this, and also to minimize memory usage another language is required.
First I thought about using the new concurrency features of C++11 but having enjoyed programming in JavaScript for a while now I was looking for a programming paradigm that is closer to JS.
Soon I was excited about trying out Go for this task as it seemed to be a promising candidate:
- Being designed as a system programming language with static types it should be possible to model the data structures for the segment tree in a memory efficient way
- With lightweight type system, garbage collection and fast compilation it feels more like a dynamic language like JavaScript
- According to the FAQ: "By its design, Go proposes an approach for the construction of system software on multicore machines"
First steps in Golang land
It's fun to learn Go. The documentation resources provided at http://golang.org/ are excellent. I found it especially helpful to read through the slides of a three day workshop about Go (Day1, Day2, Day3). The slides are not any more part of the current documentation, probably because they are not updated for current Go version 1. But I think the fundamental concepts have not changed, so it's still worth reading them.
After this, the Tour of Go provides a good means to try things out and do some exercises. The specification is even readable and on golang-nuts group the core developers of the language are very active.
The first step was to create a single threaded version of the segment tree without the concurrency features of Go. The result can be found here: stree.go
When we measure the time to build the tree and to query a range we get the following results:
To build the tree Go needs only about a quarter of the time compared to JavaScript. Interestingly, the query method in Go is only around 25-30% faster. The query used here is on the complete number range, which is the worst case query for the segment tree as all nodes have to be evaluated and a lot of intervals are found which then have to be reduced to non-duplicates.
All tests were carried out on the following machine:
We see only a 7% resp. 15% improvement when using goroutines. This is less than I expected. How can this be explained?
Rob Pike explains in these slides what he sees as the difference:
After this, the Tour of Go provides a good means to try things out and do some exercises. The specification is even readable and on golang-nuts group the core developers of the language are very active.
Translate from JavaScript to Go
When we measure the time to build the tree and to query a range we get the following results:
To build the tree Go needs only about a quarter of the time compared to JavaScript. Interestingly, the query method in Go is only around 25-30% faster. The query used here is on the complete number range, which is the worst case query for the segment tree as all nodes have to be evaluated and a lot of intervals are found which then have to be reduced to non-duplicates.
All tests were carried out on the following machine:
Go 1, Node.js (v0.6.11), Linux (3.1.9 x86_64), P8600 (Core2Duo@2.4GHz)
Running on multiple cores
Go offers two main language features for concurrency: goroutines and channels.
Goroutines can be used to solve the problem of building a segment tree in parallel: one could start multiple goroutines where each one independently creates a branch of the tree. Same concept applies to the querying algorithm: we can search with multiple goroutines in separate branches of the tree for overlapping intervals and later merge these results. So far the theory, here is the implementation: mtree.go
The following shows the result of the measurement.
We see only a 7% resp. 15% improvement when using goroutines. This is less than I expected. How can this be explained?
"Concurrency is not Parallelism"
Concurrency provides a way to structure a solution to solve a problem that may (but not necessarily) be parallelizable.Go provides language features that allows to elegantly describe a problem in a concurrent way. This is the precondition before it can be executed in parallel. But if the actual Go program can utilize parallelism and also achieve a performance benefit depends on a number of factors:
- the nature of the problem: can it be divided in independent sub tasks?
- the concurrency overhead: the usage of these language features comes with a certain overhead, communication is required to synchronize tasks. The benefit of parallelism must outweigh this overhead
- the code quality: there are always multiple ways to structure concurrent solutions. Experience with the segment tree showed that there is a high variance how different solutions performed. More on this later.
- the runtime / implementation details: when a function is called with a keyword 'go' it does not start immediately, but is scheduled for later execution. The scheduling will therefore have an impact on the parallel execution of goroutines. Also the internal memory model will influence the capability of concurrent goroutines accessing the same data structure.
Experiences when building a segment tree in parallel
I want to analyze if building a segment tree is a suitable problem for parallelism and share my experiences with implementing it in Go. As stated above, building a tree is parallelizable: multiple threads could deal with building the multiple branches of the tree. But is it also a "good" problem? I don't think so. An ideal case is given here: Effective Go, an expensive operation without memory allocation distributed on multiple cores. Building a segment tree is the opposite: the calculation is very simple, only comparing two numbers per new node of the tree, but with constant memory allocation for each node. These allocations can probably not be executed in parallel and therefore serialize the goroutines. There is an answer from Ross Cox in the discussion here that leads to this conclusion.
When we start multiple goroutines to build our tree we somehow need to know when all goroutines are finished and the tree is complete. In a wider sense: communication is required between concurrent running threads. Golang propagates the paradigm share by communicating:
Do not communicate by sharing memory; instead, share memory by communicating.
Go channels are a means to implement this paradigm. Once used to 'thinking in channels' I found them a safe and simple toolkit for the often counter-intuitive problems of parallel programming. But Go also supports mutexes to lock shared access to variables.
For the segment tree it was required to use these features in a sensible manner: as the problem as such is not optimal, the gain from parallelism could be easily eaten up by the overhead of communication and synchronization of goroutines. For every function in mtree.go I wrote at least three different versions, with different usage of channels and goroutines. For example you could first start a number of goroutines and supply them with tasks via channels, or start the goroutine in place with the task in a direct function parameter. These slight modifications on the structure had often a deep impact on the overall performance. Often the performance was worse than with the single threaded algorithm.
Optimizations in this area, which is of course a corner case, can only be done (apart from an empirical approach) with knowledge of the underlying runtime or implementation details. For example the question if write access to a single field of an array will block the whole data structure for other goroutines, cannot be answered from the Go specification.
Another area of the runtime is the scheduler. It's a question of granularity how many sub tasks you create from your original problem and therefore how many goroutines should be started in parallel. On the Core2Duo test machine the best performance for building the tree was achieved with 64 parallel goroutines, but for querying the tree the optimum was with only 4 goroutines (given a segment tree with 100000 intervals). To understand this behavior again a deeper understanding would be required: how are goroutines scheduled? When are they preempted? I don't want to go into more details here, but this discussion should help to get a basic understanding, also of the current limitations of the scheduler.
The question "Why does using GOMAXPROCS > 1 sometimes make my program slower?" belongs to this context as well.
The question "Why does using GOMAXPROCS > 1 sometimes make my program slower?" belongs to this context as well.
Conclusion
It's hard not to like Go. It feels carefully crafted in all parts. I especially liked the lightweight type system that does not need classes but still allows object oriented design. When looking at Go from a Java perspective the simplifications and the reduction to the essential seems like a step in the right direction. Coming from JavaScript I appreciated more structure and the strict compiler. Interfaces feel a bit weird in the beginning but I think one get used to. For developers who think that parallel algorithms are often non-intuitive, Go channels provide a means to structure these problems elegantly.
Although Go currently develops into a general purpose language it was designed for system programming and server applications. In this area where networking is involved, latency and huge number of multiple clients Go can show its strengths.
The segment tree presented here does not fall in this category. It might be more representative for scientific calculations. Is Go usable in this area as well? I think the interesting question is not so much what a programmer with a profound knowledge of the runtime and underlying implementation can squeeze out of the language. But more what the average developer can do in little time. Given a suitable problem Go has everything to express it in a concurrent way and it will benefit from parallelism. With the to be expected advancement of the Go runtime the scheduling should be optimized for long run calculations. If the problem is not so suitable like the here presented segment tree (and this might be obvious only on second look) it should be appropriate to implement first the single threaded version and optimize for parallelism only if required.
Although Go currently develops into a general purpose language it was designed for system programming and server applications. In this area where networking is involved, latency and huge number of multiple clients Go can show its strengths.
The segment tree presented here does not fall in this category. It might be more representative for scientific calculations. Is Go usable in this area as well? I think the interesting question is not so much what a programmer with a profound knowledge of the runtime and underlying implementation can squeeze out of the language. But more what the average developer can do in little time. Given a suitable problem Go has everything to express it in a concurrent way and it will benefit from parallelism. With the to be expected advancement of the Go runtime the scheduling should be optimized for long run calculations. If the problem is not so suitable like the here presented segment tree (and this might be obvious only on second look) it should be appropriate to implement first the single threaded version and optimize for parallelism only if required.