This is an excellent article covering some of the implementation details of how Intel hyper-threading works.
Also, pop_count should probably be an unsigned integer, since overflow of a signed int is undefined. As is, it's more severe a problem than just wrap around.
Clarification on the definition of execution-driven simulation: an execution-driven simulator will actually run a program in a simulated environment.
Clarification. I'd say dynamic assignment is simply a scheme where decisions about how to assign work to workers are made based on program status as the program runs (rather than a priori). It turns out that such a strategy is most effective where the number or size of tasks is unpredictable, since if they were predictable the assignment could have been determined up front.
@aakaskr: The solution certaining enables a high degree of concurrency when operating on the data structure. Whether the solution is in fact better than a coarse-grained locking solution depends on many factors, such as the number of contenting threads and the overhead of manipulating the fine grained locks. For example, in the transactional memory lecture, there is a slide showing an example where performance degrades when using fine-grained locks.
@ToBeContinued. You're comparing inter-chip communication via QPI to inter-core communication on the same chip via the T2 crossbar. That's not a fair comparison. A more fair comparison would be to compare the crossbar's bandwidth to the ring bus bandwidth on a modern Intel CPU.
@ypk: Could you clarify what you meant? Executing the same instruction for all CUDA threads in a warp on SIMD ALUs enables parallel execution, but does not hide latency. The processor however interleaves execution of many warps to hide large latency operations performed by any one warp.
They already do! Most modern CPUs have special purpose logic for audio/video playback, data compression, 3D graphics, etc. However, the degree of specialization beyond these widely used tasks is less than in mobile systems since the main CPU is certainly powerful enough to handle most jobs.
This code suffers from poor locality and the performance can be improved using a segmented scan. In a segmented scan, we break up the input array into small contiguous blocks and perform a scan of each of them. Though it performs more work, it can lead to better performance due to better utilization of the cache.
All networks that we have considered except for crossbar are blocking.
Disregarding the efficiency computation, I've read of large core chips which know the utilization of cores, and will power down cores which are not needed. This seems like a load balancer, in essence, and I wonder if this could be put to use in GPUs, where there are definitely a large number of cores, all of which may not be doing any work at all.
Further, to help GPU utilization, new NVIDIA GPUs have the ability to run more than a single kernel at once. This might be able to utilize more of the GPU since one kernel does not lock the device from being used.
Because a tree with n vertices has (n-1) edges, the cost for all links being one unit is O(n).
The second one, you can get from the recurrence C(n) = 2C(n/2) + O(n) because this is a balanced tree.
In this implementation each thread uses locks to ensure it has exclusive access to its prev and cur nodes. This is a much better solution than maintaining exclusive access to the entire linked list for a single thread.
Would it be a possibility for future, energy efficient, general purpose home computers could also have specialized processors for specific tasks?
Transferring data between different chips is challenging. We need to take care of communication overhead.
Implement good schedule even we can use distributed work queue so that we can reduce overhead of redistributing work
I'm not sure how this could be better than a ticket lock implementation. In high contention areas, you would assume in a ticket lock that cache values would be updated as long as more contention for the lock was coming in. Since threads will need to wait no matter what, there can be cache coherency done in the background to hide these this latency. This, and the fact you need to allocate some room for an array seems like a waste when ticket locks work so simply.
Warp execution makes GPU possible to hide latency.
An advantage to this design is that you are directly connected to your N closest neighbors, but a disadvantage is that your machine needs $2^n$ processors to work in this configuration. Processors typically do have $2*n$ processors, so if you wanted to use this configuration, you might need more or less CPUs.
Still, even with great performance gains, you loose a whole lot of the dye to just the interconnect.
@fkc9001: It doesn't look like you need to know who actually did the writes and reads. If there is a conflict detected, then all read/write bits on all cached data is cleared, and I assume that the thread doing the conflicting transactions are dealt with in software.
This might change from policy to policy. I'm not sure how lazy versioning and eager versioning would differ in this example. Since all writes and reads are logged as bits, the undoing of a log would be to clear these bits. Maybe its a difference in cache coherency?
Isn't locking the parent node necessary though? In order to lock the child node, you have to have access to the parent data structure. Therefore the parent node must be locked so that you can safely lock the child and move on.
Does eager versioning allow other processors to see the values set during this transaction? For instance, if CPU 1 initiates a transaction and writes to A, then CPU 2 reads A, will it receive the old or new value of A?
Also, doing logs within the cache can be bad for performance of the rest of the system. If your transactions are large enough to fill up cache lines, you may need to dump your cache to disk in order to both preserve the log and allow other threads to continue their work.
Optimistic detection is good, however, because someone is guaranteed to be able to commit changes. When there is high contention, its best to use a lock since there will always be work which one operator does which will be rolled back.
Locks are useful to protect read and write safety. By locking an array or a pointer, the user can guarantee that only one thread can access the critical section at a time. One must be careful when using locks as an improper locking order can cause a deadlock in the program and no work can be achieved.
Barriers are useful if there are multiple highly parallelizable tasks that occur sequentially. Thus, all of the threads can be used on the first task. Then a barrier put in place to make sure that all threads reach that point. Then all the threads can be used for the second task as well. A barrier is necessary in some cases because the user needs to know that their critical section, perhaps an array, will no longer be modified.
It is the job of the user to represent the cache line in the second diagram. The cache line will stay the same, but instead of having a line of [(0,0),(0,1),(0,2),(0,3)], the user must instead put [(0,0),(0,1),(1,0),(1,1)] into the cache line. Therefore, the user must store their values in an array in a tricky way to take advantage of spacial locality on the cache line.
Is consistency between the master and slave databases a bottleneck of any sort? I assume that since there is mostly static content on a webpage, then keeping databases consistent is not difficult and can be done without much interruption to site availability.
Does masking actually degrade performance? I'm not sure I see why this would necessarily be the case.
As shown by this slide, there really isn't any consensus with respect to which implementation of transactional memory is best. This is because the answer to that question is very workload dependent. For example, the contention around the transaction plays a big role in whether an optimistic or pessimistic approach is better. I imagine that this makes it extremely difficult to implement any sort of transactional memory that be used by a programmer at a high level and will give peak performance in most programs.
The reason you do not want random writes is because flash is terrible at this. Everything stored in a node is done so in a sequential log, to avoid these issues.
I'm a little confused here. How does the cache keep track of which operating system thread is running and making a transaction? In other words, how does the cache know who's read/write bits are being set? These caches are per core/per processor, so I'm not sure how you would be able to save all of the read/write bits per cache line every context switch.
Unfortunately, the ideas we've discussed in previous lectures about the issues that arise with parallelisn are only amplified in the context of minimizing energy usage. In order to handle the same workload with wimpy nodes, you need a bigger cluster. Now, for example, there is a lower tolerance for error in the load balancing because a single node gets overwhelmed more easily. Any amount of communication between nodes becomes a more likely cause for a bottleneck. Since, in the context of a webserver, there's really no acceptable relaxation in the required latency, the software needs to be smarter.
There is an optimization for caching where the critical word is brought into a cache line first, then the rest of the data is loaded some time later. I wonder if there is a version of this where the bus traffic would happen to each processor a bit at a time, rather than in one large chunk.
Would prefetching be considered an improvement to decrease the cache miss rate? I think this would fall under the compilers job, it still might be useful. Also, having too small of cache lines can lead to more misses. For instance, if you are iterating through an array of integers, and your cache line only has room for 2 contiguous integers. Each time you have an integer list larger than 2, you incur a cache miss on every third integer.
SIMD would be even slower when doing divergent execution since it needs to run both branching path and mask the results
^ actually busy waiting won't increase bus load, because all of the processors spinning would be in the S state anyway, until there was a write.
I asked this in lecture but I just was curious again: if Intel scales up this arrangement to 8 cores, would it be worth the silicon to have two buses that go in opposite directions? (e.g. one set of buses to communicate clockwise, and another set to go counterclockwise)
I think that part of the answer to some of the questions raised above is because the type of parallel hardware we've been discussing is both relatively new and constantly changing.
As far as completeness and productivity goes, this was something that could be reasonably well addressed 10 years ago (and even further in the past) when mainstream computing consisted of single core processors. Complete programming languages have existed "forever" and even languages like Python have been around since 1991 source: Wikipedia.
On the other hand, there hasn't been the same level of interest in high performance computing in the parallel world for as long. For example, CUDA has only been around since 2007 source: Wikipedia. So, perhaps with more time we will be able to produce a language that masters all three categories. Until we can do that, it looks like one of the three needs to be sacrificed.
In my opinion, the biggest conflict comes between productivity and high performance. The less effort that the programmer has to put into making the high performance program means that the rest of the work has to be made up by the compiler/interpreter. DSLs are useful because this means that the compiler/interpreter only needs to focus on optimizing a small subset of tasks.
ISPC task is more abstract. It does not specify how each task is assigned to processors core. It tell the compiler which part can be run on parallel. Pthread is more concrete. The programmer needs to specify how tasks are mapped to threads.
Synchronization and communication cost is high even when there is low overhead. Therefore, static assignment is preferred in most situation.
Since GPU have more arithmetic logic than control and instruction supply unit, running an instruction on GPU would be more cost effective if the same instruction can run on multiple data.
Remember that in divergent execution, SIMD can actually be slower than regular serial execution if the compiler compiles for SIMD at a higher width than the hardware can handle. We saw this in Assignment 1. For example, let's say the vector width of the hardware is 2 and the compiler is using a width of 4. If the first 3 elements take 1 cycle and the last one takes 8, then we will be using 9 cycles on the first two elements and 9 cycles on the last two elements, because for the purposes of divergence, the width is 4, even if the hardware can only do 2 elements at a time. So, it takes 18 cycles this way. But if we did it sequentially, it would only take 11.
The power wall is important from everything from mobile phones to supercomputers. For supercomputers, heat generated is heat that needs to be removed via air conditioning systems, which have their own power cost associated with them. Mobile phones need to not build up heat because no one wants to hold a burning brick in their hands or next to their ear.
Amusingly enough, the Xbox 360's 10 MB of eDRAM isn't enough to hold a full 1280x720 image. Here's a post that's rather critical of the 360's eDRAM.
Overall, this seems to reflect the issues we've seen related to the increasing use of heterogenous hardware. One of these concerns is that it makes software that takes full advantage of the hardware more difficult to write. Another is that balancing the hardware (in this case, the choice of 10 MB) is a very important decision. This is similar to the issues discussed in (lecture 22, slide 28).
An example of the first point here is solving WSP (or anything else) using OpenMP on each individual machine, and then using MPI within a cluster to communicate between machines. OpenMP is strictly a shared address space model whereas MPI is strictly a message passing model, but they can be used together as a hybrid model.