@elemental03 If you are asking about Amazon EC2, I think the answer is yes. EC2 has a mechanism called Auto-Scaling Group, which would automatically add or shut down servers behind the Elastic Load Balancer based on our customized rules.
@woojoonk I don't think it is reasonable to use coarse locks because it prevents the balanced tree algorithm from achieving any kind of speedup regardless of how many processors we use.
On a single shared bus, even though we use a tree based combining strategy, we can not parallelize the operations in different subtrees because of the contention of the bus, so there is not benefit using this strategy. It's even worse since the tree based combining strategy is much more complicated.
@chaihf, why is the latency almost O(1)?
Still not sure what rearrangeable non-blocking means. Can anyone clarify this concept?
There seem to be a variety of ways to implement the "neutral processor", and implementations have different advantages depending on number of particles and processors. A related paper: http://iopscience.iop.org/1742-6596/16/1/041/pdf/1742-6596_16_1_041.pdf
I don't see why it would be harder for the compiler to see when to sync all the threads if we do not leave the main thread idle. I think whether case 2 is bad depends on the task.
@elemental03 It is a logical unit of storage, but hardware-dependent. Could be something like a cache. Usually of medium size. http://en.wikipedia.org/wiki/Memory_bank
Is it possible to sort edges by value (or some key in general)? Does sort by source vertex id have other impacts? One thing I can think of is that destination vertices can be loaded faster but that does not seem to apply to sparse graph.
@yrkumar I think important aspect to consider is the compiler. On the one hand, the compiler may apply more optimization as scala is functional and therefore side-effect free. On the other hand, this may add more (possibly unnecessary) constraints and generated less efficient code.
power analysis of phones are complicated as the same code can have different effects across different devices. Apps that behave well individually can have problems when run in parallel. There are also non-technical challenges. For example market study suggested a large amount of power consumed by apps were used in powering 3rd party ads.
source: http://sites.ieee.org/isgt/files/2013/03/Stavrou6C.pdf
Another practical implementation of memory compression for GPUs is for the frame buffer. Intel GPUs support "framebuffer compression." The observation is that parts of the screen have a contiguous color, and that most regions of the screen do not update very often. To take advantage of this, framebuffer compression uses run-length encoding to compress each row in the framebuffer (this is the same compression scheme GIFs use). As rows on the screen change, the compressed version of the modified row is not an accurate representation of the actual row, so the GPU uses the uncompressed copy of the row to draw that row on the screen. Otherwise, the GPU uses the compressed version, saving bandwidth, and thus energy. When too many rows have changed, the GPU recompresses the entire framebuffer and all compressesed rows are valid again.
There is research into new types of RAM such as MRAM, which is magnetic RAM and is non-volatile, so no refresh is required and no power is needed to retain the contents. This has implications for instant hibernation and power management.
I think that it is important to note that the ssd and hard drive data have different scales for the Y-axis. Otherwise it looks like that the hard drive is faster than the SSD, but the SSD is actually about twice as fast as the hard drive.
In certain cases when not too many threads are competing for resources, using a global lock data structure is faster than using a fine-grained lock data structure because a fine-grained lock data structure has more overhead.
I was wondering if there is any exception to the above rule? because it makes sense that this rule would hold for all applicances. All handheld devices, laptops etc
Just wanted to note that threads steal from random cores, not nearby ones.
The reason why is explained in this faq on the main site for Cilk. They also answer some other interesting queries one might have such as why they use a dequeue rather than a FIFO queue :
I'm not sure if this was mentioned in the lecture but what are banks exactly?
The below is a pdf to a paper by the Stanford developers of Green-Marl, explaining the DSL fully. The optimizations, the type of data structures and the implementation fully.
There are many ways to even further optimize the performance in openGL. This article goes into some of the basics :
For those who use Macbook Pro 2011 and and other Macbook with multiple GPUs, the following article will be useful when programming in openGL. It tells openGL developers all they need to know in order to get their openGL code to render properly on multiple GPUs:
Besides the synchronized HashMap, Java also has a ConcurrentHashMap class that is another example of thread-safe hashmap implementation. Both maps are thread-safe implementations of the Map interface. ConcurrentHashMap, however, is implemented for higher throughput in cases where high concurrency is expected.
The following article has a good explanation of the implementation of ConcurrentHashMap, talking further about how it provides better concurrency while still being thread-safe:
ConcurrentHashMap implementation
This article compares the performance of Hashmap, synchronized hashmap and concurrent hashmap:
A split-transaction bus means that when a CPU places a memory request on the bus, that CPU immediately releases the bus, so that other entities can use the bus while the memory request is pending, and this is why a split-transaction bus is non-atomic. When the memory request is complete, the memory module involved will then acquire the bus, place the result on the bus (the read value in the case of a read request, an acknowledgment in the case of a write request), and in most implementations, also place on the bus the ID number of the CPU that had made the request. The latter (the CPU with that ID) will see that the memory’s response is for it and accept it.
Hand-over-hand locking means that we can't unlock(node) until we lock(node->next). To see why this is needed, suppose we unlock(5) prior to lock(10). There is now a point at which 5 and 10 are both free, which enables someone else to successfully delete(10), before we are able to traverse the list and delete 11.
I think declarative code would be more portable and much easier to write. SQL is an example of a declarative language.
Wraparound links are one way in which engineering complexity can be traded off for better performance (i.e., smaller average hop distance).
There have been proposals for more complicated networks based on the torus. One example is the iBT network (research paper, picture). iBT adds more links to the torus, further shortening diameter and average hop distance. On the other hand, it is almost certainly more costly to build in practice.
It's also discussed briefly in the lecture on heterogeneous and specialized processing.
According to this (old?) arstechnica article, increasing the pin count not only increases the cost of manufacturing memory, but also the cost of any connectors and also the motherboard connected to the memory. So increasing the pin count will increase the cost of several parts of the system.
It turns out that D.E. Shaw developed a machine called Anton specifically designed for computing molecular dynamics simulations. It uses the NT Method (and ASICs!) and is described here
Apple also includes the M7 coprocessor in order to collect data from the sensors, including while the phone is asleep, so that it can have data available to applications when the phone wakes up again without using the full amount of power (and battery) needed to run the full processor.
Line 21 should be int arrived = ++(b->arrive_counter)
.
The ABA problem is when the assumption is made that nothing was done because consecutive reads gave the same value. However, another thread could have modified the data, performed a computation, and then changed it back, so something did happen. We saw this problem when deleting elements from a linked list. A nice description can be found on the wikipedia page.
I believe lazy execution is waiting to evaluate an expression until the value is needed, and also tries to do sharing of values for repeated expressions as much as possible.
The stealing idea is similar to that of keeping a queue at the master level of assignment 4. Since we had I/O intensive work as well as compute intensive work,the idea of giving work to other thread somewhat resembles the homework problem.
I think the last bullet point is saying that there's a balance between breaking work up into many small pieces so that we can divide it more evenly, and keeping relatively large pieces of work so that we don't have to manage too many pieces while still balancing the workload.
@arjunh, I think the easiest way would be to simply abort one of them and restart while we let the other proceed to completion.
@nrchu I think the idea is that the write buffer flushing is done in hardware, so we can check it ahead of time and know it will never fail. Eager versioning has a fault tolerance problem not because its log keeping infrastructure is more complicated, but because it has to run untrusted code (i.e. your code), and if your code fails then it needs to go fix memory.
Optimization for certain work seems to be the reason for the relative high speed in camera.
@sluck As far as my research into quantum computation has gone, parallelism would not effect it very much, but distributed computation would. The largest factor against quantum computers right now is noise, which can render a computation useless. This noise can be created from literally anything getting into the computer. Therefore, these machines would likely need to be separated quite a bit, meaning that a distributed system of them is much more likely to exist.
However, when you can search an unsorted list in $O(\sqrt{n})$, anything is possible =P