What's Going On
kayvonf liked benchoi's comment on slide_035 of Heterogeneous Parallelism and Hardware Specialization ()

tianyih liked arjunh's comment on slide_007 of Synchronization ()

tianyih liked azeyara's comment on slide_021 of Heterogeneous Parallelism and Hardware Specialization ()

tianyih commented on slide_037 of Scaling a Web site ()

@elemental03 If you are asking about Amazon EC2, I think the answer is yes. EC2 has a mechanism called Auto-Scaling Group, which would automatically add or shut down servers behind the Elastic Load Balancer based on our customized rules.


tianyih liked DunkMaster's comment on slide_035 of Scaling a Web site ()

tianyih liked tomshen's comment on slide_035 of Transactional Memory ()

tianyih commented on slide_013 of Transactional Memory ()

@woojoonk I don't think it is reasonable to use coarse locks because it prevents the balanced tree algorithm from achieving any kind of speedup regardless of how many processors we use.


tianyih commented on slide_027 of Synchronization ()

On a single shared bus, even though we use a tree based combining strategy, we can not parallelize the operations in different subtrees because of the contention of the bus, so there is not benefit using this strategy. It's even worse since the tree based combining strategy is much more complicated.


tianyih commented on slide_026 of Interconnection Networks ()

@chaihf, why is the latency almost O(1)?


tianyih commented on slide_013 of Interconnection Networks ()

Still not sure what rearrangeable non-blocking means. Can anyone clarify this concept?


tianyih liked elemental03's comment on slide_012 of Interconnection Networks ()

crs commented on slide_012 of NT Method + Course Review ()

There seem to be a variety of ways to implement the "neutral processor", and implementations have different advantages depending on number of particles and processors. A related paper: http://iopscience.iop.org/1742-6596/16/1/041/pdf/1742-6596_16_1_041.pdf


I don't see why it would be harder for the compiler to see when to sync all the threads if we do not leave the main thread idle. I think whether case 2 is bad depends on the task.


crs commented on slide_011 of Addressing the Memory Wall ()

@elemental03 It is a logical unit of storage, but hardware-dependent. Could be something like a cache. Usually of medium size. http://en.wikipedia.org/wiki/Memory_bank


Is it possible to sort edges by value (or some key in general)? Does sort by source vertex id have other impacts? One thing I can think of is that destination vertices can be loaded faster but that does not seem to apply to sparse graph.


@yrkumar I think important aspect to consider is the compiler. On the one hand, the compiler may apply more optimization as scala is functional and therefore side-effect free. On the other hand, this may add more (possibly unnecessary) constraints and generated less efficient code.


power analysis of phones are complicated as the same code can have different effects across different devices. Apps that behave well individually can have problems when run in parallel. There are also non-technical challenges. For example market study suggested a large amount of power consumed by apps were used in powering 3rd party ads.

source: http://sites.ieee.org/isgt/files/2013/03/Stavrou6C.pdf


Q_Q commented on slide_031 of Addressing the Memory Wall ()

Another practical implementation of memory compression for GPUs is for the frame buffer. Intel GPUs support "framebuffer compression." The observation is that parts of the screen have a contiguous color, and that most regions of the screen do not update very often. To take advantage of this, framebuffer compression uses run-length encoding to compress each row in the framebuffer (this is the same compression scheme GIFs use). As rows on the screen change, the compressed version of the modified row is not an accurate representation of the actual row, so the GPU uses the uncompressed copy of the row to draw that row on the screen. Otherwise, the GPU uses the compressed version, saving bandwidth, and thus energy. When too many rows have changed, the GPU recompresses the entire framebuffer and all compressesed rows are valid again.


Q_Q commented on slide_005 of Addressing the Memory Wall ()

There is research into new types of RAM such as MRAM, which is magnetic RAM and is non-volatile, so no refresh is required and no power is needed to retain the contents. This has implications for instant hibernation and power management.


I think that it is important to note that the ssd and hard drive data have different scales for the Y-axis. Otherwise it looks like that the hard drive is faster than the SSD, but the SSD is actually about twice as fast as the hard drive.


In certain cases when not too many threads are competing for resources, using a global lock data structure is faster than using a fine-grained lock data structure because a fine-grained lock data structure has more overhead.


elemental03 commented on slide_027 of NT Method + Course Review ()

I was wondering if there is any exception to the above rule? because it makes sense that this rule would hold for all applicances. All handheld devices, laptops etc


elemental03 commented on slide_043 of Scheduling Fork-Join Parallelism ()

Just wanted to note that threads steal from random cores, not nearby ones.

The reason why is explained in this faq on the main site for Cilk. They also answer some other interesting queries one might have such as why they use a dequeue rather than a FIFO queue :

Cilk FAQ


elemental03 commented on slide_011 of Addressing the Memory Wall ()

I'm not sure if this was mentioned in the lecture but what are banks exactly?


elemental03 commented on slide_025 of Domain-Specific Programming on Graphs ()

The below is a pdf to a paper by the Stanford developers of Green-Marl, explaining the DSL fully. The optimizations, the type of data structures and the implementation fully.

Green-Marl: A DSL for Easy and Efficient Graph Analysis


elemental03 commented on slide_057 of Domain-Specific Programming Systems ()

There are many ways to even further optimize the performance in openGL. This article goes into some of the basics :

Basics of optimizing performance in openGL


For those who use Macbook Pro 2011 and and other Macbook with multiple GPUs, the following article will be useful when programming in openGL. It tells openGL developers all they need to know in order to get their openGL code to render properly on multiple GPUs:

Rendering openGL properly with multiple GPUS


elemental03 commented on slide_011 of Transactional Memory ()

Besides the synchronized HashMap, Java also has a ConcurrentHashMap class that is another example of thread-safe hashmap implementation. Both maps are thread-safe implementations of the Map interface. ConcurrentHashMap, however, is implemented for higher throughput in cases where high concurrency is expected.

The following article has a good explanation of the implementation of ConcurrentHashMap, talking further about how it provides better concurrency while still being thread-safe:

ConcurrentHashMap implementation

This article compares the performance of Hashmap, synchronized hashmap and concurrent hashmap:

Comparisions


A split-transaction bus means that when a CPU places a memory request on the bus, that CPU immediately releases the bus, so that other entities can use the bus while the memory request is pending, and this is why a split-transaction bus is non-atomic. When the memory request is complete, the memory module involved will then acquire the bus, place the result on the bus (the read value in the case of a read request, an acknowledgment in the case of a write request), and in most implementations, also place on the bus the ID number of the CPU that had made the request. The latter (the CPU with that ID) will see that the memory’s response is for it and accept it.


Hand-over-hand locking means that we can't unlock(node) until we lock(node->next). To see why this is needed, suppose we unlock(5) prior to lock(10). There is now a point at which 5 and 10 are both free, which enables someone else to successfully delete(10), before we are able to traverse the list and delete 11.


tcz commented on slide_007 of Transactional Memory ()

I think declarative code would be more portable and much easier to write. SQL is an example of a declarative language.


jon commented on slide_022 of Interconnection Networks ()

Wraparound links are one way in which engineering complexity can be traded off for better performance (i.e., smaller average hop distance).

There have been proposals for more complicated networks based on the torus. One example is the iBT network (research paper, picture). iBT adds more links to the torus, further shortening diameter and average hop distance. On the other hand, it is almost certainly more costly to build in practice.


kayvonf commented on slide_004 of NT Method + Course Review ()

It's also discussed briefly in the lecture on heterogeneous and specialized processing.


sluck commented on slide_012 of Addressing the Memory Wall ()

According to this (old?) arstechnica article, increasing the pin count not only increases the cost of manufacturing memory, but also the cost of any connectors and also the motherboard connected to the memory. So increasing the pin count will increase the cost of several parts of the system.


sluck commented on slide_004 of NT Method + Course Review ()

It turns out that D.E. Shaw developed a machine called Anton specifically designed for computing molecular dynamics simulations. It uses the NT Method (and ASICs!) and is described here


Apple also includes the M7 coprocessor in order to collect data from the sensors, including while the phone is asleep, so that it can have data available to applications when the phone wakes up again without using the full amount of power (and battery) needed to run the full processor.


Yuyang liked kayvonf's comment on slide_047 of Transactional Memory ()

wcrichto liked Yuyang's comment on slide_035 of A Basic Snooping-Based Multi-processor ()

ron commented on slide_024 of Synchronization ()

Line 21 should be int arrived = ++(b->arrive_counter).


Yuyang liked arjunh's comment on slide_028 of Interconnection Networks ()

mchoquet liked retterermoore's comment on slide_035 of Interconnection Networks ()

jhhardin commented on slide_018 of NT Method + Course Review ()

The ABA problem is when the assumption is made that nothing was done because consecutive reads gave the same value. However, another thread could have modified the data, performed a computation, and then changed it back, so something did happen. We saw this problem when deleting elements from a linked list. A nice description can be found on the wikipedia page.


I believe lazy execution is waiting to evaluate an expression until the value is needed, and also tries to do sharing of values for repeated expressions as much as possible.


woojoonk commented on slide_018 of Scheduling Fork-Join Parallelism ()

The stealing idea is similar to that of keeping a queue at the master level of assignment 4. Since we had I/O intensive work as well as compute intensive work,the idea of giving work to other thread somewhat resembles the homework problem.


jhhardin commented on slide_010 of Scheduling Fork-Join Parallelism ()

I think the last bullet point is saying that there's a balance between breaking work up into many small pieces so that we can divide it more evenly, and keeping relatively large pieces of work so that we don't have to manage too many pieces while still balancing the workload.


rokhinip commented on slide_047 of Transactional Memory ()

@arjunh, I think the easiest way would be to simply abort one of them and restart while we let the other proceed to completion.



mchoquet commented on slide_044 of Transactional Memory ()

@nrchu I think the idea is that the write buffer flushing is done in hardware, so we can check it ahead of time and know it will never fail. Eager versioning has a fault tolerance problem not because its log keeping infrastructure is more complicated, but because it has to run untrusted code (i.e. your code), and if your code fails then it needs to go fix memory.


Optimization for certain work seems to be the reason for the relative high speed in camera.


pinkertonpg commented on slide_018 of Fine-Grained Sync and Lock-Free Structures ()

@woojoonk I think you mean decrease the throughput.