What's Going On
dasteere commented on slide_029 of Scaling a Web Site ()

Wouldn't the traffic of a lot of websites experience similar trends and therefore need more resources at roughly the same times? If that is the case then amazon now has the problem of having all of these unused resources instead of the individual websites having extra resources. Seems to me like shifting the problem to somebody else, but maybe it is better to have that problem centralized instead of everybody worrying about it.


Do ASICs also use relaxed memory consistency for use cases that don't need the strict memory requirements? This would be a good way to improve efficiency for very specific applications on the hardware level


dasteere commented on slide_024 of Addressing the Memory Wall ()

What is the process for reading one 64 byte line if it is not in DRAM? Will any request to disk be put in DRAM or is there some protocol to determine what stays in DRAM?


dasteere commented on slide_073 of GPU Architecture and CUDA Programming ()

If kernel calls are asynchronous would it be possible to schedule a large kernel call and then start work on the CPU at the same time? This would result in both the CPU and the GPU being utilized simultaneously which seems like a great option.


username commented on slide_035 of Memory Consistency (+ Course Review) ()

Remembering the freeway example, throughput is the width of the freeway and latency is the speed limit. Thus if our program only has one thing running (one car on the freeway) increasing the bandwidth (width of the freeway) won't really make sense; we would want to decrease the latency (increase the speed limit). However if we had a lot of processes running in parallel (there are a tonne of cars on the freeway), we would be better off (usually) increasing the bandwidth (making the freeway wider would reduce traffic, increasing the speed limit would help though would not be as effective if there simply isn't enough room on the freeway for the cars)


username commented on slide_025 of Domain-Specific Programming Systems ()

1) Remembering Amdahl's law, we know that even a small amount of sequential computation can greatly increase the runtime of a relatively parallel program run in parallel.

2) From 213 Cache Lab we know that having relevant data in the cache reduces the latency involved with loading from disk.

3) Sometimes the overhead of avoiding synchronization or implementing fine-grain locking can actually be greater than the cost of having an atomic section in code. Thus trying the simplest solutions first is a good approach to problems


rrp123 commented on slide_029 of Parallel Deep Network Training ()

This is very inefficient since we need to send the updated params to everyone, every time any node is done.


It's generally useful to have domain knowledge as well.


If one processor fails, why does that entire computation get invalidated? Why doesn't the work from the processor that failed simply get redistributed to the rest of the processors?


themj commented on slide_023 of Addressing the Memory Wall ()

In this case, once you load in one (row, column), you get a 64-bit cache line, which is the desired result in most cases. The same effect can be reached by interleaving the bytes more coarsely (up to 8 bytes on each DRAM).


themj commented on slide_023 of Parallel Deep Network Training ()

Generally, you check if the loss is too high by computing the difference between the current solution and the desired solution. If this difference is above a predetermined threshold, then the loss is considered too high.


For both A and B, you are loading in an entire vector of size SIMD_WIDTH allowing you to take advantage of SIMD-level parallelism. Why do you get SIMD_WIDTH*SIMD_WIDTH parallelism for C?


@jedi I think that busy waiting would drive the phone to electric limit since it is still taking up power. However, since busy waiting is so low-power, I doubt that it would drain the battery all that quickly.


@williamx I believe functional languages would lean more to C/C++ side. Most common functional languages (such as SML/Haskell/OCaml) are extremely powerful and you can develop nearly anything with them but the learning curves for functional languages are extremely high.


themj commented on slide_006 of Transactional Memory ()

How does the underlying system know what the optimal way of guaranteeing atomicity is?


A non-CS example of the ABA problem is if you're driving and stopped at a red light. Then, you turn to talk to a friend and then later turn back to see that the light is still red. You think that the light hasn't changed but, while you were turned, the light actually turned green and back to red.


themj commented on slide_033 of Implementing Synchronization ()

Why do atomic operations perform better/faster than locks? Is it because they are specialized to specific types of operations?


Most current systems lie in the first circled region and so generally we can optimize throughput in database systems for these systems. For the sake of high performance computing and supercomputers with many more cores, it is important to consider database performance in the middle regions of the graph.


The three strings that can be printed are: "", "Hello", and "World" since the write propagate immediately.


The deadlock in this situation could have been avoided by prioritizing some requests over others. In this case, the BusRdX request would be prioritized over the BusRd request and thus the processor would either service the incoming BusRd request or override the BusRd request with its BusRdX request and then the other processor can resend the BusRed request.


themj commented on slide_042 of Directory-Based Cache Coherence ()

How does P0 know that the information that it is receiving is accurate since they are know receiving responses from a different person?


themj commented on slide_017 of Snooping-Based Cache Coherence ()

Shared caches can have some benefits especially if all of the processors are accessing the same data. Then, one processor can pre-fetch some lines and then all of the processors can benefit.


For assignment 2, it would have been useful to scale down since that would have still kept the circle to grids ratio the same. Since that was where the main parallelism came from, we wouldn't have changed the dynamic of the problem.


Materialize the RDD means that all the data from that RDD needs to be stored in memory before the next RDD can proceed with its work.


username commented on slide_057 of Parallel Programming Abstractions ()

Even though we don't need locks in the message passing model, we need to be careful we don't change the value after sending it if it's an asynchronous send, and we don't immediately try reading the value on an asynchronous receive


The time it will take to recalculate the lost data depends on the number of steps in the pipeline


The Winograd algorithm is based off the premise that it is possible to uniquely given its remainder with respect to the given moduli. (Assumptions are that moduli Are relatively prime)


rrp123 commented on slide_020 of Addressing the Memory Wall ()

This is a very inefficient usage of the 64 bit bus, since at any point in time, we are only getting 8 bits of information out of the 64 possible bits. If we were trying to fill in a cache line of size 64 bytes, it would take 64 cycles instead of 8 cycles that you would get if we were using all 8 DRAM chips.


In this case, we will need all of RDD A to be materialized in memory, since RDD B needs all of RDD A to be built.


Always try the simplest approach first - using an "atomic" statement for critical sections. It may very well be the case that implementation of fine-grain and lock-free data structures will have excess overhead leading to worse performance than a simpler locking algorithm.


username commented on slide_010 of Efficiently Evaluating Deep Networks ()

This type of convolution is similar to a convolution with a signal of length 9 in signal processing, however this process can be made faster by taking a FFT, multiplying the two signals (instead of convolving) and then taking the inverse Fourier transform. This tends to be a better option when the sizes of the signals are significantly larger to overcome the overhead of the FFT and IDTFT


username commented on slide_012 of Addressing the Memory Wall ()

Because the capacitors have to be constantly charged and discharged each time DRAM is read / written to, the capacitors tend to denature after repeated use, especially if the DRAM is cheaply fabricated.


Implementation-wise, how does work stealing work with local variables? For example, if the threads of local stacks, do the threads that are stealing that work steal the stack as well?


NVIDIA's newest architecture adds in support for 4x4 matrix multiply-add to enhance these computations.

NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 Accelerator Announced


Titan is a hybrid-architecture Cray XK7 system with a theoretical peak performance exceeding 27,000 trillion calculations per second (27 petaflops).

For more information about Titan: https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/


pht commented on slide_030 of Addressing the Memory Wall ()

Embedded DRAM can be optimized for low latency applications such as program, data, or cache memory in embedded microprocessor or DSP chips. With appropriate memory architecture and circuit design, GHz speeds are possible with on-chip DRAM.


There has historically been different mindsets between people dealing with computationally intensive programs and data intensive programs. Thus, their approaches to scaling programs to large systems as well as to parallelism and efficient computing has with trying to leverage the power of different aspects of the system, and if these types of programs could be dealt with as one, we could see more progress with each individual part of the system.


What causes the second inflection point around 250 or so cores?


kayvonf commented on slide_010 of Course Wrap Up + Presentation Tips ()

:-)


@kayvonf yes I agree with you completely. Sorry if my comment was unclear.

I just wished to emphasize the importance of being able to communicate the "good substance" so that others may understand just how good it is.


kayvonf commented on slide_010 of Course Wrap Up + Presentation Tips ()

@dyzz. I want to be absolutely clear here that there is no substitute for good substance.

However, when you have done work with good substance or have a very good idea you want to see your team implement in a future job, this is the time when it's probably the most important to have the skills to communicate that substance well. The best ideas will benefit others, lead to better systems, etc. and we don't want the good ideas to lose out to other ideas that might not have as good of technical merit, but are communicated well and thus trick others into thinking they are the best ideas.

In other words, good computer architecture often involves good communication.


fxffx commented on slide_027 of Implementing Synchronization ()

Test-and-test-and-set lock reduces traffic because each process will only need to check its local cache to see whether the value has changed, and only send out traffic when it observes that the value becomes valid.


fxffx commented on slide_008 of Transactional Memory ()

Using locks: guarantee that only one thread will modify the stack at one time; Lock-free: not guarantee that only one thread will modify the stack at one time, but if not, only one thread will succeed and others will start over.


The way you can sell your work often determines how it is perceived by others. The substance of your project can be amazing, but a lackluster sell can leave the audience bored and confused.


Intermediary images in deep neural networks sometimes are "recognizable" to the human eye, but more often than not it is actually very difficult to tell what the neural net is "looking" at and yet they are able to make accurate predictions based on those images. We really have a long way to go in understanding how these, and by some extension, the brain works.


fxffx commented on slide_048 of Transactional Memory ()

In case 4, the two transactions are aborting each other because writers always have priority. So two transactions will not stall but keep restarting, resulting in live lock.


dyzz commented on slide_009 of Addressing the Memory Wall ()

Although this code produces tangible performance benefits, it is only acceptable in absolutely performance critical applications. For an application where this kind of calculation is not a bottleneck or done very often, this code becomes overkill as it is much more difficult to maintain and much more prone to error.


fxffx commented on slide_045 of Transactional Memory ()

Eager versioning is optimistic about the transaction, and expects that transaction will not fail most of the time; Lazy versioning is pessimistic.


dyzz commented on slide_011 of Parallel Deep Network Training ()

@holard it is true that you are only guaranteed to converge to a global optima with a convex function, however it has been shown in practice in neural nets that often converging to a local optima is "good enough".