What's New : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Lotusword commented on slide_030 of Directory-Based Cache Coherence ()

Also one of the flat schemes.

Lotusword commented on slide_029 of Directory-Based Cache Coherence ()

Also one of the hierarchical schemes.

Lotusword commented on slide_010 of Directory-Based Cache Coherence ()

There are also other definitions: dirty node: the node that has a copy of the block in its cache in modified(dirty) state. exclusive node: the node that has a copy of the block in its cache in an exclusive state. owner node: the node that currently holds the valid copy of a block and must supply the data when needed.

Lotusword commented on slide_012 of Interconnection Networks ()

We also have: Cray T3D Network(torus), IBM SP-1, SP-2 Network, scalable coherent interconnect, SGI Origin Network, and Myricom Network.

Lotusword commented on slide_055 of Parallel Programming Abstractions ()

In openCL, threads on different cores could not communicate with each other.

Lotusword commented on slide_035 of Parallel Programming Abstractions ()

More information about SGI Altix UV 1000: https://www.sgi.com/company_info/newsroom/press_releases/2010/july/altix_uv.html

Lotusword commented on slide_043 of A Modern Multi-Core Processor ()

3 ways to reduce latency: 1) Reduce the access time to each level of the storage hierarchy 2) Reduce the likelihood that data accesses will incur high latency 3) Reduce the frequency of potential high-latency events in the application

Lotusword commented on slide_030 of Interconnection Networks ()

More about Intel Xeon Phi: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html?cid=sem43700008470409883&intel_term=xeon+phi&gclid=Cj0KEQjw09C5BRDy972s6q2y4egBEiQA5_guvyuRPkXNZXgylQ7IBRg2De46txoDXd1R564F4q3ox3MaAtxc8P8HAQ&gclsrc=aw.ds

Lotusword commented on slide_027 of Interconnection Networks ()

The average distance is N/2 with unidirectional links, and N/3 with bidirectional links. The bisection width is one.

Lotusword commented on slide_023 of Interconnection Networks ()

The average distance is roughly (2/3)N. Removal of a single link partitions the network, so the bisection width is one.

Lotusword commented on slide_025 of Directory-Based Cache Coherence ()

Full-bit vector scheme is also called flat directory scheme, which could be divided into two classes: memory-based schemes and cache-based schemes.

Lotusword commented on slide_014 of Parallel Programming Case Studies ()

If the center of mass of the cell is far enough away from the body, the entire subtree under that cell is approximated by a single body at the center of mass of the cell, and the force this center of mass exerts on the body computed. If, however, the center of mass is not far enough away, the cell must be "opened" and each of its subcells visited.

Lotusword commented on slide_030 of Parallel Deep Network Training ()

Could worker nodes communicate data with each other?

Lotusword commented on slide_075 of Course Wrap Up + Presentation Tips ()

Thanks Kayvon!

Lotusword commented on slide_075 of Course Wrap Up + Presentation Tips ()

Thanks Kayvon?

sbyeon commented on slide_040 of Scaling a Web Site ()

How many levels of CDN can we realistically have without having noticeable sync delays?

Panda commented on slide_075 of Course Wrap Up + Presentation Tips ()

Best class ever! Thanks Kayvon!

bdebebe commented on slide_024 of Parallel Deep Network Training ()

@maxdecmeridius One way is to obviously just try, but that might not always be possible considering time constraints and resources. At that point, the best thing to do is approximate, based on educated examinations about the parameters.

bdebebe commented on slide_038 of Efficiently Evaluating Deep Networks ()

A recent CMU student, George Hotz, is making a self-driving car using DNN's. With this increased computing power, his research is becoming reality, as you can see in this video: https://www.youtube.com/watch?v=YjTnYBaQQpw

bdebebe commented on slide_037 of The Future of High-Performance Computing ()

@mperron It's very reliant on money as @ojd mentioned. It's a cynical way to look at it, but that type of research and development will only happen if there is a push in the right direction by people with the money to do so, but it seems that as of late that push is happening more often.

bdebebe commented on slide_004 of Addressing the Memory Wall ()

@grarawr The dominating factor in performance cost isn't as easy to pin down as is the energy cost of data movement being so dominant, but it is definitely one of the leading causes we learned about in this class.

bdebebe commented on slide_040 of In-Memory Distributed Computing using Spark ()

Working on our final project, we have been running into this exact issue. The overhead involved in setting up a multithreaded program doesn't outweigh the performance improvements, especially in artificial intelligence. It is a sobering thought when thinking about how applicable parallelization actually is in our every day lives.

bdebebe commented on slide_026 of Domain-Specific Programming Systems ()

This analysis is a good model as to how to reason about parallelizing any sequential code in general.

bdebebe commented on slide_030 of Heterogeneous Parallelism and Hardware Specialization ()

I thought the particular bit below was very interesting, considering some of the later ideas we learned in the class about energy tradeoffs:

"To achieve long battery life when playing video, mobile devices must decode the video in hardware; decoding it in software uses too much power. Many of the chips used in modern mobile devices contain a decoder called H.264, an industry standard that is used in every Blu-ray DVD player and has been adopted by Apple, Google (YouTube), Vimeo, Netflix and many other companies.

Although Flash has recently added support for H.264, the video on almost all Flash websites currently requires an older generation decoder that is not implemented in mobile chips and must be run in software. The difference is striking: on an iPhone, for example, H.264 videos play for up to 10 hours, while videos decoded in software play for less than 5 hours before the battery is fully drained.

When websites re-encode their videos using H.264, they can offer them without using Flash at all. They play perfectly in browsers like Apple's Safari and Google's Chrome without any plugins whatsoever, and look great on iPhones, iPods and iPads."

tdecker commented on slide_022 of GPU Architecture and CUDA Programming ()

Yes but as long as you are close does it really matter, as we saw latter in the class. In graphics sometimes speed is more important then accuracy. I guess what I am saying is a near infinite number of materials may exist in the world but you can get close to all of them with a much smaller set of materials.

tdecker commented on slide_029 of Parallel Programming Basics ()

Does this type of technique scale when you have a more complex graph. This example could be make into a bipartite graph. What if the graph was at minimum 3 colors or 4 colors.

bdebebe commented on slide_015 of Transactional Memory ()

@jaguar This was especially the case in assignment 3. The contention with even just 1 atomic operation completely slowed everything down.

tdecker commented on slide_014 of Parallel Programming Abstractions ()

I think a lot of these optimizations become dependent on the situation. This is why we can get such amazing speed ups for specific application code

bdebebe commented on slide_016 of Fine-Grained Synchronization and Lock-Free Programming ()

@aeu The above point touched on most of it, but I also think it's worth noting that the discussion not only deals with contention, but overhead tradeoffs as well.

Araina commented on slide_018 of Parallelizing the 3D Graphics Pipeline ()

From my point of view, here "early out test" and "early in test" may be first check whether the four corner points is within the triangle, if the answer is no, other points within this chunk can be ignored.

Besides, one way to determine the "right side" of each line of the triangle is to use the left point and the formula of line. The "right side" of each line is the same side with the left point. eg. use P1 to determine the "right side" of line P0P1.

Araina commented on slide_010 of Parallelizing the 3D Graphics Pipeline ()

The main difference between assignment and this lecture is on Depth Test and Depth/Color Write. Here only writes the color of the closest frame for higher efficiency and utility.

c0d3r commented on slide_057 of Course Wrap Up + Presentation Tips ()

Kayvon, this discussion made me smile. Thank you so much for showing that you understand the culture here and really care about the well-being of your students.

I might add one thing: there is a lot of pressure, especially at CMU, to find what you are "most interested" or most passionate about. Unfortunately, I've seen a lot of people crumble under that pressure because the truth is: 1) There are, pretty much, an infinite number of things you can do. 2) You will only ever try a tiny subset of those things. 3) Things that seem cool and exciting at first will become harder the more you work on it.

It's ok to never find the thing that you would be "most" passionate about. From a pure probability standpoint, the chance of that happening is very small. (Like, maybe I could've been a professional mogul skier, but I never tried.) If you are doing something that you care about and want to contribute to, then you are already on the right track. And actually, you won't even care about it 100% of the time. Every time I see a seg fault I feel the urge to pound expletives into my laptop. But if something you do gives you those really satisfying moments when you're like "OMG I did it!" then it's worth it.

c0d3r commented on slide_052 of Parallelizing the 3D Graphics Pipeline ()

Weaknesses: Fine-grained: poor cache locality, not all values in the cache are being used Course-grained: possible uneven distribution of work

c0d3r commented on slide_035 of Interconnection Networks ()

Could someone provide an example of a non-blocking multi-stage logarithmic topology?

c0d3r commented on slide_008 of Parallelizing the 3D Graphics Pipeline ()

Typically if the center of a pixel is occluded by a certain triangle, then that triangle "covers" this pixel. In the example above, triangles 3 and 4 cover this pixel.

ok commented on slide_040 of Parallel Deep Network Training ()

I found this neat article on the NVIDIA DGX-1 http://www.nextplatform.com/2016/04/06/dgx-1-nvidias-deep-learning-system-newbies/ Something interesting I saw was that it burns 32,000 Watts, which puts things in perspective.

ote commented on slide_067 of Course Wrap Up + Presentation Tips ()

I've been really fortunate to work within the Project Olympus framework, and I can say firsthand that it is one of the most awesome, and criminally underused resources on CMU's campus. Whether you have an idea or don't have an idea, the folks at Project Olympus are really helpful in guiding you, providing all the resources you need or don't even know you need. The best part is that getting involved is a simple as sending an email or walking into their facilities on Henry Street (off Craig).

kayvonf commented on slide_061 of Domain-Specific Programming Systems ()

Yes, it's now available here:

http://graphics.cs.cmu.edu/projects/halidesched/

captainFlint commented on slide_016 of Parallelizing the 3D Graphics Pipeline ()

To compute all the red pixels, the naive algorithm is to just test each pixel (on screen) for whether it is inside the triangle or not. A slight optimization on this is to use the bounding box of the triangle and test only pixels in that box (like we did for circles in assignment 2). However, in case of triangles, we can have a triangle covering very few pixels and yet its bounding box might be the entire screen.

jsunseri commented on slide_034 of Heterogeneous Parallelism and Hardware Specialization ()

Anton was used to observe folding events for 12 structurally diverse proteins: see How Fast-Folding Proteins Fold. They used typical simulation conditions and gathered between 1us and 1ms of simulation time for each protein; for reference, with an integration timestep of ~2.5fs and recording trajectory snapshots every 200ps, a typical researcher uses 10-100ns of simulation time which usually takes between 1 and 2 weeks of real time running on a decent GPU. They don't seem to mention how long it took for them to record ~8ms of total simulation time, but no doubt it was much less time than it would take anyone else (as in, it would probably be infeasible for anyone else) and they observed an impressive ~400 folding/unfolding events.

jsunseri commented on slide_015 of Domain-Specific Programming on Graphs ()

@momoda Looks like there are two versions of the algorithm (see this source for an explanation), and this slide uses version 1 (the original published version) while the previous slide uses version 2. Basically the one on the previous slide gives an actual probability of reaching a page after clicking on many links; all the PageRanks form a probability distribution over all websites. The one on this slide is the expected number of times someone would reach a website if they restarted the search procedure as many times as there are websites.

jsunseri commented on slide_015 of The Future of High-Performance Computing ()

@stride16 I think the point was that if you expect performance scaling based on the number of nodes/CPUs/etc. and then some of those nodes fail, you get significantly worse scaling than expected even if only a few of them fail.

jsunseri commented on slide_027 of Fine-Grained Synchronization and Lock-Free Programming ()

Notably this paper from 1991 reports a lock-free multiprocessor OS kernel that requires a double-word CAS and claims performance benefits due to being lock free.

jsunseri commented on slide_019 of Domain-Specific Programming Systems ()

I don't think any of the molecular simulation "languages" listed in the first comment are languages. They're just software packages; you use them as command line programs, and although they do typically have an "input file" format, that input isn't really sophisticated enough to call it a language - it's just a set of input parameter specifications for the simulation, like how big of a time step to take. A legitimate domain-specific language in that category is NAB (Nucleic Acid Builder). I was getting a memory access error in a program for free energy calculations that called some other programs under the hood, and when I starting trying to hunt down the problem I discovered it was somewhere inside about 5,000 lines of NAB that opened with a comment written in the early 90s by someone at Sun (who had evidently written the program) that said "much of what follows is untested and poorly commented." That was the only comment.

jsunseri commented on slide_037 of Parallel Deep Network Training ()

To distinguish the last case from this one: in the last few slides, we discussed how parameter values could be sharded across multiple servers to reduce the adverse effects of contention on latency - all workers must communicate with the parameter servers, both to update their own parameter values and to update the parameter values on the servers. Those communication costs become greater if there are more parameters to send back and forth, but we can reduce individual transfer times (and effectively pipeline the transferring of the full set of parameters) by having multiple parameter servers with disjoint subsets of the parameters stored on each. Now we consider what happens if the number of parameters becomes so great that it becomes impractical for a worker node to work on all of them at once - each worker now may work on a subset of the parameters at a time. This would also reduce contention for the parameters servers, since only a subset of the worker nodes would need to communicate with each one.

jsunseri commented on slide_017 of Efficiently Evaluating Deep Networks ()

Interesting link - CNNs trained for object detection can also be used to generate those objects in images composed of random noise (preprocessed with a generic prior to make the noise have local correlation the way natural images do). The post points out that this kind of image generation is useful for identifying deficiencies in the training set after a round of training (e.g. the neural net thought that images of dumbbells always had to be accompanied by a flexing bicep). There are some novel research applications too, but I won't discuss them further to avoid getting scooped :P

jsunseri commented on slide_024 of In-Memory Distributed Computing using Spark ()

@MaxFlowMinCut I think the definition for crossProduct given above is actually a Cartesian product, which you access in Spark by using cartesian(). The cross product as we commonly use it (in physics, for example) is not defined for an arbitrary number of dimensions (see Wedge Product for the thing that is defined in an arbitrary number of dimensions), so an implementation would probably assume three-dimensional vectors.