What's Going On
llcoolj commented on slide_066 of Parallel Deep Neural Networks ()

however, this is difficult to distribute ^


it seems (from the internet) that in general cilk allows for better performance than openmp? is that true?


llcoolj commented on slide_057 of Domain-Specific Programming Systems ()

Google actually uses Halide for their image processing in Android phones


llcoolj commented on slide_028 of Addressing the Memory Wall ()

"In the long-term, the scaling of the memory wall may take place once 3D memory devices and/or optical interconnects are commercialized. Failing that, expect to see much larger and smarter cache hierarchies." -hpcwire.com


adilets commented on slide_037 of Addressing the Memory Wall ()

Does this mean that offsets corresponding to different bases will always have a different byte size?


@LazyKiller, you can imagine perhaps having to store a bunch of information and variables for a user who is going through an e-commerce platform, which isn't necessarily information that should the programmer wants to commit to a database yet because it's not complete


@yeezus98, stateful servers means that the state needs to persist throughout a client's session. This means that only the servers which have information about the state relating to the request will be able to service it. As the overhead of storing state across all the servers is too high, you can imagine that only one or two of them will have it. That's how it limits load balancing options


cloudhary commented on slide_065 of Parallel Deep Neural Networks ()

Minor typo here I believe, it should read "fine-grain" instead of find


tommywow commented on slide_017 of Domain-Specific Programming Systems ()

I think OpenGL can be considered as a DSL. It speeds up graphics rasterization with hardware acceleration usually using the GPU. It completes a very specific task.


some other ASIC examples on mobile devices include many image processing units such as face detection and other feature detection.


since iphone doesn't have ASIC siri cannot listen all the time because it drains battery


llcoolj commented on slide_049 of Transactional Memory ()

do we prefer optimistic to pessimistic because it can't stay stalled infinitely?


tommywow commented on slide_047 of Transactional Memory ()

Stalling allows the previous work to not get scarped and to be restarted. It saves some time and computation.


For conservative approach, if send-ready request fails at first, sender will resend the ready request.If receive-ready request is successful the next time, the data finally sent to the destination should be different from what the sender wants to send in the beginning since there seems to be no buffer in this approach?


cloudhary commented on slide_001 of Domain-specific programming on graphs ()

@Tengjiao, what would you consider as an example of a computation that can be abstracted as a graph model but doesn't otherwise naturally express itself as "graph data" in your sense of the word?


cloudhary commented on slide_057 of Domain-Specific Programming Systems ()

I'm really in awe of the results presented on this slide. It's definitely got me thinking if work can be done more efficiently in other domain specific languages rather than be writing parallel code myself. Of course, that wouldn't get me a decent score on the project.


wxt commented on slide_012 of Implementing Synchronization ()

So a block would be a programmer-defined scheduling command, where the code forces a context switch? And during busy waiting, the processor itself will decide when to interrupt spinning?


yes, because it needs the prior one to show what points to the next


LazyKiller commented on slide_042 of Interconnection Networks ()

In this case, is there still only one flow that can be transmitting?


LazyKiller commented on slide_041 of Interconnection Networks ()

@hanzhoul I think it happens when the link has started passing.


So, when one thread is going to do hand-over-hand, it needs to acquire 2 nodes(head & head.next) at once, right?


hanzhoul commented on slide_029 of Implementing Synchronization ()

I have the same doubt. Why do we only use one barrier for all of these synchronizations?


According to lecture 4 slide 55, it means send() and recv() don't return until send() receives acknowledgement from recv().


What ongoing efforts or research are being made to make FPGA programming easier? I can imagine an "FPGA compiler" that translates higher-level code into a form that can be read by an FPGA, but what kinds of challenges would be associated with such a project?


bpr commented on slide_024 of Implementing Synchronization ()

@hanzhoul, yes, without contention, you want to execute the test-and-set as soon as possible. If you have to test first, then that requires an additional coherence request.


What does it mean 'matching receive and source data sent'?


hanzhoul commented on slide_024 of Implementing Synchronization ()

Under uncontented situation, is test-and-set with back off faster than test-and-test-and-set?


@BBB the requests are from different clients thus we cannot control and aggregate them.


Split_Personality_Computer commented on slide_012 of Directory-Based Cache Coherence ()

@gogogo It looks like memory is split up between the processors so P0 goes to read memory that originally resided in P1 but then it is P1's job to tell P0 that P2 actually has the most up-to-date version of what it is looking for (because P2 must've read & changed the value from P1).

So P0 requests the most up-to-date location from P1 because P1's directory has a list of who has what.


hanzhoul commented on slide_041 of Interconnection Networks ()

Does head-of-line blocking problem only occur when one link is reserving route and has not started passing packages yet?


cuiwei commented on slide_025 of Addressing the Memory Wall ()

eDRAM does not act solely as a level 4 cache, because, note in the diagram, that other devices can access eDRAM without going through LLC. A more accurate term I would use is that eDRAM serves as a faster buffer to DDR. And since every memory requests, eDRAM or DDR, goes through eDRAM CTL, the architecture allows the eDRAM to stay coherent with DDR and therefore transparent to applications(thus programmers)


bpr commented on slide_035 of Snooping-Based Cache Coherence ()

@MangoSister, "there is no arbitration step and the caches assume the O/F will handle supplying". So if no O (or no F) is present, then memory supplies the data.

While the most recent cache to issue a BusRd is in F, that does not guarantee that it will not evict the line before the the next BusRd for that line.


MangoSister commented on slide_035 of Snooping-Based Cache Coherence ()

@bpr: (2) Yes I understand in MOESI, the cache in O state is supposed to supply the data. However, because O only comes from a previous M, a possible situation would be that there are multiple S, but no O. In such case, who should supply the data? (I think this won't happen in MESIF though because as you mentioned, the most recent cache to request read receives the line in F, so there must be one F in the same case.)


Split_Personality_Computer commented on slide_049 of Snooping-Based Cache Coherence ()

@taoy1 I think this is the case where you'd want to mark x volatile since another core can modify it. I know some GPU languages have different data structures for read-only versus read-write; potentially read-only data structures are not marked volatile while read-write data structures are marked volatile to force everyone to communicate through the L2.


bpr commented on slide_033 of Snooping-Based Cache Coherence ()

@SPC, the S state will not flush the data, so any transition to S must flush. On a BusRdX, the question is whether the implementation supports a cache-to-cache transfer. If so, then the cache holding the line in M, just transfers the line in response to the BusRdX, otherwise, memory must supply the data.


mpcomplete commented on slide_020 of Domain-specific programming on graphs ()

Are there any optimisations that can be done on the basis of whether the graph is sparse or dense?


bpr commented on slide_035 of Snooping-Based Cache Coherence ()

@MangoSister: 1) Yes, either. Since the S/F states are clean, the data must be written back to memory. It may also be possible to send the memory (or snoop from the bus) at that time.

2) With MESI, the design is to either have memory supply the data in the case with multiple S states, or the caches in S have to arbitrate who supplies the data. Usually if the design has O / F, then there is no arbitration step and the caches assume the O/F will handle supplying.


to delete every node before our Head node


Split_Personality_Computer commented on slide_033 of Snooping-Based Cache Coherence ()

Can someone explain the purpose of flushing the data when another processor reads? I thought the point of these processor states is that once we've loaded in data from main memory we no longer have to communicate with main memory while we're processing that data with one of our cores.


MangoSister commented on slide_035 of Snooping-Based Cache Coherence ()

I have several questions regarding to the responsibility of transferring data:

1 In MESI/MESIF, say only one cache hold the data in M state, and then if another new processor request the data (either BusRd or BusRdX), the old cache must flush the data to memory. Does the new cache get data from memory after the old cache finishes flushing, or it get data from the old cache before flushing, or.. does flushing and supplying data to new cache happen simultaneously?

2 It seems that MOESI still does not solve the problem when there are multiple caches in S state but none in M state. Who should supply the data when there are only caches in S state (no M and thus no O)?


bpr commented on slide_019 of Interconnection Networks ()

@caiqifang, this question is addressed in the discussion on the previous slide.


@carnegieigenrac, as stated in lecture, it was not clear from the code how exactly the tree, hypercube, and hierarchical barriers differ. Effectively they are different k-nary tree designs. The takeaway was that the OpenMP runtime has some choice without requiring the user to properly implement appropriate barriers for the underlying system.