Homework 2
Due Date : Tuesday September 28
Total Points : 100 pts
You must work on this assignment with one other person. You should
work together on all parts of the assignment, but submit only one set
of solutions. If you each work on part, then you each only learn part
of the material. Please be sure to write both names on the submitted
solutions.
Note: Copying material from Wikipedia, other online sources, or
any source will not be tolerated. This form of plagiarism has
occurred in the past, and penalties for violating the Duke Community
Standard will be severe.
Keep all your answers short!
ILP and Pipelining (Appendix A, Chapters 2 and 3) (35 points)
- (5 pts) H&P 2.2
- (5 pts) H&P 2.7
- (5 pts)
Write a summary (1 page or less), of the EPIC paper by Schlansker and Rau.
What were the contributions of the paper? What were its strengths and weaknesses?
How do you think the paper could have been improved?
- (5 pts)
How does static scheduling work?
What are the advantages of using static scheduling?
What are its problems and possible solutions that address these?
Describe both hardware and software aspects.
Cite all sources you use to answer these questions.
- (10 pts) Do some research on the pipelines of two modern microprocessors (except Intel Pentium4).
For each write a couple paragraphs describing their pipelines.
How deep are they? How wide? Why do you think they have different (or similar) pipelines?
What are the advantages and disadvantages of them respectively?
Cite all sources you use to answer these questions.
- (5 pts) Intel has developed a prototype chip, called Polaris, that has 80 in-order cores on it.
Why do you think Intel's architects might be leaning towards Polaris-like chips as opposed to Pentium4-like chips?
What are Polaris's advantages and disadvantages (with respect to a Pentium4-like chip)?
Consider at least the following issues: impact of CMOS technology trends, software, power, and design complexity.
Cite all sources you use to answer these questions.
Superscalar and Dynamic ILP (Chapter 2) (25 points)
- (5 pts) H&P 2.5
- (5 pts) Assume the processor chip is 1cm by 1cm. Assume the clock is 2GHz.
How many cycles does it take for an ideal signal (i.e., a signal traveling at the speed of light)
to travel the farthest possible distance on chip? Assume that you can route a signal diagonally.
Now compare this result to the situation back when chips were the same size but the clock was 100MHz.
Why do you think this trend affects microarchitects?
- (10 pts) How many bypass paths are there in a 6-stage (F1, F2, D, X, M, W) pipeline that is 4 instructions wide?
How many wires and multiplexors (and what size muxes) does this bypass network require, compared to a processor without bypassing?
- (5 pts) What are the advantages of operand forwarding (also called operand/data bypassing)?
What do you think were the major issues Pentium4 architects faced w.r.t. forwarding? What are the major challenges
architects are facing designing next-generation microprocessors w.r.t. forwarding? Argument your answers.
Cite all sources you use to answer these questions.
Pipelining and Hazards with SimpleScalar) (40 points)
Start with the sim-safe simulator.
The main loop of the simulator, sim_main(), executes each instruction in-order and increments the cycle counter by one.
Note that sim-safe does NOT model the timing of the execution - it only models the functional effects of each instruction.
To model timing, you'll have to modify sim-safe.c to count how many cycles have elapsed during each iteration of sim_main().
Run all experiments with the three benchmarks (anagram, gcc and go).
- Performance:
Assume your processor is a 4-wide, in-order superscalar (i.e., can execute a maximum of 4 instructions per cycle).
Ignoring data dependencies and assuming no hazards of any kind, what is its performance (i.e., how many cycles does it take to run)?
- Data hazards:
Now assume that the processor cannot execute data dependent instructions in the same cycle.
For example, if an instruction writes to register 2, then no subsequent instruction (in program order) that
reads register 2 can execute in the same cycle (it must wait until the next cycle). How does this affect performance?
Note that this question is independent of the pipeline length.
- Structural hazards:
Now assume that the L1 data cache has only one port and thus the processor can only execute at most one memory operation (load or store) per cycle.
How does this affect its performance?
- Control hazards:
Now further assume that the processor has a 9-stage pipeline.
The result of a conditional branch (i.e., taken or not-taken) is computed in stage 7.
The processor statically predicts all conditional branches as not-taken and continues fetching
from the instruction after the branch (the fall-through instruction). If the branch is indeed not-taken,
then there is no penalty. If the branch is taken, then all instructions after the branch are squashed and
fetching resumes from the instruction immediately from the branch destination. How does this affect performance?
Update!
We will be using Duke Blackboard for submission.
Submit: You will submit the version of sim-safe.c that incorporates all three issues raised in parts (b), (c) , and (d).
Make sure your code changes are properly commented.
When you're ready to submit:
- Rename your file to HW2_NetId1_NetId2_sim-safe.c where NetId1 and NetId2 are the NetIds for both homework partners (e.g. HW2_ab34_xy16_sim-safe.c).
- Go to Duke Blackboard course page -> Tools -> Digital Dropbox -> Send File.
- Under name paste the filename, HW2_NetId1_NetId2_sim-safe.
- Chose the file and click Submit.