Joint UIUC/UMD Parallel
Algorithms/Programming Course
David Padua
,
University of Illinois at Urbana-Champaign
Motivation 1/4
Programmers of today’s parallel machines must overcome 3
productivity busters, beyond just identifying operations that
can be executed in parallel:
(i) impose the often difficult 4-step programming-for-locality
recipe: decomposition, assignment, orchestration, and
mapping [CS99]
(ii) reason about concurrency in threads; e.g., race conditions
(iii) for machines such as GPU, that fall behind on serial (or
Motivation 2/4: Commodity computer systems
If you want your program to run significantly faster … you’re going to have to parallelize it
Parallelism: only game in town
But, where are the players?
“The Trouble with Multicore: Chipmakers are busy designing
microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010
• Only heroic programmers can exploit the vast parallelism in current
machines – Report by CSTB, U.S. National Academies 2011
• An education agenda must: (i) recognize this reality, (ii) adapt to it,
Motivation 3/4: Technical Objectives
• Parallel computing exists for providing speedups over serial computing • Its emerging democratization the general body of CS students &
graduates must be capable of achieving good speedups
What is at stake?
A general-purpose computer that can be programmed effectively by too few programmers, or requires excessive learning application SW
development costs more, weakening market potential of not only the computer:
Traditionally, Economists look to the manufacturing sector for bettering the recovery prospects of the economy. Software production is the quintessential 21st century mode of manufacturing. These prospects
Motivation 4/4: Possible Roles for Education
•
Facilitator. Prepare & train students and the workforce
for a future dominated by parallelism.
•
Testbed. Experiment with vertical approaches and
refine them to identify the most cost-effective ways
for achieving speedups.
•
Benchmark. Given a vertical approach, identify the
developmental stage at which it can be taught.
Rationale: Ease of learning/teaching is a necessary
(though not sufficient) condition for
ease-of-programming
The joint inter-university course
• UIUC: Parallel Programming for Science and Engineering, Prof: DP • UMD: Parallel Algorithms, Prof: UV
• Student population: upper-division undergrads and graduate
students. Diverse majors and backgrounds
• ~1/2 of the fall 2010 sessions, joint by videoconferencing.
Objectives
1. Demonstrate logistical and educational feasibility of a real-time co-taught course.
Outcome Overall success. Minimal glitches. Helped to alert students that success on material taught by the other prof is as important.
Joint sessions
• DP taught OpenMP programming. Provided parallel architecture knowledge • UV taught parallel (PRAM) algorithms. ~20 minutes of XMTC programming • 3 joints programming assignments
Non-shared sessions
• UIUC: mostly MPI. Submitted more OpenMP programming assignments • UMD: More parallel algorithms. Dry homework on design & analysis of
parallel algorithms. Submitted a more demanding XMTC programming assignment
JC: Anonymous questionnaire filled by the students. Accessed by DP and UV only after all grades were posted, per IRB guidelines
Rank
approaches for
achieving (hard) speedups
Breadth-first-search (BFS) example
•
42 students in fall 2010 joint UIUC/UMD course
- <1X speedups using OpenMP on 8-processor SMP
- 7x-25x speedups on 64-processor XMT FPGA prototype
Questionnaire All students, but one : XMTC ahead of OpenMP for
achieving speedups
Parallel Random-Access Machine/Model
PRAM:
n synchronous processors all having unit time access to a shared memory.
Reactions You got to be kidding, this is way:
- Too easy
- Too difficult:
Immediate Concurrent Execution
‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01].Example: Pairwise parallel summation. 1st round for 8 elements: In parallel 1st+2nd, 3rd+4th,5th+6th,7th+8th
ICE basis for architecture specs:
Feasible for many-cores
Algorithms
Programming
Programmer’s workflow
Rudimentary yet stable
compiler
PRAM-On-Chip HW Prototypes
64-core, 75MHz FPGA of XMT [SPAA98..CF08
]
Toolchain Compiler + simulator HIPS’11
128-core interconnection network IBM 90nm: 9mmX5mm, - 400 MHz [HotI07]
FPGA designASIC
• IBM 90nm: 10mmX10mm • 150 MHz
Architecture scales to 1000+ cores on-chip
Has the study of PRAM algorithms
helped XMT programming?
• Majority of UIUC students No
• UMD students Strong Yes: enforced by written explanation
Discussion
Exposure of UIUC students to PRAM algorithms and XMT programming much more limited. Their understanding of this material not challenged by analytic
homework, or exams.
For same programming challenges, performance of UIUC and UMD students was similar.
More Issues/lessons
•
Recall the title of the courses at UIUC/UMD: Should we
use class time only for algorithms or also for
programming?
Algorithms: high level of abstraction. Allows to cover more
advanced problems. Note: Understanding tested only for
UMD students.
•
Made do with already assigned courses. Next time: more
homogenous population; e.g., CS grad class. If interested
in taking part, please let us know
•
General lesson: IRB requires pre-submission of all
Conclusion
For parallelism to succeed serial computing in the
mainstream, the first experience of students got to:
- demonstrate solid hard speedups
- be trauma-free
Beyond education Objective rankings of approaches
for achieving hard speedups provide a clue for
Course homepages
agora.cs.illinois.edu/display/cs420fa10/Home and
www.umiacs.umd.edu/users/vishkin/TEACHING/enee459p-f10.html
For summary of the PRAM/XMT education approach:
www.umiacs.umd.edu/users/vishkin/XMT/PPOPPCPATH2011.pdf
Includes teaching experience extending from middle school to
graduate courses, course material [class notes,
programming assignments, video presentations of a full-day
tutorial and a full-semester graduate course], a software
toolchain (compiler and cycle-accurate simulator, HIPS
5/20) available for free download, and the XMT hardware
How I teach parallel algorithms at
different developmental stage
• Graduate In class, same PRAM algorithms course as in prior decades
and complexity-style dry HW. <20 minutes of XMTC programming. 6 programming assigning with target hard speedups objectives. Include: parallel graph connectivity and XMT performance tuning
• Upper division undergraduate Less dry HW. Less programming. Still
demand hard speedups
• Freshmen/HS [SIGCSE’10] Minimal/no dry HW. Same problems as in
freshmen serial programming course
Understanding of par algorithms needs to be enforced & validated by
programming, or otherwise most students will get very little from it
What about architecture education?
• Need badly parallel architectures that make parallel thinking easier • In the happy days of serial computing, stored-program + program
counter wall between arch and alg algs low priority. Not now!
• A trigger for XMT: brilliant incompetence of CSE@UMD.
ECE faculty never teach undergrad alg courses. Can be alg researcher and teach arch courses … XMT
Reality Few regularly teach arch and (grad) alg courses, not to say par algs
But, why rely on accidents?! teach next generation arch students to master both, so that they can be better architects
• Very different thought styles are used for one and the same problem
more often than are very closely related ones—1935, Ludwik Fleck (‘the Turing’ of Sociology of Science)