• Tidak ada hasil yang ditemukan

ipdps11EduParTalk.ppt 443KB Jun 23 2011 01:06:34 PM

N/A
N/A
Protected

Academic year: 2017

Membagikan "ipdps11EduParTalk.ppt 443KB Jun 23 2011 01:06:34 PM"

Copied!
17
0
0

Teks penuh

(1)

Joint UIUC/UMD Parallel

Algorithms/Programming Course

David Padua

,

University of Illinois at Urbana-Champaign

(2)

Motivation 1/4

Programmers of today’s parallel machines must overcome 3

productivity busters, beyond just identifying operations that

can be executed in parallel:

(i) impose the often difficult 4-step programming-for-locality

recipe: decomposition, assignment, orchestration, and

mapping [CS99]

(ii) reason about concurrency in threads; e.g., race conditions

(iii) for machines such as GPU, that fall behind on serial (or

(3)

Motivation 2/4: Commodity computer systems

If you want your program to run significantly faster … you’re going to have to parallelize it

Parallelism: only game in town

But, where are the players?

“The Trouble with Multicore: Chipmakers are busy designing

microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010

Only heroic programmers can exploit the vast parallelism in current

machines – Report by CSTB, U.S. National Academies 2011

An education agenda must: (i) recognize this reality, (ii) adapt to it,

(4)

Motivation 3/4: Technical Objectives

Parallel computing exists for providing speedups over serial computingIts emerging democratization  the general body of CS students &

graduates must be capable of achieving good speedups

What is at stake?

A general-purpose computer that can be programmed effectively by too few programmers, or requires excessive learning  application SW

development costs more, weakening market potential of not only the computer:

Traditionally, Economists look to the manufacturing sector for bettering the recovery prospects of the economy. Software production is the quintessential 21st century mode of manufacturing. These prospects

(5)

Motivation 4/4: Possible Roles for Education

Facilitator. Prepare & train students and the workforce

for a future dominated by parallelism.

Testbed. Experiment with vertical approaches and

refine them to identify the most cost-effective ways

for achieving speedups.

Benchmark. Given a vertical approach, identify the

developmental stage at which it can be taught.

Rationale: Ease of learning/teaching is a necessary

(though not sufficient) condition for

ease-of-programming

(6)

The joint inter-university course

UIUC: Parallel Programming for Science and Engineering, Prof: DPUMD: Parallel Algorithms, Prof: UV

Student population: upper-division undergrads and graduate

students. Diverse majors and backgrounds

~1/2 of the fall 2010 sessions, joint by videoconferencing.

Objectives

1. Demonstrate logistical and educational feasibility of a real-time co-taught course.

Outcome Overall success. Minimal glitches. Helped to alert students that success on material taught by the other prof is as important.

(7)

Joint sessions

DP taught OpenMP programming. Provided parallel architecture knowledgeUV taught parallel (PRAM) algorithms. ~20 minutes of XMTC programming 3 joints programming assignments

Non-shared sessions

UIUC: mostly MPI. Submitted more OpenMP programming assignmentsUMD: More parallel algorithms. Dry homework on design & analysis of

parallel algorithms. Submitted a more demanding XMTC programming assignment

JC: Anonymous questionnaire filled by the students. Accessed by DP and UV only after all grades were posted, per IRB guidelines

(8)

Rank

approaches for

achieving (hard) speedups

Breadth-first-search (BFS) example

42 students in fall 2010 joint UIUC/UMD course

- <1X speedups using OpenMP on 8-processor SMP

- 7x-25x speedups on 64-processor XMT FPGA prototype

Questionnaire All students, but one : XMTC ahead of OpenMP for

achieving speedups

(9)

Parallel Random-Access Machine/Model

PRAM:

n synchronous processors all having unit time access to a shared memory.

Reactions You got to be kidding, this is way:

- Too easy

- Too difficult:

(10)

Immediate Concurrent Execution

‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01].Example: Pairwise parallel summation. 1st round for 8 elements: In parallel 1st+2nd, 3rd+4th,5th+6th,7th+8th

ICE basis for architecture specs:

(11)

Feasible for many-cores

Algorithms

Programming

Programmer’s workflow

Rudimentary yet stable

compiler

PRAM-On-Chip HW Prototypes

64-core, 75MHz FPGA of XMT [SPAA98..CF08

]

Toolchain Compiler + simulator HIPS’11

128-core interconnection network IBM 90nm: 9mmX5mm, - 400 MHz [HotI07]

FPGA designASIC

IBM 90nm: 10mmX10mm 150 MHz

Architecture scales to 1000+ cores on-chip

(12)

Has the study of PRAM algorithms

helped XMT programming?

Majority of UIUC students No

UMD students Strong Yes: enforced by written explanation

Discussion

Exposure of UIUC students to PRAM algorithms and XMT programming much more limited. Their understanding of this material not challenged by analytic

homework, or exams.

For same programming challenges, performance of UIUC and UMD students was similar.

(13)

More Issues/lessons

Recall the title of the courses at UIUC/UMD: Should we

use class time only for algorithms or also for

programming?

Algorithms: high level of abstraction. Allows to cover more

advanced problems. Note: Understanding tested only for

UMD students.

Made do with already assigned courses. Next time: more

homogenous population; e.g., CS grad class. If interested

in taking part, please let us know

General lesson: IRB requires pre-submission of all

(14)

Conclusion

For parallelism to succeed serial computing in the

mainstream, the first experience of students got to:

- demonstrate solid hard speedups

- be trauma-free

Beyond education Objective rankings of approaches

for achieving hard speedups provide a clue for

(15)

Course homepages

agora.cs.illinois.edu/display/cs420fa10/Home and

www.umiacs.umd.edu/users/vishkin/TEACHING/enee459p-f10.html

For summary of the PRAM/XMT education approach:

www.umiacs.umd.edu/users/vishkin/XMT/PPOPPCPATH2011.pdf

Includes teaching experience extending from middle school to

graduate courses, course material [class notes,

programming assignments, video presentations of a full-day

tutorial and a full-semester graduate course], a software

toolchain (compiler and cycle-accurate simulator, HIPS

5/20) available for free download, and the XMT hardware

(16)

How I teach parallel algorithms at

different developmental stage

Graduate In class, same PRAM algorithms course as in prior decades

and complexity-style dry HW. <20 minutes of XMTC programming. 6 programming assigning with target hard speedups objectives. Include: parallel graph connectivity and XMT performance tuning

Upper division undergraduate Less dry HW. Less programming. Still

demand hard speedups

Freshmen/HS [SIGCSE’10] Minimal/no dry HW. Same problems as in

freshmen serial programming course

 Understanding of par algorithms needs to be enforced & validated by

programming, or otherwise most students will get very little from it

(17)

What about architecture education?

Need badly parallel architectures that make parallel thinking easierIn the happy days of serial computing, stored-program + program

counter  wall between arch and alg  algs low priority. Not now!

A trigger for XMT: brilliant incompetence of CSE@UMD.

ECE faculty never teach undergrad alg courses. Can be alg researcher and teach arch courses …  XMT

Reality Few regularly teach arch and (grad) alg courses, not to say par algs

But, why rely on accidents?! teach next generation arch students to master both, so that they can be better architects

Very different thought styles are used for one and the same problem

more often than are very closely related ones—1935, Ludwik Fleck (‘the Turing’ of Sociology of Science)

Referensi

Dokumen terkait

A small credit card-sized circuit board used to connect a modem, memory, network card, or storage device to a notebook computer.. CIS 120

“Most popular” shelf: 20% most popular books.. 5 ft

transmisi data (contoh, komputer) dan media transmisi atau jaringan.  Karakteristik dari media transmisi 

 Jalur-jalur gerbang lain dibuka (enable) untuk output pada waktu yang

logical address Add1 to a physical memory location (this is done by Memory Management Unit MMU)..

Architecture – fall 2003, Technion 18 Basic EV8 Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Icach e Register Map Dcach e Regs Regs

mechanical devices that process the data; refers to the computer as well as peripheral devices..

For font attractiveness, Comic was perceived as being more attractive than Arial and Courier , while Styled and Impact were perceived as more attractive than