link

January 27, Thursday
12:00 – 13:30

Task Superscalar Multiprocessors
Computer Science seminar
Lecturer : Yoav Etsion
Lecturer homepage : http://www.cs.huji.ac.il/~etsman/
Affiliation : Barcelona Supercomputing Center
Location : 202/37
Host : Dr. Danny Hendler
Parallel programming is notoriously difficult and is still considered an artisan's job. Recently, the shift towards on-chip parallelism has brought this issue to the front stage. Commonly referred to as the Programmability Wall, this problem has already motivated the development of simplified parallel programming models, and most notably task-based models. In this talk, I will present Task Superscalar Multiprocessors, a conceptual multiprocessor organization that operates by dynamically uncovering task-level parallelism in a sequential stream of tasks. Task superscalar multiprocessors target an emerging class of task-based dataflow programming models, and thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model. The key component in the design is the Task Superscalar Pipeline, an abstraction of instruction-level out-of-order pipelines that operates at the task-level and can be embedded into any manycore fabric to manage cores as functional units. Like out-of-order pipelines that dynamically uncover parallelism in a sequential instruction stream and drive multiple functional units, the task superscalar pipeline uncovers task-level parallelism in a stream of tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. I will describe the design of the task superscalar pipeline, and discuss how it tackles the scalability limitations of instruction-level out-of-order pipelines. Finally, I will present simulation results that demonstrate the design can sustain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ~50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95-255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated multiprocessor with 256 cores.