As described in section 2.3.2, the barrier implementation of a flow dependence arc that leads to read requests requires two barriers: one to enforce the flow dependence itself (i.e. prevent early requests), and one to enforce the subsequent anti dependence (i.e. prevent late requests). Using the methods presented until this point, the implementation of a reader unknown communication pattern cannot be optimized further.
Consider the data parallel code in Figure 2.12(a)
and its corresponding synchronization graph in Figure 2.12(b).
The flow dependence between s1 and s2 would be implemented by a barrier
between the two statements, and the anti dependence between s2 and
s1 would be implemented by a barrier before s1,
since the anti dependence
is carried across the iterations of the sequential loop.
The barrier for enforcing the anti dependence
can be avoided by using the standard
technique for eliminating anti-dependence: make a new copy
of variable
for every iteration.
The barrier for enforcing
the flow dependence can be
avoided if each request can be associated with the
iteration in which it is generated; subsequently if it arrives early
at the destination, it can be stored in a temporary buffer and serviced
only after the corresponding copy is defined. It follows that
a temporary array of K elements may be used to avoid
any barrier synchronizations for K iterations.
So, within these K iterations,
threads adaptively synchronize only with those threads
from which they need data. After K iterations have been executed,
the copies can be reused as discussed subsequently.
The concept of variables which queue requests until definition is
essentially that of I-structures [ANP89] or
definition variables [Cha91]. Figure 2.12(c) shows
the use of definition variables in implementing the code of
Figure 2.12(a). Each definition variable consists
of three fields: a value, a defined/undefined bit, and a pointer
to a list of early requests. In the implementation shown,
a barrier is executed only once in every K iterations. Hence,
each thread needs to have an array of K definition variables.
At statement s1, after a is defined, the corresponding definition
variable is defined as well. All requests to it must be replied to
as well, and that function is performed by the routine service.
Subsequently each thread places a request to the thread from which
it needs data. We assume a third value for the read/write bit
(no longer just a bit) which
identifies the request as being a definition variable request.
The function
essentially does just that, while
servicing incoming requests. Note that this function, as well as
the function poll, must be modified to enqueue early requests
for definition variables. After K iterations, the definition
variables need to be reused, so a barrier is executed to synchronize
all threads, and subsequently all definition
variables are initialized to the undefined state by the
routine
.
The code for this and other routines is straightforward
and is omitted.
Figure 2.12: Eliminating barriers for reader unknown patterns