From: Patrick Moran
<pmoran@nas.nasa.gov>
Date: Tue Sep 2, 2003 4:29:46 PM US/Pacific
To: John Shalf
<jshalf@lbl.gov>
Cc: diva@lbl.gov
Subject: Re: DiVA Survey
(Please return by Sept 10!)
Reply-To: patrick.j.moran@nasa.gov
On Wednesday 27 August 2003 03:33 pm, you
wrote:
John,
Here's my couple cents worth:
=============The
Survey=========================
Please answer the attached survey with as
much or as little verbosity
as you please and return it to me by
September 10. The survey has 3
mandatory sections and 4 voluntary (bonus)
sections. The sections are
as follows;
Mandatory;
1)
Data Structures
2)
Execution Model
3)
Parallelism and Load-Balancing
Voluntary;
4)
Graphics and Rendering
5)
Presentation
6)
Basic Deployment and Development Environment Issues
7)
Collaboration
We will spend this workshop focusing on
the first 3 sections, but I
think we will derive some
useful/motivating information from any
answers to questions in the voluntary
sections.
I'll post my answers to this survey on
diva mailing list very soon.
You can post your answers publicly if you
want to, but I am happy to
regurgitate your answers as
"anonymous contributors" if it will enable
you to be more candid in your evaluation
of available technologies.
1) Data Structures/Representations/Management==================
The center of every successful modular
visualization architecture has
been a flexible core set of data
structures for representing data that
is important to the targeted application
domain. Before we can begin
working on algorithms, we must come to
some agreement on common methods
(either data structures or
accessors/method calls) for
exchanging data
between components of our vis framework.
There are two potentially disparate
motivations for defining the data
representation requirements. In the coarse-grained case, we need to
define standards for exchanging data
between components in this
framework (interoperability). In the fined-grained case, we want to
define some canonical data structures that
can be used within a
component -- one developed specifically
for this framework. These two
use-cases may drive different set of
requirements and implementation
issues.
*
Do you feel both of these use cases are equally important or should
we focus exclusively on one or the other?
I think both cases are important, but
agreeing upon the fine-grained access
will be harder.
*
Do you feel the requirements for each of these use-cases are aligned
or will they involve two separate
development tracks? For instance,
using "accessors" (method calls
that provide abstract access to
essentially opaque data structures) will
likely work fine for the
coarse-grained data exchanges between
components, but will lead to
inefficiencies if used to implement
algorithms within a particular
component.
*
As you answer the "implementation and requirements" questions below,
please try to identify where
coarse-grained and fine-grained use cases
will affect the implementation
requirements.
I think the focus should be on interfaces
rather than data structures. I
would advocate this approach not just
because it's the standard
"object-oriented" way, but because
it's the one we followed with FEL,
and now FM, and it has been a big win for
us. It's a significant benefit
not having to maintain different versions of
the same visualization
technique, each dedicated to a different
method for producing the
data (i.e., different data structures). So, for example, we use the same
visualization code in both in-core and
out-of-core cases. Assuming up
front that an interface-based approach would be too slow is, in my
humble opinion, classic premature
optimization.
What are requirements for the data
representations that must be
supported by a common infrastructure. We will start by answering Pat's
questions of about representation
requirements and follow up with
personal experiences involving particular
domain scientist's
requirements.
Must:
support for structured data
Structured data support is a must.
Must/Want:
support for multi-block data?
Multi-block is a must.
Must/Want:
support for various unstructured data representations?
(which ones?)
We have unstructured data, mostly based on
tetrahedral or prismatic meshes.
We need support for at least those
types. I do not think we could
simply
graft unstructured data support on top of
our structured data structures.
Must/Want:
support for adaptive grid standards?
Please be specific
about which adaptive grid methods you are
referring to. Restricted
block-structured AMR (aligned grids),
general block-structured AMR
(rotated grids), hierarchical unstructured
AMR, or non-hierarchical
adaptive structured/unstructured meshes.
Adaptive grid support is a "want"
for us currently, probably eventually
a "must". The local favorite is CART3D, which
consists of hierarchical
regular grids. The messy part is that CART3D also supports having
more-or-less arbitrary shapes in the domain,
e.g., an aircraft fuselage.
Handling the shape description and all the
"cut cell" intersections
I expect will be a pain.
Must/Want:
"vertex-centered" data, "cell-centered" data?
other-centered?
Most of the data we see is still
vertex-centered. FM supports other
associations, but we haven't used them much
so far.
Must:
support time-varying data, sequenced, streamed data?
Support for time-varying data is a must.
Must/Want:
higher-order elements?
Occasionally people ask about it, but we
haven't found it to be a "must".
Must/Want:
Expression of material interface boundaries and other
special-treatment of boundary conditions.
We don't see this so much. "Want", but not must.
*
For commonly understood datatypes like structured and unstructured,
please focus on any features that are
commonly overlooked in typical
implementations. For example, often data-centering is overlooked in
structured data representations in vis
systems and FEM researchers
commonly criticize vis people for
co-mingling geometry with topology
for unstructured grid
representations. Few
datastructures provide
proper treatment of boundary conditions or
material interfaces. Please
describe your personal experience on these
matters.
One thing left out of the items above is
support for some sort of "blanking"
mechanism, i.e., a means to indicate that
the data at some nodes are not
valid.
That's a must for us. For
instance, with Earth science data we see
the use of some special value to indicate
"no data" locations.
*
Please describe data representation requirements for novel data
representations such as bioinformatics and
terrestrial sensor datasets.
In particular, how should we handle more abstract data that is
typically given the moniker
"information visualization".
"Field Model" draws the line only
trying to represent fields and the meshes
that the fields are based on. I not really familiar enough with other
types
of data to know what
interfaces/data-structures would be best.
We haven't
see a lot of demand for those types of data
as of yet. A low-priority
"want".
What do you consider the most
elegant/comprehensive implementation for
data representations that you believe
could form the basis for a
comprehensive visualization framework?
*
For instance, AVS uses entirely different datastructures for
structure, unstructured and geometry
data. VTK uses class inheritance
to express the similarities between related
structures. Ensight treats
unstructured data and geometry nearly
interchangably. OpenDX uses more
vector-bundle-like constructs to provide a
more unified view of
disparate data structures. FM uses data-accessors (essentially
keeping
the data structures opaque).
Well, as you'd expect, as the primary author
of Field Model (FM) I think it's
the most elegant/comprehensive of the
lot. It handles structured and
unstructured data. It handles data non-vertex-centered data. I think it
should be able to handle adaptive data,
though it hasn't actually been
put to the test yet. And of course every adaptive mesh
scheme is a little
different. I think it could handle boundary condition needs, though
that's
not something we see much of.
*
Are there any of the requirements above that are not covered by the
structure you propose?
Out-of-core? Derived fields? Analytic meshes (e.g., regular meshes)?
Differential operators? Interpolation methods?
*
This should focus on the elegance/usefulness of the core
design-pattern employed by the
implementation rather than a
point-by-point description of the
implemenation!
I think if we could reasonably cover the
(preliminary) requirments above,
that would be a good first step. I agree with Randy that whatever we
come up with will have to be able to
"adapt" over time as our understanding
moves forward.
*
Is there information or characteristics of particular file format
standards that must percolate up into the
specific implementation of
the in-memory data structures?
In FM we tried hard to file-format-specific
stuff out of the core model.
Instead, there are additional modules built
on top of FM that handle
the file-format-specific stuff, like I/O and
derived fields specific to
a particular format. Currently we have PLOT3D, FITS, and
HDFEOS4
modules that are pretty well filled out, and
other modules that are
mostly skeletons at this point.
We should also be careful not to assume that
analyzing the data starts
with "read the data from a file into
memory, ...". Don't forget
out-of-core,
analysis concurrent with simulation, among
others.
One area where the file-format-specific
issues creep in is with metadata.
Most file formats have some sort of metadata
storage support, some much
more elaborate than others. Applications need to get at this
metadata,
possibly through the data model, possibly
some other way. I don't have
the answer here, but it's something to keep
in mind.
For the purpose of this survey, "data
analysis" is defined broadly as
all non-visual data processing done
*after* the simulation code has
finished and *before* "visual
analysis".
*
Is there a clear dividing line between "data analysis" and
"visual
analysis" requirements?
Your definition excludes concurrent analysis
and steering from
"visualization". Is this intentional? I don't think there's a clear dividing
line here.
*
Can we (should we) incorporate data analysis functionality into this
framework, or is it just focused on visual
analysis.
I think you would also want to include
feature detection techniques. For
large data analysis in particular, we don't
want to assume that the scientist
will want to do the analysis by visually
scanning through all the data.
*
What kinds of data analysis typically needs to be done in your
field? Please give examples and how these functions are currently
implemented.
Around here there is interest in
vector-field topology feature detection
techniques, for instance, vortex-core
detection.
*
How do we incorporate powerful data analysis functionality into the
framework?
Carefully :-)? By striving not to make a closed system.
2) Execution Model=======================
It will be necessary for us to agree on a
common execution semantics
for our components. Otherwise, while we might have
compatible data
structures but incompatible execution
requirements. Execution
semantics is akin to the function of
protocol in the context of network
serialization of data structures. The motivating questions are as
follows;
*
How is the execution model affected by the kinds of
algorithms/system-behaviors we want to
implement.
In general I see choices where at one end of
the spectrum we have
simple analysis techniques where most of the
control responsibilities
are handled from the outside. At the other end we could have more
elaborate techniques that may handle load
balancing, memory
management, thread management, and so
on. Techniques towards
the latter end of the spectrum will
inevitably be intertwined more
with the execution model.
*
How then will a given execution model affect data structure
implementations
Well, there's always thread-safety issues.
*
How will the execution model be translated into execution semantics
on the component level. For example will we need to implement
special
control-ports on our components to
implement particular execution
models or will the semantics be implicit
in the way we structure the
method calls between components.
Not sure.
What kinds of execution models should be
supported by the distributed
visualization architecture
*
View dependent algorithms? (These were typically quite difficult to
implement for dataflow visualization
environments like AVS5).
Not used heavily here, but would be
interesting. A "want".
*
Out-of-core algorithms
A "must" for us.
*
Progressive update and hierarchical/multiresolution algorithms?
A "want".
*
Procedural execution from a single thread of control (ie. using an
commandline language like IDL to
interactively control an dynamic or
large parallel back-end)
Scripting support is a "must".
*
Dataflow execution models? What is
the firing method that should be
employed for a dataflow pipeline? Do you need a central executive like
AVS/OpenDX or, completely distributed
firing mechanism like that of
VTK, or some sort of abstraction that
allows the modules to be used
with either executive paradigm?
Preferably a design that does not lock us in
to one execution model.
*
Support for novel data layouts like space-filling curves?
Not a pressing need here, as of yet.
*
Are there special considerations for collaborative applications?
*
What else?
Distributed control? Fault tolerance?
How will the execution model affect our
implementation of data
structures?
*
how do you decompose a data structure such that it is amenable to
streaming in small chunks?
Are we assuming streaming is a requirement?
How do you handle visualization algorithms
where the access patterns
are not known a priori? The predominant example: streamlines
and streaklines.
Note the access patterns can be in both
space and time. How do you avoid
having each analysis technique need to know
about each possible data
structure in order to negotiate a streaming
protocol? How do add another
data structure in the future without having
to go through all the analysis
techniques and put another case in their
streaming negotiation code?
In FM the fine-grained data access
("accessors") is via a standard
interface. The evaluation is all lazy. This design means more
function calls, but it frees the analysis
techniques from having to know
access patterns a priori and negotiate with
the data objects. In FM
the data access methods are virtual
functions. We find the overhead
not to be a problem, even with relatively
large data. In fact, the overhead
is less an issue with large data because the
data are less likely to be
served up from a big array buffer in memory
(think out-of-core, remote
out-of-core, time series, analytic meshes,
derived fields, differential-
operator fields, transformed objects, etc.,
etc.).
The same access-through-an-interface
approach could be done without
virtual functions, in order to squeeze out a
little more performance, though
I'm not convinced it would be worth it. To start with you'd probably end up
doing a lot more C++ templating. Eliminating the virtual functions would
make it harder to compose things at
run-time, though you might be able
to employ run-time compilation techniques a
la SCIRun 2.
*
how do you represent temporal dependencies in that model?
In FM, data access arguments have a time
value, the field interface is
the same for both static and time-varying
data.
*
how do you minimize recomputation in order to regenerate data for
view-dependent algorithms.
Caching? I don't have a lot of experience with view-dependent
algorithms.
What are the execution semantics necessary
to implement these execution
models?
*
how does a component know when to compute new data? (what is the
firing rule)
*
does coordination of the component execution require a central
executive or can it be implemented using
only rules that are local to a
particular component.
*
how elegantly can execution models be supported by the proposed
execution semantics? Are there some things, like loops or
back-propagation of information that are
difficult to implement using a
particular execution semantics?
The execution models we have used have kept
the control model in
each analysis technique pretty simple,
relying on an external executive.
The one big exception is with
multi-threading. We've
experimented with
more elaborate parallelism and
load-balancing techniques, motivated in
part by latency hiding desires.
How will security considerations affect
the execution model?
More libraries to link to? More latency in network communication?
3) Parallelism and
load-balancing=================
Thus far, managing parallelism in
visualization systems has been a
tedious and difficult at best. Part of this is a lack of powerful
abstractions for managing
data-parallelism, load-balancing and
component control.
Please describe the kinds of parallel
execution models that must be
supported by a visualization component
architecture.
*
data-parallel/dataflow pipelines?
*
master/slave work-queues?
*
streaming update for management of pipeline parallelism?
*
chunking mechanisms where the number of chunks may be different from
the number of CPU's employed to process
those chunks?
We're pretty open here. Mostly straight-forward work-queues.
*
how should one manage parallelism for interactive scripting
languages that have a single thread of
control? (eg. I'm using a
commandline language like IDL that
interactively drives an arbitrarily
large set of parallel resources. How can I make the parallel back-end
available to a single-threaded interactive
thread of control?)
I've used Python to control multiple
execution threads. The (C++)
data objects are thread safe, the minimal
provisions for thread-safe
objects in Python haven't been too much of a
problem.
Please describe your vision of what kinds
of software support /
programming design patterns are needed to
better support parallelism
and load balancing.
*
What programming model should be employed to express parallelism.
(UPC, MPI, SMP/OpenMP, custom sockets?)
*
Can you give some examples of frameworks or design patterns that you
consider very promising for support of
parallelism and load balancing.
(ie. PNNL Global Arrays or Sandia's
Zoltan)
http://www.cs.sandia.gov/Zoltan/
http://www.emsl.pnl.gov/docs/global/ga.html
*
Should we use novel software abstractions for expressing parallelism
or should the implementation of
parallelism simply be an opaque
property of the component? (ie. should
there be an abstract messaging
layer or not)
*
How does the NxM work fit in to all of this? Is it sufficiently
differentiated from Zoltan's capabilities?
I don't have a strong opinion here. I'm not familiar with Zoltan et al.
Our experience with parallelism tends to be
more shared-memory than
distributed memory.
===============End of Mandatory Section
(the rest is
voluntary)=============
Laziness prevails here, so I'll stop for now
:-).
Pat