From: Patrick Moran <firstname.lastname@example.org>
Date: Tue Sep 2, 2003 4:29:46 PM US/Pacific
To: John Shalf <email@example.com>
Subject: Re: DiVA Survey (Please return by Sept 10!)
On Wednesday 27 August 2003 03:33 pm, you wrote:
Here's my couple cents worth:
Please answer the attached survey with as much or as little verbosity
as you please and return it to me by September 10. The survey has 3
mandatory sections and 4 voluntary (bonus) sections. The sections are
1) Data Structures
2) Execution Model
3) Parallelism and Load-Balancing
4) Graphics and Rendering
6) Basic Deployment and Development Environment Issues
We will spend this workshop focusing on the first 3 sections, but I
think we will derive some useful/motivating information from any
answers to questions in the voluntary sections.
I'll post my answers to this survey on diva mailing list very soon.
You can post your answers publicly if you want to, but I am happy to
regurgitate your answers as "anonymous contributors" if it will enable
you to be more candid in your evaluation of available technologies.
1) Data Structures/Representations/Management==================
The center of every successful modular visualization architecture has
been a flexible core set of data structures for representing data that
is important to the targeted application domain. Before we can begin
working on algorithms, we must come to some agreement on common methods
(either data structures or accessors/method calls) for exchanging data
between components of our vis framework.
There are two potentially disparate motivations for defining the data
representation requirements. In the coarse-grained case, we need to
define standards for exchanging data between components in this
framework (interoperability). In the fined-grained case, we want to
define some canonical data structures that can be used within a
component -- one developed specifically for this framework. These two
use-cases may drive different set of requirements and implementation
* Do you feel both of these use cases are equally important or should
we focus exclusively on one or the other?
I think both cases are important, but agreeing upon the fine-grained access
will be harder.
* Do you feel the requirements for each of these use-cases are aligned
or will they involve two separate development tracks? For instance,
using "accessors" (method calls that provide abstract access to
essentially opaque data structures) will likely work fine for the
coarse-grained data exchanges between components, but will lead to
inefficiencies if used to implement algorithms within a particular
* As you answer the "implementation and requirements" questions below,
please try to identify where coarse-grained and fine-grained use cases
will affect the implementation requirements.
I think the focus should be on interfaces rather than data structures. I
would advocate this approach not just because it's the standard
"object-oriented" way, but because it's the one we followed with FEL,
and now FM, and it has been a big win for us. It's a significant benefit
not having to maintain different versions of the same visualization
technique, each dedicated to a different method for producing the
data (i.e., different data structures). So, for example, we use the same
visualization code in both in-core and out-of-core cases. Assuming up
front that an interface-based approach would be too slow is, in my
humble opinion, classic premature optimization.
What are requirements for the data representations that must be
supported by a common infrastructure. We will start by answering Pat's
questions of about representation requirements and follow up with
personal experiences involving particular domain scientist's
Must: support for structured data
Structured data support is a must.
Must/Want: support for multi-block data?
Multi-block is a must.
Must/Want: support for various unstructured data representations?
We have unstructured data, mostly based on tetrahedral or prismatic meshes.
We need support for at least those types. I do not think we could simply
graft unstructured data support on top of our structured data structures.
Must/Want: support for adaptive grid standards? Please be specific
about which adaptive grid methods you are referring to. Restricted
block-structured AMR (aligned grids), general block-structured AMR
(rotated grids), hierarchical unstructured AMR, or non-hierarchical
adaptive structured/unstructured meshes.
Adaptive grid support is a "want" for us currently, probably eventually
a "must". The local favorite is CART3D, which consists of hierarchical
regular grids. The messy part is that CART3D also supports having
more-or-less arbitrary shapes in the domain, e.g., an aircraft fuselage.
Handling the shape description and all the "cut cell" intersections
I expect will be a pain.
Must/Want: "vertex-centered" data, "cell-centered" data?
Most of the data we see is still vertex-centered. FM supports other
associations, but we haven't used them much so far.
Must: support time-varying data, sequenced, streamed data?
Support for time-varying data is a must.
Must/Want: higher-order elements?
Occasionally people ask about it, but we haven't found it to be a "must".
Must/Want: Expression of material interface boundaries and other
special-treatment of boundary conditions.
We don't see this so much. "Want", but not must.
* For commonly understood datatypes like structured and unstructured,
please focus on any features that are commonly overlooked in typical
implementations. For example, often data-centering is overlooked in
structured data representations in vis systems and FEM researchers
commonly criticize vis people for co-mingling geometry with topology
for unstructured grid representations. Few datastructures provide
proper treatment of boundary conditions or material interfaces. Please
describe your personal experience on these matters.
One thing left out of the items above is support for some sort of "blanking"
mechanism, i.e., a means to indicate that the data at some nodes are not
valid. That's a must for us. For instance, with Earth science data we see
the use of some special value to indicate "no data" locations.
* Please describe data representation requirements for novel data
representations such as bioinformatics and terrestrial sensor datasets.
In particular, how should we handle more abstract data that is
typically given the moniker "information visualization".
"Field Model" draws the line only trying to represent fields and the meshes
that the fields are based on. I not really familiar enough with other types
of data to know what interfaces/data-structures would be best. We haven't
see a lot of demand for those types of data as of yet. A low-priority "want".
What do you consider the most elegant/comprehensive implementation for
data representations that you believe could form the basis for a
comprehensive visualization framework?
* For instance, AVS uses entirely different datastructures for
structure, unstructured and geometry data. VTK uses class inheritance
to express the similarities between related structures. Ensight treats
unstructured data and geometry nearly interchangably. OpenDX uses more
vector-bundle-like constructs to provide a more unified view of
disparate data structures. FM uses data-accessors (essentially keeping
the data structures opaque).
Well, as you'd expect, as the primary author of Field Model (FM) I think it's
the most elegant/comprehensive of the lot. It handles structured and
unstructured data. It handles data non-vertex-centered data. I think it
should be able to handle adaptive data, though it hasn't actually been
put to the test yet. And of course every adaptive mesh scheme is a little
different. I think it could handle boundary condition needs, though that's
not something we see much of.
* Are there any of the requirements above that are not covered by the
structure you propose?
Out-of-core? Derived fields? Analytic meshes (e.g., regular meshes)?
Differential operators? Interpolation methods?
* This should focus on the elegance/usefulness of the core
design-pattern employed by the implementation rather than a
point-by-point description of the implemenation!
I think if we could reasonably cover the (preliminary) requirments above,
that would be a good first step. I agree with Randy that whatever we
come up with will have to be able to "adapt" over time as our understanding
* Is there information or characteristics of particular file format
standards that must percolate up into the specific implementation of
the in-memory data structures?
In FM we tried hard to file-format-specific stuff out of the core model.
Instead, there are additional modules built on top of FM that handle
the file-format-specific stuff, like I/O and derived fields specific to
a particular format. Currently we have PLOT3D, FITS, and HDFEOS4
modules that are pretty well filled out, and other modules that are
mostly skeletons at this point.
We should also be careful not to assume that analyzing the data starts
with "read the data from a file into memory, ...". Don't forget out-of-core,
analysis concurrent with simulation, among others.
One area where the file-format-specific issues creep in is with metadata.
Most file formats have some sort of metadata storage support, some much
more elaborate than others. Applications need to get at this metadata,
possibly through the data model, possibly some other way. I don't have
the answer here, but it's something to keep in mind.
For the purpose of this survey, "data analysis" is defined broadly as
all non-visual data processing done *after* the simulation code has
finished and *before* "visual analysis".
* Is there a clear dividing line between "data analysis" and "visual
Your definition excludes concurrent analysis and steering from
"visualization". Is this intentional? I don't think there's a clear dividing
* Can we (should we) incorporate data analysis functionality into this
framework, or is it just focused on visual analysis.
I think you would also want to include feature detection techniques. For
large data analysis in particular, we don't want to assume that the scientist
will want to do the analysis by visually scanning through all the data.
* What kinds of data analysis typically needs to be done in your
field? Please give examples and how these functions are currently
Around here there is interest in vector-field topology feature detection
techniques, for instance, vortex-core detection.
* How do we incorporate powerful data analysis functionality into the
Carefully :-)? By striving not to make a closed system.
2) Execution Model=======================
It will be necessary for us to agree on a common execution semantics
for our components. Otherwise, while we might have compatible data
structures but incompatible execution requirements. Execution
semantics is akin to the function of protocol in the context of network
serialization of data structures. The motivating questions are as
* How is the execution model affected by the kinds of
algorithms/system-behaviors we want to implement.
In general I see choices where at one end of the spectrum we have
simple analysis techniques where most of the control responsibilities
are handled from the outside. At the other end we could have more
elaborate techniques that may handle load balancing, memory
management, thread management, and so on. Techniques towards
the latter end of the spectrum will inevitably be intertwined more
with the execution model.
* How then will a given execution model affect data structure
Well, there's always thread-safety issues.
* How will the execution model be translated into execution semantics
on the component level. For example will we need to implement special
control-ports on our components to implement particular execution
models or will the semantics be implicit in the way we structure the
method calls between components.
What kinds of execution models should be supported by the distributed
* View dependent algorithms? (These were typically quite difficult to
implement for dataflow visualization environments like AVS5).
Not used heavily here, but would be interesting. A "want".
* Out-of-core algorithms
A "must" for us.
* Progressive update and hierarchical/multiresolution algorithms?
* Procedural execution from a single thread of control (ie. using an
commandline language like IDL to interactively control an dynamic or
large parallel back-end)
Scripting support is a "must".
* Dataflow execution models? What is the firing method that should be
employed for a dataflow pipeline? Do you need a central executive like
AVS/OpenDX or, completely distributed firing mechanism like that of
VTK, or some sort of abstraction that allows the modules to be used
with either executive paradigm?
Preferably a design that does not lock us in to one execution model.
* Support for novel data layouts like space-filling curves?
Not a pressing need here, as of yet.
* Are there special considerations for collaborative applications?
* What else?
Distributed control? Fault tolerance?
How will the execution model affect our implementation of data
* how do you decompose a data structure such that it is amenable to
streaming in small chunks?
Are we assuming streaming is a requirement?
How do you handle visualization algorithms where the access patterns
are not known a priori? The predominant example: streamlines and streaklines.
Note the access patterns can be in both space and time. How do you avoid
having each analysis technique need to know about each possible data
structure in order to negotiate a streaming protocol? How do add another
data structure in the future without having to go through all the analysis
techniques and put another case in their streaming negotiation code?
In FM the fine-grained data access ("accessors") is via a standard
interface. The evaluation is all lazy. This design means more
function calls, but it frees the analysis techniques from having to know
access patterns a priori and negotiate with the data objects. In FM
the data access methods are virtual functions. We find the overhead
not to be a problem, even with relatively large data. In fact, the overhead
is less an issue with large data because the data are less likely to be
served up from a big array buffer in memory (think out-of-core, remote
out-of-core, time series, analytic meshes, derived fields, differential-
operator fields, transformed objects, etc., etc.).
The same access-through-an-interface approach could be done without
virtual functions, in order to squeeze out a little more performance, though
I'm not convinced it would be worth it. To start with you'd probably end up
doing a lot more C++ templating. Eliminating the virtual functions would
make it harder to compose things at run-time, though you might be able
to employ run-time compilation techniques a la SCIRun 2.
* how do you represent temporal dependencies in that model?
In FM, data access arguments have a time value, the field interface is
the same for both static and time-varying data.
* how do you minimize recomputation in order to regenerate data for
Caching? I don't have a lot of experience with view-dependent algorithms.
What are the execution semantics necessary to implement these execution
* how does a component know when to compute new data? (what is the
* does coordination of the component execution require a central
executive or can it be implemented using only rules that are local to a
* how elegantly can execution models be supported by the proposed
execution semantics? Are there some things, like loops or
back-propagation of information that are difficult to implement using a
particular execution semantics?
The execution models we have used have kept the control model in
each analysis technique pretty simple, relying on an external executive.
The one big exception is with multi-threading. We've experimented with
more elaborate parallelism and load-balancing techniques, motivated in
part by latency hiding desires.
How will security considerations affect the execution model?
More libraries to link to? More latency in network communication?
3) Parallelism and load-balancing=================
Thus far, managing parallelism in visualization systems has been a
tedious and difficult at best. Part of this is a lack of powerful
abstractions for managing data-parallelism, load-balancing and
Please describe the kinds of parallel execution models that must be
supported by a visualization component architecture.
* data-parallel/dataflow pipelines?
* master/slave work-queues?
* streaming update for management of pipeline parallelism?
* chunking mechanisms where the number of chunks may be different from
the number of CPU's employed to process those chunks?
We're pretty open here. Mostly straight-forward work-queues.
* how should one manage parallelism for interactive scripting
languages that have a single thread of control? (eg. I'm using a
commandline language like IDL that interactively drives an arbitrarily
large set of parallel resources. How can I make the parallel back-end
available to a single-threaded interactive thread of control?)
I've used Python to control multiple execution threads. The (C++)
data objects are thread safe, the minimal provisions for thread-safe
objects in Python haven't been too much of a problem.
Please describe your vision of what kinds of software support /
programming design patterns are needed to better support parallelism
and load balancing.
* What programming model should be employed to express parallelism.
(UPC, MPI, SMP/OpenMP, custom sockets?)
* Can you give some examples of frameworks or design patterns that you
consider very promising for support of parallelism and load balancing.
(ie. PNNL Global Arrays or Sandia's Zoltan)
* Should we use novel software abstractions for expressing parallelism
or should the implementation of parallelism simply be an opaque
property of the component? (ie. should there be an abstract messaging
layer or not)
* How does the NxM work fit in to all of this? Is it sufficiently
differentiated from Zoltan's capabilities?
I don't have a strong opinion here. I'm not familiar with Zoltan et al.
Our experience with parallelism tends to be more shared-memory than
===============End of Mandatory Section (the rest is
Laziness prevails here, so I'll stop for now :-).