From: James
Kohl <kohlja@ornl.gov>
Date: Wed
Sep 10, 2003 3:32:44 PM US/Pacific
To: John
Shalf <jshalf@lbl.gov>
Cc: diva@lbl.gov
Subject: Re: DiVA Survey (Please return by Sept 10!)
Reply-To: kohlja@ornl.gov
O.K., here goes...
wahoo... :)
=============The Survey=========================
1) Data Structures/Representations/Management==================
The center of every successful modular visualization
architecture has
been a flexible core set of data structures for representing
data that
is important to the targeted application domain. Before we can begin
working on algorithms, we must come to some agreement on common
methods
(either data structures or accessors/method calls) for exchanging data
between components of our vis framework.
There are two potentially disparate motivations for defining the
data
representation requirements. In the coarse-grained case, we need to
define standards for exchanging data between components in this
framework (interoperability). In the fined-grained case, we want to
define some canonical data structures that can be used within a
component -- one developed specifically for this framework. These two
use-cases may drive different set of requirements and
implementation
issues.
*
Do you feel both of these use cases are equally important or
should
we focus exclusively on one or the other?
I think for now we need to exclusively focus on exchanging data
between
components, rather than any fine-grained generalized data
objects...
The first order entry into any component development is to
"wrap up
what ya got".
The "rip things apart" phase comes after you can glue
all the coarse-grained piece together reliably...
*
Do you feel the requirements for each of these use-cases are
aligned
or will they involve two separate development tracks?
Two separate development tracks. Definitely.
There are different driving
design forces and they can be developed (somewhat) independently
(I hope).
For instance, using "accessors" (method calls that
provide abstract
access to essentially opaque data structures) will likely work
fine for
the coarse-grained data exchanges between components, but will
lead to
inefficiencies if used to implement algorithms within a
particular
component.
*
As you answer the "implementation and requirements" questions
below,
please try to identify where coarse-grained and fine-grained use
cases will affect the implementation requirements.
What are requirements for the data representations that must be
supported by a common infrastructure. We will start by answering Pat's
questions of about representation requirements and follow up
with
personal experiences involving particular domain scientist's
requirements.
Must:
support for structured data
Must.
Must/Want:
support for multi-block data?
Must.
Must/Want:
support for various unstructured data representations?
(which ones?)
Must/Want:
support for adaptive grid standards?
Please be specific
about which adaptive grid methods you are referring to. Restricted
block-structured AMR (aligned grids), general block-structured
AMR
(rotated grids), hierarchical unstructured AMR, or
non-hierarchical
adaptive structured/unstructured meshes.
Must/Want:
"vertex-centered" data, "cell-centered" data?
other-centered?
All of these should be "Wants", to the extent that they
require more
sophisticated handling, or are less well-known in terms of
generalizing
the interfaces.
For example, the AMR folks havfe been trying to get together and
define
a standard API, and have been as yet unsuccessful. Who are we to attempt
this where they have failed...?
So to clarify, if we *really* understand (or think we do) a
particular
data representation/organization, or even a specific subset of a
general
representation type, then by all means lets whittle an API into
our stuff.
Otherwise, leave it alone for someone else to do, or do as
strictly needed.
Must:
support time-varying data, sequenced, streamed data?
MUST.
Must/Want:
higher-order elements?
Must/Want:
Expression of material interface boundaries and other
special-treatment of boundary conditions.
Wants, see above...
*
For commonly understood datatypes like structured and
unstructured, please focus on any features that are commonly
overlooked in
typical implementations.
For example, often data-centering is overlooked
in structured data representations in vis systems and FEM
researchers
commonly criticize vis people for co-mingling geometry with
topology
for unstructured grid representations. Few datastructures provide
proper treatment of boundary conditions or material
interfaces. Please
describe your personal experience on these matters.
*
Please describe data representation requirements for novel data
representations such as bioinformatics and terrestrial sensor
datasets.
In particular, how
should we handle more abstract data that is
typically given the moniker "information
visualization".
I don't think we should "pee in this pool" either
yet. Are any of us
experts in this kind of viz?
Let's stick with what we collectively know
best and make that work before we try to tackle a
related-but-fundamentally-
different-domain.
What do you consider the most elegant/comprehensive
implementation for
data representations that you believe could form the basis for a
comprehensive visualization framework?
Sounds like the "Holy Grail" to me... If anything even remotely close to
this already existed, we'd all be using it already...
(Unless of course it's the dreaded NIH syndrome...)
*
For instance, AVS uses entirely different datastructures for
structure, unstructured and geometry data. VTK uses class inheritance
to express the similarities between related structures. Ensight treats
unstructured data and geometry nearly interchangably. OpenDX uses more
vector-bundle-like constructs to provide a more unified view of
disparate data structures.
FM uses data-accessors (essentially keeping
the data structures opaque).
*
Are there any of the requirements above that are not covered by
the
structure you propose?
*
This should focus on the elegance/usefulness of the core
design-pattern employed by the implementation rather than a
point-by-point description of the implemenation!
*
Is there information or characteristics of particular file format
standards that must percolate up into the specific
implementation of
the in-memory data structures?
I dunno, but what does HDF5 or NetCDF include? We should definitely be
able to handle various meta-data...
Otherwise, our viz framework should be able to read in all sorts
of
file-based data as input, converting it seamlessly into our
"Holy Data
Grail" format for all the components to use and pass
around. But the
data shouldn't be identifiable as having once been HDF or NetCDF,
etc...
(i.e. it's important to read the data format, but not to use it
internally)
For the purpose of this survey, "data analysis" is
defined broadly as
all non-visual data processing done *after* the simulation code
has
finished and *before* "visual analysis".
*
Is there a clear dividing line between "data analysis" and
"visual
analysis" requirements?
NO. There shouldn't
be - these operations are tightly coupled, or even
symbiotic, and *should* all be incorporated into the same
framework,
indistinguishable from each other.
*
Can we (should we) incorporate data analysis functionality into
this
framework, or is it just focused on visual analysis.
YES.
*
What kinds of data analysis typically needs to be done in your
field?
Simple sampling, basic statistical averages/deviations, principal
component
analysis (PCA, or EOF for climate folks), other dimension
reduction.
Please give examples and how these functions are currently
implemented.
C/C++ code... mostly
slow serial... :-Q
*
How do we incorporate powerful data analysis functionality into
the
framework?
As components (duh)...
:-)
We should define some "standard" APIs for the desired
analysis functions,
and then either wrap existing codes as components or shoehorn in
existing
component implementations from systems like ASPECT.
2) Execution Model=======================
It will be necessary for us to agree on a common execution
semantics
for our components.
Otherwise, while we might have compatible data
structures but incompatible execution requirements. Execution
semantics is akin to the function of protocol in the context of
network
serialization of data structures. The motivating questions are as
follows;
*
How is the execution model affected by the kinds of
algorithms/system-behaviors we want to implement.
Directly. There are
probably a few main exec models we want to cover.
I don't think the list is *that* long...
As such, we should anticipate building several distinct framework
environments that each exclusively support a given exec
model. Then
the trick is to "glue" these individual frameworks
together so they can
interoperate (exchange data and invoke each others' component
methods)
and be arbitrarily "bridged" together to form complex
higher-level
pipelines or other local/remote topologies.
*
How then will a given execution model affect data structure
implementations
I don't think it should affect the data structure impls at all,
per se.
Clearly, the access patterns will be different for various
execution models,
but this shouldn't change the data impl. Perhaps a better question is
how to indicate the expected access pattern to allow a given data
impl
to optimize or properly prefetch/cache the accesses...
*
How will the execution model be translated into execution
semantics on the component level. For example will we need to implement
special control-ports on our components to implement particular
execution
models or will the semantics be implicit in the way we structure
the
method calls between components.
Components should be "dumb" and let other components or
the framework invoke
them as needed for a given execution model. The framework dictates the
control flow, not the component. The API shouldn't change.
If you want multi-threaded components, then the framework better
support
that, and the API for the component should take the possibility
into account.
What kinds of execution models should be supported by the
distributed
visualization architecture
*
View dependent algorithms? (These were typically quite difficult
to
implement for dataflow visualization environments like AVS5).
Want.
*
Out-of-core algorithms
Must. This is a
necessary evil of "big data".
You need some killer
caching infrastructure throughout the pipeline (e.g. like
VizCache).
*
Progressive update and hierarchical/multiresolution algorithms?
Must.
*
Procedural execution from a single thread of control (ie. using an
commandline language like IDL to interactively control an
dynamic or
large parallel back-end)
This is not an execution model, it is a command/control interface
issue.
You should be able to have a GUI, programmatic control, or
scripting to
dictate interactive control (or "steering" as they call
it... :-). The
internal software organization shouldn't change, just the
interface to
the outside (or inside) world...
*
Dataflow execution models?
Must.
What is the firing method that should
be employed for a dataflow pipeline? Do you need a central executive like
AVS/OpenDX or, completely distributed firing mechanism like that
of
VTK, or some sort of abstraction that allows the modules to be
used
with either executive paradigm?
This should be an implementation issue in the "dataflow
framework", and
should not affect the component-level APIs.
*
Support for novel data layouts like space-filling curves?
Must. But this isn't
an execution model either. It's a
data structure
or algorithmic detail...
*
Are there special considerations for collaborative applications?
Surely. The
interoperability of distinct framework implementations
ties in with this...
but the components shouldn't be aware that they
are being run collaboratively/remotely... definitely a framework issue.
*
What else?
Yeah right.
How will the execution model affect our implementation of data
structures?
It shouldn't. The
execution model should be kept independent of the
data structures as much as possible.
If you want to build higher-level APIs for specific data access
patterns
that's fine, but keep the underlying data consistent where
possible.
*
how do you decompose a data structure such that it is amenable to
streaming in small chunks?
This sounds a lot like distributed data decompositions. I suspect that
given a desired block/cycle size, you can organize/decompose data
in all
sorts of useful ways, depending on the expected access pattern.
In conjunction with this, you could also reorganize static
datasets
into filesystem databases, with appropriate naming conventions or
perhaps a special protocol for lining up the data blob files in
the
desired order for streaming (in either time or space along any
axis).
Meta-data in the files might be handy here, too, if it's indexed
efficiently for fast lookup/searching/selection.
*
how do you represent temporal dependencies in that model?
Meta-data, or file naming conventions...
*
how do you minimize recomputation in order to regenerate data for
view-dependent algorithms.
No clue.
What are the execution semantics necessary to implement these
execution
models?
*
how does a component know when to compute new data? (what is the
firing rule)
There are really only 2 possibilities I can see - either a
component is
directly invoked by another component or the framework, or else a
method
must be triggered by some sort of dataflow dependency or
stream-based
event mechanism.
*
does coordination of the component execution require a central
executive or can it be implemented using only rules that are
local to a
particular component.
This is a framework implementation detail. No. No. Bad Dog.
The component doesn't know what's outside of it (in the rest of
the
framework, or the outside world). It only gets invoked, one way or
another.
*
how elegantly can execution models be supported by the proposed
execution semantics?
Are there some things, like loops or
back-propagation of information that are difficult to implement
using a
particular execution semantics?
We need to keep the different execution models separate, as
implementation
details of individual frameworks. This separates the concerns here.
How will security considerations affect the execution model?
Ha ha ha ha...
They won't right away, except in collaboration scenarios.
Think "One MPI Per Framework" and do things the old
fashioned way
locally, then do the "glue" for inter-framework
connectivity with
proper authentication only as needed. (No worse than Globus... :-)
3) Parallelism and load-balancing=================
Thus far, managing parallelism in visualization systems has been
a
tedious and difficult at best. Part of this is a lack of powerful
abstractions for managing data-parallelism, load-balancing and
component control.
Please describe the kinds of parallel execution models that must
be
supported by a visualization component architecture.
*
data-parallel/dataflow pipelines?
Must.
*
master/slave work-queues?
Must.
*
streaming update for management of pipeline parallelism?
Must.
*
chunking mechanisms where the number of chunks may be different
from
the number of CPU's employed to process those chunks?
This sounds the same as master/slave to me, as in "bag of tasks"...
*
how should one manage parallelism for interactive scripting
languages that have a single thread of control? (eg. I'm using a
commandline language like IDL that interactively drives an
arbitrarily
large set of parallel resources. How can I make the parallel back-end
available to a single-threaded interactive thread of control?)
Broadcast, Baby...
Either you blast the commands out to everyone SIMD
style (unlikely) or else you talk to the Rank 0 task and the
command
gets forwarded on a fast internal network.
Please describe your vision of what kinds of software support /
programming design patterns are needed to better support
parallelism
and load balancing.
*
What programming model should be employed to express parallelism.
(UPC, MPI, SMP/OpenMP, custom sockets?)
All but UPC will be necessary for various functionality.
*
Can you give some examples of frameworks or design patterns that
you consider very promising for support of parallelism and load
balancing.
(ie. PNNL Global Arrays or Sandia's Zoltan)
http://www.cs.sandia.gov/Zoltan/
http://www.emsl.pnl.gov/docs/global/ga.html
Nope, that covers my list of hopefuls.
*
Should we use novel software abstractions for expressing
parallelism or should the implementation of parallelism simply
be an
opaque property of the component? (ie. should there be an
abstract
messaging layer or not)
It's not our job to develop "novel" parallelism
abstractions. We should
just use existing abstractions like what the CCA is developing.
*
How does the NxM work fit in to all of this? Is it sufficiently
differentiated from Zoltan's capabilities?
I don't know what Zoltan can do specifically, but MxN is designed
for
basic "parallel data redistribution". This means it is good for doing
big parallel-to-parallel data movement/transformations among two
disparate
parallel frameworks, or between two parallel components in the
same
framework with different data decompositions. MxN is also good for
"self-transpose" or other types of local data reorganization
within a
given (parallel) component.
MxN doesn't do interpolation in space or time (yet, probably for a
while),
and it won't wash your car (but it won't drink your beer either...
:-).
If you need something fancier, or if you don't really need any data
reorganization between the source and destination of a transfer,
then
MxN *isn't* for you...
===============End of Mandatory Section (the rest is
voluntary)=============
4) Graphics and Rendering=================
What do you use for converting geometry and data into images
(the
rendering-engine).
Please comment on any/all of the following.
*
Should we build modules around declarative/streaming methods for
rendering geometry like OpenGL, Chromium and DirectX or should
we move
to higher-level representations for graphics offered by scene
graphs?
What are the pitfalls of building our component architecture
around
scene graphs?
*
What about Postscript, PDF and other scale-free output methods for
publication quality graphics? Are pixmaps sufficient?
In a distributed environment, we need to create a rendering
subsystem
that can flexibly switch between drawing to a client application
by
sending images, sending geometry, or sending geometry fragments
(image-based rendering)?
How do we do that?
I would think this could be achieved by a sophisticated data
communication
protocol - one that encodes the type of data in the stream, say,
using XML
or some such thingy.
*
Please describe some rendering models that you would like to see
supported (ie. view-dependent update, progressive update) and
how they
would adjust dynamically do changing objective functions
(optimize for
fastest framerate, or fastest update on geometry change, or
varying
workloads and resource constraints).
*
Are there any good examples of such a system?
What is the role of non-polygonal methods for rendering (ie.
shaders)?
*
Are you using any of the latest gaming features of commodity cards
in your visualization systems today?
*
Do you see this changing in the future? (how?)
5) Presentation=========================
It will be necessary to separate the visualization back-end from
the
presentation interface.
For instance, you may want to have the same
back-end driven by entirely different control-panels/GUIs and
displayed
in different display devices (a CAVE vs. a desktop
machine). Such
separation is also useful when you want to provide different
implementations of the user-interface depending on the targeted
user
community. For
instance, visualization experts might desire a
dataflow-like interface for composing visualization workflows
whereas a
scientists might desire a domain-specific dash-board like
interface
that implements a specific workflow. Both users should be able to
share the same back-end components and implementation even
though the
user interface differs considerably.
How do different presentation devices affect the component
model?
Not. The display
device only affects resolution or bandwidth required.
This could be parameterized in the component invocations APIs, but
should not otherwise change an individual component.
If you want a "multiplexer" to share a massive data
stream with a powerwall
and a PDA, then the "multiplexer component"
implementation handles that...
*
Do different display devices require completely different user
interface paradigms?
No. Different GUIs
should all map to some common framework command/control
interface. The same
functions will ultimately get executed, just from buttons
with different labels or appl-specific short-cuts... The UIs should all be
independent, but talk the same protocol to the framework.
If so, then we must define a clear separation
between the GUI description and the components performing the
back-end
computations. If
not, then is there a common language to describe user
interfaces that can be used across platforms?
Yuk.
*
Do different display modalities require completely different
component/algorithm implementations for the back-end compute
engine?
(what do we do about that??)
Algorithm maybe, component no. This could fall into the venue of the
different execution-model-specific frameworks and/or their
bridging...
I dunno.
What Presentation modalities do you feel are important and what
do you
consider the most important.
*
Desktop graphics (native applications on Windows, on Macs)
MUST.
*
Graphics access via Virtual Machines like Java?
Ha ha ha ha...
*
CAVEs, Immersadesks, and other VR devices
Must.
*
Ultra-high-res/Tiled display devices?
Must.
*
Web-based applications?
Probably a good idea.
Someone always asks for this...
:-Q
What abstractions do you think should be employed to separate
the
presentation interface from the back-end compute engine?
Some sort of general protocol descriptor, like XML...? Nuthin fancy.
*
Should we be using CCA to define the communication between GUI and
compute engine or should we be using software infrastructure
that was
designed specifically for that space? (ie. WSDL, OGSA, or
CORBA?)
The CCA doesn't do such communication per se. Messaging between or in/out
of frameworks is always "out of band" relative to CCA
port invocations.
If the specific framework impl wants to shove out data on some
wire,
then it's hidden below the API level...
I would think that WSDL/SOAP would be O.K. for low-bandwidth uses.
*
How do such control interfaces work with parallel applications?
Should the parallel application have a single process that
manages the
control interface and broadcasts to all nodes or should the
control
interface treat all application processes within a given
component as
peers?
I vote for the "single process that manages the control
interface and
broadcasts to all nodes" (or the variation above, where one
of the
parallel tasks forwards to the rest internally :-). The latter is
not scalable.
BTW, you can't have "application processes within a...
component".
What does that even mean?
Usually, an application "process" consists of a
collection of one or
more components that have been composed with some specific
connectivity...
6) Basic Deployment/Development Environment Issues============
One of the goals of the distributed visualization architecture
is
seamless operation on the Grid -- distributed/heterogeneous
collections
of machines.
However, it is quite difficult to realize such a vision
without some consideration of deployment/portability
issues. This
question also touches on issues related to the development
environment
and what kinds of development methods should be supported.
What languages do you use for core vis algorithms and
frameworks.
C/C++.
*
for the numerically intensive parts of vis algorithms
C/C++/Fortran/F90...?
*
for the glue that connects your vis algorithms together into an
application?
C/C++.
*
How aggressively do you use language-specific features like C++
templates?
RUN AWAYYYY!!! These
are not consistent across o.s./arch/compiler yet.
Maybe someday...
*
is Fortran important to you? Is it
important that a framework
support it seamlessly?
Fortran is crucial for many application scientists. It is not directly
useful for the tools I build.
But if you want to ever integrate application code components
directly
into a viz framework, then you better not preclude this... (or
Babel...)
*
Do you see other languages becoming important for visualization
(ie.
Python, UPC, or even BASIC?)
Nope.
What platforms are used for data analysis/visualization?
*
What do you and your target users depend on to display results?
(ie.
Windows, Linux, SGI, Sun etc..)
All of the above (not so much Sun any more...).
*
What kinds of presentation devices are employed (desktops,
portables, handhelds, CAVEs, Access Grids,
WebPages/Collaboratories)
and what is their relative importance to active users.
All but handhelds are important, mostly desktops, CAVEs/hi-res and
AG,
in decreasing order.
*
What is the relative importants of these various presentation
methods from a research standpoint?
CAVEs/hi-res and AG are worthwhile research areas. The rest can be
weaved in or incorporated more easily.
*
Do you see other up-and-coming visualization platforms in the
future?
Yes, but I haven't figured out where exactly to stick the chip
behind
my ear for the virtual holodeck equipment... :)
Tell us how you deal with the issue of versioning and library
dependencies for software deployment.
CVS.
*
For source code distributions, do you bundle builds of all related
libraries with each software release (ie. bundle HDF5 and FLTK
source
with each release).
No, but provide web links or separate copies of dependent
distributions
next to our software on the web site...
Too ugly to include everything in one big bundle, and not as
efficient
as letting the user download just what they need. (As long as everything
you need is centrally located or accessible...)
*
What methods are employed to support platform independent builds
(cmake, imake, autoconf).
What are the benefits and problems with this
approach.
Mostly autoconf so far.
My student thinks automake and libtools is "cool"
but we haven't used them yet...
*
For binaries, have you have issues with different versions of
libraries (ie. GLIBC problems on Linux and different JVM
implemetnations/version for Java). Can you tell us about any
sophisticated packaging methods that address some of these problems
(RPM need not apply)
Just say no. Open
Source is the way to go, with a small set of "common"
binaries just for yuks.
Most times the binaries won't work with the
specific run-time libs anyway...
*
How do you handle multiplatform builds?
Autoconf, shared source tree, with arch-specific subdirs for
object files,
libs and executables.
How do you (or would you) provide abstractions that hide the
locality
of various components of your visualization/data analysis
application?
I would use "proxy" components that use out-of-band
communication to
forward invocations and data to the actual component
implementation.
*
Does anyone have ample experience with CORBA, OGSA, DCOM, .NET,
RPC?
Please comment on advantages/problems of these technologies.
Nope.
*
Do web/grid services come into play here?
Yuk, I hope not.
7) Collaboration ==========================
If you are interested in "collaborative appllications"
please define
the term "collaborative". Perhaps provide examples of collaborative
application paradigms.
"Collaborative" is 2 or more geographically/remote
teams, sharing one
common viz environment, with shared control and full telepresence.
(Note: by this definition, "collaborative" does not yet
exist... :-)
Is collaboration a feature that exists at an application level
or are
there key requirements for collaborative applications that
necessitate
component-level support?
Collaboration should exist *above* the application level, either
outside
the specific framework or as part of the framework
"bridging" technology.
*
Should collaborative infrastructure be incorporated as a core
feature of very component?
NO. Let the framework
proxy to collaborative capabilities...
*
Can any conceivable collaborative requirement be satisfied using a
separate set of modules that specifically manage distribution of
events
and data in collaborative applications?
I dunno, I doubt it.
*
How is the collaborative application presented? Does the
application only need to be collaborative sometimes?
Yes, collaboration should be flexible and on demand as needed -
like
dialing out on the speakerphone while in the middle of a
meeting...
*
Where does performance come in to play?
Does the visualization
system or underlying libraries need to be performance-aware?
There likely will need to be "hooks" to specify
performance requirements,
like "quality of service". This should perhaps be incorporated as part
of the individual component APIs, or at least metered by the
frameworks...
(i.e. I'm
doing a given task and I need a framerate of X for it to be
useful
using my current compute resources)...
It would be wise to specify the frame rate requirement, perhaps
interactively
depending on the venue...
e.g. in interactive collaboration scenarios you'd
rather drop some frames consistently than stall completely or in
bursts...
network aware (i.e. the system is
starving for data and must respond by adding an alternate stream
or
redeploying the pipeline).
This sounds like futureware to me - an intelligent network
protocol layer...
beyond our scope for sure!
Are these considerations implemented at the
component level, framework level, or are they entirely
out-of-scope for
our consideration?
These issues should be dealt with mostly at the framework level,
if at all.
I think they're mostly out-of-scope for the first incarnation...
WHEW! DONE! :-D
Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeembo