From: James
Kohl <kohlja@ornl.gov>
Date: Wed
Sep 10, 2003 3:32:44 PM US/Pacific
To: John
Shalf <jshalf@lbl.gov>
Cc: diva@lbl.gov
Subject: Re: DiVA Survey (Please return by Sept 10!)
Reply-To: kohlja@ornl.gov
O.K., here goes...
wahoo... :)
=============The Survey=========================
1) Data Structures/Representations/Management==================
The center of every successful modular visualization
architecture has
been a flexible core set of data structures for representing
data that
is important to the targeted application domain. Before we can begin
working on algorithms, we must come to some agreement on common
methods
(either data structures or accessors/method calls) for exchanging data
between components of our vis framework.
There are two potentially disparate motivations for defining the
data
representation requirements. In the coarse-grained case, we need to
define standards for exchanging data between components in this
framework (interoperability). In the fined-grained case, we want to
define some canonical data structures that can be used within a
component -- one developed specifically for this framework. These two
use-cases may drive different set of requirements and
implementation
issues.
*
Do you feel both of these use cases are equally important or
should
we focus exclusively on one or the other?
I think for now we need to exclusively focus on exchanging data
between
components, rather than any fine-grained generalized data
objects...
The first order entry into any component development is to
"wrap up
what ya got".
The "rip things apart" phase comes after you can glue
all the coarse-grained piece together reliably...
*
Do you feel the requirements for each of these use-cases are
aligned
or will they involve two separate development tracks?
Two separate development tracks. Definitely.
There are different driving
design forces and they can be developed (somewhat) independently
(I hope).
For instance, using "accessors" (method calls that
provide abstract
access to essentially opaque data structures) will likely work
fine for
the coarse-grained data exchanges between components, but will
lead to
inefficiencies if used to implement algorithms within a
particular
component.
*
As you answer the "implementation and requirements" questions
below,
please try to identify where coarse-grained and fine-grained use
cases will affect the implementation requirements.
What are requirements for the data representations that must be
supported by a common infrastructure. We will start by answering Pat's
questions of about representation requirements and follow up
with
personal experiences involving particular domain scientist's
requirements.
Must:
support for structured data
Must.
Must/Want:
support for multi-block data?
Must.
Must/Want:
support for various unstructured data representations?
(which ones?)
Must/Want:
support for adaptive grid standards?
Please be specific
about which adaptive grid methods you are referring to. Restricted
block-structured AMR (aligned grids), general block-structured
AMR
(rotated grids), hierarchical unstructured AMR, or
non-hierarchical
adaptive structured/unstructured meshes.
Must/Want:
"vertex-centered" data, "cell-centered" data?
other-centered?
All of these should be "Wants", to the extent that they
require more
sophisticated handling, or are less well-known in terms of
generalizing
the interfaces.
For example, the AMR folks havfe been trying to get together and
define
a standard API, and have been as yet unsuccessful. Who are we to attempt
this where they have failed...?
So to clarify, if we *really* understand (or think we do) a
particular
data representation/organization, or even a specific subset of a
general
representation type, then by all means lets whittle an API into
our stuff.
Otherwise, leave it alone for someone else to do, or do as
strictly needed.
Must:
support time-varying data, sequenced, streamed data?
MUST.
Must/Want:
higher-order elements?
Must/Want:
Expression of material interface boundaries and other
special-treatment of boundary conditions.
Wants, see above...
*
For commonly understood datatypes like structured and
unstructured, please focus on any features that are commonly
overlooked in
typical implementations.
For example, often data-centering is overlooked
in structured data representations in vis systems and FEM
researchers
commonly criticize vis people for co-mingling geometry with
topology
for unstructured grid representations. Few datastructures provide
proper treatment of boundary conditions or material
interfaces. Please
describe your personal experience on these matters.
*
Please describe data representation requirements for novel data
representations such as bioinformatics and terrestrial sensor
datasets.
In particular, how
should we handle more abstract data that is
typically given the moniker "information
visualization".
I don't think we should "pee in this pool" either
yet. Are any of us
experts in this kind of viz?
Let's stick with what we collectively know
best and make that work before we try to tackle a
related-but-fundamentally-
different-domain.
What do you consider the most elegant/comprehensive
implementation for
data representations that you believe could form the basis for a
comprehensive visualization framework?
Sounds like the "Holy Grail" to me... If anything even remotely close to
this already existed, we'd all be using it already...
(Unless of course it's the dreaded NIH syndrome...)
*
For instance, AVS uses entirely different datastructures for
structure, unstructured and geometry data. VTK uses class inheritance
to express the similarities between related structures. Ensight treats
unstructured data and geometry nearly interchangably. OpenDX uses more
vector-bundle-like constructs to provide a more unified view of
disparate data structures.
FM uses data-accessors (essentially keeping
the data structures opaque).
*
Are there any of the requirements above that are not covered by
the
structure you propose?
*
This should focus on the elegance/usefulness of the core
design-pattern employed by the implementation rather than a
point-by-point description of the implemenation!
*
Is there information or characteristics of particular file format
standards that must percolate up into the specific
implementation of
the in-memory data structures?
I dunno, but what does HDF5 or NetCDF include? We should definitely be
able to handle various meta-data...
Otherwise, our viz framework should be able to read in all sorts
of
file-based data as input, converting it seamlessly into our
"Holy Data
Grail" format for all the components to use and pass
around. But the
data shouldn't be identifiable as having once been HDF or NetCDF,
etc...
(i.e. it's important to read the data format, but not to use it
internally)
For the purpose of this survey, "data analysis" is
defined broadly as
all non-visual data processing done *after* the simulation code
has
finished and *before* "visual analysis".
*
Is there a clear dividing line between "data analysis" and
"visual
analysis" requirements?
NO. There shouldn't
be - these operations are tightly coupled, or even
symbiotic, and *should* all be incorporated into the same
framework,
indistinguishable from each other.
*
Can we (should we) incorporate data analysis functionality into
this
framework, or is it just focused on visual analysis.
YES.
*
What kinds of data analysis typically needs to be done in your
field?
Simple sampling, basic statistical averages/deviations, principal
component
analysis (PCA, or EOF for climate folks), other dimension
reduction.
Please give examples and how these functions are currently
implemented.
C/C++ code... mostly
slow serial... :-Q
*
How do we incorporate powerful data analysis functionality into
the
framework?
As components (duh)...
:-)
We should define some "standard" APIs for the desired
analysis functions,
and then either wrap existing codes as components or shoehorn in
existing
component implementations from systems like ASPECT.
2) Execution Model=======================
It will be necessary for us to agree on a common execution
semantics
for our components.
Otherwise, while we might have compatible data
structures but incompatible execution requirements. Execution
semantics is akin to the function of protocol in the context of
network
serialization of data structures. The motivating questions are as
follows;
*
How is the execution model affected by the kinds of
algorithms/system-behaviors we want to implement.
Directly. There are
probably a few main exec models we want to cover.
I don't think the list is *that* long...
As such, we should anticipate building several distinct framework
environments that each exclusively support a given exec
model. Then
the trick is to "glue" these individual frameworks
together so they can
interoperate (exchange data and invoke each others' component
methods)
and be arbitrarily "bridged" together to form complex
higher-level
pipelines or other local/remote topologies.
*
How then will a given execution model affect data structure
implementations
I don't think it should affect the data structure impls at all,
per se.
Clearly, the access patterns will be different for various
execution models,
but this shouldn't change the data impl. Perhaps a better question is
how to indicate the expected access pattern to allow a given data
impl
to optimize or properly prefetch/cache the accesses...
*
How will the execution model be translated into execution
semantics on the component level. For example will we need to implement
special control-ports on our components to implement particular
execution
models or will the semantics be implicit in the way we structure
the
method calls between components.
Components should be "dumb" and let other components or
the framework invoke
them as needed for a given execution model. The framework dictates the
control flow, not the component. The API shouldn't change.
If you want multi-threaded components, then the framework better
support
that, and the API for the component should take the possibility
into account.
What kinds of execution models should be supported by the
distributed
visualization architecture
*
View dependent algorithms? (These were typically quite difficult
to
implement for dataflow visualization environments like AVS5).
Want.
*
Out-of-core algorithms
Must. This is a
necessary evil of "big data".
You need some killer
caching infrastructure throughout the pipeline (e.g. like
VizCache).
*
Progressive update and hierarchical/multiresolution algorithms?
Must.
*
Procedural execution from a single thread of control (ie. using an
commandline language like IDL to interactively control an
dynamic or
large parallel back-end)
This is not an execution model, it is a command/control interface
issue.
You should be able to have a GUI, programmatic control, or
scripting to
dictate interactive control (or "steering" as they call
it... :-). The
internal software organization shouldn't change, just the
interface to
the outside (or inside) world...
*
Dataflow execution models?
Must.
What is the firing method that should
be employed for a dataflow pipeline? Do you need a central executive like
AVS/OpenDX or, completely distributed firing mechanism like that
of
VTK, or some sort of abstraction that allows the modules to be
used
with either executive paradigm?
This should be an implementation issue in the "dataflow
framework", and
should not affect the component-level APIs.
*
Support for novel data layouts like space-filling curves?
Must. But this isn't
an execution model either. It's a
data structure
or algorithmic detail...
*
Are there special considerations for collaborative applications?
Surely. The
interoperability of distinct framework implementations
ties in with this...
but the components shouldn't be aware that they
are being run collaboratively/remotely... definitely a framework issue.
*
What else?
Yeah right.
How will the execution model affect our implementation of data
structures?
It shouldn't. The
execution model should be kept independent of the
data structures as much as possible.
If you want to build higher-level APIs for specific data access
patterns
that's fine, but keep the underlying data consistent where
possible.
*
how do you decompose a data structure such that it is amenable to
streaming in small chunks?
This sounds a lot like distributed data decompositions. I suspect that
given a desired block/cycle size, you can organize/decompose data
in all
sorts of useful ways, depending on the expected access pattern.
In conjunction with this, you could also reorganize static
datasets
into filesystem databases, with appropriate naming conventions or
perhaps a special protocol for lining up the data blob files in
the
desired order for streaming (in either time or space along any
axis).
Meta-data in the files might be handy here, too, if it's indexed
efficiently for fast lookup/searching/selection.
*
how do you represent temporal dependencies in that model?
Meta-data, or file naming conventions...
*
how do you minimize recomputation in order to regenerate data for
view-dependent algorithms.
No clue.
What are the execution semantics necessary to implement these
execution
models?
*
how does a component know when to compute new data? (what is the
firing rule)
There are really only 2 possibilities I can see - either a
component is
directly invoked by another component or the framework, or else a
method
must be triggered by some sort of dataflow dependency or
stream-based
event mechanism.
*
does coordination of the component execution require a central
executive or can it be implemented using only rules that are
local to a
particular component.
This is a framework implementation detail. No. No. Bad Dog.
The component doesn't know what's outside of it (in the rest of
the
framework, or the outside world). It only gets invoked, one way or
another.
*
how elegantly can execution models be supported by the proposed
execution semantics?
Are there some things, like loops or
back-propagation of information that are difficult to implement
using a
particular execution semantics?
We need to keep the different execution models separate, as
implementation
details of individual frameworks. This separates the concerns here.
How will security considerations affect the execution model?
Ha ha ha ha...
They won't right away, except in collaboration scenarios.
Think "One MPI Per Framework" and do things the old
fashioned way
locally, then do the "glue" for inter-framework
connectivity with
proper authentication only as needed. (No worse than Globus... :-)
3) Parallelism and load-balancing=================
Thus far, managing parallelism in visualization systems has been
a
tedious and difficult at best. Part of this is a lack of powerful
abstractions for managing data-parallelism, load-balancing and
component control.
Please describe the kinds of parallel execution models that must
be
supported by a visualization component architecture.
*
data-parallel/dataflow pipelines?
Must.
*
master/slave work-queues?
Must.
*
streaming update for management of pipeline parallelism?
Must.
*
chunking mechanisms where the number of chunks may be different
from
the number of CPU's employed to process those chunks?
This sounds the same as master/slave to me, as in "bag of tasks"...