From: James Kohl <kohlja@ornl.gov>

Date: Wed Sep 10, 2003  3:32:44 PM US/Pacific

To: John Shalf <jshalf@lbl.gov>

Cc: diva@lbl.gov

Subject: Re: DiVA Survey (Please return by Sept 10!)

Reply-To: kohlja@ornl.gov

 

O.K., here goes...  wahoo...  :)

 

=============The Survey=========================

 

1) Data Structures/Representations/Management==================

The center of every successful modular visualization architecture has

been a flexible core set of data structures for representing data that

is important to the targeted application domain.  Before we can begin

working on algorithms, we must come to some agreement on common methods

(either data structures or accessors/method  calls) for exchanging data

between components of our vis framework.

 

There are two potentially disparate motivations for defining the data

representation requirements.  In the coarse-grained case, we need to

define standards for exchanging data between components in this

framework (interoperability).  In the fined-grained case, we want to

define some canonical data structures that can be used within a

component -- one developed specifically for this framework.  These two

use-cases may drive different set of requirements and implementation

issues.

      * Do you feel both of these use cases are equally important or

      should we focus exclusively on one or the other?

 

I think for now we need to exclusively focus on exchanging data between

components, rather than any fine-grained generalized data objects...

 

The first order entry into any component development is to "wrap up

what ya got".  The "rip things apart" phase comes after you can glue

all the coarse-grained piece together reliably...

 

      * Do you feel the requirements for each of these use-cases are

      aligned or will they involve two separate development tracks?

 

Two separate development tracks.  Definitely.  There are different driving

design forces and they can be developed (somewhat) independently (I hope).

 

For instance, using "accessors" (method calls that provide abstract

access to essentially opaque data structures) will likely work fine for

the coarse-grained data exchanges between components, but will lead to

inefficiencies if used to implement algorithms within a particular

component.

      * As you answer the "implementation and requirements" questions

      below, please try to identify where coarse-grained and fine-grained use

cases will affect the implementation requirements.

 

What are requirements for the data representations that must be

supported by a common infrastructure.  We will start by answering Pat's

questions of about representation requirements and follow up with

personal experiences involving particular domain scientist's

requirements.

      Must: support for structured data

Must.

 

      Must/Want: support for multi-block data?

Must.

 

      Must/Want: support for various unstructured data representations?

(which ones?)

      Must/Want: support for adaptive grid standards?  Please be specific

about which adaptive grid methods you are referring to.  Restricted

block-structured AMR (aligned grids), general block-structured AMR

(rotated grids), hierarchical unstructured AMR, or non-hierarchical

adaptive structured/unstructured meshes.

      Must/Want: "vertex-centered" data, "cell-centered" data?

other-centered?

 

All of these should be "Wants", to the extent that they require more

sophisticated handling, or are less well-known in terms of generalizing

the interfaces.

 

For example, the AMR folks havfe been trying to get together and define

a standard API, and have been as yet unsuccessful.  Who are we to attempt

this where they have failed...?

 

So to clarify, if we *really* understand (or think we do) a particular

data representation/organization, or even a specific subset of a general

representation type, then by all means lets whittle an API into our stuff.

Otherwise, leave it alone for someone else to do, or do as strictly needed.

 

      Must: support time-varying data, sequenced, streamed data?

MUST.

 

      Must/Want: higher-order elements?

      Must/Want: Expression of material interface boundaries and other

special-treatment of boundary conditions.

Wants, see above...

 

      * For commonly understood datatypes like structured and

unstructured, please focus on any features that are commonly overlooked in

typical implementations.  For example, often data-centering is overlooked

in structured data representations in vis systems and FEM researchers

commonly criticize vis people for co-mingling geometry with topology

for unstructured grid representations.  Few datastructures provide

proper treatment of boundary conditions or material interfaces.  Please

describe your personal experience on these matters.

      * Please describe data representation requirements for novel data

representations such as bioinformatics and terrestrial sensor datasets.

 In particular, how should we handle more abstract data that is

typically given the moniker "information visualization".

 

I don't think we should "pee in this pool" either yet.  Are any of us

experts in this kind of viz?  Let's stick with what we collectively know

best and make that work before we try to tackle a related-but-fundamentally-

different-domain.

 

What do you consider the most elegant/comprehensive implementation for

data representations that you believe could form the basis for a

comprehensive visualization framework?

 

Sounds like the "Holy Grail" to me...  If anything even remotely close to

this already existed, we'd all be using it already...

 

(Unless of course it's the dreaded NIH syndrome...)

 

      * For instance, AVS uses entirely different datastructures for

structure, unstructured and geometry data.  VTK uses class inheritance

to express the similarities between related structures.  Ensight treats

unstructured data and geometry nearly interchangably.  OpenDX uses more

vector-bundle-like constructs to provide a more unified view of

disparate data structures.  FM uses data-accessors (essentially keeping

the data structures opaque).

      * Are there any of the requirements above that are not covered by

      the structure you propose?

      * This should focus on the elegance/usefulness of the core

design-pattern employed by the implementation rather than a

point-by-point description of the implemenation!

      * Is there information or characteristics of particular file format

standards that must percolate up into the specific implementation of

the in-memory data structures?

 

I dunno, but what does HDF5 or NetCDF include?  We should definitely be

able to handle various meta-data...

 

Otherwise, our viz framework should be able to read in all sorts of

file-based data as input, converting it seamlessly into our "Holy Data

Grail" format for all the components to use and pass around.  But the

data shouldn't be identifiable as having once been HDF or NetCDF, etc...

(i.e. it's important to read the data format, but not to use it internally)

 

For the purpose of this survey, "data analysis" is defined broadly as

all non-visual data processing done *after* the simulation code has

finished and *before* "visual analysis".

      * Is there a clear dividing line between "data analysis" and "visual

analysis" requirements?

 

NO.  There shouldn't be - these operations are tightly coupled, or even

symbiotic, and *should* all be incorporated into the same framework,

indistinguishable from each other.

 

      * Can we (should we) incorporate data analysis functionality into

      this framework, or is it just focused on visual analysis.

 

YES.

 

      * What kinds of data analysis typically needs to be done in your

field?

 

Simple sampling, basic statistical averages/deviations, principal component

analysis (PCA, or EOF for climate folks), other dimension reduction.

 

Please give examples and how these functions are currently

implemented.

 

C/C++ code...  mostly slow serial...  :-Q

 

      * How do we incorporate powerful data analysis functionality into

      the framework?

 

As components (duh)...  :-)

 

We should define some "standard" APIs for the desired analysis functions,

and then either wrap existing codes as components or shoehorn in existing

component implementations from systems like ASPECT.

 

2) Execution Model=======================

It will be necessary for us to agree on a common execution semantics

for our components.  Otherwise, while we might have compatible data

structures but incompatible execution requirements.  Execution

semantics is akin to the function of protocol in the context of network

serialization of data structures.  The motivating questions are as

follows;

      * How is the execution model affected by the kinds of

algorithms/system-behaviors we want to implement.

 

Directly.  There are probably a few main exec models we want to cover.

I don't think the list is *that* long...

 

As such, we should anticipate building several distinct framework

environments that each exclusively support a given exec model.  Then

the trick is to "glue" these individual frameworks together so they can

interoperate (exchange data and invoke each others' component methods)

and be arbitrarily "bridged" together to form complex higher-level

pipelines or other local/remote topologies.

 

      * How then will a given execution model affect data structure

implementations

 

I don't think it should affect the data structure impls at all, per se.

 

Clearly, the access patterns will be different for various execution models,

but this shouldn't change the data impl.  Perhaps a better question is

how to indicate the expected access pattern to allow a given data impl

to optimize or properly prefetch/cache the accesses...

 

      * How will the execution model be translated into execution

semantics on the component level.  For example will we need to implement

special control-ports on our components to implement particular execution

models or will the semantics be implicit in the way we structure the

method calls between components.

 

Components should be "dumb" and let other components or the framework invoke

them as needed for a given execution model.  The framework dictates the

control flow, not the component.  The API shouldn't change.

 

If you want multi-threaded components, then the framework better support

that, and the API for the component should take the possibility into account.

 

What kinds of execution models should be supported by the distributed

visualization architecture

      * View dependent algorithms? (These were typically quite difficult

      to implement for dataflow visualization environments like AVS5).

Want.

 

      * Out-of-core algorithms

 

Must.  This is a necessary evil of "big data".  You need some killer

caching infrastructure throughout the pipeline (e.g. like VizCache).

 

      * Progressive update and hierarchical/multiresolution algorithms?

Must.

 

      * Procedural execution from a single thread of control (ie. using an

commandline language like IDL to interactively control an dynamic or

large parallel back-end)

 

This is not an execution model, it is a command/control interface issue.

 

You should be able to have a GUI, programmatic control, or scripting to

dictate interactive control (or "steering" as they call it... :-).  The

internal software organization shouldn't change, just the interface to

the outside (or inside) world...

 

      * Dataflow execution models?

Must.

 

What is the firing method that should

be employed for a dataflow pipeline?  Do you need a central executive like

AVS/OpenDX or, completely distributed firing mechanism like that of

VTK, or some sort of abstraction that allows the modules to be used

with either executive paradigm?

 

This should be an implementation issue in the "dataflow framework", and

should not affect the component-level APIs.

 

      * Support for novel data layouts like space-filling curves?

 

Must.  But this isn't an execution model either.  It's a data structure

or algorithmic detail...

 

      * Are there special considerations for collaborative applications?

 

Surely.  The interoperability of distinct framework implementations

ties in with this...  but the components shouldn't be aware that they

are being run collaboratively/remotely...  definitely a framework issue.

 

      * What else?

Yeah right.

 

How will the execution model affect our implementation of data

structures?

 

It shouldn't.  The execution model should be kept independent of the

data structures as much as possible.

 

If you want to build higher-level APIs for specific data access patterns

that's fine, but keep the underlying data consistent where possible.

 

      * how do you decompose a data structure such that it is amenable to

streaming in small chunks?

 

This sounds a lot like distributed data decompositions.  I suspect that

given a desired block/cycle size, you can organize/decompose data in all

sorts of useful ways, depending on the expected access pattern.

 

In conjunction with this, you could also reorganize static datasets

into filesystem databases, with appropriate naming conventions or

perhaps a special protocol for lining up the data blob files in the

desired order for streaming (in either time or space along any axis).

Meta-data in the files might be handy here, too, if it's indexed

efficiently for fast lookup/searching/selection.

 

      * how do you represent temporal dependencies in that model?

 

Meta-data, or file naming conventions...

 

      * how do you minimize recomputation in order to regenerate data for

view-dependent algorithms.

 

No clue.

 

What are the execution semantics necessary to implement these execution

models?

      * how does a component know when to compute new data? (what is the

firing rule)

 

There are really only 2 possibilities I can see - either a component is

directly invoked by another component or the framework, or else a method

must be triggered by some sort of dataflow dependency or stream-based

event mechanism.

 

      * does coordination of the component execution require a central

executive or can it be implemented using only rules that are local to a

particular component.

 

This is a framework implementation detail.  No.  No.  Bad Dog.

 

The component doesn't know what's outside of it (in the rest of the

framework, or the outside world).  It only gets invoked, one way or

another.

 

      * how elegantly can execution models be supported by the proposed

execution semantics?  Are there some things, like loops or

back-propagation of information that are difficult to implement using a

particular execution semantics?

 

We need to keep the different execution models separate, as implementation

details of individual frameworks.  This separates the concerns here.

 

How will security considerations affect the execution model?

 

Ha ha ha ha...

 

They won't right away, except in collaboration scenarios.

 

Think "One MPI Per Framework" and do things the old fashioned way

locally, then do the "glue" for inter-framework connectivity with

proper authentication only as needed.  (No worse than Globus... :-)

 

3) Parallelism and load-balancing=================

Thus far, managing parallelism in visualization systems has been a

tedious and difficult at best.  Part of this is a lack of powerful

abstractions for managing data-parallelism, load-balancing and

component control.

 

Please describe the kinds of parallel execution models that must be

supported by a visualization component architecture.

      * data-parallel/dataflow pipelines?

Must.

 

      * master/slave work-queues?

Must.

 

      * streaming update for management of pipeline parallelism?

Must.

 

      * chunking mechanisms where the number of chunks may be different

      from the number of CPU's employed to process those chunks?

 

This sounds the same as master/slave to me, as in "bag of tasks"...