From: James Kohl <kohlja@ornl

From: James Kohl <kohlja@ornl.gov>

Date: Wed Sep 10, 2003 3:32:44 PM US/Pacific

To: John Shalf <jshalf@lbl.gov>

Cc: diva@lbl.gov

Subject: Re: DiVA Survey (Please return by Sept 10!)

Reply-To: kohlja@ornl.gov

O.K., here goes... wahoo... :)

=============The Survey=========================

1) Data Structures/Representations/Management==================

The center of every successful modular visualization architecture has

been a flexible core set of data structures for representing data that

is important to the targeted application domain. Before we can begin

working on algorithms, we must come to some agreement on common methods

(either data structures or accessors/method calls) for exchanging data

between components of our vis framework.

There are two potentially disparate motivations for defining the data

representation requirements. In the coarse-grained case, we need to

define standards for exchanging data between components in this

framework (interoperability). In the fined-grained case, we want to

define some canonical data structures that can be used within a

component -- one developed specifically for this framework. These two

use-cases may drive different set of requirements and implementation

issues.

* Do you feel both of these use cases are equally important or

should we focus exclusively on one or the other?

I think for now we need to exclusively focus on exchanging data between

components, rather than any fine-grained generalized data objects...

The first order entry into any component development is to "wrap up

what ya got". The "rip things apart" phase comes after you can glue

all the coarse-grained piece together reliably...

* Do you feel the requirements for each of these use-cases are

aligned or will they involve two separate development tracks?

Two separate development tracks. Definitely. There are different driving

design forces and they can be developed (somewhat) independently (I hope).

For instance, using "accessors" (method calls that provide abstract

access to essentially opaque data structures) will likely work fine for

the coarse-grained data exchanges between components, but will lead to

inefficiencies if used to implement algorithms within a particular

component.

* As you answer the "implementation and requirements" questions

below, please try to identify where coarse-grained and fine-grained use

cases will affect the implementation requirements.

What are requirements for the data representations that must be

supported by a common infrastructure. We will start by answering Pat's

questions of about representation requirements and follow up with

personal experiences involving particular domain scientist's

requirements.

Must: support for structured data

Must.

Must/Want: support for multi-block data?

Must.

Must/Want: support for various unstructured data representations?

(which ones?)

Must/Want: support for adaptive grid standards? Please be specific

about which adaptive grid methods you are referring to. Restricted

block-structured AMR (aligned grids), general block-structured AMR

(rotated grids), hierarchical unstructured AMR, or non-hierarchical

adaptive structured/unstructured meshes.

Must/Want: "vertex-centered" data, "cell-centered" data?

other-centered?

All of these should be "Wants", to the extent that they require more

sophisticated handling, or are less well-known in terms of generalizing

the interfaces.

For example, the AMR folks havfe been trying to get together and define

a standard API, and have been as yet unsuccessful. Who are we to attempt

this where they have failed...?

So to clarify, if we *really* understand (or think we do) a particular

data representation/organization, or even a specific subset of a general

representation type, then by all means lets whittle an API into our stuff.

Otherwise, leave it alone for someone else to do, or do as strictly needed.

Must: support time-varying data, sequenced, streamed data?

MUST.

Must/Want: higher-order elements?

Must/Want: Expression of material interface boundaries and other

special-treatment of boundary conditions.

Wants, see above...

* For commonly understood datatypes like structured and

unstructured, please focus on any features that are commonly overlooked in

typical implementations. For example, often data-centering is overlooked

in structured data representations in vis systems and FEM researchers

commonly criticize vis people for co-mingling geometry with topology

for unstructured grid representations. Few datastructures provide

proper treatment of boundary conditions or material interfaces. Please

describe your personal experience on these matters.

* Please describe data representation requirements for novel data

representations such as bioinformatics and terrestrial sensor datasets.

In particular, how should we handle more abstract data that is

typically given the moniker "information visualization".

I don't think we should "pee in this pool" either yet. Are any of us

experts in this kind of viz? Let's stick with what we collectively know

best and make that work before we try to tackle a related-but-fundamentally-

different-domain.

What do you consider the most elegant/comprehensive implementation for

data representations that you believe could form the basis for a

comprehensive visualization framework?

Sounds like the "Holy Grail" to me... If anything even remotely close to

this already existed, we'd all be using it already...

(Unless of course it's the dreaded NIH syndrome...)

* For instance, AVS uses entirely different datastructures for

structure, unstructured and geometry data. VTK uses class inheritance

to express the similarities between related structures. Ensight treats

unstructured data and geometry nearly interchangably. OpenDX uses more

vector-bundle-like constructs to provide a more unified view of

disparate data structures. FM uses data-accessors (essentially keeping

the data structures opaque).

* Are there any of the requirements above that are not covered by

the structure you propose?

* This should focus on the elegance/usefulness of the core

design-pattern employed by the implementation rather than a

point-by-point description of the implemenation!

* Is there information or characteristics of particular file format

standards that must percolate up into the specific implementation of

the in-memory data structures?

I dunno, but what does HDF5 or NetCDF include? We should definitely be

able to handle various meta-data...

Otherwise, our viz framework should be able to read in all sorts of

file-based data as input, converting it seamlessly into our "Holy Data

Grail" format for all the components to use and pass around. But the

data shouldn't be identifiable as having once been HDF or NetCDF, etc...

(i.e. it's important to read the data format, but not to use it internally)

For the purpose of this survey, "data analysis" is defined broadly as

all non-visual data processing done *after* the simulation code has

finished and *before* "visual analysis".

* Is there a clear dividing line between "data analysis" and "visual

analysis" requirements?

NO. There shouldn't be - these operations are tightly coupled, or even

symbiotic, and *should* all be incorporated into the same framework,

indistinguishable from each other.

* Can we (should we) incorporate data analysis functionality into

this framework, or is it just focused on visual analysis.

YES.

* What kinds of data analysis typically needs to be done in your

field?

Simple sampling, basic statistical averages/deviations, principal component

analysis (PCA, or EOF for climate folks), other dimension reduction.

Please give examples and how these functions are currently

implemented.

C/C++ code... mostly slow serial... :-Q

* How do we incorporate powerful data analysis functionality into

the framework?

As components (duh)... :-)

We should define some "standard" APIs for the desired analysis functions,

and then either wrap existing codes as components or shoehorn in existing

component implementations from systems like ASPECT.

2) Execution Model=======================

It will be necessary for us to agree on a common execution semantics

for our components. Otherwise, while we might have compatible data

structures but incompatible execution requirements. Execution

semantics is akin to the function of protocol in the context of network

serialization of data structures. The motivating questions are as

follows;

* How is the execution model affected by the kinds of

algorithms/system-behaviors we want to implement.

Directly. There are probably a few main exec models we want to cover.

I don't think the list is *that* long...

As such, we should anticipate building several distinct framework

environments that each exclusively support a given exec model. Then

the trick is to "glue" these individual frameworks together so they can

interoperate (exchange data and invoke each others' component methods)

and be arbitrarily "bridged" together to form complex higher-level

pipelines or other local/remote topologies.

* How then will a given execution model affect data structure

implementations

I don't think it should affect the data structure impls at all, per se.

Clearly, the access patterns will be different for various execution models,

but this shouldn't change the data impl. Perhaps a better question is

how to indicate the expected access pattern to allow a given data impl

to optimize or properly prefetch/cache the accesses...

* How will the execution model be translated into execution

semantics on the component level. For example will we need to implement

special control-ports on our components to implement particular execution

models or will the semantics be implicit in the way we structure the

method calls between components.

Components should be "dumb" and let other components or the framework invoke

them as needed for a given execution model. The framework dictates the

control flow, not the component. The API shouldn't change.

If you want multi-threaded components, then the framework better support

that, and the API for the component should take the possibility into account.

What kinds of execution models should be supported by the distributed

visualization architecture

* View dependent algorithms? (These were typically quite difficult

to implement for dataflow visualization environments like AVS5).

Want.

* Out-of-core algorithms

Must. This is a necessary evil of "big data". You need some killer

caching infrastructure throughout the pipeline (e.g. like VizCache).

* Progressive update and hierarchical/multiresolution algorithms?

Must.

* Procedural execution from a single thread of control (ie. using an

commandline language like IDL to interactively control an dynamic or

large parallel back-end)

This is not an execution model, it is a command/control interface issue.

You should be able to have a GUI, programmatic control, or scripting to

dictate interactive control (or "steering" as they call it... :-). The

internal software organization shouldn't change, just the interface to

the outside (or inside) world...

* Dataflow execution models?

Must.

What is the firing method that should

be employed for a dataflow pipeline? Do you need a central executive like

AVS/OpenDX or, completely distributed firing mechanism like that of

VTK, or some sort of abstraction that allows the modules to be used

with either executive paradigm?

This should be an implementation issue in the "dataflow framework", and

should not affect the component-level APIs.

* Support for novel data layouts like space-filling curves?

Must. But this isn't an execution model either. It's a data structure

or algorithmic detail...

* Are there special considerations for collaborative applications?

Surely. The interoperability of distinct framework implementations

ties in with this... but the components shouldn't be aware that they

are being run collaboratively/remotely... definitely a framework issue.

* What else?

Yeah right.

How will the execution model affect our implementation of data

structures?

It shouldn't. The execution model should be kept independent of the

data structures as much as possible.

If you want to build higher-level APIs for specific data access patterns

that's fine, but keep the underlying data consistent where possible.

* how do you decompose a data structure such that it is amenable to

streaming in small chunks?

This sounds a lot like distributed data decompositions. I suspect that

given a desired block/cycle size, you can organize/decompose data in all

sorts of useful ways, depending on the expected access pattern.

In conjunction with this, you could also reorganize static datasets

into filesystem databases, with appropriate naming conventions or

perhaps a special protocol for lining up the data blob files in the

desired order for streaming (in either time or space along any axis).

Meta-data in the files might be handy here, too, if it's indexed

efficiently for fast lookup/searching/selection.

* how do you represent temporal dependencies in that model?

Meta-data, or file naming conventions...

* how do you minimize recomputation in order to regenerate data for

view-dependent algorithms.

No clue.

What are the execution semantics necessary to implement these execution

models?

* how does a component know when to compute new data? (what is the

firing rule)

There are really only 2 possibilities I can see - either a component is

directly invoked by another component or the framework, or else a method

must be triggered by some sort of dataflow dependency or stream-based

event mechanism.

* does coordination of the component execution require a central

executive or can it be implemented using only rules that are local to a

particular component.

This is a framework implementation detail. No. No. Bad Dog.

The component doesn't know what's outside of it (in the rest of the

framework, or the outside world). It only gets invoked, one way or

another.

* how elegantly can execution models be supported by the proposed

execution semantics? Are there some things, like loops or

back-propagation of information that are difficult to implement using a

particular execution semantics?

We need to keep the different execution models separate, as implementation

details of individual frameworks. This separates the concerns here.

How will security considerations affect the execution model?

Ha ha ha ha...

They won't right away, except in collaboration scenarios.

Think "One MPI Per Framework" and do things the old fashioned way

locally, then do the "glue" for inter-framework connectivity with

proper authentication only as needed. (No worse than Globus... :-)

3) Parallelism and load-balancing=================

Thus far, managing parallelism in visualization systems has been a

tedious and difficult at best. Part of this is a lack of powerful

abstractions for managing data-parallelism, load-balancing and

component control.

Please describe the kinds of parallel execution models that must be

supported by a visualization component architecture.

* data-parallel/dataflow pipelines?

Must.

* master/slave work-queues?

Must.

* streaming update for management of pipeline parallelism?

Must.

* chunking mechanisms where the number of chunks may be different

from the number of CPU's employed to process those chunks?

This sounds the same as master/slave to me, as in "bag of tasks"...

* how should one manage parallelism for interactive scripting

languages that have a single thread of control? (eg. I'm using a

commandline language like IDL that interactively drives an arbitrarily

large set of parallel resources. How can I make the parallel back-end

available to a single-threaded interactive thread of control?)

Broadcast, Baby... Either you blast the commands out to everyone SIMD

style (unlikely) or else you talk to the Rank 0 task and the command

gets forwarded on a fast internal network.

Please describe your vision of what kinds of software support /

programming design patterns are needed to better support parallelism

and load balancing.

* What programming model should be employed to express parallelism.

(UPC, MPI, SMP/OpenMP, custom sockets?)

All but UPC will be necessary for various functionality.

* Can you give some examples of frameworks or design patterns that

you consider very promising for support of parallelism and load balancing.

(ie. PNNL Global Arrays or Sandia's Zoltan)

http://www.cs.sandia.gov/Zoltan/

http://www.emsl.pnl.gov/docs/global/ga.html

Nope, that covers my list of hopefuls.

* Should we use novel software abstractions for expressing

parallelism or should the implementation of parallelism simply be an

opaque property of the component? (ie. should there be an abstract

messaging layer or not)

It's not our job to develop "novel" parallelism abstractions. We should

just use existing abstractions like what the CCA is developing.

* How does the NxM work fit in to all of this? Is it sufficiently

differentiated from Zoltan's capabilities?

I don't know what Zoltan can do specifically, but MxN is designed for

basic "parallel data redistribution". This means it is good for doing

big parallel-to-parallel data movement/transformations among two disparate

parallel frameworks, or between two parallel components in the same

framework with different data decompositions. MxN is also good for

"self-transpose" or other types of local data reorganization within a

given (parallel) component.

MxN doesn't do interpolation in space or time (yet, probably for a while),

and it won't wash your car (but it won't drink your beer either... :-).

If you need something fancier, or if you don't really need any data

reorganization between the source and destination of a transfer, then

MxN *isn't* for you...

===============End of Mandatory Section (the rest is

voluntary)=============

4) Graphics and Rendering=================

What do you use for converting geometry and data into images (the

rendering-engine). Please comment on any/all of the following.

* Should we build modules around declarative/streaming methods for

rendering geometry like OpenGL, Chromium and DirectX or should we move

to higher-level representations for graphics offered by scene graphs?

What are the pitfalls of building our component architecture around

scene graphs?

* What about Postscript, PDF and other scale-free output methods for

publication quality graphics? Are pixmaps sufficient?

In a distributed environment, we need to create a rendering subsystem

that can flexibly switch between drawing to a client application by

sending images, sending geometry, or sending geometry fragments

(image-based rendering)? How do we do that?

I would think this could be achieved by a sophisticated data communication

protocol - one that encodes the type of data in the stream, say, using XML

or some such thingy.

* Please describe some rendering models that you would like to see

supported (ie. view-dependent update, progressive update) and how they

would adjust dynamically do changing objective functions (optimize for

fastest framerate, or fastest update on geometry change, or varying

workloads and resource constraints).

* Are there any good examples of such a system?

What is the role of non-polygonal methods for rendering (ie. shaders)?

* Are you using any of the latest gaming features of commodity cards

in your visualization systems today?

* Do you see this changing in the future? (how?)

5) Presentation=========================

It will be necessary to separate the visualization back-end from the

presentation interface. For instance, you may want to have the same

back-end driven by entirely different control-panels/GUIs and displayed

in different display devices (a CAVE vs. a desktop machine). Such

separation is also useful when you want to provide different

implementations of the user-interface depending on the targeted user

community. For instance, visualization experts might desire a

dataflow-like interface for composing visualization workflows whereas a

scientists might desire a domain-specific dash-board like interface

that implements a specific workflow. Both users should be able to

share the same back-end components and implementation even though the

user interface differs considerably.

How do different presentation devices affect the component model?

Not. The display device only affects resolution or bandwidth required.

This could be parameterized in the component invocations APIs, but

should not otherwise change an individual component.

If you want a "multiplexer" to share a massive data stream with a powerwall

and a PDA, then the "multiplexer component" implementation handles that...

* Do different display devices require completely different user

interface paradigms?

No. Different GUIs should all map to some common framework command/control

interface. The same functions will ultimately get executed, just from buttons

with different labels or appl-specific short-cuts... The UIs should all be

independent, but talk the same protocol to the framework.

If so, then we must define a clear separation

between the GUI description and the components performing the back-end

computations. If not, then is there a common language to describe user

interfaces that can be used across platforms?

Yuk.

* Do different display modalities require completely different

component/algorithm implementations for the back-end compute engine?

(what do we do about that??)

Algorithm maybe, component no. This could fall into the venue of the

different execution-model-specific frameworks and/or their bridging...

I dunno.

What Presentation modalities do you feel are important and what do you

consider the most important.

* Desktop graphics (native applications on Windows, on Macs)

MUST.

* Graphics access via Virtual Machines like Java?

Ha ha ha ha...

* CAVEs, Immersadesks, and other VR devices

Must.

* Ultra-high-res/Tiled display devices?

Must.

* Web-based applications?

Probably a good idea. Someone always asks for this... :-Q

What abstractions do you think should be employed to separate the

presentation interface from the back-end compute engine?

Some sort of general protocol descriptor, like XML...? Nuthin fancy.

* Should we be using CCA to define the communication between GUI and

compute engine or should we be using software infrastructure that was

designed specifically for that space? (ie. WSDL, OGSA, or CORBA?)

The CCA doesn't do such communication per se. Messaging between or in/out

of frameworks is always "out of band" relative to CCA port invocations.

If the specific framework impl wants to shove out data on some wire,

then it's hidden below the API level...

I would think that WSDL/SOAP would be O.K. for low-bandwidth uses.

* How do such control interfaces work with parallel applications?

Should the parallel application have a single process that manages the

control interface and broadcasts to all nodes or should the control

interface treat all application processes within a given component as

peers?

I vote for the "single process that manages the control interface and

broadcasts to all nodes" (or the variation above, where one of the

parallel tasks forwards to the rest internally :-). The latter is

not scalable.

BTW, you can't have "application processes within a... component".

What does that even mean?

Usually, an application "process" consists of a collection of one or

more components that have been composed with some specific connectivity...

6) Basic Deployment/Development Environment Issues============

One of the goals of the distributed visualization architecture is

seamless operation on the Grid -- distributed/heterogeneous collections

of machines. However, it is quite difficult to realize such a vision

without some consideration of deployment/portability issues. This

question also touches on issues related to the development environment

and what kinds of development methods should be supported.

What languages do you use for core vis algorithms and frameworks.

C/C++.

* for the numerically intensive parts of vis algorithms

C/C++/Fortran/F90...?

* for the glue that connects your vis algorithms together into an

application?

C/C++.

* How aggressively do you use language-specific features like C++

templates?

RUN AWAYYYY!!! These are not consistent across o.s./arch/compiler yet.

Maybe someday...

* is Fortran important to you? Is it important that a framework

support it seamlessly?

Fortran is crucial for many application scientists. It is not directly

useful for the tools I build.

But if you want to ever integrate application code components directly

into a viz framework, then you better not preclude this... (or Babel...)

* Do you see other languages becoming important for visualization

(ie. Python, UPC, or even BASIC?)

Nope.

What platforms are used for data analysis/visualization?

* What do you and your target users depend on to display results?

(ie. Windows, Linux, SGI, Sun etc..)

All of the above (not so much Sun any more...).

* What kinds of presentation devices are employed (desktops,

portables, handhelds, CAVEs, Access Grids, WebPages/Collaboratories)

and what is their relative importance to active users.

All but handhelds are important, mostly desktops, CAVEs/hi-res and AG,

in decreasing order.

* What is the relative importants of these various presentation

methods from a research standpoint?

CAVEs/hi-res and AG are worthwhile research areas. The rest can be

weaved in or incorporated more easily.

* Do you see other up-and-coming visualization platforms in the

future?

Yes, but I haven't figured out where exactly to stick the chip behind

my ear for the virtual holodeck equipment... :)

Tell us how you deal with the issue of versioning and library

dependencies for software deployment.

CVS.

* For source code distributions, do you bundle builds of all related

libraries with each software release (ie. bundle HDF5 and FLTK source

with each release).

No, but provide web links or separate copies of dependent distributions

next to our software on the web site...

Too ugly to include everything in one big bundle, and not as efficient

as letting the user download just what they need. (As long as everything

you need is centrally located or accessible...)

* What methods are employed to support platform independent builds

(cmake, imake, autoconf). What are the benefits and problems with this

approach.

Mostly autoconf so far. My student thinks automake and libtools is "cool"

but we haven't used them yet...

* For binaries, have you have issues with different versions of

libraries (ie. GLIBC problems on Linux and different JVM

implemetnations/version for Java). Can you tell us about any

sophisticated packaging methods that address some of these problems

(RPM need not apply)

Just say no. Open Source is the way to go, with a small set of "common"

binaries just for yuks. Most times the binaries won't work with the

specific run-time libs anyway...

* How do you handle multiplatform builds?

Autoconf, shared source tree, with arch-specific subdirs for object files,

libs and executables.

How do you (or would you) provide abstractions that hide the locality

of various components of your visualization/data analysis application?

I would use "proxy" components that use out-of-band communication to

forward invocations and data to the actual component implementation.

* Does anyone have ample experience with CORBA, OGSA, DCOM, .NET,

RPC? Please comment on advantages/problems of these technologies.

Nope.

* Do web/grid services come into play here?

Yuk, I hope not.

7) Collaboration ==========================

If you are interested in "collaborative appllications" please define

the term "collaborative". Perhaps provide examples of collaborative

application paradigms.

"Collaborative" is 2 or more geographically/remote teams, sharing one

common viz environment, with shared control and full telepresence.

(Note: by this definition, "collaborative" does not yet exist... :-)

Is collaboration a feature that exists at an application level or are

there key requirements for collaborative applications that necessitate

component-level support?

Collaboration should exist *above* the application level, either outside

the specific framework or as part of the framework "bridging" technology.

* Should collaborative infrastructure be incorporated as a core

feature of very component?

NO. Let the framework proxy to collaborative capabilities...

* Can any conceivable collaborative requirement be satisfied using a

separate set of modules that specifically manage distribution of events

and data in collaborative applications?

I dunno, I doubt it.

* How is the collaborative application presented? Does the

application only need to be collaborative sometimes?

Yes, collaboration should be flexible and on demand as needed - like

dialing out on the speakerphone while in the middle of a meeting...

* Where does performance come in to play? Does the visualization

system or underlying libraries need to be performance-aware?

There likely will need to be "hooks" to specify performance requirements,

like "quality of service". This should perhaps be incorporated as part

of the individual component APIs, or at least metered by the frameworks...

(i.e. I'm

doing a given task and I need a framerate of X for it to be useful

using my current compute resources)...

It would be wise to specify the frame rate requirement, perhaps interactively

depending on the venue... e.g. in interactive collaboration scenarios you'd

rather drop some frames consistently than stall completely or in bursts...

network aware (i.e. the system is

starving for data and must respond by adding an alternate stream or

redeploying the pipeline).

This sounds like futureware to me - an intelligent network protocol layer...

beyond our scope for sure!

Are these considerations implemented at the

component level, framework level, or are they entirely out-of-scope for

our consideration?

These issues should be dealt with mostly at the framework level, if at all.

I think they're mostly out-of-scope for the first incarnation...

WHEW! DONE! :-D

Jeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeembo