From: John Shalf <jshalf@lbl.gov>

Date: Wed Sep 10, 2003  11:53:29 AM US/Pacific

To: diva@lbl.gov

Subject: Re: DiVA Survey (Please return by Sept 10!)

 

 

OK,

here are my responses to the mandatory portion of survey.

I'll send the voluntary section separately.

 

On Wednesday, August 27, 2003, at 03:33 PM, John Shalf wrote:

=============The Survey=========================

Please answer the attached survey with as much or as little verbosity as you please and return it to me by September 10.  The survey has 3 mandatory sections and 4 voluntary (bonus) sections.  The sections are as follows;

Mandatory;

         1) Data Structures

         2) Execution Model

         3) Parallelism and Load-Balancing

Voluntary;

         4) Graphics and Rendering

         5) Presentation

         6) Basic Deployment and Development Environment Issues

         7) Collaboration

We will spend this workshop focusing on the first 3 sections, but I think we will derive some useful/motivating information from any answers to questions in the voluntary sections.

 

I'll post my answers to this survey on diva mailing list very soon.  You can post your answers publicly if you want to, but I am happy to regurgitate your answers as "anonymous contributors" if it will enable you to be more candid in your evaluation of available technologies.

 

1) Data Structures/Representations/Management==================

The center of every successful modular visualization architecture has been a flexible core set of data structures for representing data that is important to the targeted application domain.  Before we can begin working on algorithms, we must come to some agreement on common methods (either data structures or accessors/method  calls) for exchanging data between components of our vis framework.

 

There are two potentially disparate motivations for defining the data representation requirements.  In the coarse-grained case, we need to define standards for exchanging data between components in this framework (interoperability).  In the fined-grained case, we want to define some canonical data structures that can be used within a component -- one developed specifically for this framework.  These two use-cases may drive different set of requirements and implementation issues.

         * Do you feel both of these use cases are equally important or should we focus exclusively on one or the other?

 

While I am very interested in design patterns, data structures, and services that could make the design of the interior of parallel/distributed components easier, it is clear that the interfaces between components are the central focus of this project.  So the definition of inter-component data exchanges is preeminent.

 

         * Do you feel the requirements for each of these use-cases are aligned or will they involve two separate development tracks?  For instance, using "accessors" (method calls that provide abstract access to essentially opaque data structures) will likely work fine for the coarse-grained data exchanges between components, but will lead to inefficiencies if used to implement algorithms within a particular component.

 

Given the focus on inter-component data exchange, I think accessors provide the most straightforward paradigm for data exchange.  The arguments to the data access methods can involve elemental data types rather than composite data structures (eg. we use scalars and arrays of basic machine data types rather than hierarchical structures).  Therefore we should look closely at FM's API organization as well as the accessors employed by SCIRun V1 (before they employed dynamic compilation).

 

The accessor method works well for abstracting component location, but requires potentially redundant copying of data for components in the same memory space.  It may be necessary to use reference counting in order to reduce the need to recopy data arrays between co-located components, but I'd really like to avoid making ref counting a mandatory requirement if we can avoid it.  (does anyone know how to avoid redundant data copying between opaque components without employing reference counting?)

 

What are requirements for the data representations that must be supported by a common infrastructure.  We will start by answering Pat's questions of about representation requirements and follow up with personal experiences involving particular domain scientist's requirements.

         Must: support for structured data

Must

         Must/Want: support for multi-block data?

Must

         Must/Want: support for various unstructured data representations? (which ones?)

Cell based initially.  Arbitrary connectivity eventually, but not manditory.

 

         Must/Want: support for adaptive grid standards?  Please be specific about which adaptive grid methods you are referring to.  Restricted block-structured AMR (aligned grids), general block-structured AMR (rotated grids), hierarchical unstructured AMR, or non-hierarchical adaptive structured/unstructured meshes.

 

If we can define the data models rigorously for the individual grid types (ie. structured and unstructured data), then adaptive grid standards really revolve around an infrastructure for indexing data items.  We normally think of indexing datasets by time and by data species.  However, we need to have more general indexing methods that can be used to support concepts of spatial and temporal relationships.  Support for pervasive indexing structures is also important for supporting other visualization features like K-d trees, octrees, and other such methods that are used to accelerate graphics algorithms.  We really should consider how to pass such representations down the data analysis pipeline in a uniform manner because they are used so commonly.

 

         Must/Want: "vertex-centered" data, "cell-centered" data? other-centered?

Must understand all centering (particularly for structured grids where vis systems are typically lax in storing/representing this information).

 

         Must: support time-varying data, sequenced, streamed data?

Yes to all.  However, the concept of streamed data must be defined in more detail.  This is where the execution paradigm is going to affect the data structures.

 

         Must/Want: higher-order elements?

Not yet.

 

         Must/Want: Expression of material interface boundaries and other special-treatment of boundary conditions.

 

Yes, we must treat ghost zones specially or parallel vis algorithms will create significant artifacts.  I'm not sure what is required for combined air-ocean models.

 

         * For commonly understood datatypes like structured and unstructured, please focus on any features that are commonly overlooked in typical implementations.  For example, often data-centering is overlooked in structured data representations in vis systems and FEM researchers commonly criticize vis people for co-mingling geometry with topology for unstructured grid representations.  Few datastructures provide proper treatment of boundary conditions or material interfaces.  Please describe your personal experience on these matters.

 

There is little support for non-cartesian coordinate systems in typical data structures.  We will need to have a discussion of how to support coordinate projections/conversions in a comprehensive manner.  This will be very important for applications relating to the National Virtual Observatory.

 

         * Please describe data representation requirements for novel data representations such as bioinformatics and terrestrial sensor datasets.  In particular, how should we handle more abstract data that is typically given the moniker "information visualization".

 

I simply don't know enough about this field to comment.

 

What do you consider the most elegant/comprehensive implementation for data representations that you believe could form the basis for a comprehensive visualization framework?

         * For instance, AVS uses entirely different datastructures for structure, unstructured and geometry data.  VTK uses class inheritance to express the similarities between related structures.  Ensight treats unstructured data and geometry nearly interchangably.  OpenDX uses more vector-bundle-like constructs to provide a more unified view of disparate data structures.  FM uses data-accessors (essentially keeping the data structures opaque).

 

Since I'm already on record as saying that opaque data accessors are essential for this project, it is clear that FM offers the most compelling implementation that satisfies this requirement.

 

         * Are there any of the requirements above that are not covered by the structure you propose?

 

We need to be able to express a wider variety of data layout conversions and have some design pattern that reduces the need to recopy data arrays for local components.  The FM model also needs to have additional API support for hierarchical indices to accelerate access to subsections of arrays or domains.

 

         * Is there information or characteristics of particular file format standards that must percolate up into the specific implementation of the in-memory data structures?

 

I hope not.

 

For the purpose of this survey, "data analysis" is defined broadly as all non-visual data processing done *after* the simulation code has finished and *before* "visual analysis".

         * Is there a clear dividing line between "data analysis" and "visual analysis" requirements?

 

There shouldn't be.  However, people at the SRM workshop left me with the impression that they felt data analysis had been essentially abandoned by the vis community in favor or "visual analysis" methods.  We need to undo this.

 

         * Can we (should we) incorporate data analysis functionality into this framework, or is it just focused on visual analysis.

 

Vis is bullshit without seamless integration with flexible data analysis methods.  The most flexible methods available are text-based.  The failure to integrate more powerful data analysis features into contemporary 3D vis tools has been a serious problem.

 

         * What kinds of data analysis typically needs to be done in your field?  Please give examples and how these functions are currently implemented.

 

This question is targeted at vis folks that have been focused on a particular scientific domain.  For general use, I think of IDL as being one of the most popular/powerful data analysis languages.  Python has become increasingly important -- especially with the Livermore numerical extensions and the PyGlobus software.  However, use of these scripting/data analysis languages have not made the transition to parallel/distributed-memory environments (except in a sort of data-parallel batch mode).

 

         * How do we incorporate powerful data analysis functionality into the framework?

 

I'm very interested in work that Nagiza Samatarova has proposed for a parallel implementation of the R statistics language.  The traditional approach for parallelizing scripting languages is to run them in a sort of MIMD mode of Nprocs identical scripts operating on different chunks of the same dataset.  This makes it difficult to have a commandline/interactive scripting environment.  I think Nagiza is proposing to have an interactive commandline environment that transparently manipulates distributed actions on the back-end.

 

There is a similar work in progress on parallel matlab at UC Berkeley.  Does anyone know of such an effort for Python?  (most of the parallel python hacks I know of are essentially MIMD which is not very useful).

 

2) Execution Model=======================

It will be necessary for us to agree on a common execution semantics for our components.  Otherwise, while we might have compatible data structures but incompatible execution requirements.  Execution semantics is akin to the function of protocol in the context of network serialization of data structures.  The motivating questions are as follows;

         * How is the execution model affected by the kinds of algorithms/system-behaviors we want to implement.

         * How then will a given execution model affect data structure implementations

 

There will need to be some way to support both declarative execution semantics, data-driven and demand-driven semantics.  By declarative semantics, I mean support for environments that want to be in control of when the component "executes" or interactive scripting environments that wish to use the components much like subroutines.  This is separate from the demands of very interactive use-cases like view-dependent algorithms where the execution semantics must be more automatic (or at least hidden from the developer who is composing the components into an application).  I think this is potentially relevant to data model discussions because the automatic execution semantics often impose some additional requirements on the data structures to hand off tokens to one another.  There are also issues involved with managing concurrent access to data involved.  For instance, a demand-driven system demanded of progressive-update or view-dependent algorithms, will need to manage the interaction between the arrival of new data and asynchronous requests from the viewer to recompute existing data as the geometry is rotated.

 

         * How will the execution model be translated into execution semantics on the component level.  For example will we need to implement special control-ports on our components to implement particular execution models or will the semantics be implicit in the way we structure the method calls between components.

 

I'm going to propose that we go after the declarative semantics first (no automatic execution of components) with hopes that you can wrap components that declare such an execution model with your own automatic execution semantics (whether it be a central executive or a distributed one).  This follows the paradigm that was employed for tools such as VisIt that wrapped each of the pieces of the VTK execution pipeline so that it could impose its own execution semantics on the pipeline rather than depending on the exec semantics that were predefined by VTK.  DiVA should follow this model, but start with the simplest possible execution model so that it doesn't need to be deconstructed if it fails to meet the application developer's needs (as was the case with VisIt).

 

We should have at least some discussion to ensure that the *baseline* declarative execution semantics imposes the fewest requirements for component development but can be wrapped in a very consistent/uniform/simple manner to support any of our planned pipeline execution scenarios.  This is an excercise in making things as simple as possible, but thinking ahead far enough about long-term goals to ensure that the baseline is "future proof" to some degree.

 

What kinds of execution models should be supported by the distributed visualization architecture

         * View dependent algorithms? (These were typically quite difficult to implement for dataflow visualization environments like AVS5).       

 

              Must be supported, but not as a basline exec model.

 

         * Out-of-core algorithms

              Same deal.  We must work out what kinds of attributes are required of the data structures/data model to represent temporal decomposition of a dataset.  We should not encode the execution semantics as part of this (it should be outside of the component), but we must ensure that the data interfaces between components are capable of representing this kind of data decomposition/use-case.

 

         * Progressive update and hierarchical/multiresolution algorithms?

 

Likewise, we should separate the execution semantics necessary to implement this from the requirements imposed on the data representation.  Data models in existing production data analysis/visualization systems often do not provide an explicit representation for such things as multiresolution hierarchies.  We have LevelOfDetail switches, but that seems to be only a week form of representation for these hierarchical relationships and limits the effectivness of algorithms that depend on this method of data representation.  Those requirements should not be co-mingled with the actual execution semantics for such components (its just the execution interface)

 

         * Procedural execution from a single thread of control (ie. using an commandline language like IDL to interactively control an dynamic or large parallel back-end)

 

This should be our primary initial target.  I do not have a good understanding of how best to support this, but its clear that we must ensure that a commandline/interactive scripting language must be supported.  Current data parallel scripting interfaces assume data-parallel, batch-mode execution of the scripting interpreters (this is a bad thing).

 

         * Dataflow execution models?  What is the firing method that should be employed for a dataflow pipeline?  Do you need a central executive like AVS/OpenDX or, completely distributed firing mechanism like that of VTK, or some sort of abstraction that allows the modules to be used with either executive paradigm?

 

This can probably be achieved by wrapping components that have explicit/declarative execution semantics.  Its an open question as to whether these execution models are a function of the component or the framework that is used to compose the components into an application though.

 

         * Support for novel data layouts like space-filling curves?

 

I don't understand enough about such techniques to know how to approach this.  However, it does point out that it is essential that we hand off data structures via accessors  that keep the internal data structures opaque rather than complex data structures.

 

         * Are there special considerations for collaborative applications?

         * What else?

 

Ugh.  I'm also hoping that collaborative applications only impose requirements for wrapping baseline components rather than imposing internal requirements on the interfaces that exchange data between the components.  So I hope we can have "accessors" or "multiplexor/demultiplexor" objects that connect to essentially non-collaboration-aware components in order support such things.  Otherwise, I'm a bit daunted by the requirements imposed.

 

How will the execution model affect our implementation of data structures?

         * how do you decompose a data structure such that it is amenable to streaming in small chunks?

 

The recent SDM workshop pointed out that chunking/streaming interfaces are going to be essential for any data analysis system that deals with large data, but there was very little agreement on how the chunking should be expressed.  The chunking also potentially involves end-to-end requirements of the components that are assembled in a pipeline as you must somehow support uniformity in the passage of chunks through the system (ie. the decision you make about the size of one chunk will impose requirements for all other dependent streaming interfaces in the system).  We will need to walk through at least one use-case for chunking/streaming to get an idea of what the constraints are here.  It may be too tough an issue to tackle in this first meeting though.

 

         * how do you represent temporal dependencies in that model?

 

Each item in a datastructures needs to have some method of referring to dependencies both spatial (ie. interior boundaries caused by domain decomposition) and temporal.  Its important to make these dependencies explicit in the data structures provide a framework the necessary information to organize parallelism in both the pipeline and data-parallel directions.  The implementation details of how to do so are not well formulated and perhaps out-of-scope for our discussions.  So this is a desired *requirement* that doesn't have a concrete implementation or design pattern involved.

 

         * how do you minimize recomputation in order to regenerate data for view-dependent algorithms.

 

I don't know.  I'm hoping someone else responding to this survey has some ideas on this.  I'm uncertain how it will affect our data model requirements.

 

What are the execution semantics necessary to implement these execution models?

         * how does a component know when to compute new data? (what is the firing rule)

 

For declarative semantics, the firing rule is an explicit method call that is invoked externally.  Hopefully such objects can be *wrapped* to encode semantics that are more automatic (ie. the module itself decides when to fire depending on input conditions), but initially it should be explicit.

 

         * does coordination of the component execution require a central executive or can it be implemented using only rules that are local to a particular component.

 

It can eventually be implemented using local semantics, but intiially, we should design for explicit external control.

 

         * how elegantly can execution models be supported by the proposed execution semantics?  Are there some things, like loops or back-propagation of information that are difficult to implement using a particular execution semantics?

 

Its all futureware at this point.  We want to first come up with clear rules for baseline component execution and then can come up with some higher level / automatic execution semantics that can be implemented by *wrapping* such components.  The "wrapper" would then take responsibility for imposing higher-level automatic semantics.

 

How will security considerations affect the execution model?

 

I don't know.  Please somebody tell me if this is going to be an issue.  I don't have a handle on the *requirements* for security.  But I do know that simply using a secure method to *launch* a component is considered insufficient by security people who would also require that connections between components be explicitly authenticated as well.  Most vis systems assume secure launching (via SSH or GRAM) is sufficient.  The question is perhaps whether security and authorization are a framework issue or a component issue.  I am hoping that it is the former (the role of the framework that is used to compose the components).

 

3) Parallelism and load-balancing=================

Thus far, managing parallelism in visualization systems has been a tedious and difficult at best.  Part of this is a lack of powerful abstractions for managing data-parallelism, load-balancing and component control.

 

If we are going to address inter-component data transfers to the exclusion of data structures/models internal to the component, then much of this section is moot.  The only question is how to properly represent data-parallel-to-data-parallel transfers and also the semantics for expressing temporal/pipeline parallelism and streaming semantics.  Load-balancing becomes an issue that is out-of-scope because it is effectively something that is inside of components (and we don't want to look inside of the components).

 

Please describe the kinds of parallel execution models that must be supported by a visualization component architecture.

         * data-parallel/dataflow pipelines?

 

Must

 

         * master/slave work-queues?

 

Maybe: If we want to support progressive update or heterogeneous execution environments.

 

         * streaming update for management of pipeline parallelism?

 

Must.

 

         * chunking mechanisms where the number of chunks may be different from the number of CPU's employed to process those chunks?

 

Absolutely.  Of course, this would possibly be implemented as a master/slave work-queue, but there are other methods.

 

         * how should one manage parallelism for interactive scripting languages that have a single thread of control?  (eg. I'm using a commandline language like IDL that interactively drives an arbitrarily large set of parallel resources.  How can I make the parallel back-end available to a single-threaded interactive thread of control?)

 

I think the is very important and a growing field of inquiry for data analysis environments.  Whatever agreements we come up with, I want to make sure that things like parallel R are not left out in these considerations.

 

Please describe your vision of what kinds of software support / programming design patterns are needed to better support parallelism and load balancing.

         * What programming model should be employed to express parallelism.  (UPC, MPI, SMP/OpenMP, custom sockets?)

 

If we are working just on the outside of components, this question should be moot.  We must make sure the API is not affected by these choices though.

 

         * Can you give some examples of frameworks or design patterns that you consider very promising for support of parallelism and load balancing.  (ie. PNNL Global Arrays or Sandia's Zoltan)

                       http://www.cs.sandia.gov/Zoltan/

                       http://www.emsl.pnl.gov/docs/global/ga.html

 

Also out of scope.  This would be something employed within a component, but if we are restricting discussions to what happens on the interface between components, then this is also a moot point.  At minimum, it will be important to ensure that such options will not be precluded by our component interfaces.

 

         * Should we use novel software abstractions for expressing parallelism or should the implementation of parallelism simply be an opaque property of the component? (ie. should there be an abstract messaging layer or not)

 

Yes.

 

         * How does the NxM work fit in to all of this?  Is it sufficiently differentiated from Zoltan's capabilities?

 

I need a more concrete understanding of MxN.  I understand what it is supposed to do, but I'm not entirely sure what requirements it would impose on any given component interface implementation.  It seems like something our component data interfaces should support, but perhaps such redistribution could be hidden inside of an MxN component?  So should this kind of redistribution be supported by the inter-component interface or should there be components that explicitly effect such data redistributions?  Jim... Help!

 

 

===============End of Mandatory Section (the rest is voluntary)=============

 

4) Graphics and Rendering=================

What do you use for converting geometry and data into images (the rendering-engine).  Please comment on any/all of the following.

         * Should we build modules around declarative/streaming methods for rendering geometry like OpenGL, Chromium and DirectX or should we move to higher-level representations for graphics offered by scene graphs?

 

This all depends on the scope of the framework.  A-priori, you can consider the rendering method separable and render this question moot.  However, this will make it quite difficult to provide very sophisticated support for progressive update, image-based-methods, and view-dependent algorithms because the rendering engine becomes intimately involved in such methods.  I'm concerned that this is where the component model might break down a bit.  Certainly the rendering component of traditional component-like systems like AVS or NAG Explorer the most heavy-weight and complex components of the entire environment. Often, the implementation of the rendering component would impose certain requirements on components that had to interact with it closely (particularly in the case of NAG/Iris Explorer where you were really directly exposed to the fact that the renderer was built atop of OpenInventor).

 

So, we probably cannot take on the issue of renderers quite yet, but we are eventually going to need to define a big "component box" around OpenGL/Chromium/DirectX.  That box is going to have to be carefully  built so as to keep from precluding any important functionality that each of those rendering engines can offer.  Again, I wonder if we would need to consider scene graphs if only to offer a persistent datastructure to hand-off to such an opaque rendering engine.  This isn't necessarily a good thing.

 

What are the pitfalls of building our component architecture around scene graphs?

 

It will add greatly to the complexity of this system.  It also may get in the way of novel rendering methods like Image-based methods.

 

         * What about Postscript, PDF and other scale-free output methods for publication quality graphics?  Are pixmaps sufficient?

 

Pixmaps are insufficient.  Our data analysis infrastructure has been moving rapidly away from scale-free methods and rapidly towards pixel-based methods.  I don't know how to stop this slide or if we are poised to address this issue as we look at this component model.

 

In a distributed environment, we need to create a rendering subsystem that can flexibly switch between drawing to a client application by sending images, sending geometry, or sending geometry fragments (image-based rendering)?  How do we do that?

         * Please describe some rendering models that you would like to see supported (ie. view-dependent update, progressive update) and how they would adjust dynamically do changing objective functions (optimize for fastest framerate, or fastest update on geometry change, or varying workloads and resource constraints).

 

I see this as the role for the framework.  It also points to the need to have performance models and performance monitoring built in to every component so that the framework has sufficient information to make effective pipeline deployment decisions in response to performance constraints.  It also points to the fact that at some level in this component architecture, component placement decisions must be entirely abstract (but such a capability is futureware).

 

So in the short-term its important to design components with effective interfaces for collecting performance data and representing either analytic or historical-based models of that data.  This is a necessary baseline to get to the point that a framework could use such data to make intelligent deployment/configuration decisions for a distributed visualization system.

 

         * Are there any good examples of such a system?

 

No.

 

What is the role of non-polygonal methods for rendering (ie. shaders)?

         * Are you using any of the latest gaming features of commodity cards in your visualization systems today?

         * Do you see this changing in the future? (how?)

 

I'd like to know if anyone is using shader hardware.  I don't know much about it myself, but it points out that we need to plan for non-polygon-based visualization methods.  Its not clear to me how to approach this yet.

 

5) Presentation=========================

It will be necessary to separate the visualization back-end from the presentation interface.  For instance, you may want to have the same back-end driven by entirely different control-panels/GUIs and displayed in different display devices (a CAVE vs. a desktop machine).   Such separation is also useful when you want to provide different implementations of the user-interface depending on the targeted user community.  For instance, visualization experts might desire a dataflow-like interface for composing visualization workflows whereas a scientists might desire a domain-specific dash-board like interface that implements a specific workflow.  Both users should be able to share the same back-end components and implementation even though the user interface differs considerably.

 

How do different presentation devices affect the component model?

         * Do different display devices require completely different user interface paradigms?  If so, then we must define a clear separation between the GUI description and the components performing the back-end computations.  If not, then is there a common language to describe user interfaces that can be used across platforms?

 

Systems that attempt to use the same GUI paradigm across different presentation media have always been terrible in my opinion.  I strongly believe that each presentation medium requires a GUI design that is specific to that particular medium.  This imposes a strong requirement that our compute pipeline for a given component architecture be strictly separated from the GUI that controls the parameters and presents the visual output of that pipeline.  OGSA/WSDL has been proposed as one way to define that interface, but it is extremely complex to use.  One could use CCA to represent the GUI handles, but that might be equally complex.  Others have simply customized ways to use XML descriptions of their external GUI interface handles for their components.  The latter seems much simpler to deal with, but is it general enough?

 

         * Do different display modalities require completely different component/algorithm implementations for the back-end compute engine?  (what do we do about that??)