Support geophysics view backed by integer data

Description

Content:

  • The "nodata value" problem
    * The OpenGIS specification
    * The Geotools extension to OGC specification
    * The problem with the Geotools extension
    * Possible solutions

------------------------------------------------------------------ The "nodata value" problem
------------------------------------------------------------------ Rasters are regular grid (like matrix), but some grid elements may have missing value. If values are stored as floating points, the Java's Float.NaN or Double.NaN values fit nicely. But if values are stored as integers, then some special values (e.g. Integer.MIN_VALUE) must be used for marking cell with missing value.

------------------------------------------------------------------ The OpenGIS specification
------------------------------------------------------------------ GridCoverage can stores value of arbitrary type (floats, integers, etc.). Values can be extracted for an arbitrary (x,y) geographic coordinates using the GridCoverage.evaluate(x,y) method. This method performs no special processing for "nodata" value; the special value is returned as-is.

SampleDimension contains information about each image's band, including the list of special values to be used as "nodata" values. This list is returned by SampleDimension.getNodataValue().

Processing of "nodata" value is up to the user. The user have to get the value with GridCoverage.evaluate(x,y), then compare it with the list of SampleDimension.getNodataValue(). This OpenGIS behavior is the desired one for many application.

------------------------------------------------------------------ The Geotools extension to OGC specification
------------------------------------------------------------------ For computation purpose, checking for "nodata" values put an overhead on the user side. Furthermore, many numerical routines has no special knowledge of missing data. For example, Java Advanced Imaging (JAI) has no notion of "nodata" pixel (the problem is usually handle with a mask, but the user still have to creates the mask). The easiest way is to stores the "nodata" values as Float.NaN or Double.NaN. Obviously, this is possible only if the underlying storage use floating points values. Using NaN for "nodata" values has many advantages:

  • It is the most standard "nodata" value a numerical routine could
    check.

  • If the numerical routine was not designed for handling NaN, it
    will returns NaN rather than a random number computed from values
    that shouldn't have been included in the computation. For example,
    applying a convolution with JAI on images with Float.NaN values
    work perfectly well.

We want to leverage the rich set of JAI operators. Many JAI operations are implemented by native code using MMX (on Intel) or VIS (on UltraSparc) microprocessor instructions for maximum speed. We can't beat that speed. But JAI has no notion of "nodata values". However, it can work with Float.NaN. (Note: users can use JAI with a mask or a ROI (Region Of Interest). However, he still have to manage the mask. For example with a convolution, some pixels valid in the source image may become invalid in the destination image if a neighbor value was "nodata"). For each image, Geotools manage two views:

  • A "geophysics" view in which all values are "real world" floating
    point numbers (for example an altitude in meters). "Nodata" values
    maps to NaN and nothing else. This view is suitable for
    computation purpose.

  • A "packed" view in which all values are integers with some special
    values for "nodata". This view is suitable for storage (actually
    most DataStore will read data in this form) and for rendering the
    image on a display device.

This "geophysics" view is a Geotools's addition. It doesn't conflict with OGC spec (all GridCoverage and SampleDimension methods behave in the expected way for "packed" view). It aims to make user's life easier: no need to check for "nodata" values when working with "geophysics" view, the user is sure to always get NaN for them.

------------------------------------------------------------------ The problem with the Geotools extension
------------------------------------------------------------------ The "geophysics" view is used for computation, with NaN for "nodata" values. The packed view is currently used BOTH for integer data with special "nodata" values, and for displaying to the screen. In theory, this dual usage should be possible: In J2SE, Raster can hold any values, its really the ColorModel which supplies the color from those values. In practice, the only fast ColorModel are the one provided with J2SE because they are handle in some special way by native code (DirectX, OpenGL, video hardward acceleration, etc.). Which means, to get decent speed, we need to convert data to unsigned 8 or 16 bits integers and use an IndexColorModel, or to 32 bits and use a DirectColorModel. All other custom ColorModel will paint very slowly.

=======================================
==== The problem in one sentence: ====
=======================================
What if a DataStore is loading data in a packed form, but this packed view is not compatible with a standard J2SE's ColorModel?

------------------------------------------------------------------ Possible solutions
------------------------------------------------------------------

  • Use a custom ColorModel and accept it even if it is a slow one.
    Possibly add an optimization in Renderer in order to handle this
    problem at the rendering level.

  • "corrupt" the geophysics view with non-NaN "nodata" values.
    Note: in my opinion, this solution would go against the reason
    why geophysics view were introduced in the first place: make
    computation as straight-forward as possible. The user would no
    longer be able to use it without checking for
    SampleDimension.getNodataValue().

  • Split the "packed" view in two categories: "packed" and
    "viewable". This solution would add more complexity to
    the already complex API (three views instead of two:
    "geophysics", "packed" and "viewable").

  • Convert the data into geophysics values at loading time
    (work done by DataStore). This is the solution currently
    suggested for the current GCS-Coverage implementation.

Environment

None

Activity

Show:
codehaus
April 10, 2015, 3:05 PM

CodeHaus Comment From: aaime - Time: Tue, 9 Dec 2003 00:58:29 -0600
---------------------
Some considerations:

  • mmx are used only on windows (see the JAI readme file), and only on integer data. Why? Because the mmx register can hold only array of integers. If you want to do vector operations on floats, you have to use SSE (which are specific to Intel processors by the way).

  • using special color models does not solve fully the problem, but it's a solution we may use. There is an assumption that is still problematic: the real world data is the geophisics version in floating point format. What if the real data is kept in integer format? The geophisics version in this case does not exists. There must be a way to recognize this, something in the API, that tells you not to look for the fp version of the data which would be meaningless and/or slower to use when the real data value is kept in integer format.

codehaus
April 10, 2015, 3:05 PM

CodeHaus Comment From: desruisseaux - Time: Tue, 9 Dec 2003 04:32:01 -0600
---------------------
There is a subtle distinction to point out: Actually, "geophysics" view doesn't force a floating-point format. It force the usage of Float.NaN or Double.NaN for "nodata" values. Obviously, forcing usage of NaN is equivalent to forcing floating-point format (since NaN doesn't exist for integer type), except if the raster has no missing value. In this particular case, the users is totally free to create a "geophysics" view in any integer format if he wants.

As for the problematic assumption (the real world data is the geophysics version in floating point format), it is not that problematic. We do not requires that data are stored in this format. We require that they can be expressed in this format. "Geophysics" and "packed" are just two views of the same data; the user pickup whatever view best matches his need and usually doesn't need to care about which one is the storage format. Actually, the vast majority of real data are kept in integer format. It is not a problem and work well as long as the integer format is compatible with a standard J2SE ColorModel. Problem begin when it is not compatible (e.g. contains negative values).

codehaus
April 10, 2015, 3:05 PM

CodeHaus Comment From: aaime - Time: Wed, 10 Dec 2003 01:05:25 -0600
---------------------
I've written a small program to compare the performance of simple raster computations between integer and floats. I simply create

two rasters, fill in some null values, and sum them keeping into

consideration the null values. For floats, no null value comparison is needed, for integers, I've written a small methods that checks againts an array of null values (that contains only Integer/Short.MIN_VALUE). The results are... suprising, but not that much if I think about it better.

Times are in milliseconds to perform only the sum, input WritableRaster creation time is not considered, the sum is performed 5 times and then the time averaged, the fixed null version uses a

single null values instead of an array of values (thus avoiding

a function call too, which is performed to mimic the SampleDimension.isNull() method needed to asses wheter a value

is null or not), inputs are two 1000x1000 matrix.

VM started with no options:

Sum duration, bytes: 137

Sum duration, bytes, fixed null: 118 *

Sum duration, shorts: 162

Sum duration, shorts, fixed null: 144

Sum duration, ints: 169

Sum duration, ints, fixed null: 152

Sum duration, floats: 133 <-- !!

Sum duration, doubles: 174

VM started with -server:

Sum duration, bytes: 132

Sum duration, bytes, fixed null: 120

Sum duration, shorts: 120

Sum duration, shorts, fixed null: 118

Sum duration, ints: 137

Sum duration, ints, fixed null: 132

Sum duration, floats: 102 <-- !!

Sum duration, doubles: 131 <-- !!

In the end, we can conclude that I was wrong: the cost of

checking for null values is more than the cost of a floating

point operation in most of the cases, and always if we

consider that to stick with opengis we need also to support

multiple null values.

In the end, it's better to always use floats for computation

since they are faster.

 

I'll attach the source file so that other people can confirm the

results.

codehaus
April 10, 2015, 3:05 PM

CodeHaus Comment From: aaime - Time: Wed, 10 Dec 2003 01:06:26 -0600
---------------------
Benchmark sources

codehaus
April 10, 2015, 3:05 PM

CodeHaus Comment From: mbedward - Time: Sat, 22 May 2010 07:12:28 -0500
---------------------
closing old issue

Assignee

Unassigned

Reporter

codehaus

Triage

None

Components

Affects versions

Priority

Medium
Configure