Defer bounds calculation during import

Description

Some geoserver users have observed that generating the Import Context can be very slow for large databases (eg. 2-3 minutes for a database containing a table with 4.5 million rows).

After examining the Importer code, I beleive this is caused by generating the resource bounds as the tasks are created (a calculation which is dependant on the number of rows) during context generation.

Moving bounds generation to the do__Import() step reduces the time to generate the context by increasing the import time. The net time to do an import appears unchanged.

This provides notable time savings for users who only want to import one or two tables from a large database, and should not affect users who are importing everything.

Environment

None

Activity

Show:
Torben Barsballe
October 1, 2015, 8:22 PM
Edited

Adding this as a comment as it doesn't really affect the ticket and adds unneccessary complexity to the ticked description, but is still relevant:

In theory, this change introduces a minor workflow change to importer:

If you import a file that has a bounds problem, you will get a task with a NO_BOUNDS state.
Normally, this would be recognized during the generate context step, similar to the NO_CRS state.
After this change, you would only get to the NO_BOUNDS state after running the import for the first time. This would require you to update the bounds and re-run the import.

In practice, this 'never' comes in to play because the bounds can 'always' be calculated from the data (Notably, the Importer UI has no way of handling a NO_BOUNDS state, and the NO_BOUNDS state is not ever tested).

Torben Barsballe
October 2, 2015, 6:57 PM
Edited

Pull request: https://github.com/geoserver/geoserver/pull/1247

I have does some testing, and for more 1 million rows, the bounds calculation has a notable effect on the import time, with:

Where n is the number of rows.

Test results (summary):

Importing 1 table with ~45,000,000 rows (single geometry column)

Current importer:
Generate Context: 1:45 m
Do Import: 0:02 m

After change:
Generate Context: 0:04 m
Do Import: 1:10 m

Importing 1 table ~4,000,000 rows (multiple geometry columns)

Current importer:
Generate Context: 0:13 m
Do Import: 0:01 m

After change:
Generate Context: 0:04 m
Do Import: 0:11 m

Note: It was observed that multiple geometry columns had neglegible effect of import times

Fixed

Assignee

Torben Barsballe

Reporter

Torben Barsballe

Triage

None

Fix versions

Affects versions

None

Components

Priority

Medium