geoserver monitor plugin - scaling troubles

Description

None

Environment

When a request is done, it is sent to a post processor thread. We found this thread unable to keep up with incoming requests.

 

Specifically, ~100 reqs/second come in, but a failed DNS reverse lookup takes multiple seconds. After a day of runtime, I found ~300K PostProcessTask objects in the heap. This has multiple interesting consequences.

  1. Memory leaks, of course. As the PostProcessTask contains the HttpServletRequest / Reponse, and these are big objects, we quickly talk about multiple gigabytes.

2. After a HttpServletRequest/Response is serviced, tomcat will reuse some backing buffers for new requests. But as these objects are still held on by the post processor, fragments of unrelated requests can turn up in the post processing.

3. The post process executor service is hardcoded to 2 threads, which is simply not enough. This should be configurable.

4. ReverseDNSPostProcessor has a reverseLookupCache, but AFAIK it is only read, never written. Apart from that, it overlaps with the JDK cache and has no TTL. So it seems useless. I propose replacing it with a TTL-based cache, caching only the lookup FAILURES, as these are not cached by the JDK.

As a workaround, I configured it to not do reverse DNS lookups in monitoring.properties

Activity

Show:

Andrea Aime February 8, 2023 at 8:58 AM

Agreed on having a configurable thread pool for post-processing.

About 2. that happens in many other places in GeoServer, can’t be helped and the code working asynchronously needs to avoid hitting HttpServletRequest/Response objects but… the ReverseDNSPostProcessor seems to be working off cached information in the RequestData, is this point relevant to the discussion? I’d recommend you stick only to the minimum relevant information, we are not paid to follow up on these tickets, volunteer time is very scarce, be mindful of it.

Agreed on the reverse lookup cache, it should be written. And probably be more sophisticated, like caching failed lookups too, and having a maximum cache time.

Fixed

Details

Assignee

Reporter

Fix versions

Affects versions

Components

Priority

Created February 7, 2023 at 1:08 PM
Updated August 18, 2023 at 12:27 PM
Resolved August 18, 2023 at 12:27 PM

Flag notifications