Operations/Minutes/2026-06-11
OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 11 June 2026, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Participants
- Paul Norman (OWG volunteer, OSMF contractor)
- Tom Hughes (OWG volunteer)
- Grant Slater (OWG, OSMF Senior Site Reliability Engineer)
- Héctor Ochoa Ortiz (OSMF Board)
Minutes by Dorothea Kazazi.
New action items from this meeting
- Operations Working Group (OWG) to follow up on chat about 1) and change the 509 error code to 409 for the website, 2) relaxing the 509 limit if it hits people unintentionally, so as to not alarm on tiny POP error percentages. [Topic: Reducing Fastly alerts]
- Grant to increase the number of workers for Apache across the tile servers. [Topic: Upgrading workers for Apache]
Reportage
Retiring the HOT style from osm.org
Related to action item: 2026-05-28 OPS to ask OSM France how do they view if we retire the HOT style from osm.org since it is not under active development. [AOB: Tracestrack topo and HOT layers]
Assigned to Paul.
Upgrading Naga
Related to action item: 2026-05-28 Ops to upgrade Naga. [AOB: OS upgrades].
Done. Naga had MediaWiki instances on it, which had not been expected to be present.
Large downloads from QGIS
Related to action item: 2026-03-19 Grant to research what triggers a large download from QGIS. [Topic: QGIS Tiles usage] # 2026-05-14 Not much progress; unable to find the new usage header in latest versions.
QGIS started backporting the feature for their next release.
Issue to be parked.
404 tiles
Related to action item: 2026-03-19 Paul to overhaul how we're doing the 404 tiles. [Topic: Tiles]
Action item to be replaced by a ticket.
Mailman conversion
Related to action item:2026-03-05 ) Grant to do a dry run for the Mailman conversion, probably on Rhaegal in Croatia. [Topic: Upgrades: Machines on Ubuntu 22.04]
Grant has started it and has validated all the mailing lists, except two which had minor corruptions.
OTRS
OTRS (an issue/ticket tracker used by several Working Groups of the OpenStreetMap Foundation) uses a Perl module, and the version of that module in Debian has been upgraded to a newer version that is incompatible with OTRS. Tom had to manually hack one of the scripts.
Upgrades
Grant upgraded the two unused Scaleway machines in France from Ubuntu 22.04 to Debian 12 via an undisclosed method. Grant uninstalled all the kernel packages.
Suggestions
- Upgrade the database servers in a similar way.
- Objection.
QGIS Tile server
The QGIS board is happy to fund a EUR 10,000 new tile server. A draft invoice had been raised by OSMF and Grant has looked at different options at https://www.serverschmiede.com/en/.
Server location
Preference: Dublin over Amsterdam, as Grant plans to visit Dublin.
Power
Amsterdam:
- 3.3 used of 3.5 KWatt
- 200 watts free.
- Extra power was purchased once in the past.
Dublin:
- 3,6 used of 4 KWatt
- 400 watts free
Additional power can be requested, if needed.
Redundancy
Concern: We should have pushed for 2 machines instead of 1, as we have redundancy requirements and they're currently using more than one machine.
- QGIS immediately responded to the OWG enquiry.
- Grant worded his email so that we could ask for a second server in the future.
Grant was looking at Gen10+, 32 cores each, as they have good prices and CPUs. Unclear how NVMe plugs in.
Prices
- One 8TB disk currently costs EUR 4,000.
- Two 4TB drives currently cost more than what OWG paid for two 8TB drives in the past. We would need two 4TB, for redundancy.
- Grant's current server pricing ~ EUR 11,000.
- The OWG might have to spend some of its budget (~ EUR 1,000), in order to get a full Scalable 3 system.
Suggestion
Scalable Gen3 server, with
- two 4TB or 8TB drives.
- pick-up and return warranty - cost: EUR 300.
- the advanced license pack.
800 watt PSUs instead of 500 watt, as,
- the 500 watt ones are at the edge of what the system can handle and
- we had issues on the 500 watts where the machine just dies.
Proposal: 2x Intel Xeon Gold 6338N SRKJ9 32C.
Based on the Culebre tile server (see further below), we might need 128 GB extra, which would be around EUR 1,100. We need some cache and it is better to have a system which lasts longer.
Other points mentioned during discussion
- Both 800 and 500 watt PSUs are platinum for Gen10+, with very high efficiency (~ >92%).
- The 800 wall platinum PSU is probably more efficient than the 1,000 titanium at the range we are running.
- The 6338N is a newer revision of 6338, has the same price and it is clocking the CPU higher and the RAM slightly lower. The power usage is significantly lower as well.
Decision
Discuss options outside the meeting.
Mailman
Preparing to migrate mailing lists to Mailman 3. Intending to use PostgreSQL as the Mailman 3 database backend, rather than the default SQLite. The Debian package can manage schema upgrades automatically without requiring custom Chef logic, similar to what was done with OTRS and the DB config package.
Increasing workers for Apache
Issue: The QGIS tile server was unable to handle the volume of incoming connections under heavy load, including being unable to return 404 responses. Once one machine stopped accepting connections, Fastly routed the traffic to other machines, which also became overloaded.
Grant increased the number of Apache workers on the QGIS tile server and mentioned that it handled the increased numbers well.
Suggestion: Upgrade the workers on other machines.
Other points mentioned during discussion
- The OWG encountered similar issues during the last round of upgrades. Tom had to restart some machines multiple times, as they were getting overwhelmed with connections.
- Fastly has likely fixed the problem of failover to another overload, with better load distribution.
Action item
Grant to increase the number of workers for Apache across the tile servers.
On culebre tile server and apps memory usage
https://hardware.openstreetmap.org/servers/culebre.openstreetmap.org/
Issue: 175 GB in apps appears higher than expected.
- The machine was up for 33 days and GBs may have gone up a bit with Mapnik 4, but there's no sign of it leaking.
- The machine was using 143 GB before Mapnik 4.
Other points mentioned during discussion
- Mapnik does a lot of internal caching,
- The difference in memory usage between servers is likely related to the number of rendering cores or threads, as Mapnik caches data per core or thread.
- Renderd is using 64% of memory on Culebre.
- CPU utilisation on the servers is over 90%, with CPU pressure up to 6%.
Reducing Fastly alerts
Issue: Excessive alert noise from Fastly e.g. due to 5xx errors generated by cgi-map in certain situations. Might be related to one of the rate limits that people have been complaining about.
- Any 5xx error is considered to be a server error.
- Rate limiting errors should be 4xx.
509 error code
Used by Apache and cPanel to indicate the web hosting client has exceeded the bandwidth allocation. Sent directly by cgi-map.
Is one of the oldest limits we have:
- initially created to deter scraping by setting a maximum rate at which you could fetch data from the API.
- initially only applied to map call and later extended to element calls.
- set around 10 years ago.
Triggered by maximum amount of data that you can fetch from the API per second.
- E.g people using editors to download the entirety of every relation that touches an area.
- in some cases also with areas that just have a lot of big relations.
- by people that are intentionally trying to abuse the APIs via downloading all the GPS traces.
Is unrelated to whether an account is new or not.
Suggestions
- Exclude codes from the alarm.
- Change 509 to 409.
- Relax the 509 limit, if it hits people unintentionally.
- Set a traffic threshold, and trigger alarms on numbers, rather than (or in addition to) a percentage.
Other points mentioned during discussion
- Setting thresholds on machines where there is a very low count of requests is hard, because a small amount can cause a very big apparent error rate.
- 503: internal server error.
- 502: proxy error.
Action item
OWG to follow up on chat about 1) and change the 509 error code to 409 for the website, 2) relaxing the 509 limit if it hits people unintentionally, so as to not alarm on tiny POP error percentages.
Topic raised by Héctor Ochoa Ortiz (Board member)
Six out of seven board members had a face-to-face meeting this last weekend in Madrid and they discussed some topics related to the OWG.
Hiring a second system admin
- Craig Allan (Chairperson) has been working on a draft job description for hiring a second system admin, to be shared with OWG by the next meeting.
- Plan to have a contract by the end of 2026.
- The different Working Groups could have the freedom of setting up their own AI policy.
- Try to get AI to attribute OSM correctly.
Harry's Mastodon post was mentioned by OWG as a low hanging fruit.
Next step: Héctor to send an email with details to the OWG, as he had connection issues during the meeting.
Action items
- 2026-05-28 Minh to follow up on DWG non-disclosure agreements, then IP access for moderators in website repo. [DWG request for user info for moderators]
- 2026-05-28 Paul to ask OSM France how do they view if we retire the HOT style from osm.org since it is not under active development [AOB: Tracestrack topo and HOT layers]
2026-05-28Ops to upgrade Naga. [AOB: OS upgrades]Done.- 2026-05-14 Grant to check with Fastly team via Slack about 100% ngwaf traffic
- 2026-05-14 Paul to do PR setting referer policy header on embed
2026-03-19 Paul to create a breakdown of QGIS tile traffic statistics for different zoom levels. [Topic: QGIS Tiles usage]- 2026-03-19 Grant to research what triggers a large download from QGIS. [Topic: QGIS Tiles usage] # 2026-06-11 Action item to be parked.
2026-03-19 Paul to overhaul how we're doing the 404 tiles. [Topic: QGIS Tiles usage]To be replaced by ticket.- 2026-03-19 Paul to look into the typo on tile block message 403r [Topic: Typo on tile block message 403r?]
- 2026-03-19 Paul and Grant to run some time limited experiments during non peak hours to test catching anonymous/fake-ua scrapers. Genuine Google Bot etc will continue to be permitted. [Topic: Fastly Client Challenges] # 2026-05-14 Grant to chat to Tom about what CA + Signed Certs are required. Grant to read the AWS documentation on requirements.
- 2026-03-05 Grant to do a dry run for the Mailman conversion, probably on Rhaegal in Croatia. [Topic: Upgrades: Machines on Ubuntu 22.04]
2026-01-22 Tom to draft follow up question on pgbackrest local backup required or can /JUST/ S3 be used. [Topic: Credativ consultancy on OSM.org Postgres database update]- 2025-10-16 Grant and Paul to set up a meeting about AWS Identity and Access Management Roles Anywhere https://docs.aws.amazon.com/rolesanywhere/latest/userguide/introduction.html. [Topic: AWS CA cert]
- 2025-10-16 Grant to create a PR regarding refactoring some stuff: Make kitchen handle selecting cookbook-OS combos. and only running tests on cookbooks that have changed. [Topic: Reworking of Test Kitchen methods for defining which jobs run on Test Kitchen GitHub actions]
- 2025-10-16 Grant to create a PR about adding logic to Chef for retrying failed initial creation of Let's Encrypt certificates [Topic: Add logic to Chef for retrying failed initial creation of Let's Encrypt certificates]
Action items that have been stricken-through are completed, removed, or have been moved to GitHub tickets.