Operations/Minutes/2025-10-16
OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 16 October 2025, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Participants
- Tom Hughes (Operations Working Group volunteer)
- Grant Slater (Senior Site Reliability Engineer, OWG)
- Paul Norman (Operations Working Group volunteer and OSMF contractor)
Minutes by Dorothea Kazazi.
New action items from this meeting
- Grant and Paul to set up a meeting about AWS Identity and Access Management Roles Anywhere. [Topic: AWS CA cert]
- Grant to create a PR regarding refactoring some stuff. [Topic: Reworking of Test Kitchen methods for defining which jobs run on Test Kitchen Github actions]
- Grant to create a PR about adding logic to Chef for retrying failed initial creation of Let's Encrypt certificates. [Topic: Add logic to Chef for retrying failed initial creation of Let's Encrypt certificates]
- Paul to ping Grant on the "Repurpose or return old tile servers" ticket. [Topic: #575 rhaegal - Repurpose or return old tile servers]
Reportage
AWS CA cert
Related to action item 2025-10-02 Grant to discuss with Paul Norman and flesh out his suggestion and determine the practicalities (e.g. key revocation). [Topic: AWS CA cert]
Action item: Grant and Paul to set up a meeting about AWS Identity and Access Management Roles Anywhere https://docs.aws.amazon.com/rolesanywhere/latest/userguide/introduction.html.
Serving vector tile styles
Related to action item [[[Operations/Minutes/2025-10-02|2025-10-02]] Grant to follow up with Paul. [Topic: Serving vector tile styles]
Discussion about serving a style #1263.
Upgrade to Postgres 17
Related to action items
- 2025-10-02 Tom to do a check on Saturday. [Topic: Upgrade to Postgres 17].
- 2025-10-02 Grant to upgrade the baseboard manager controller before the PG17 upgrade.
Done.
Related to action item 2025-09-18 Paul to look at potential issues related to the collation of indexes - Debian Postgres upgrade. [Topic: OSM DB upgrade to Postgres 17]
Paul was going to check whether the C.UTF-8 collation could change between versions. Technically, it can change.
Suggestion: Use ASCII, as we don't use indexes to do like matching, but only exact string matching.
Need to rebuild indexes anyway.
Gen10 Nominatim purchase (USA)
Related to action item 2025-10-02 Grant to go ahead with the purchase of the Gen10 (second-hand) server for Nominatim in the US. [Topic: Gen10 Nominatim purchase (USA)]
- OSU is happy and have provided the shipping address.
- The purchase is covered by the budget, but Grant needs to first talk with Roland Olbricht (OSMF Treasurer).
Reworking of Test Kitchen methods for defining which jobs run on Test Kitchen Github actions
Goals
- To not have to run all the tests, all the time.
- Run just the tests that are required to give the feedback as quick as possible, in less than 20'.
Current status
- We have a job which defines which jobs should run, and then it actions them.
- 20-50 minutes currently needed to run all the tests.
On test failures
- Not all test failures are genuine failures.
- Test failures are getting better since we got rid of DNSSEC stuff.
- It is best to check the actions page https://github.com/openstreetmap/chef/actions, rather than the commit page.
Steps
- Have a job which manages which jobs run.
- Work out a hash for our tests, record whether the hash has passed a test, and then use it as a decider whether to run the same hash again.
- No need for hashing. We simply need to know which roles / cookbooks affect each test. Then, we can tell from the commit which cookbook / roles have been touched and which tests depend on that.
 
Suggestions
- Move to role-based testing, instead of cookbook-based testing. We do have a couple of role-based tests, but only a few.
- Create a base role, for a bare machine without any services.
Other points mentioned during discussion
- It's currently difficult to find from the changed files which tests are relevant.
- We have cancelled the running tests, if another commit comes in before that test is run. Cancelled tests appear with grayed icons on the actions page.
Action item: Grant to create a PR regarding refactoring some stuff.
Clearing out old Chef PRs
See https://github.com/openstreetmap/chef/pulls
#784 Add Debian 13 to tests + #659 Add Ubuntu 24.04 support
- Add Debian 13 to tests https://github.com/openstreetmap/chef/pull/784
- Add Ubuntu 24.04 support https://github.com/openstreetmap/chef/pull/659
Suggestion: Merge #784 and #659, as they are overlapping.
It is probably not a lot of work. Could run on DB server.
Decision: Close and replace #659.
#654 Enable use of podman to run tests
https://github.com/openstreetmap/chef/pull/654
Decision: Close PR as superseded by #736 (Allow podman to run kitchen tests).
#568 Add Translate extension to Wiki
https://github.com/openstreetmap/chef/pull/568
Action item: Grant to ask Minh on https://github.com/openstreetmap/chef/pull/568
#520 dev: user accounts no-longer required
https://github.com/openstreetmap/chef/pull/520
Grant to rebase.
#684 Restrict frontend access to CDNs, monitoring and admins
Blocked by Fastly.
Add logic to Chef for retrying failed initial creation of Let's Encrypt certificates
- Issue: If certificates are not successfully created on the first try on ACME, an admin has to log in and rerun the request.
- Suggestion: Add logic to Chef, so every Chef run (~30') it will try to create a certificate.
- Limits:: 5 retries an hour.
Action item: Grant to create a PR about adding logic to Chef for retrying failed initial creation of Let's Encrypt certificates.
#1282 Switch backups to systemd timer triggered
https://github.com/openstreetmap/operations/issues/1282
Goal: Know about job failures.
Suggestion: Use flock - it creates a filesystem-based lock.
"It locks a specified file or directory, which is created (assuming appropriate permissions), if it does not already exist."
Issue: We want to block other backups from running at the same time.
Shared link for helper script: https://gist.github.com/PhrozenByte/4418f5cde6bb687b064ace7a256abefe
#575 rhaegal - Repurpose or return old tile servers
https://github.com/openstreetmap/operations/issues/575
Someone has told us they can get us a better hardware, as the hardware we have is too old. He can get a faster machine, but not a faster disk.
Suggestion: Shut down rhaegal, if we're not doing anything with it.
Action item: Paul to ping Grant on the "Repurpose or return old tile servers" ticket.
Angor - South African server
The not-for-profit organisation which is hosting angor https://hardware.openstreetmap.org/servers/angor.openstreetmap.org/ will be replaced by another organisation.
The person who gave us the server is now at the new organisation and told us that he can now give us better hardware. He had mentioned that he will come back to Grant, but hasn't yet.
#1299 Create S3 bucket for dev postgres wal
https://github.com/openstreetmap/operations/issues/1299
Aim: Create a S3 bucket for testing postgresql WAL archiving on the dev server.
This is blocking the database maintenance.
#1284 OpenTelemetry Investigation
https://github.com/openstreetmap/operations/issues/1284
- Ian Dees has OpenTelemetry set up.
- OpenTelemetry can be used as the intermediary, to receive network connections and pass it on to other places.
- Could use Prometheus for storage - Grafana Alloy https://github.com/grafana/alloy could be used as well.
Suggestion
- Look into the hardware requirements for storage.
Other points mentioned during discussion
- People who use commercial solutions such as New Relic, sometimes end-up paying more for it, than for their entire cloud hosting.
Grant pinged Ian Dees during the meeting.
Action items
- 2025-10-02 Grant to discuss with Paul Norman and flesh out his suggestion and determine the practicalities (e.g. key revocation). [Topic: AWS CA cert]
- 2025-10-02 Grant to follow up with Paul. [Topic: Serving vector tile styles]
- 2025-10-02 Tom to do a check on Saturday. [Topic: Upgrade to Postgres 17]
- 2025-10-02 Grant to go ahead with the purchase of the Gen10 (second-hand) server for Nominatim in the US. [Topic: Gen10 Nominatim purchase (USA)]
- 2025-10-02 Grant to upgrade the baseboard manager controller before the PG17 upgrade. [Topic: Upgrade to Postgres 17]
- [2025-09-18](https://hackmd.io/yDbLczVeSAWrLQbBFpxTzQ/edit) Paul to look at potential issues related to the collation of indexes - Debian Postgres upgrade. [Topic: OSM DB upgrade to Postgres 17]
- 2025-07-24 Grant to set-up a test for OWG's review [Topic: Switching www.osm.org to Fastly frontend]
- 2025-07-24 Grant to do the Mailman 2 to 3 conversion [Topic: Mailing lists] - https://github.com/openstreetmap/operations/issues/1264
- DONE first part, see the agenda: 2025-06-12 Tom still to run OSMDBT test. OPS then to plan a maintenance window for the OSM.org postgres database update. [Topic: OSM.org postgres database]
- 2025-05-01 Progress, we need to form academic justification and then we should get something: Grant to follow-up with Australian hosting again. [Topic: OSUOSL funding / issues]
- 2025-05-01 Grant to see if other University offers are still available and what hardware would be required. [Topic: OSUOSL funding / issues
- 2025-03-20 Grant to follow-up with the South African contact about the potential hardware donation from a mobile network. [Topic: New offers of Servers Australia and South Africa]
- 2025-03-20 Grant to run an SQL query to identify more email providers used by spammers. [Topic: Spam] #2025-05-01 Grant has created a small list of disposable email providers.
Action items that have been stricken-through are completed, removed, or have been moved to GitHub tickets.