Operations/Minutes/2025-10-02
OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 2 October 2025, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Participants
- Tom Hughes (Operations Working Group volunteer)
- Grant Slater (OSMF Senior Site Reliability Engineer, OWG)
Apologies
- Paul Norman (Operations Working Group volunteer and OSMF contractor)
Minutes by Dorothea Kazazi, including some notes from Grant.
New action items from this meeting
- Grant to discuss with Paul Norman and flesh out his suggestion and determine the practicalities (e.g. key revocation). [Topic: AWS CA cert]
- Grant to follow up with Paul. [Topic: Serving vector tile styles]
- Tom to do a check on Saturday. [Topic: Upgrade to Postgres 17]
- Grant to go ahead with the purchase of the Gen10 (second-hand) server for Nominatim in the US. [Topic: Gen10 Nominatim purchase (USA)]
- Grant to upgrade the baseboard manager controller before the PG17 upgrade. [Topic: Upgrade to Postgres 17]
Reportage
snap-02 BIOS upgrade
Related to action item 2025-09-18 Grant to test the BIOS upgrade against snap-02. Whenever suitable. [Topic: OSM DB upgrade to Postgres 17]
Done. See https://github.com/openstreetmap/operations/issues/1289 Was relatively easy - the Bios upgrade took a bit longer than expected.
Grant changed two settings on Snap02, one of them for power management (CPU scaling). It will scale much higher when a lower number of cores are in use.
South Africa potential hardware donation
Related to action item 2025-03-20 Grant to follow-up with the South African contact about the potential hardware donation from a mobile network. [Topic: New offers of Servers Australia and South Africa] #2025-09-18 parked
Grant recently emailed his contact in South Africa, who is checking whether we can get better hardware in a more modern data center.
AWS CA cert
We have a few machines that have backups synced across. Grant wants backups copied directly to S3, rather than sending them to the backup server that would then send them to S3. This way, we would remove one layer of potential failure (e.g. if the storage server is out of space). Want copies of backups to be independent.
This will increase the number of keys we have.
We have keys for:
- Planet publishing
- Planet dumping
- Backups
- Tilelog processing, run by Paul Norman
- Rails
It is recommended to give each service its own key.
On AWS
You give each machine a role in AWS, which allows them to sudo into another role (e.g rails/images/backups/log processing). The roles have all the access credentials that they need. The machine's role can only switch to another role, for which they have gotten permission.
On self-hosted systems
We can't do this method with self-hosted systems, but we could use Certificate Authority (CA) certificates: if the machine presents itself with a signed certificate, we could trust the CA.
Wondering whether we could leverage the private key to do a Certificate Signing Request (CSR) against an internal CA and produce a certificate. The secret key would then never leave the individual server. E.g. Chef would run and create the CSR, and the CSR would be copied to e.g. to the ACNE system.
A CSR is not needed in this case. The Chef keys could be completely ignored and built our own system. Would need a central system, and we create a self-signed certificate which would be our CA. Each machine would generate its own key pair and upload the public key to the server, which would sign it with the CA and send it back to the client.
If we want to simplify the process, we could extract the public key for each client and possibly sign those public keys using our own CA certificate, to create a certificate for that client. The Chef server has the public keys, which can be extracted with knife.
We would need to submit the root certificate to Amazon. Rails probably supports CA-based authentication to Amazon.
On sudoing to another role vs current practice
Advantage of proposed method
- When a server gets added to the pool, it automatically gets an access key to AWS.
- We would need to go to Terraform and tell AWS about what the server can do.
- Create roles in Chef.
 
 
- We would need to go to Terraform and tell AWS about what the server can do.
Disadvantage of proposed method
- Creation of extra steps when we add a new machine (run Chef, run OpenTofu).
Suggestion
- Taking the roles from Chef and injecting them in OpenTofu.
Other points mentioned during discussion
- Sudoing to another role gives credentials (such as an id and secret) which we could extract e.g. every 12 hours.
On storing keys
- Chef.pem is the private key.
- The Fastly configuration that we have now has no keys, as they already had a single public identity with Amazon that their customers are working with. This case is a bit different from ours, as arbitrary customers use the single identity.
Other points mentioned during discussion
- Grant can now share the Fastly OpenTofu code, as it no longer contains any secrets.
On private CA managers
- Easy-RSA https://github.com/OpenVPN/easy-rsa
- Not needed, as Chef has the necessary resources.
 
- Could use LetsEncrypt https://letsencrypt.org/
Action item: Grant to discuss with Paul Norman bout the AWS CA cert proposed method and flesh out his suggestion and determine the practicalities (e.g. key revocation).
AWS Setup automation
Paul is setting-up the log-processing stuff. He uses some AWS services (Athena and Glue) which we're moving to a new account. Paul would like to have automation on AWS, like we do with Fastly.
There are some bugs related to OpenTofu, where some resources are forgotten and recreated.
Grant tries to figure out granting the permissions.
Serving vector tile styles
Follow-up to conversation in 2025-08-07 meeting and making a call on [#1263](https://github.com/openstreetmap/operations/issues/1263).
Action item: Grant to follow up with Paul.
Gen10 Nominatim purchase (USA)
Action item: Grant to go ahead with the purchase of the Gen10 (second-hand) server for Nominatim in the US.
AWS email
AWS are changing some S3 operations.
If they are things that fail to replicate, they would previously get deleted, if there was a delete policy set or life-cycle rule set. Now they won't delete them.
Other points mentioned during discussion
- Grant has a script which checks if things have missed replication, and replicates them if they haven't.
Action item: Grant to do a commit to alert manager, so that we get notified if there are failures in replication.
Upgrade to Postgres 17
Grant to update the baseboard manager beforehand, which he has already done for Snap-02. The CPU load will spike briefly.
Action items
- Tom to do a check on Saturday.
- Grant Grant to upgrade the baseboard manager controller before the PG17 upgrade.
Action items
- 2025-09-18 Done: Grant to test the BIOS upgrade against snap-02. Whenever suitable. [Topic: OSM DB upgrade to Postgres 17]
- 2025-09-18 Done: Grant to send an announcement about the upgrade tomorrow. [Topic: OSM DB upgrade to Postgres 17]
- 2025-09-18 Paul to look at potential issues related to the collation of indexes - Debian Postgres upgrade. [Topic: OSM DB upgrade to Postgres 17]
- 2025-09-18 Done: Grant to get Gen10 quotes for Nominatim upgrade. [Topic: Hardware upgrade - Nominatim]
- 2025-09-18 Done: Grant to ask Sarah if she will be happy with a Gen10. [Topic: Hardware upgrade - Nominatim]
- 2025-08-07 Done: Grant to 1) create AWS account + S3 buckets, 2) start from what we log for raster tiles, and 3) set the logging compression to zstd. [Topic: Vector Tile Logging]
- 2025-07-24 Grant to set-up a test for OWG's review [Topic: Switching www.osm.org to Fastly frontend]
- 2025-07-24 Grant to do the Mailman 2 to 3 conversion [Topic: Mailing lists] - https://github.com/openstreetmap/operations/issues/1264
- DONE first part, see the agenda: 2025-06-12 Tom still to run OSMDBT test. OPS then to plan a maintenance window for the OSM.org postgres database update. [Topic: OSM.org postgres database]
- 2025-05-01 Progress, we need to form academic justification and then we should get something: Grant to follow-up with Australian hosting again. [Topic: OSUOSL funding / issues]
- 2025-05-01 Grant to see if other University offers are still available and what hardware would be required. [Topic: OSUOSL funding / issues
- 2025-03-20 Grant to follow-up with the South African contact about the potential hardware donation from a mobile network. [Topic: New offers of Servers Australia and South Africa]
- 2025-03-20 Grant to run an SQL query to identify more email providers used by spammers. [Topic: Spam] #2025-05-01 Grant has created a small list of disposable email providers.
- To be removed, see "reportage" 2024-09-19 Grant to create an IP blocklist script. [Topic: Cloudflare keep enabled Reportage] - Discussion during 2024-07-25 OPS to make a reasonable evaluation whether to go with Cloudflare, Fastly or none. #2025-09-18 To be removed, as we will use Fastly.
Action items that have been stricken-through are completed, removed, or have been moved to GitHub tickets.