Operations/Minutes/2025-06-26
OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 26 June 2025, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Participants
- Tom Hughes (Operations Working Group volunteer)
- Grant Slater (Senior Site Reliability Engineer, OWG)
- Paul Norman (Operations Working Group volunteer and OSMF contractor)
- Jochen Topf
Minutes by Dorothea Kazazi, including some notes from Grant.
New action items from this meeting
- Tom to document what he did to manually get osmdbt to continue. [Topic: osmdbt/Postgres replication breakage - osm replication diff outage]
- Jochen to look at osmdbt and the filtering in Fakelog. [Topic: osmdbt/Postgres replication breakage - osm replication diff outage]
osmdbt/Postgres replication breakage - osm replication diff outage
Related links |
---|
|
Issue:
OpenStreetMap planet diffs stopped updating on 2025-06-25. The database was fine. The processing stopped without generating any false output. There were multiple alarms (tile replication/ Nominatim/failed service on Norbert), some of which went off very quickly.
There seemed to be an unexpected issue with PostgreSQL and how osmdbt pulls the changes from the database.
Timeline
- The pg_dump backup started early Monday (2025-05-23) and finished late Wednesday evening (2025-06-25), ending approximately an hour before the osmdbt failure.
- 2025-06-25 21:31 / 21:32 the outage started - PostgreSQL
- 2025-06-25 21:33:00 GMT STATEMENT: SELECT * FROM pg_logical_slot_peek_changes($1, NULL, NULL);
- 2025-06-25 21:33:14 GMT LOG: checkpoint starting: time
- 2025-06-25 21:33:25 GMT ERROR: invalid memory alloc request size 1243650064
Also:
- 2025-06-25 20:36 Spike6 was queuing processes - people might have briefly gotten an error message.
- 2025-06-25 20:00 UTC until ~ 22:30 UTC The number of active Rails queries against Postgres spiked from virtually nothing to >100. Peak: ~ 20:30 UTC.
Recovery was manual run of osmdbt with Fakelog. Manually diffed/removed duplicates
We are unsure exactly what triggered the PostgreSQL issue.
Other observations
- Connections: There was a spike in the connections' count.
- Transactions: There was a local peak in transactions per second. That's when the buffer caches started showing increased hits and the index scan started going up to several million per second. We hit ~ 40 million index scans per second an hour later.
- GPX scans:: There were up to 40 million GPX scans per second. Most likely somebody was scraping the GPX history page on the website, one page (with 20 GPX records) at a time. The timestamps are aligning with the increase in the other dashboards.
- There doesn't seem to be a boost in incoming requests. There was a boost in request errors.
- Tom is not convinced this could be the cause.
- Notes: There were some large notes' queries. One of them ran over 1 hour and they were retried. Grant reported that on the channel.
- Scraping: Spike02 showed heavy scraping on 2025-06-25 using fake browsers. Around 5 requests per second, someone e.g. requesting all tags in Germany.
- There was no significant increase in processes last night.
What we're not seeing
- A query suddenly slowing down - rather, we're seeing an error deep from the internals.
Potential causes
- High API usage trigger could have triggered something.
- PostgreSQL 15 bug. We plan PG17 upgrade soon, as it has different compression and will make some of our processing faster.
- Post pg_dump cleanup triggered something.
- Related to the Postgres Autovacuum.
- Autovacuum usually affects the small tables.
- More parallel connections, as normal connections were staying on longer.
- More incoming connections, which caused additional Rails daemons to spin up up to the limit.
On logical replication plugin
We have a logical replication plugin, written by Jochen Topf, installed in our Postgres. When osmdbt asks for logical replication changes, using that plugin, Postgres parses the bin, the WAL logs, makes calls to the functions in the plugin, and it effectively returns the logical replication data. It seems that PostgreSQL’s built-in 1 GB size limit was exceeded, causing functions to fail.
On Postgres version and data compression
- pg_dump uses gzip compression by default.
- zstd is significantly faster, as it is single threaded. It has parallelised compression but it might not use it in pg_dump, which can do its own parallelisation per table, at least in the restore process.
- zlib is not known to be fast.
- Grant found some benchmarks and we could half the time needed for a dump.
On complications that Tom run into
- Issue: Fakelog different sorting added complexity.
- Issue: Fakelog produces different output (1st column).
Suggestions
- Have Fakelog το take a file with existing data and automatically exclude anything which was in that file from its output.
- Remove the transaction ID and the commit records from the start of the lines. This information hasn't been used in the years that osmdbt is running, and the removal would make the files smaller.
Comments on suggestions
- Both suggestions seem doable to Jochen. Could make one of them optional.
- Implementing the changes would mean rolling out a new version of osmdbt, which would also require time from OPS.
- Changing the output format would involve changing the plugin, so that will add significant complications in terms of deploying it.
Other points mentioned during discussion
- 3 days is a long time for a transaction to be open.
- Tom found a related discussion thread about the same error message.
Alternative replication
mmd found a way which doesn't rely on this special plugin, but uses a generic plugin that comes with Postgres that outputs the data in json format. The code is in a branch of osmdbt. [Jochen did not have the time to look at it closely.
pgoutput is what the built-in logical replication between servers uses.
Decisions
- To not change the special plugin at the moment and only makes changes to osmdbt - unless any additional issues occur.
- Revisit mmd's solution in the future - would require some testing.
Action items
- Tom to document what he did to manually get osmdbt to continue.
- Tricky to ensure that we don't miss anything or have duplicates. The backup system that we used is looking for changes based on date, whereas normally you're looking at them based on the order in which transaction is committed.
- Jochen to look at osmdbt and the filtering in Fakelog.
Future might be to move away from PostgreSQL osmdbt plugin and instead follow up https://github.com/openstreetmap/osmdbt/issues/38 Look good.
Horntail
Related links |
---|
|
Issue:
Horntail had recurring soft lockups over the last months, and they seem to be CPU related. There was also a CPU error with the baseboard management controller.
About the server
- Is the secondary web server for planet.openstreetmap.org (Supermicro SYS-120U-TNR, Ubuntu 22.04).
- Is out of warranty and it is running the latest bios.
- Second-hand replacement CPU for the same model would cost approximately GBP 500 (Scalable 3), as it is fairly recent.
- Needs space for the generation of the planet and the temporary state for the backups.
Potential causes
CPU power fluctuations could be caused e.g. by a hardware problem or a BIOS option. The machine is mostly idle, so it might be due to power saving.
Twin server Norbert
- Identical hardware, but shows no similar issues, even though it gets heavier usage.
- Needs to be rebooted after the planet dump finishes, due to 900+ days uptime.
Grant is running CPU burn-in tests to reproduce the issue.
Plan
- Short term: Reboot it and load the optimal defaults from BIOS.
- Long-term: Might need to get a new machine if the issues continue.
On planet dumps
Current status
- The files on plant.osm are the ones that exist physically on that server and then it just redirects you to S3, where files are downloaded from.
- We tend to keep the old files.
- Grant is manually cleaning some of the back-ups, due to space limits.
Long term
- Grant wants to see if we can remove the intermediate step of local storage of planet files by redirecting to S3 lookups - effectively making planet.osm just a skin on S3 calls.
Action items reviewed at the beginning of the meeting
- 2025-06-12 Tom to look into plausibility of OSM.org Postgres upgrade: Tom will do a dry-run on a disconnected promoted slave to test upgrade. Secondary will need to be re-synced after the upgrade. Need to confirm the downstream affect on (planet-dump-ng). [Topic: OSM.org Postgres database]
- 2025-06-12 OPS to plan a maintenance window for the OSM.org Postgres database update. [Topic: OSM.org Postgres database]
2025-06-12 Grant to reply to Hector (board) and i) ask whether the email was a board request, ii) suggest leaving the technical implementation to the OWG, and iii) ask the board for guidance on how to modify the policy to cover this case. [Topic: Board question about adding a Wikimedia Italia fundraising banner on osm.org]Done- 2025-05-01 Grant to follow-up with Australian hosting again. [Topic: OSUOSL funding / issues]
- 2025-05-01 Grant to see if other University offers are still available and what hardware would be required. [Topic: OSUOSL funding / issues
- 2025-03-20 Grant to negotiate with HE.net if we can get better cost from them as a fallback link (which he had proposed), to allow budget spend elsewhere. [Topic: HE.net]
- 2025-03-20 Grant to run an SQL query to identify more email providers used by spammers. [Topic: Spam]
- 2025-03-06 Grant to present a draft budget at the next meeting.
- 2024-09-19 Grant to create an IP blocklist script. [Topic: Cloudflare keep enabled Reportage] - Discussion during 2024-07-25 OPS to make a reasonable evaluation whether to go with Cloudflare, Fastly or none.
Action items that have been stricken-through are completed, removed, or have been moved to GitHub tickets.