OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 5 October 2023, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Some topics have been reordered.
- Tom Hughes (OWG)
- Grant Slater (OWG)
- Paul Norman (OWG)
- Guillaume Rischard (OSMF board)
- Mateusz Konieczny (OSMF board)
- Sahil Dhiman
Minutes by Dorothea Kazazi, including notes by Grant.
New action items from this meeting
- 2023-10-05 Tom to put the ping connector on the Prometheus machine in Oregon. [Topic: Ridley shutting off]
- 2023-10-05 Paul to tag Grant on the ticket for moving OSMF emails out of Google. [AOB topic: Moving OSMF emails out of Google]
Action item: 2023-09-21 Grant to create a table on the cache headers that we send
- Tom started the ticket.
- Grant intends to add more detail about the zoom levels: how many of the tiles get viewed per zoon, the percentage that we generate and how much are regenerated
- Suggestion: also contain all of the relevant cache headers.
- We set two cache-headers: expiry and last-modified.
Action item: 2023-09-07 Paul to create a GitHub template for the new Attribution repository
https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites without attribution]
Paul sent an email with some concerns and made it in the form of a circular.
[Topic: WordPress security] - Shared, but additional work required
- All users on all the older sections are disabled.
- As 2FA is enabled, if they're not a current user, they can't log in.
- Pending: just trimming the blog users down.
Supposed security issue
Topic added by Mateusz Konieczny (Board)
Not an operations issue. Vulnerabilities in OpenStreetMap operated servers and services should be reported by email to email@example.com (Source).
Enable replication diff redirect to S3 (eu-central-1) ?
- planet files: Grant enabled redirection.
- replication diffs: redirection not enabled yet but can be done, as there is no versioning/issues.
- pnorman looked into osmosis in 2018, when we redirected http to https, if it follows redirects. He added it to trac, to allow updates.
- potential for some usage of curl.
- Osmosis version reported.
On whether looking at the logs will provide information about the old clients with problematic redirects
- Logs tell you the user agent, if that user agent follows redirects.
- Can provide information on whether Osmosis is very old or people are using curl.
Other points mentioned during discussion
- There are clients that would not follow redirects, but those are clients that have not updated in a long time and there's no way to reach.
- The default behavior of curl is to not automatically follow redirects, which can lead to issues.
- Communicate about the redirects.
- Grant to do a quick pause, get data, and discuss this on IRC.
Communication Working Group eager for Operation Working Group "news" stories
- Talked with Communication Working Group (CWG) members and they want to publish a blog-post on the planet to S3 migration.
- Gave Amazon Web Services (AW)S a heads up that we will talk about them and the option for them to get involved.
- AWS might take about a week to respond.
- Reached out to Andy, on whether he wants to blog about his bootstrap work.
- Created blank reports for future dates.
- Blog posts on the Rails port work.
Other points mentioned during discussion
- Post to generate interest in our activities and see if there's anyone else who can help out.
- Grant sends to the foundation a weekly report on what he's been working on and has been adding that to the OWG reports - will do a pull request soon.
Outage Yesterday Discussion
- Permissions issue with Database.
- Chef can now manage database permissions, but would not automatically catch new tables. Investigation needed. Multiple Database Users. Script that fixes permissions by restoring known good permissions wouldn't know how to set permissions for new tables. We should move "detecting" this earlier.
- Tom fixed the permissions by additions to Chef.
- Planet diff seems to have access to more tables - review needed.
- We cannot easily add a test to catch the permissions issue.
- Tom runs migrations as the rails user, but the database is owned by the OpenStreetMap user. The Rails user had no permission to those tables.
- Migrations carry permissions from the database by default. So they got all permissions for the OpenStreetMap user.
- Migrations don't manage the permissions model at all.
- Add a self identity check in the migration script, checking whether it is running as the user that owns the database.
- It wouldn't help.
- Tom to check whether Postgres gave permission to do it for OSM users that own the database when the new object was created - even though it was created by the Rails user.
- Having a repair script run after.
- It wouldn't help, because the permissions of a table which is missing would be unknown.
- Most it could do is alert someone.
- Move the detection earlier.
- We could probably have something which looks for any tables with suspicious permissions.
- The website code, which has a migration script, assumes that you're just using one database user, but on the production database we have different users with different permissions.
- Production operations issue that is outside the scope of the website repository where the migration lives.
- OSM user - owns the database.
- Rails user
- CGI map user
- planet dump and planet diff
- backup user
- plus a few additional ones
Other point mentioned during discussion
- Need to grant permissions when we create new tables.
- HE was performing maintenance in London, linking AMS and DUB. Severe packet loss between those links and outside access that comes in via that route.
- Paul Norman raised an email ticket and then called. They identified the maintenance issue. Requested ETA. HE resolved the issue partially (BGP update?), but caused AWS (others?) to 100% drop traffic. Reported to HE and 20mins later fixed.
- 5% of the Fastly nodes were unreachable.
- It captured a lot of traffic, e.g. from Vancouver to Amsterdam and from Eastern Europe to Dublin, which go through that path.
- AWS took substantially longer: the rendering server to Amsterdam took 20-25 minutes.
- When their maintenance finished there was a spike of packet loss again, but much shorter in time.
- A flapping link can take down both links.
- HE.net said there was a low-light link in Amsterdam.
- Suggestion: Guillaume or Paul to stop over at Amsterdam on their way to or from State of the Map EU 2023 and swap the modules around on the patch cable.
- Grant has not mailed the power supplies to AMS yet, as Equinix has a policy to only keep things for 5 days for collection. More than 5 days, means that it will need to be done by remote hands, with additional cost.
Any other business
- Issue: if we change the file name, potentially we break some downloaders of the file, if it's not in the same format that they were expecting.
- There's no official S3 rename function. The only way you can rename is to copy the file to the new name and delete the old name.
Decision: keep things as they are.
120 issues in operations tracker
Grant has broken some tickets into multiple ones (e.g. the S3 one was split to 7).
University College London
- Blocked by Grant wanting to move the South African aerial imagery, which runs on Draco (G7).
- Low priority.
- Should be a higher priority as it is blocking the UCL move and we have two servers with memory issues there.
- Ridley (Site gateway, Foundation related sites) and Draco are more difficult to maintain and their power efficiency ratio is not comparable to modern machines.
Suggestion: Ridley (G6) migration (it's on the ticket).
- MediaWiki and WordPress are now fine.
- CiviCRM is okay with the version of PHP, but there was potentially an incompatible plugin.
- The staging website for supporting.osm.org broke and Grant used a backport version of PHP7.
Ridley shutting off
- It will be migrated.
- When we had the HE.net outage, it was good having one of the site gateways on a different site. Helped with monitoring.
Action item: Tom to put the ping connector on the Prometheus machine in Oregon.
Moving OSMF emails out of Google
- Guillaume will reach out to Microsoft to see if the foundation can get a sponsored Office 365 license.
- Grant to try a paid version of Mailbox.org to see if the functionality actually works for what the foundation requires.
Action item: Paul to tag Grant on the ticket for moving OSMF emails out of Google.
Questions by NorthCrab
Where do conversations take place between Operations meetings
Who to get in touch with about OSM tech questions
- Grant can help.
- Please be mindful of what the questions you ask and how long you expect people to take to respond to them.
- If you need data, it helps to supply queries that will provide this data that you need.
Meeting adjourned 56 minutes after start.
Open Operations Tickets
Review open, what needs policy and what needs someone to help with.
- 2023-09-21 Grant to create a table on the cache headers that we send. [Topic: Surrogate key patch]
- 2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites with out attribution]
- 2023-06-29 Grant to put Martijn's policy for addition of OSM editors to the osm.org menu out for feedback.[Topic: Draft policy by Martijn van Exel]
- 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]
- 2023-05-04 [WordPress] Grant to share list of WordPress users with Dorothea and their response to keeping an account. [Topic: WordPress security] # Shared, but additional work required