Operations/Minutes/2023-11-02
OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 2 November 2023, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
Participants
- Tom Hughes (OWG)
- Paul Norman (OWG)
- Grant Slater (OWG)
- Guillaume Rischard (OSMF board)
- NorthCrab
Minutes by Dorothea Kazazi.
Absent
New action items from this meeting
- Paul Norman to update the pull request https://github.com/openstreetmap/openstreetmap-website/pull/4126 [Topic: OPNVKarte]
- Dorothea Kazazi to create accounts on the OSMF NextCloud server for the OWG members and add Grant as an admin. [Topic: AMS data center visit]
Reportage
GitHub template for missing attribution
- 2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites without attribution]
At what point do we have someone from the OSMF looking to make sure that the action is reasonable?
- We decided that we needed to have some OSMF oversight before blocking sites that allegedly use OSMF tiles without attribution.
- Mateusz Konieczny and Guillaume Rischard are taking care of it.
On the GitHub repository on domains with use OSMF tiles and lack attribution
- Grant Slater has worked on the automation required to create the export list from the tile attribution repository, of the domains that should be blocked from receiving OSMF tiles.
- The process needs some documentation.
- If someone with the right privileges adds the "accepted" tag to a ticket on that repository, the domain remains on the list of domains blocked from receiving OSMF tiles. Once the ticket is closed the domain is removed from the list.
Suggestion: Automatically add the list to the Fastly dictionary.
OPNVKarte featured layer
- Had an SSL certificate issue which was fixed.
- Concern: repeated issues and unable to get response from the single person who supports the featured layer.
- Getting 404 on some of the non-cached tiles.
Suggestions
- Remove OPNVKarte.
- Grant offered to rebase.
Other point mentioned during discussion: Might not have time to rebase.
Action item: Paul Norman to update the pull request https://github.com/openstreetmap/openstreetmap-website/pull/4126
HE Network
The problematic link is up.
Dublin outage
- The genuine downtime we had was minimal: we had 1 hour with 20-30% packet loss.
- Our packet loss does not capture the full outage, as it was out to parts of Europe.
- It depends where the traffic gets handed to HE.
- The outage impacted traffic from Eastern Europe going to Dublin.
- Various outages in the month.
- They clearly had other capacity, it just wasn't sufficient.
On refund
- Missed on a refund on the Dublin outage as we didn't raise the issue with them.
- You can get a refund if:
- you raise the issue and they don't fix it within a two hour period.
- full internet outage - but refund still complicated. Email might be needed to claim the refund.
- The document doesn't make clear if there's a double dip option in a network outage scenario.
Suggestions
- Explicitly say if both sites are out, preferably by separate tickets.
- Paul to call sales person and ask what they plan to do to prevent future outages if a cable is lost.
- US case. Power in one of their own data centers - they didn't have enough back-up power capacity.
- Different ISPs would be ideal, long-term.
- Anycast: announce the same /24 from both data centers, but depending on the ASN that person is coming from.
- We wouldn't want to do that because each data center is going to be using its own database and then we would get a replication problem with databases.
- The link would be the problem and how you run that tunnel - we can't run that tunnel across 4G.
On BGP
- BGP tunnel - the tunnel would have gone down.
- BGP, own subnet, 2 ISPs: we could change how we announce it.
- We don't have the capacity to take on the additional workload of BGP.
- Once BGP is set up it works, and we can get help from people like Clement.
Other points mentioned during discussion
- Both links will be end of life next year.
- Resiliency in AMS is better.
Tom disconnected 21'.
osm2pgsql-replication
25' Tom rejoined.
Related: https://switch2osm.org/serving-tiles/updating-as-people-edit-osm2pgsql-replication/
Paul did a setup with osm2pgsql for a client and it is easier from what we got.
On complaints
- We've had complaints about the algorithm we use.
- Most have been because of our version of the expirer. People have changed a relation and they're expecting a change.
- Recent case: Name of an island (relation), related to recent vandalism.
On expiring tile algorithm
- Custom, Ruby implementation, stored in Chef.
- It looks at the location of the nodes and the way nodes that are touched.
- It only takes an action once per invocation.
- The script will have to read a list of tiles, but it doesn't de-duplicate the list.
- it would de-duplicate, but probably not meta tiles.
- It gives tile numbers which "renderd expire" (which mark the tiles as dirty) can read.
Suggestions
- Have a PR to make it do less than 1 minute sleep between retries.
- We ignore relations completely, so we only handle nodes and ways. If one changes a relation, that won't cause any dirtying, whether it's tags or members, and a way which spans a tile without having a node in it won't dirty that tile, it will only dirty the tiles where the way has nodes.
- We shouldn't push expiries all the way to the edge.
Other points mentioned during discussion
- Servers are more reliable about updates, so we notice cases where they're not updating.
- There's no point going to OSM2PGSQL replication, if we don't also move the dirty algorithm.
- With our load levels the issue is not how many tiles are being requested from the back end, it's how many renders it has to do.
Action item: Paul to sketch out the different components we would need to implement and how they relate.
Fastly purges
- Grant tested a Fastly purge script and it was very slow.
- Can't hit the 100,000 requests per hour limit sequentially, because their API responds so slowly.
- the limit is per account, so probably across multiple distributions.
- We can assume anything in the level one cache is also in the level two.
Suggestion: use fire headers.
Any other business
AMS data center visit
- Paul to visit the AMS data center on Thursday.
- Delivery expected soon, probably tomorrow - past customs.
- Grant encountered some issues to find the correct tax code for customs.
Suggestion: shut down Karm (Read only database mirror for www.osm.org) completely, to switch both of the PSUs: going from 1600 watts to 1000 watts.
Action item: Dorothea to create accounts on the OSMF NextCloud server for the OWG members and add Grant as an admin.
Delivery of hard disks for Prometheus
Arrived, according to Lance.
- Suggestion: Go for Raid6
- From Raid 5 to 6 we would have to add a disk.
Question by NorthCrab about LVMs
- We used to use LVMs but it's painful, because we have hardware raid on many of the machines.
- Used HP machines come standard with Raid.
- We tend to allocate 100% of the storage
- The hardware Raid controller used to be a pre-installed component, but is optional on Gen9 and Gen10.
- Every disk we get from now going forward is probably going to be NVMe.
Restore archive planet files to S3
Operations ticket 967: https://github.com/openstreetmap/operations/issues/967
Ongoing.
Grant's script currently doing:
- planet PBF files from 2021.
- full history from 2020.
- planet OSMs from 2022.
Other points mentioned during discussion
- It takes nearly two days to restore a file, but it restores a full year batch in one go.
- It may finish like end of this week
Open Ops Tickets
Review open, what needs policy and what needs someone to help with.
- https://github.com/openstreetmap/operations/issues
- https://github.com/orgs/openstreetmap/projects/1
- https://github.com/orgs/openstreetmap/projects/1/views/2?filterQuery=-is%3Aclosed
Action items
2023-10-19 Grant to open an issue on the S3 script. [Topic: Proceed with moving replication Diffs to AWS? https://github.com/openstreetmap/operations/issues/971]- 2023-09-21 Grant to create a table on the cache headers that we send. [Topic: Surrogate key patch]
2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites with out attribution]- 2023-06-29 Grant to put Martijn van Exel's policy for addition of OSM editors to the osm.org menu out for feedback. [Topic: Draft policy by Martijn van Exel]
- 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]
- 2023-05-04 [WordPress] Grant to share list of WordPress users with Dorothea and their response to keeping an account. [Topic: WordPress security] - Shared, but additional work required