OpenStreetMap Foundation, Operations Meeting - Draft minutes
These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.
Thursday 19 October 2023, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co
- Tom Hughes (OWG)
- Paul Norman (OWG)
- Grant Slater (OWG)
- Guillaume Rischard (OSMF board)
- Sahil Dhiman
Minutes by Dorothea Kazazi.
New action item from this meeting
- Grant to open an issue on the S3 script. [Topic: Proceed with moving replication Diffs to AWS? https://github.com/openstreetmap/operations/issues/971]
2023-10-05 Paul to tag Grant on the ticket for moving OSMF emails out of Google
From AOB topic: Moving OSMF emails out of Google.
- Grant suggested to the board that we switch the OSMF email service from Google to Mailbox.org, which also provides video conferencing features.
2023-09-07 Paul to create a GitHub template for the new repository which will be only for cases of missing attribution from sites using our tiles
From topic: (With LWG) Issue template/checklist for blocking sites with out attribution. New repository: https://github.com/openstreetmap/tile-attribution/
- Decision to cancel the action, as OPS was not confident in blocking sites without confirmation from OSMF that they lacked proper attribution and that blocking was the right course of action.
- Paul to send an email regarding this matter.
From topic: Draft policy by Martijn.
- The existing version contains numerous draft notes.
- Plan to release it as it is, allowing individuals to provide feedback and comments.
Timing of issues: ~10 days ago and on the 17th of October.
- Experienced 70% packet loss, primarily load-dependent (correlated with latency patterns), and remained constant.
- TCP traffic appeared to be less impacted compared to ICMP.
- IPv6 exhibited lower packet loss than IPv4.
- Traffic was more affected going into Amsterdam compared to Dublin.
- Dublin primarily handles traffic from North America and West Africa.
- Amsterdam handles traffic from a majority of Europe, a significant portion of Africa, and Asia.
- Traffic between the two sites was affected.
- Traffic between Amsterdam and JANET experienced a greater impact compared to the traffic between Amsterdam and Dublin.
On impact on end-users
- Only one complaint on the OSM US Slack channel, where a user mentioned experiencing slower service.
- There might be some issues in parts of Africa, but it's worth noting that the network links in that region are generally of lower quality.
Concern: They didn't respond until Paul reached out to them directly.
On services in AMS that could be moved and served from DUB
- Run a tunnel between Slough and Dublin, with IPv6 over it.
- Not possible.
- Remove Amsterdam from the load.
- The majority of the traffic directed at Amsterdam originates from other European regions, rather than the UK.
Other point mentioned: There have been undersea cable outages in Africa in the last few weeks.
Decision: nothing to do short term except press on the SLA violation.
- It has been raised with the sales contact.
Suggested changes for the future:
- More responsibility for us to patch.
The Operations Working Group (OWG) has granted approval for Paul's travel to Amsterdam, where one of our data centers is. The expenses for this visit are equivalent to less than one hour of remote-hands support.
- Fiber link is degrading: signal loss is 0.9 DB/week.
- If the situation worsens, we may need to engage remote hands assistance, which can be a cumbersome process.
- There are additional tasks that require attention.
Other points mentioned
- Grant to ship the power supplies and potentially some second-hand spare modules to Amsterdam, each costing GBP 12 including shipping.
- There are spare cables available on-site that we could use for replacement.
- The modules currently present are 10G.
- There is a module cleaner and a fiber cleaner at Amsterdam.
Proceed with moving replication Diffs to AWS?
Plan: move replication diffs to AWS, including state.txt
- Move to EU bucket - which has no delay.
- Accessing state.txt will redirect to S3.
- If you attempt to access a future diff file that doesn't exist, it will redirect to S3 and generate a 404 error from S3.
Highest priority: Eliminate the 404 error while files are still in the process of uploading.
Replication delay: Typically around 5 minutes during planet uploads, but has occasionally extended up to 13 minutes.
Other points mentioned
- Grant reviewed the top 70 user agents for two days of traffic.
- Minimal disruptions anticipated from the move.
- Last week's redirects had issues due to being copied to planet.osm before Amazon S3.
- Suggestion: Grant to fix the script before planet goes live tomorrow evening.
- Guillaume submitted a PR and NorthCrab a patch on it, for enabling multiple uploads in parallel.
- The most popular file is actually published last.
- Full planet.osm
- Full planet.pbf
- Changeset discussions
Action item: Grant to open an issue on the S3 script.
Any Other Business
OWG 2024 budget
Paul needs to talk to the board - the accountant was absent.
- Clearly distinguish between OPEX and CAPEX and obtain information from the accountant regarding ongoing depreciation.
- Grant to send to Paul the depreciation sheet that he acquired from the accountant (depreciation: 5 year duration).
- Need more details about some very old hard drives.
On servers and drives
- Draco: 5 disks over 11 years of power-on time that are still spinning.
- Ironbelly (site gateway): Replaced almost all spinning-rust hard-drives. It's now at the end of its operational life.
- No database servers with spinning rust - We've been using SSDs for approximately the last two years.
- We have other drives with spinning rust.
- Shenron (Mailing lists server, OSQA server for help.openstreetmap.org) and Dulcy (Nominatim geocoding server) have some spinning rust drives.
On proposal to getting a general purpose server for Oregon
- Get a general purpose server for Oregon as Stormfly may not provide sufficient capacity for retaining Prometheus data.
- We have four HP 960:
- two that have been on for 2,5 years that have 40-50% of their lifespan remaining and
- two that have been on for ~ 5.0 years that have 75-77% of their lifespan remaining. Additionally, they have more up-to-date firmware compared to the first pair.
- There are available slots to accommodate four additional disks, specifically SSDs (not NVMes).
- Capacity: it is estimated that it will take us until April or May to reach full utilisation.
- There was a reset of Prometheus database around March 2023.
Other points mentioned during discussion
- This year we didn't spend nearly as much money as we were hoping to.
- This was achieved by saving on AWS costs and postponing the acquisition of a new database server.
- 18 months is the current data retention time on Prometheus.
- Write off very old hard-drives at the next financial year.
- Be budget-conscious next year.
- Make efficient use of our existing resources, as we have a significant amount of unused capacity.
- Work on containerising.
- Acquire larger capacity disks.
Open Ops Tickets
Review open, what needs policy and what needs someone to help with.
- 2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites with out attribution]
2023-08-24 Tom to see how can traces simplification be done. [Topic: Large scale GPX uploads] -> To be made into an issue 2023-08-24 Paul to email MapTiler [Topic: MapTiler featured layer] 2023-08-24 Paul to open a ticket to accept GitHub and Wikimedia emails [Topic: Validating user emails]
- 2023-06-29 Grant to put Martijn van Exel's policy for addition of OSM editors to the osm.org menu out for feedback. [Topic: Draft policy by Martijn van Exel]
- 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]
- 2023-05-04 [WordPress] Grant to share list of WordPress users with Dorothea and their response to keeping an account. [Topic: WordPress security] - Shared, but additional work required
2023-08-24 Paul to work on creating a FAQ in order to reduce incoming communications. -> To be turned into a ticket