Operations/Minutes/2024-07-25

From OpenStreetMap Foundation

OpenStreetMap Foundation, Operations Meeting - Draft minutes

These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.

Thursday 25 July 2024, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co

Participants

Minutes by Dorothea Kazazi and notes by Grant (beginning of sections).

Not present


New action items from this meeting

  • Grant to move the requirements regarding deployment of new production services that were mentioned during the 25 July meeting to the ops site in a "Policy" communication format. [Topic: Deployment of new production services]
  • Grant to determine the Cloudflare API call to block IPs, in order to deal with scrappers [Topic: Cloudflare keep enabled?]
  • OPS to make a reasonable evaluation whether to go with Cloudflare, Fastly or none. [Topic: Cloudflare keep enabled?]
  • Paul to follow up with Copernicus and see if we can get rendering servers from them. [Topic: State of the Map Europe 2024]

Deployment of new production services

Policy on reacting to requests to deploy new services in production without sufficient notice and on running of new unmanaged services.

  • Agreed launch schedule ahead of time, including Operations team.
  • Ops require 7 day window for us to "production" deploy, complex sites may require additional time.
  • Communicate to the Ops Group, not individuals.

Topic related to OSM 20 birthday website. Grant was asked to productionise it in very short notice. Suggestion to create a policy regarding requests to deploy new services in production and on running of new unmanaged services.

On thic case and the machine hosting the OSM 20 birthday website

  • Grant was asking when he can productionise the OSM 20 birthday website, and on Monday was told that people were still working on it, and they wanted the website to be deployed the next day.
  • Plugins were added to the birthday website until the time it went live.
  • The OSM 20 birthday website is currently hosted on fume (HPE ProLiant DL360 Gen10, Debian 12), which is intended as the location for community.openstreetmap.org
  • Assumption of no long-term plans to use it, after the OSM 20 birthday.
  • The plugins aren't being updated and WordPress isn't being updated because it's disabled from Chef, as they have a lot of unmanaged changes that still need to be moved back into Chef.
  • They might not have thought of the time required and not understand the technical implications, like the website needing back-up.
  • There might be a problem with communication.

Suggestion regarding this case: Move the birthday website a couple of weeks after the birthday.

Policy suggestions
Set the right expectations from the start:

  • Schedule needed to be known in advance.
  • Give at least one-week notice for deploying new services in production, sometimes more depending on the service, instead of couple of hours.

Points to be turned into a ticket and a policy

  • Agreed launch schedule ahead of time, including Operations team.
  • Ops require a 7-day window to “production” deploy, complex sites may require additional time.
  • Communicate to the Ops Group, not individuals.

Other points mentioned during discussion

  • There was also disagreement by OPS about the website being done in WordPress. OPS now have a ticket about a policy regarding WordPress websites.

Action item

Grant to move the requirements regarding deployment of new production services that were mentioned during the meeting to the ops site in a "Policy" communication format.

Post-meeting additions: https://github.com/openstreetmap/operations/issues/1125


Proposed CPU upgrades for odin / ysera

https://github.com/openstreetmap/operations/issues/1105#issuecomment-2227267807

  • odin - Tile server, Supermicro SYS-1029P-WTRT, Ubuntu 22.04
  • ysera - Tile server, Supermicro SYS-1029P-WTRT, Ubuntu 22.04


Xeon Gold 5120 -> Xeon Gold 6148 (40% improvement?)
https://www.cpubenchmark.net/compare/3154vs3176/Intel-Xeon-Gold-5120-vs-Intel-Xeon-Gold-6148
Xeon Gold 6148 is ~£140 each

"2nd Gen Intel® Xeon® Scalable Processors and Intel® Xeon® Scalable Processors
Dual Socket Socket P (LGA-3647) supported, CPU TDP supports Up to 205W TDP, Dual UPI up to 10.4 GT/s"

General agreement on upgrade. Paul and Grant will finalise on which CPU to upgrade to. We will upgrade ysera.


Cloudflare keep enabled?

Practical Issue:

  • Site deployment are more difficult, deploys need to be synced. Also require cache purge.
  • Scrapers are not being blocked at moment because their access is masked by Cloudflare.

We should evaluate the different options benefits / negative points.

---

We have Cloudflare enabled - it's doing proxying. Performance: the website loads really fast, API feels sluggish.

  • Cloudflare might tell us at some point that the free plan is not suited to us, due to the data volume.
  • Fastly is also an option, as it also has DDOS protection.

Practical Issues

  • Site deployment are more difficult, deploys need to be synced. Also require manual cache purge. We need to ensure that all 3 front-ends switch at the same time.
  • Scrapers are not being blocked at the moment because their access is masked by Cloudflare. That might explain why the API seems a bit more sluggish and the difference in response size that we're seeing in CGI map.

On suggestion of sticky or semi-sticky mapping from client IP to machine

  • Probably won't work.
  • Issue: We relied previously relied on the fact that browsers would typically load all the assets from the same machine that they loaded the page from. They're unlikely to use a different I{ address to load the subsidiary assets, whereas Cloudflare doesn't know that. So if somebody loads the page and then asks the asset, they go to a different back end, get a 404, because that backend hasn't deployed yet and they cache that 404 for a while.

Related to Cloudflare cache rules
We have limited ability to control Cloudflare settings, as we're using the free tier. We have 10 cache rules and currently using one.

  • Cloudflare is caching 404s.
  • Rules can be on the following request parameters: cookie referrer, SSL, URL, user agent, request headers, cookie value, file extension.
  • We're interested in response parameters. There is a Cloudflare setting "browser cache TTL", to "determine the length of time Cloudflare instructs a visitor's browser to cache files", however it does not affect for how long Cloudflare caches the 404.
  • Tiered caching enabled on Cloudflare, which does cross-cache queries via the closest server to your servers and that becomes the primary cache server.
    • Not worth it with tiles.
    • Might work for DDOS attack.

Related to Cloudflare IP access rules

  • They don't disclose a limit on them.

Related to Cloudflare block rules

  • Has reasonable number of block rules.

Cloudflare

  • They don't seem to put access restrictions for particular countries.
  • Privacy concern raised by some community members, also raised for Fastly.

On having Cloudflare or a CDN being in front of everything at the moment

  • In principle happy, ignoring scrapper and deployment issues.

Other points mentioned during discussion

  • We need a CDN in front, for attacks.
  • Could use Fastly, as it also has DDOS protection, and have pin or semi pin between backend and IP.

Related to scrappers

  • People are not supposed to use generic user agents for requests - OPS has a policy.
  • Suggestion: Figure out if there are any obvious candidates for blocking scrappers and manually add them.

On performance issues

  • Performance issues started on the night before the DDOS attack, when we did the RAM changes on the server.
    • An index might need rebuilding, or it is necessary to restart the daemons, to flush out dead connections (recommended by mmd).
    • One of the tables does seem to be doing a lot more index scans.
    • Ability to turn it on all the time?

Action items

  • Grant to determine the Cloudflare API call to block IPs, in order to deal with scrappers.
  • OPS to make a reasonable evaluation whether to go with Cloudflare, Fastly or none.

Any other business

State of the Map Europe 2024

Paul attended the State of the Map Europe 2024 conference and talked with Copernicus. They were a sponsor of the conference, they are located in Warsaw and Frankfurt, and use our standard tile layer. They also have a cloud provider (CloudFerro).

Action item

Paul to follow up with Copernicus and see if we can get rendering servers from them.

Upgrade CPUs in Ironbelly

Related to ticket - openstreetmap/operations Upgrade CPUs in ironbelly (Issue #1124)

Ok with the upgrade.


Action items reviewed at the beginning of the meeting

  • [2024-07-11] Grant to find equivalent storage options for Karm and Eddie [Topic: DB server drives]
  • [2024-07-11] OPS to consider whether to add extra filters. [Topic: DDOS]
  • [2024-06-27] Grant to send an announcement about the OAuth status on Monday. [Topic: OAuth]
  • [2024-06-27] Paul to see if we're still appropriately balanced on the CDN and then OPS to decide on upgrading the RAM of Odin. [Topic: rhaegel usage?]
  • [2024-06-27] OPS to do capacity planning for tile.openstreetmap.org [Topic: rhaegel usage?]
  • [2024-05-30] OPS to add the SDRP requirement to the Editor Policy draft and see what feedback we receive. [Topic: Editor Policy] # On the 2024-06-13 agenda
  • [2024-05-02] OPS to revisit the OpenMapTiles application. # 2024-06-13 They haven't responded to the questions. Paul to email them again.
  • 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]

Action items that have been stricken-through are either completed, or have been moved to GitHub tickets.