Working Group Minutes/EWG 2013-04-01
Appearance
Attendees
| IRC nick | Real name |
|---|---|
| apmon | Kai Krueger |
| Firefishy | Grant Slater |
| iandees | Ian Dees |
| tmcw | Tom MacWright |
| zere | Matt Amos |
Summary
- Long lines rendering issue
- pnorman reported an issue where very long (world-spanning) lines were created in the osm2pgsql rendering database causing strange visual artefacts and long rendering times.
- There was discussion of which component was likely to exhibit the problem, and work-arounds.
- ACTION: zere to try to reproduce long way issue and find out what component is affected and file a ticket as appropriate.
- API outages
- The question was raised over how editors can find out about API outages other than simply getting failed API requests.
- ACTION: zere to add something to the capabilities call to make this information available to editors.
- Discussion of whether the pingdom monitoring was fine-grained enough to cover all the components of the API.
IRC Log
17:04:27 <zere> welcome, everyone. hope the new time hasn't been confusing 17:05:04 <zere> minutes of the last meeting: http://www.osmfoundation.org/wiki/Working_Group_Minutes/EWG_2013-03-25 17:05:13 <zere> please let me know if anything needs changing 17:06:06 <zere> pnorman mentioned an issue last meeting which we didn't have time to properly discuss, so let's start with that today. 17:06:51 <zere> the issue was that a large way was created spanning pretty much the whole world, and it had a bunch of unfortunate downstream effects in osm2pgsql, mapnik and friends 17:09:33 <zere> apmon, you're the most knowledgable about that part of the stack - are there specific things we should be breaking out of this, or is it a straightforward bug we should file a ticket against (in osm2pgsql / renderd / mapnik)? 17:10:29 <apmon> that part is most likely a mapnik issue (which I know relatively little about) 17:11:38 <apmon> I haven't looked at the issue in any detail though 17:11:45 <zere> all i could think of for mapnik would be that a large way would cause precision issues when agg scaled to fixed point numbers. 17:11:53 <apmon> I do think it has happened in the past though as well 17:12:04 <zere> but i thought osm2pgsql chopped up long ways to avoid that sort of thing? 17:12:39 <apmon> I'd have to check if it does (and if that is still working) 17:15:30 <apmon> It does look like there is code in osm2pgsql that splits long geometries 17:15:40 <apmon> I'll need to investigate that further then 17:16:03 <zere> ok. worth filing a ticket so it doesn't get lost, or not? 17:16:31 <apmon> probably yes. But I guess it isn't clear yet which component exactly it applies to 17:16:49 <apmon> probably most helpful would be if someone can reproduce the issue in a small local extract 17:17:17 <zere> indeed. do we have any volunteers for that? 17:17:30 * zere looks in pnorman's general direction 17:17:32 <apmon> e.g. see if one can reproduce it my taking a liechtenstein extract, then moving one node in a way to australia and see what happens 17:18:19 <zere> not sure you'd even need the liechtenstein extract. i suspect it would be enough to start with an empty db, then add a single long way in a diff. 17:18:50 <zere> i supposed, having just said it was easy, that means i should volunteer ;-) 17:19:08 <apmon> :-) 17:19:18 <zere> #action zere to try to reproduce long way issue and find out what component is affected and file a ticket as appropriate 17:19:37 <zere> i'm guessing the easter weekend means we're underpopulated here... 17:19:48 <zere> but does anyone have anything else they'd like to discuss? 17:20:36 <iandees> anything we should share about the outage? 17:22:14 <zere> from a software perspective, not so much... my understanding is that it was a hardware failure and Firefishy went to heroic lengths to fix it on a holiday sunday. 17:22:51 <zere> is there anything we should share in addition to what's been shared already? 17:23:34 <apmon> I guess a question is can something be done to reduce the probability of Firefishy needing to go out to the DC on a holiday sunday 17:23:49 <apmon> although I suspect that is more of an issue for the OWG than EWG 17:24:12 * iandees nods. 17:24:20 <iandees> I was thinking mostly guidance for communication 17:24:28 <iandees> but i didn't scroll far enough down on the blog post 17:24:36 <iandees> which is fine 17:24:55 <iandees> (http://blog.osmfoundation.org/2013/03/31/database-maintenance/ is what i'm referring to) 17:24:57 <apmon> Perhaps one question for EWG, is how did the editors handle the API outage 17:24:58 <zere> the plan is there for an extra server capable of being failed-over-to. it's just that reality pre-empted the plan somewhat. should have moved quicker, with the benefit of hindsight 17:25:27 <apmon> Did many people loose a lot of data because they did edits and then couldn't upload them? 17:26:15 <apmon> Particularly the period of read-only rather than offline might have caused issues to josm and non osm.org potlatch users 17:26:46 <iandees> also, is there any software work that can be done to automatically fail over to "read-only" mode so the website is at least responsive? 17:26:48 <apmon> all editors should probably have a good warning system if the api is read-only 17:27:58 <zere> yes, i'm not sure what potlatch users could do about it. with josm there's always the option of saving the change for later, but i don't think potlatch or iD has that, do they? 17:28:22 <zere> in an ideal world they shouldn't have to, of course... 17:29:02 <apmon> my guess would be a good proportion of non expert JOSM users might not know how to deal with a read-only API either 17:29:34 <tmcw> zere: iD stores changes in localStorage so they can be recovered 17:29:44 <tmcw> but doesn't have much in terms of conflict resolution 17:30:00 <tmcw> is there a call to check if the api is read-only? 17:30:11 <zere> tmcw: cool. 17:32:11 <apmon> I think someone said the api/capabilities call doesn't provide that info 17:32:30 <zere> "writing" API calls will return service unavailable if the api is read only 17:32:45 <zere> but not capabilities - and that information isn't available there 17:32:57 <zere> i'll quickly add that... 17:33:58 <tmcw> please update https://github.com/systemed/iD/issues/1224 when there's some info 17:34:07 <iandees> zere: capabilities will return a 200 when others are returning 500's (thus pingdom requests a node) 17:35:18 <zere> i think capabilities was also returning 500. it did for me at times. i think because something in there references an activerecord class which wasn't loaded 17:35:40 <apmon> looking at the log of OSM-HealthCheck, it looks like some calls returned 500 whereas others returned service unavailable during offline mode 17:36:31 <iandees> i was thinking about other downtimes, not this most recent one 17:36:39 <zere> i wonder if database unavailability is something that's unit-testable within rails... 17:37:57 <apmon> Not sure if database unavailability is testable, but the db-offline and db-readonly mode probably should be 17:38:57 <zere> indeed, but i think any tests for db-offline mode when the database isn't really unavailable are giving us false passes. 17:39:48 <zere> iandees: i think i've missed your point, then... could you elaborate, please? 17:39:57 <apmon> you might be able to set a broken db config, if you can change the config on a per test case 17:41:02 <iandees> zere: there were times when people were complaining about map calls or individual node/way endpoints not responding but pingdom didn't alert because it was only sending alerts for when capabilities was failing 17:42:50 <zere> yeah, should be possible to cover more of the API with the pingdom tests. i'll ask Firefishy. 17:43:09 <iandees> right now it's doing capabilities and a node 17:43:15 <iandees> by id 17:43:27 <apmon> the online / offline / readonly status is in application.yml. So if you can set that in a unit test case you should be able to set the config settings in database.yml to simulate a non existing db server 17:45:52 <Firefishy> We are running out of pingdom checks, but yes. Pingdom can do lookups and parse results for text eg to pass. 17:46:34 <zere> ah, we only have a certain number of endpoints before we need to upgrade to a more expensive account? 17:46:42 <Firefishy> Yip. 17:46:48 <Firefishy> Although I can likely trim some. 17:48:35 <apmon> how expensive is a more expensive account? 17:49:13 <apmon> if it would be helpful, then presumably that is something that osmf could pay for? 17:49:13 <Firefishy> 15 extra checks are $9.38/month 17:49:20 <Firefishy> Yip 17:50:20 <apmon> That is amazingly expensive for what it sounds like it is doing, but still affordable it seems 17:53:05 <zere> i dunno, just over $100/year doesn't sound excessive, given the complexities of pinging these URLs from a bunch of servers all over the world and making sure that it's reporting real failure rather than intermediate network flakiness. 17:53:27 <zere> i mean, on top of the several hundred that the existing package costs ;-) 17:54:01 <zere> but then we're only looking at doing something /full or /history and the map call to cover all the bits of software, right? 17:55:09 <zere> speaking of which, i had a random question - is C++11 acceptable nowadays? if i ported cgimap over to require that, do you think it would cause problems? 17:57:13 <apmon> as cgimap isn't directly general purpose software, I think as long as it works on the osm servers it is more or less acceptable 17:57:28 <Firefishy> zere: It would be good if the /api/capabilities call report API readonly status 17:58:12 <zere> yup, i'll get right on it 17:59:02 <zere> anything else anyone wanted to discuss? 18:06:08 <zere> thanks to everyone for coming. hope to see you next week - same time: 5pm UTC.