We have Horde setup with Workflows and a triage channel. It works pretty well, most issues are handled by the HashedIssueHandler or ScopedIssueHandler, and end up getting grouped together in our slack triage channel.
When we get a Systemic Issue, a message also gets posted to the triage channel. It is not grouped, and our triage channel occasionally get spammed by nearly identical systemic issues.
Is there a way to route systemic issues to a different triage channel, or filter them so that Horde admins can see them, but make them less noisy for people looking at code/content issues?
Steps to Reproduce
We have Horde setup with workflows and a triage channel, and get XGE based systemic issues. The systemic issues spam the triage channel.
Hey there Steve,
Thanks for your question - it’s an interesting one as I haven’t spent too much time in this system.
Of note: A related question regarding finger printing has [come up in the [Content removed] which describes some of the uniqueness checks & cadence thereof. I only mention this because it relays some relevant information on how we fingerprint issues, and how they can be grouped towards a slack message (and updated across streams if a similar fingerprint was detected). I’m also trying to understand whether there are N systemic issues being created (their finger prints are different, and as a result they’re being sent far and wide due to them being perceived as different), or whether something else is amiss and we aren’t grouping them as we should be (details later in reply).
Making the assumption here that all the fingerprints are unique, and you’re getting a valid explosion of XGE errors (some type of system outage etc), there doesn’t appear to be anything that would allow you to explicitly suppress these particular issues from being sent out via the SlackNotificationSink, or on a different triage channel per se. One relevant note that I can see about this as it pertains to extra configuration we use, is the “triage type aliases”.
As an example in your workflows:
"triageTypeAliases": { "Systemic": "SOME_ALIAS" },
This will specifically tag the alias for the Systemic event IDs. We can see this referenced in the SlackNotificationSink.cs. Now, with some minor divergence in the code we *should* be able to redirect to some special triage channel just for systemic, but to my earlier point, no such built in feature appears to exist.
For posterity XGE events are not derived via our structured logging system, but instead though XoreaxEventMatcher. Thinking aloud, the issue key seems to be stream-template-nodename - so I can see where a single job that goes wide over N nodes may precipitate many systemic unique fingerprints. I wonder if we could make some type of change here to allow for better grouping in later consumption? If we are in this scenario where we think there should not be so many unique finger prints, can you provide some of the server log with “unique fingerprints”; I’d be curious what comes up here. For posterity the documentation around fingerprint matching logic is here.
[mention removed] - in case he has any other info here, or I’ve missed something.
Kind regards,
Julian
re: Grouping systemic issues - I’m not sure this would help much. The XGE issue is sporadic, so the issues get closed automatically when the next build runs, and even if they were grouped would open again the next time they appear.
The triageTypeAlias sounds promising for locating where I could make mods to try and change either the slack channel, or try to change the workflow handling the issue. I’ll take a look.
Hey Steve,
Thanks for the feedback. Re: grouping - yeah sounds good - didn’t think it would be but we don’t use XGE too much internally so wasn’t sure if perhaps the fingerprinting wasn’t working effectively.
Keep me up to date on how it fairs for ya - could be the start of an interesting user story!
Kind regards,
Julian