Event Driven Network Automation with StackStorm & WebEx ChatOps

The bulk of network automation I tend to see is what I call ‘proactive’ automation where a task or process has been disseminated into its component parts and codeded up in a language or framework ready to be ran at the behest of the network administrator/engineer. Examples of this approach could be firewall policy extraction and conversion for a migration activity to a new platform or even something as ambitious as the automated provision of a new VPN onto a providers MPLS infrastructure when they onboard a customer.

Initially coming from a Network Operations background, I still like to think of ways how Network Automation can be leveraged in response to faults, outages and to aid in the time to recovery – lets be honest, Murphy is real & things will break/go down when you least want them to. This in mind, I started to look at what is available in the space of ‘event’ driven network automation and to be honest there isn’t much, I assume alot of large organisations (FAANG) write their own bespoke internal tooling for this .. However, the main two frameworks I came across were SaltStack and StackStorm.

Both seems like great projects, I decided to look into StackStorm and its a great framework – its essentially an IFTTT engine.

Take the topology below, the peering between r1 and r2 is my SUPER-important OSPF adjacency, wouldn’t it be nice to be notified about any issues on this as soon as possible?

Super Important OSPF Adjacency

In the world of remote work people are communicating more than ever over their IM clients – having notifications of outages and network issues pushed directly into the IM client along with some basic triage? If a peering goes down, it could be an issue with the interface itself lets have those details as soon as possible without having to log onto the device.

I created a simple bot in Cisco’s Webex Teams in a mock ‘Network Operations’ Room

The below rough and ready python script interacts with the bot and posts a message in the room, it performs some simple parsing of the syslog that is presented to it via stackstorm and grabs the interface details associated with the downed adjacency.

import requests
import json
from netmiko import ConnectHandler
import sys
import re

syslogmessage = sys.argv[2]

interface = re.search('on(.+)from', syslogmessage).group(1)
print(interface)
device = sys.argv[1]

dev = {
    'device_type': 'cisco_ios',
    'host': device,
    'username': 'admin',
    'password': 'password123'
}

net_connect = ConnectHandler(**dev)

interface_output = net_connect.send_command(
    'show interface {}'.format(interface))


bot_token = REDACTED
roomm_id = REDACTED
message = 'OSPF Peering Failure Detected On Device {} \n {}' .format(
    sys.argv[1], sys.argv[2])
api_url = 'https://webexapis.com/'

headers = {'Authorization': 'Bearer {}'.format(
    bot_token), 'content-type': 'application/json'}
payload = {
    "roomId": roomm_id,
    "markdown": message+'\n'+interface_output
}


post_message = requests.post(
    url=api_url+'/v1/messages', headers=headers, data=json.dumps(payload))

How are we going to trigger the above script? this is where StackStorm comes in. The extensibility of StackStorm is really appealing, with lots of community written ‘packs’ available we can just leverage one of these. I have leveraged the syslog listener “Ghost2Logger” (https://github.com/StackStorm-Exchange/stackstorm-ghost2logger)

Stackstorm (ST2) works on the concepts of trigger,rules and actions. I have created a rule in ST2 for our critical peering that uses the ghost2logger pack – we will look at the trigger actions in a moment.

The rule will trigger our workflow, in turn triggering our pre-written ChatOps script and a basic “remediation script” that bounces the interface associated with the peering.

This is the metadata file (below) associated with our OSPF remediation workflow, this defines the pack the workflow belongs to and the entry_point that details the actual workflow tasks.

craig@ubuntu:/opt/stackstorm/packs/examples/actions$ cat ospfremediationsmeta.yaml 
---
name: ospf-remediations-seq
pack: examples
description: ospf-workflow
runner_type: orquesta
entry_point: workflows/ospfremediations.yaml
enabled: true
parameters:
  notify_script:
    required: true
    type: string

  remediate_script:
    required: true
    type: string

  cwd:
    required: true
    type: string

Actual workflow definition (below) has 3 inputs that are received, the name of the “notify script” the name of out “remediate script” and the working directory from where to run these scripts.

craig@ubuntu:/opt/stackstorm/packs/examples/actions/workflows$ cat ospfremediations.yaml 
input:
  - notify_script
  - remediate_script
  - cwd

tasks:
  task1:
    action: core.local cmd=<% ctx(cwd) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

        do: task2

  task2:
    action: core.local cmd=<% ctx(notify_script) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>
        do: task3
  task3:
    action: core.local cmd=<% ctx(remediate_script) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>

The rule trigger actions and workflow inputs are defined in the ST2 WebUI – you can also edit these in the CLI but I find the GUI ok for rapid testing

the ghost2logger.pattern_match action allows us 3 values – trigger.host, trigger.message and trigger.pattern – we can match based on these. Leveraging regex I have matched OSPF neighbor down events.

Based off of this, we can feed the values into our script arguments with standard j2 ({{trigger.name/message}}) syntax

Now the workflow and rule is finished lets test it!

Its a normal day in the NOC all quiet in the WebEx Chat….

Suddenly we get a WebEx Teams notification that there has been an OSPF peering failure on device 192.168.137.65! Along with the output of show interface Eth0/1

If we look closer at the output we can see that the peering interface has been shutdown! That explains why the peering has gone down.

Based on our automated ST2 workflow logic, we should have had a remediation script run as part of this workflowto bounce the interface… lets check on the router to see if its sorted everything out for us.

Awesome, the peering is back up and we didn’t even need to log manually into the router.

Lets check the stackstorm run history to see if the workflow ran as expected

We can see the ospf-remediations-seq ran with a by issuing: sudo st2 execution list

We can get more details about the workflow by checking out its execution-id – we can see the workflow and all 3 associated tasks succeeded!

Pretty Sweet!

StackStorm seems like a really useful framework and I will definitely be checking it out and seeing what else it has to offer!

scripts/workflow files: https://github.com/thecraigus/stackstorm

One thought on “Event Driven Network Automation with StackStorm & WebEx ChatOps

  1. Hey Craig. Really appreciate your time and effort in writing this article.

    Like you mentioned, I tried to do some research about auto remediation based on syslog – there really is none… There is no ‘start’ to ‘finish’ tutorials out there. But you’ll always hear people throw out buzz words saying it can be done via ‘ansible’ but never show the missing pieces to the puzzle.. So I’m grateful for this article that you have put together.

    I do have some questions because I’m trying to set this up now in my network… I’m at the part where you use chatops to output the ‘show inteface’ stats in the channel… Is there a way I can integrate this to Slack instead?

    May I also know how your rule is configured in the GUI? Like what did you put for the trigger, criteria and action….? That would be so helpful for me if you could..

    Thanks so much!

Leave a comment