RouteDefault: Musings of a Network Engineer

Using Oracle Cloud Infrastructure (OCI) Free Tier to deploy a HA containerized web application

Firstly, I’m not a front-end developer – nor do I aspire to be. This project first started out as an exercise to get more familiar with Golangs/Go’s net/http module. Creating a janky web app (with probably the worse UI known to humankind) was just a means to an end (see below evidence) I found an awesome API (https://the-one-api.dev) that allowed me to explore my Lord of the Rings obsession whilst exploring Go’s net/http package…

Once I’d created the app, I decided to package it as a container and just kind of archive it. Before I knew it I was experimenting with ways to host/deploy it. Obviously, the cloud was a no-brainer, and what better cloud platform than OCI?

OCI’s free tier gives access to some great always free resources (including 2 free AMD compute instances) for you to lab, study, and experiment with. I highly recommend checking it out. https://www.oracle.com/uk/cloud/free/

If you are interested in the GO source code for my webapp then it’s in the below repository along with the dockerfile:

https://github.com/thecraigus/tolkienwebapp

The actual docker image is over at:

https://hub.docker.com/repository/docker/clcartlidge/tolkienweb/general

The terraform configuration file manages the application lifecycle, provisioning and destroying the resources as required. Below is a breakdown of the elements of this stack. Again, all are available on OCI free tier.

2 x AMD Micro VM Compute Instances (Running Oracle Linux + Docker) Spread Across 2 fault domains.
1 x Load Balancer
1 x Load Balancer Listener
1 x Backend-Set
1 x VCN
2 x Subnets (one web tier and one lb tier.)

The Terraform configuration is avalable at :

https://github.com/thecraigus/tolkien-oci-terraform/blob/master/tolkien-portal-ha.tf

The OCI Terraform provider is very well documented and there are lots of pre-written examples you can use to get yourself up and running in no-time.

Time to run terraform:

Plan:

Apply:

The remote-exec provisioner allows us to bootstrap the compute instances to install docker, pull the image from dockerhub and start the application!

Lets check out our newly created resources!

Great, our 2 free compute instances are up and running.

And so is our OCI Load Balancer.

Let’s browse and give our app a road test.

Hit search

Thanks for reading guys.

Happy Labbing!

Creating Terraform Providers from YANG Data With JTAF (Juniper Terraform Automation Framework)

Starting out in my NetDevOps journey I didn’t really have much love for IaC frameworks *cough ansible cough* – the whole concept of trying to implement code-logic in a domain-specific language didn’t sit right with me. Ansible just felt like unnecessary abstraction, any attempt at trying to implement any advanced logic and the wheels quickly fell off, not to mention the debugging hell. For IaC I much prefer something like Nornir where plays can be written in an actual Turing complete programming language.

However for all its faults, I’ve learned to like Ansible – and by the same token, I’ve learned to really like Terraform. I suppose where Terraform provides advantages is its ability to provide a full configuration lifecycle system and really allow us to implement the concept of immutable infrastructure practices (yeah… so can netconf with a full config-replace I suppose) however If you are using Terraform to manage your public/private cloud presence then I suppose it can be helpful to use the same toolset for your traditional network infra as well as cloud resources.

A few vendors have produced officially maintained terraform providers for various products. However one of the issues is network vendors often have a large portfolio of different platforms and associated software versions – creating/maintaining terraform providers for each of these would be an absolutely massive task. The clever bods over at Juniper Networks realized this and have created a tool that allows customers to create their own ‘lean/targeted’ Terraform providers based on the features/platforms/software versions they are running.

JTAF uses the publicly available Juniper YANG models to create terraform providers. As the project is pretty new there is are a limited number of examples around. I thought I’d go through my experience of creating a terraform provider for Juniper vSRX 19.2R1 to manage security policy.

Pre-Reqs

There are 2 application pre-reqs that are required before we can use JTAF:

pyang (https://pypi.org/project/pyang/)
go (https://go.dev)

Please ensure Golang and Pyang are installed and operational on your machine before proceeding

Downlaod JTAF/YANG

The first place to start is by actually downloading JTAF tool itself.

Clone the repo (https://github.com/Juniper/junos-terraform) to your local machine as below:

Also download the Junper YANG model repository (https://github.com/Juniper/yang.git) – These Yang models are the baselines from what our custom provider will be based upon.

Environment Setup

create a JTAF config file in the TOML markup format, this will instruct JTAF to where the YANG models, xpath file, and output provider directory lives. Give your provider a name, ours will be called “vsrxsec”

Please ensure that the output directories exist as JTAF will look for these as part of the provider creation process

craig@ubuntu:~/Desktop/jtaf-exampes$ cat config.toml 
yangDir = "/home/craig/Desktop/jtaf-exampes/yangfiles"
providerDir = "/home/craig/Desktop/jtaf-exampes/terraform_providers"
xpathPath = "/home/craig/Desktop/jtaf-exampes/xpath_sample.xml"
fileType = "text"
providerName = "vsrxsec"

Identify Required YANG Models

In order to build our provider, we need to identify the required YANG models that relate to the resources we want to create. How I seem to do this is by looking at the Juniper CLI in XML and deducing the hierarchy from there – I’m sure there are much more elegant solutions but that was the way I did it 🙂

Looking at the below – in order to create the address-book we are under the security highrachy in the structure.

Identify the YANG models from the downloaded YANG files repository and copy these over to the yangDir path that is denoted in your config.TOML file

One thing I did notice is that there is a dependency on the ‘common’ yang models for the software version of your choosing, ensure these are copied to the yangDir path with the targeted YANG models, In the interest of eliminating any dependencies I am copying all YANG models for junos-es (srx) for 19.2 into our yangDir folder.

Process YANG

Now the YANG files have been identified and copied to the required working directory, we can use the first half of the JTAF tool to process these yang modules into their counterpart YIN (an XML representation of YANG) and an XPath reference file.

Navigate to the cmd/processYANG directory of the cloned JTAF repo and build the go file with go build. This will present an executable binary that we can use to process the identified YANG files.

Run the process yang binary with the -config flag set to the location of the config.toml file, this will allow JTAF to find the location of the identified YANG models. Now play the waiting game! … This will take a bit of time go and have a shower, hot bath, cook dinner, grab a 20 min power nap … whatever takes your fancy 🙂

Identify Xpaths

After the process YANG section of JTAF has run then there should be equivalent .yin and xpath files for each of the YANG modules that are in the yangDir directory indicated in the config.toml file

.yin files are simply xml representations of the YANG data

The xpath files will help us in identifying the resource capabilities that we want to build out in our provider.

As we are looking to manage security policy lets identify the required xpaths that we need to feed into JTAF to ensure it builds out the provider correctly. As shown above the best way I have found to do this is to actually build out some configuration in Junos first and then look at the XML representation of this to deduce the hierarchy (don’t worry you don’t have to commit it, just rollback 0!)

one of the ‘guidelines’ for JTAF is to build using the smallest unit of possible concern – so instead of identifying the top-level ‘/security’ xpath, lets try to be as granular as possible in identifying the xpaths we need.

Taking the below, to find the xpath for the ip-prefix address book we need to identify the security/address-book/address/ip-prefix xpath string

Lets open the junos-es-conf-security@2019-01-01_xpath.txt and validate the xpaths we require, we can see that security/address-book/address/ip-prefix is a valid xpath expression

We can continue the same logic to identify all the required xpaths we need as shown below and use these to build out our terraform provider.

create address-book prefix (/security/address-book/address/ip-prefix)
match policy source (/security/policies/policy/policy/match/source-address)
match policy dest (/security/policies/policy/policy/match/destination-address)
match policy application (/security/policies/policy/policy/match/application)
apply then permit (/security/policies/policy/policy/then/permit)

Modify the xpath directory that is referenced in the config.toml file with the identified xpaths and save.

Process Providers

Now we have all the pieces in place to generate our terraform provider!

Navigate to the junos-terraform/cmd/processProviders dir and build the go file using go build and then run the binary with the config flag set to the location of the config.toml file.

This part doesnt take long at all in comparison to processing the YANG data that we did earlier

Now all the component pieces have been generated to actually build the terraform provider from the source. Navigate to the terraform_providers output directory that was specified in the config.toml file and build the provider as below:

Great, we have compiled the provider binary – we just now need to add it into the terraform plugins directory so that we can reference it in our HCL!

Using the provider

I suppose one of the things that JTAF doesn’t generate is the documentation required to actually use the provider. However, this is relatively trivial to figure out, by looking at the Golang source files for the provider you can deduce the required inputs if you know how to read a Golang struct.

I have written some HCL that leverages our new provider (on my github for reference) lets run through the terraform lifecycle with our new provider.

Running a terraform init pulls in our new provider wth no issues.

Running terraform plan we can see the execution plan of what resources will be created.

Running terraform apply creates our resources

Lets check the configuration in our vSRX – we can see the security policy has been created successfully!

Note on Mutability…

If we require to change any of the configurations that is managed via terraform on our devices, we need to taint the commit resource for that device using ‘terraform taint …..’ to ensure the commit resource is recreated.

Conclusion

Well that’s it, creating a custom terraform provider using JTAF! I hope you found this somewhat useful, if you want to check out the HCL I used its all on my github (https://github.com/thecraigus/jtaf-terraform-labbing)

I also encourage you to check out Chris Russells JTAF blog also! https://nifry.com/2022/02/18/junipers-terraform-automation-framework-jtaf/

Provisioning MPLS L3 VPN’s w/async Python + RestConf

Introduction/Rambilings

https://github.com/thecraigus/mpls-auto-provision/tree/master

When initially getting into code and marrying it with network engineering, I wasn’t overly obsessed with the speed of execution – my code ran (most of the time) and that was that. Whatever I scripted tended to be faster than manually typing it out box-by-box anyway, any attempt at micro-optimizations tended to be the death of progress for me. I would obsess over timing a function and re-writing it rather than making significant progress in overall functionality – should I do a standard loop or list comprehension!?

However, there comes a time when looking at code execution and efficiency makes sense when looking at the big picture. Parallel execution of configuration changes against network devices typically fits this use case.

There are multiple schools of thought when it comes to this in the python world, one is threading and another other is asyncio.

For whatever reason in my studies and general labbing I have always gravitated down the path of writing async code vs threads, I’m unsure why – I think I just found async code a little easier to read/write and understand conceptually in the beginning – so I guess I ran with this approach. A great breakdown of the differences and use cases is outlined here (http://masnun.rocks/2016/10/06/async-python-the-different-forms-of-concurrency/)

I recently wanted to look at a small proof of concept in my lab for provisioning MPLS L3 VPNs with async RestConf API calls. Hopefully, it gives an outline of writing async code in a network engineering context and maybe inspires you to start coding/swap to async (if you dont already) 🙂

Lab Overview

In our hypothetical scenario, we administer an MPLS L3 VPN provider infrastructure, whenever we onboard a new customer we have to manually configure each PE node with the new customer VRF, BGP configuration, and Service Interface southbound to the access network on each PE node.

With the projected growth of the company, this process needs to be streamlined, error-free, fast, and ensure configuration consistency across the network. (‘fast’ is all relative I suppose, but in a code context we should look to eliminate the use of blocking functions and try to do as much as we can at the time same time)

In terms of hardware, the MPLS backbone network consists of 4 CSR1000V (3 Acting as PE nodes) routers with the RestConf API exposed. The access network is administered by a different BU and there is currently no initiative to automate this functionality.

Code Overview

As we are using RestConf we need to choose a HTTP library for working with the API, requests seems like an obvious choice – however as requests does not support asyncio (and fast is one of our requirements) we will be using aiohttp.

The async and await keywords are of particular importance when working with asyncio, By prepending the async keyword at the start of a function we essentially transform it into a ‘coroutine’ and lets python know that this should be ran asynchronously inside the event loop, the ‘await’ keyword hands control back to the event loop and can be used to ensure we don’t block execution of our code. From the function, we are returning the awaitable object.

In the context of the below – we are essentially saying “don’t wait for the API response, see if there is anything else to do”

The above logic is repeated throughout the code, there are specific class methods that create the configuration on the target devices (VRF, MP-BGP and Service Interfaces)

If we look at the main function it is again defined as a coroutine. We are instantiating a class for each of our PE nodes and by using the asyncio.create_task() method we are able to package together tasks that can be executed concurrently

I broke the create VRF, Update MpBGP and Provision Service interface out into three separate tasks with an async sleep function between them. If I ran them all within the same task I ran into some dependency issues / issues with the netconf data store syncing – I’m assuming this isn’t the most elegant way to do this however it works!

We are passing our main coroutine into the event loop and instructing it to run until complete.

(Please place passwords in environment vars)

Code Run

Lets onboard a new customer!

In my config.ini file, I have the details of a new customer for that we want to create a L3 VPN for. The customer information be in a backend database or presented through an API – however for proof of concept this will do.

Other Burger chains are available – however, BK is king for a reason 🙂

Let’s quickly check our PE nodes, we only have 3 VPN’s configured thus far.

Run the code! – seemed to go without a hitch

Let’s check the PE nodes! – Looks like our new L3 VPN has been created as expected and prefixes are being exchanged.

Conclusion

There it is, async restconf calls! As always, all code is available on my github.

I appreciate you can’t see the speed in a writeup but trust me it’s substantially quicker than using a blocking library such as requests 🙂

All code below:-

https://github.com/thecraigus/mpls-auto-provision/tree/master

Unit Testing Network Infrastructure w/ pyATS

There’s nothing better than a greenfield deployment. The infrastructure has been deployed to the ‘Gold Standard’ in terms of design practices and all required optimisations have been put in place from a network engineering perspective to ensure the best user and application experience.

In reality, these infrastructures don’t live in an isolated environment, moves adds, and changes often end up causing drift from the initial design. Verbose documentation that outlines the configuration of the environment quickly becomes obsolete as the network adapts to its new requirements.

From an operational perspective, the optimal operating state of the infrastructure becomes lost and abstracted with the over-reliance on high-level monitoring tools – NOC teams often resort to ‘blob-watching’ (dots go green to red..) as opposed to ensuring the network is operating at its ‘Gold Standard’ – What even is the intended operating state of the infrastructure, how can we quickly determine drift? Cross-referencing that initial design document to CLI show commands? – Forget about it.

pyATS

This is where pyATS comes in, pyATS is a python based testing framework geared towards network infrastructure – (There are a number of abstractions and additional modules to the pyATS eco-system such as genie parsers and a CLI wrapper, this writeup will focus on the core python library). Developed and maintained by Cisco internal engineering, it’s now (and has been for a while) available for the outside world to use. And yes it supports multiple vendors! Just check out the unicon docs (https://pubhub.devnetcloud.com/media/unicon/docs/user_guide/supported_platforms.html) unicon is the underlying connection library that pyATS is using.

Take the below topology:

I have a number of unit tests that validate the optimal operating state of the infrastructure.

Both R1 and R2 should have 1 OSPF peering. – No less, no more.
The next hop for R2’s Loopback From R1 should be via 10.250.10.2
The next hop for R1’s Loopback from R2 shoult be via 10.250.10.2
NXOS-A should have 2 eBGP peers in an ‘Established’ state.

Lets code it up!

pyATS requires the infrastructure that you are testing against be defined in a YAML testbed file. In my case, this is in my_testbed.yaml

testbed:
    name: my_topology
    credentials:
        default:
            username: craig
            password: Redacted!
        enable:
            password: Redacted!

devices:
    R1: 
        os: ios
        type: ios
        connections:
            vty:
                protocol: ssh
                ip: 192.168.137.35
    R2:
        os: ios
        type: ios
        connections:
            vty:
                protocol: ssh
                ip: 192.168.137.36
    NXOS-A:
        os: nxos
        type: switch
        connections:
            rest:
                class: rest.connector.Rest
                ip: 192.168.137.37
                credentials:
                    rest:
                        username: craig
                        password: Redacted!

For R1 and R2 we are using the SSH connector within unicon, however for NXOS-A I decided to leverage the REST API for something a bit different.

In order to connect to device APIs with pyATS/unicon, you will have to install the rest connector package (https://developer.cisco.com/docs/rest-connector/)

The below code is our actual ‘test script’ in this script we can code up our test requirements that were articulated above that highlight our infrastructure’s operational gold standard.

tests are decorated with the aetest.test decorator, we can write individual tests or loop over infrastructure components that subscribe to a common test-case with the aetest.loop decorator

from pyats import aetest
import re

class CommonSetup(aetest.CommonSetup):

    @aetest.subsection
    def check_topology(self,
                       testbed):
        ios1 = testbed.devices['R1']
        ios2 = testbed.devices['R2']
        nxos_a = testbed.devices['NXOS-A']

        self.parent.parameters.update(ios1 = ios1, ios2 = ios2,nxos_a = nxos_a)


    @aetest.subsection
    def establish_connections(self, steps, ios1, ios2, nxos_a):
        with steps.start('Connecting to %s' % ios1.name):
            ios1.connect()

        with steps.start('Connecting to %s' % ios2.name):
            ios2.connect()

        with steps.start('Connecting to %s' % nxos_a.name):
            nxos_a.connect(via='rest')

@aetest.loop(device = ('ios1', 'ios2'))
class CommonOspfValidation(aetest.Testcase):
    @aetest.test
    def NeighborCount(self,device):
        try:
            result = self.parameters[device].execute('show ip ospf neighbor summary')
        except:
            pass

        else:
            neighborcount = re.search(r"FULL.+(\d)",result).group(1)
            assert  int(neighborcount) == 1

class R1_ValidateEgressTransit(aetest.Testcase):
    @aetest.test
    def R2NextHop(self,ios1):
        try:
            result = ios1.execute('show ip route ospf')
        except:
            pass

        else:
            nextHop = re.search("2.2.2.2.+via (\d+.\d+.\d+.\d+)",result).group(1)
            assert  nextHop == '10.250.10.2'

class R2_ValidateEgressTransit(aetest.Testcase):
    @aetest.test
    def R1NextHop(self,ios2):
        try:
            result = ios2.execute('show ip route ospf')
        except:
            pass

        else:
            nextHop = re.search("1.1.1.1.+via (\d+.\d+.\d+.\d+)",result).group(1)
            assert  nextHop == '10.250.10.1'

class NXOS_A_Unit_Tests(aetest.Testcase):
    @aetest.test
    def peer_r1(self,nxos_a):
        try:
            result = nxos_a.rest.get('/api/mo/sys/bgp/inst/dom-default/peer-[10.250.100.2]/ent-[10.250.100.2].json')
        except:
            pass

        else:
            operState = (result['imdata'][0]['bgpPeerEntry']['attributes']['operSt'])
            if operState != 'established':
                self.failed('peer not established')

    @aetest.test
    def peer_r2(self,nxos_a):
        try:
            result = nxos_a.rest.get('/api/mo/sys/bgp/inst/dom-default/peer-[10.250.100.6]/ent-[10.250.100.6].json')
        except:
            pass

        else:
            operState = (result['imdata'][0]['bgpPeerEntry']['attributes']['operSt'])
            if operState != 'established':
                self.failed('peer not established')
class CommonCleanup(aetest.CommonCleanup):

    @aetest.subsection
    def disconnect(self, steps, ios1, ios2):
        with steps.start('Disconnecting from %s' % ios1.name):
            ios1.disconnect()

        with steps.start('Disconnecting from %s' % ios2.name):
            ios2.disconnect()

if __name__ == '__main__':
    import argparse
    from pyats.topology import loader

    parser = argparse.ArgumentParser()
    parser.add_argument('--testbed', dest = 'testbed',
                        type = loader.load)

    args, unknown = parser.parse_known_args()

The script structure follows a Setup/Test/Cleanup structure. With the unit tests defined in the Test section. This is where the test logic is defined. Lets examine a single unit test from our script. The below is the unit test for NXOS-A, peering to r1.

class NXOS_A_Unit_Tests(aetest.Testcase):
    @aetest.test
    def peer_r1(self,nxos_a):
        try:
            result = nxos_a.rest.get('/api/mo/sys/bgp/inst/dom-default/peer-[10.250.100.2]/ent-[10.250.100.2].json')
        except:
            pass

        else:
            operState = (result['imdata'][0]['bgpPeerEntry']['attributes']['operSt'])
            if operState != 'established':
                self.failed('peer not established')

We are connecting the the API of NXOS-A and performing a get-request to get the status of the peer 10.250.100.2
We are examining the result of the API request, if the state is not establised, we fail the unit test.

We run our test-script and specify the testbed YAML file at runtime with the –testbed flag.

The results of our unit tests are below with the status of each one in a nice tree format. It looks like all our unit tests have passed!

Let’s break something and run these tests again, lets make a configuration change to break the peering from NXOS-A to R1 and run these tests again.

We can see that the unit tests have now failed, specifically on the NXOS_A unit tests for peer R1. If we look at the log output we can see our failure reason.

Summary

Awesome! Again this was a simplistic example but as if with anything in the code-area, the only limit is your imagination.

All code available on my github

Model-Driven Streaming Telemetry with TIG Stack (IOS-XE)

In the SDN/NetDevOps era, it would be unfair to leave network monitoring behind. Monitoring and general network ‘observability’ is going through just as much of a transformation as the configuration management of the devices themselves.

SNMP, although highly structured by design can be a bit of a management nightmare for the uninitiated. Trawling through endless OID structures, dealing with unsupported MIBs, and battling with protocol inefficiencies are just a few of the drawbacks one faces when using SNMP to target specific data.

With the advent of YANG data modeling, we can use essentially the same data modeling schema that we use for declaratively defining our configuration intent as well as specifying the data we want to retrieve for network observability purposes – Goodbye to the dark art of OID trawling! One data modeling language to rule them all.

So YANG can help us easily identify the data we want, how about visualising it? This is where TIG stack comes in. TIG is made up of three open-source components Without getting too heavy, their general purpose is outlined below.

Telegraf – The Data Collection Piece
InfluxDB – The Data Storage Piece
Grafana – The Data Visualisation Piece

Getting a TIG stack instance up and running for testing/labbing is pretty straightforward, we can use docker-compose to quickly stand up our stack (docker-compose file in my github).

Lets ensure the Telegraf configuration is set to receive our streaming telemetry data before standing up our environment.

in the telegraf.conf file we are leveraging the inbuilt cisco_telemetery_mdt plugin. Let’s use gRPC as the transport because ~~its edgy and cool~~ why not? Let’s set telegraf to listen on port 57000 for incoming gRPC messages from our hosts.

After running our docker-compose up, we can see that all 3 elements of the TIG stack are up and running and exposing the ports specified.

Great, our monitoring stack is up! let’s get some data into it.

I am using containerlab with the vr-net plugin to run a couple of CSR-1000v’s as ‘container-wrapped’ VM’s but it doesn’t really matter what you use, as long as the routers have reachability to the telegraf container on port 57000.

As we are using YANG, we need to identify the actual YANG models that the platform supports, we can use something like pygnmi or gnmic for this, however I decided to bush off an old custom python script that I had kicking about (just to see if it still worked…). These capabilities are documented but lets actively look anyway.

Running our capabilities script we can see that the YANG model ‘Cisco-IOS-XE-bgp-oper’ is supported, along with countless others. Let’s focus on this model for now in order to gain insight into our BGP operations.

As we are working with YANG, I highly recommend using a tool such as pyang to help visualise the model – git clone the following repo to get all YANG models locally (https://github.com/YangModels/yang)

Let’s browse to the dir where our model resides and run pyang to visualise the tree.

Looking at the above structure, we can deduce the following xpath expression to see the amount of installed BGP prefixes in our BGP table

/bgp-ios-xe-oper:bgp-state-data/neighbors/neighbor/installed-prefixes

From the above, it may be obvious where the expression following the colon (:) comes from, but what about before the colon? Well, we are able to find this at the top of the module itself. We can dig into the actual model itself and look for the top-level prefix as below:

Now we have the full xPath and know how to construct it, we can configure our Router to send telemetry via gNMI

The configuration on IOS-XE is relatively self-explanatory – We are setting the data encoding to key-value google proto-buf (kvgpb) for use with grpc. Specifying the xpath expression as deduced from our YANG model. Setting YANG-Push as this is ‘streaming’ telemetry leveraging YANG. We are also setting the update frequency to ‘periodic’ based on 5 100ths of a second – we could also change this to ‘on-change’ to only send telemetry data when there is a change in the value itself.

_{telemetry ietf subscription 20
 encoding encode-kvgpb
 filter xpath /bgp-ios-xe-oper:bgp-state-data/neighbors/neighbor/installed-prefixes
 source-address 172.20.20.15
 stream yang-push
 update-policy periodic 500
 receiver ip address 172.20.20.8 57000 protocol grpc-tcp}

We can validate the configuration for each xPath expression by looking at ‘show telemetry ietf subscription <id> detail’ – Looks like our config is valid, lets build out a Grafana Dashboard!

In Grafana we can create a new dashboard and add a panel to visualise the data we are receiving from telegraf and storing in influxDB. The query editor makes it simple, with a few clicks you can cobble together some SQL-esq statement that will return you what you require.

We have 5 prefixes installed in the BGP table so far!

I’ll advertise some more from our peer and we should see them reflected in the pannel/we have just built.

We can see that the amount of prefixes installed in the BGP table has increased to 8 from 5!

Let’s add another configuration to see the uptime of the peer along with another panel in our dashboard.

telemetry ietf subscription 30
 encoding encode-kvgpb
 filter xpath /bgp-ios-xe-oper:bgp-state-data/neighbors/neighbor/up-time
 source-address 172.20.20.15
 stream yang-push
 update-policy periodic 500
 receiver ip address 172.20.20.8 57000 protocol grpc-tcp

We are leveraging the same YANG model, just a different xPath this time.

Our dashboard configuration looks a little different this time, this is because the value returned from is a string and not an int – the key values are specified in the tree returned from pyang, the strings are displayed in a Table format this time, I have chosen to only show the last row in the table.

We can see our peering has been up for 1 week and 4 days

We can display the pannels on the same dashboard as below, allowing for us to easily create some useful NOC views from any YANG model! Pretty Sweet?

So that’s been visualising streaming telemetry data in TIG stack with gRPC, from IOS-XE!

Thanks for reading! As Always, code/files on my github!
https://github.com/thecraigus/TIGBlog/blob/main/telegraf.conf

Creating a Cisco SDWAN Chatbot with Azure App Services + Python

In a previous blog-post I confessed my affinity for ‘ChatOps’, and how instant messaging clients can be used to help network operations teams (using a framework like stackstorm and ‘event’ driven automation).

I wanted to explore this idea a bit more with another post. However, this time focusing on how the network administrator might actively request data as opposed to its being posted based on a failure event/trigger generated by something like a syslog or other telemetry data.

I decided to prototype a quick Cisco SDWAN chatbot. In the high majority of cases, the Cisco SDWAN vManage controller is an internet-facing device, making it the perfect fit for a use-case such as this. With other ‘on-prem’ SDN controllers there might be a bit more work required, such as VPN’s back to on-prem infrastructure (depending on where the bot is hosted) but nothing out of the reach of possibility.

Below highlights a quick overview of the prototype solution workflow.

A request is made in the WebexTeams Client
The webex service makes a HTTP callback to out bot-endpoint (Hosted in Azure)
Out bot looks and parses the request and makes a request to the vManage controller for whatever data was requested.
Our bot posts the data back to the originating room within webex teams

I already had a pre-existing bot in my Webex teams channel from my previous blog post, I just needed to request the access token again. Creating a bot is a really simple and well-documented process over at https://developer.webex.com

The webhook is essentially a HTTP request that is specified to trigger upon certain conditions. The creation of the webhook is typically done through the API, however, to save writing code it’s quick to do this via the Webex teams API explorer and fill in the variables We want the HTTP callback to trigger when ‘messages’ are ‘created’, messages being the resource and created being the event. The target URL is where our SDWAN chatbot will be hosted.

The actual bot endpoint will be hosted in the Microsoft Azure App Services PaaS platform. As Azure is the cloud platform I am most familiar with I chose this by default, but as long as the bot has access to/from the general internet then it shouldn’t be an issue.

I am linking the App Service into my GitHub as the deployment option – so whenever I commit changes to the repo the deployment happens automagically! Pretty cool!? can anyone say CI/CD ?

The runtime stack has been chosen as Python 3.8 as that is what our bot is written in, but again it could be whatever language you are familiar with.

The actual bot code is below. I am leveraging the Flask framework – but the concepts still apply with whatever python Web Framework you are using (Flask, Django, FastAPI) – This code is NOT production-grade, please do not bake credentials into your production code, this is just for demo purposes.

from flask import Flask, request
import requests
from webexteamssdk import WebexTeamsAPI, Webhook
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


username = 'devnetuser'
password = 'RG!_Yw919_83'
vManage = 'sandbox-sdwan-1.cisco.com'
authurl = 'https://{}/j_security_check'.format(vManage)
authbody = {'j_username': f'{username}', 'j_password': f'{password}'}
url = f'https://{vManage}/dataservice/'
token_url = url + 'client/token'

app = Flask(__name__)
api = WebexTeamsAPI()


@app.route('/events', methods=['GET', 'POST'])
def webex_teams_webhook_events():
    if request.method == 'GET':
        return ("""
                   <html>
                       <head>
                           <title>SDWAN Chatbot!</title>
                       </head>
                   <body>
                   <p>
                   <h1>SDWAN Chatbot</h1>
                   <h2>The App is running!</h2> 
                   </p>
                   </body>
                   </html>
                """)

    elif request.method == 'POST':
        json_data = request.json
        print('\n webhook data \n')
        print(json_data)

        webhook_data = Webhook(json_data)
        room = api.rooms.get(webhook_data.data.roomId)
        message = api.messages.get(webhook_data.data.id)


        bot = api.people.me()

        if message.personId == bot.id:
            return 'OK'

        else:
            if "sdwan controller status" in message.text:
                viptela = requests.session()
                viptela.post(url=authurl, data=authbody, verify=False)
                login_token = viptela.get(url=token_url, verify=False)
                viptela.headers['X-XSRF-TOKEN'] = login_token.content
                getStatus = viptela.get(
                    url=f"https://{vManage}:443/dataservice/device/monitor", verify=False).json()

                deviceStatus = []
                for device in getStatus['data']:
                    deviceStatus.append(
                        device['host-name']+f' Status: {device["status"]}')
                api.messages.create(room.id, text=str(deviceStatus))
            return 'OK'


if __name__ == '__main__':
    app.run()

I am also using the Cisco SDWAN sandbox as the vManage endpoint, one shortcoming of this is that there are no cEdge/vEdge devices associated with this controller. So I was limited with what data I could pull back, I was limited to the actual controllers themselves.

Without doing a line by line rundown of the code, what we are essentially looking for is the string ‘sdwan controller status’ in the message that was sent to the bot in our webhook. If we see this string then we perform an API call to vManage requesting the status of the controllers and then we post it back into the originating room.

To abstract alot of the functionality, we are using the webexteams SDK Webhook class. One thing to be mindful of is that the Webhook class requires an access token to be specified at runtime, I have set this as an environment variable in Azure App Services as below:

To test the bot endpoint is working we can trigger the “get” endpoint by browsing to the url

In order to get the bot to perform the webhook, we have to mention the bot in the webex teams chat using @NetDevOps

We can ask the bot the status of the controllers by specifying a message that contains the string ‘sdwan controller status’

Awesome!

We could take the concept further by requesting specific device informantion, individual device configuration of and even alarms and events!

Thanks for reading!

All code on my github

https://github.com/thecraigus/sdwanchatbot

Exploring gNMI with Arista cEOS + YANG

In-network programmability we abandoned the CLI in favor of the API, but what transport protocol do we use? In my journey, I have typically utilised REST, Netconf, RESTconf and even a bit of CLI scraping down an SSH connection when needed to. One of the more relatively recent additions to the network programmability party is gNMI.

gNMI is developed on top of Googles open-source gRPC framework. It leverages Googles protobuf messaging format and HTTP2 to achieve speeds 7-10 times faster than a REST transaction.

The following RPC’s are supported as per the gNMI Specification:

Capabilites – (used to gain understandaing of the targets capabilities)
Get – (used to retrieve snapshots of the data on the target )
Set – (used to modify data on a target)
Subscribe – (used to control subscriptions to the taget, typically used in streaming telemetry)

gNMI is implemented by a number of network vendors across various platforms, I will be looking at Aristas implementation in their containerized EOS platform.

The following configuration enables the gNMI interface over the default Arista port of 6030

In order to gain a list of the capabilities of the target and its supported data models, we can use the gnmic cmdlet directly from bash, we use the insecure flag to indicate that the target is not using TLS transport (as this is a lab)

In the output we can see that the platform supports lots of data models, mainly in the native Arista and Openconfig formats

I have chosen to work with the openconfig format. As this demo will be focusing on working with an ACL, I have chosen the ACL Openconfig model, however, this model will be relative to whatever your use-case is.

One of the things I have learnt about working with YANG, is that in its raw format it isn’t easy on the eye – determining the paths to a resource can be very time consuming without the correct tooling, I recommend the usage of YANG catalog or pyang to get a better feel for the actual model of the path required.

Arista makes their platforms supported YANG models available online and this works a treat with pyang.

pyang is a really useful tool for visualizing YANG models, navigate to the dir of the yang model in question (in our case /openconfig/public/release/models/acl) and issue the command pyang -f tree <model> to view the model in a tree format.

from the above format, we are able to deduce the correct path to our resource by following the hierarchy provided by pyang.

In order to view the source address of sequence number 10 in our ACL ‘test123’ the path would be as follows:

_{/acl/acl-sets/acl-set[name=test123]/acl-entries/acl-entry[sequence-id=10]/ipv4/config/source-address}

Now we know the path, we could use the gnmic command line tool again to pull the data back, however – let’s use the python library pygnmi to create a simple script to return the data we are interested in.

We instantiate a session of gNMIclient by feeding it that data relevant to our target.

We connect to our target by issuing the .connect() method.

Once connected we can issue a ‘get’ RPC (as defined in the gNMI specification) in this RPC we issue the path of the resource we are interested in, as deduced from the chosen YANG model

We get the below data returned from the request (I have leveraged the json standard library to make the original output a bit more readable)

In order to modify the device configuration, we can use the ‘set’ RPC that is defined in the gNMI specification to ‘replace’ or ‘update’ the configuration, the highlighted differences are defined in the specification

In the above script, we are replacing the existing source address with our new source as defined in the variable newsource, we are then getting the source address with a get RPC to validate the change.

We can see the output as below!

Conclusion

gNMI is (relatively) new, fast, and for the most part – pretty well documented, from what I have seen so far. YANG takes a bit of getting used to but with the correct tooling, it isn’t impossible.

Supposedly where gNMI really shines in in streaming telemetry with its subscribe RPC – this is something that I want to look into in a future post along with other network observability technologies.

Key tools for working with gNMI:

gnmic
pygnmi (if you are a python dev)
pyang (for visualising those yang models)
vendor documentation (Thanks Arista! – I cant comment on other vendors comprehensiveness as of yet…)

Event Driven Network Automation with StackStorm & WebEx ChatOps

The bulk of network automation I tend to see is what I call ‘proactive’ automation where a task or process has been disseminated into its component parts and codeded up in a language or framework ready to be ran at the behest of the network administrator/engineer. Examples of this approach could be firewall policy extraction and conversion for a migration activity to a new platform or even something as ambitious as the automated provision of a new VPN onto a providers MPLS infrastructure when they onboard a customer.

Initially coming from a Network Operations background, I still like to think of ways how Network Automation can be leveraged in response to faults, outages and to aid in the time to recovery – lets be honest, Murphy is real & things will break/go down when you least want them to. This in mind, I started to look at what is available in the space of ‘event’ driven network automation and to be honest there isn’t much, I assume alot of large organisations (FAANG) write their own bespoke internal tooling for this .. However, the main two frameworks I came across were SaltStack and StackStorm.

Both seems like great projects, I decided to look into StackStorm and its a great framework – its essentially an IFTTT engine.

Take the topology below, the peering between r1 and r2 is my SUPER-important OSPF adjacency, wouldn’t it be nice to be notified about any issues on this as soon as possible?

In the world of remote work people are communicating more than ever over their IM clients – having notifications of outages and network issues pushed directly into the IM client along with some basic triage? If a peering goes down, it could be an issue with the interface itself lets have those details as soon as possible without having to log onto the device.

I created a simple bot in Cisco’s Webex Teams in a mock ‘Network Operations’ Room

The below rough and ready python script interacts with the bot and posts a message in the room, it performs some simple parsing of the syslog that is presented to it via stackstorm and grabs the interface details associated with the downed adjacency.

import requests
import json
from netmiko import ConnectHandler
import sys
import re

syslogmessage = sys.argv[2]

interface = re.search('on(.+)from', syslogmessage).group(1)
print(interface)
device = sys.argv[1]

dev = {
    'device_type': 'cisco_ios',
    'host': device,
    'username': 'admin',
    'password': 'password123'
}

net_connect = ConnectHandler(**dev)

interface_output = net_connect.send_command(
    'show interface {}'.format(interface))


bot_token = REDACTED
roomm_id = REDACTED
message = 'OSPF Peering Failure Detected On Device {} \n {}' .format(
    sys.argv[1], sys.argv[2])
api_url = 'https://webexapis.com/'

headers = {'Authorization': 'Bearer {}'.format(
    bot_token), 'content-type': 'application/json'}
payload = {
    "roomId": roomm_id,
    "markdown": message+'\n'+interface_output
}


post_message = requests.post(
    url=api_url+'/v1/messages', headers=headers, data=json.dumps(payload))

How are we going to trigger the above script? this is where StackStorm comes in. The extensibility of StackStorm is really appealing, with lots of community written ‘packs’ available we can just leverage one of these. I have leveraged the syslog listener “Ghost2Logger” (https://github.com/StackStorm-Exchange/stackstorm-ghost2logger)

Stackstorm (ST2) works on the concepts of trigger,rules and actions. I have created a rule in ST2 for our critical peering that uses the ghost2logger pack – we will look at the trigger actions in a moment.

The rule will trigger our workflow, in turn triggering our pre-written ChatOps script and a basic “remediation script” that bounces the interface associated with the peering.

This is the metadata file (below) associated with our OSPF remediation workflow, this defines the pack the workflow belongs to and the entry_point that details the actual workflow tasks.

craig@ubuntu:/opt/stackstorm/packs/examples/actions$ cat ospfremediationsmeta.yaml 
---
name: ospf-remediations-seq
pack: examples
description: ospf-workflow
runner_type: orquesta
entry_point: workflows/ospfremediations.yaml
enabled: true
parameters:
  notify_script:
    required: true
    type: string

  remediate_script:
    required: true
    type: string

  cwd:
    required: true
    type: string

Actual workflow definition (below) has 3 inputs that are received, the name of the “notify script” the name of out “remediate script” and the working directory from where to run these scripts.

craig@ubuntu:/opt/stackstorm/packs/examples/actions/workflows$ cat ospfremediations.yaml 
input:
  - notify_script
  - remediate_script
  - cwd

tasks:
  task1:
    action: core.local cmd=<% ctx(cwd) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

        do: task2

  task2:
    action: core.local cmd=<% ctx(notify_script) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>
        do: task3
  task3:
    action: core.local cmd=<% ctx(remediate_script) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>

The rule trigger actions and workflow inputs are defined in the ST2 WebUI – you can also edit these in the CLI but I find the GUI ok for rapid testing

the ghost2logger.pattern_match action allows us 3 values – trigger.host, trigger.message and trigger.pattern – we can match based on these. Leveraging regex I have matched OSPF neighbor down events.

Based off of this, we can feed the values into our script arguments with standard j2 ({{trigger.name/message}}) syntax

Now the workflow and rule is finished lets test it!

Its a normal day in the NOC all quiet in the WebEx Chat….

Suddenly we get a WebEx Teams notification that there has been an OSPF peering failure on device 192.168.137.65! Along with the output of show interface Eth0/1

If we look closer at the output we can see that the peering interface has been shutdown! That explains why the peering has gone down.

Based on our automated ST2 workflow logic, we should have had a remediation script run as part of this workflowto bounce the interface… lets check on the router to see if its sorted everything out for us.

Awesome, the peering is back up and we didn’t even need to log manually into the router.

Lets check the stackstorm run history to see if the workflow ran as expected

We can see the ospf-remediations-seq ran with a by issuing: sudo st2 execution list

We can get more details about the workflow by checking out its execution-id – we can see the workflow and all 3 associated tasks succeeded!

Pretty Sweet!

StackStorm seems like a really useful framework and I will definitely be checking it out and seeing what else it has to offer!

scripts/workflow files: https://github.com/thecraigus/stackstorm

VXLAN BGP-EVPN with Cumulus + NXOS

In one of my previous blogs I outlined the basic configuration required for a simple VXLAN deployment between 2 Cisco Nexus 9k V switches. The overall aim of extending layer 2 across a layer 3 backbone was achieved, however as is the default behavior of VXLAN with no control plane mechanism – the solution still relied on a flood a learn behavior to propagate mac addresses across the fabric. This writeup aims to outline the additional constructs and configuration that is required to bring some pseudo-intelligence into the solution by leveraging the EVPN with MP-BGP to control the distribution of this traffic and increase the scalability of the solution.

I was also curious to get to grips with Cumulus Linux and test the extensibility of the BGP-EVPN solution between disperate vendors. The lab is built around 2 NXOSv 9k’s in the Spine & 2 NXOSv 9k’s along with a Cumulus VX in the leaf layer.

Lab Devices Outlined Below:

Cisco NX-OSv 9000 9300v 9.3.3 (Spine A & B, Leaf A & B)
Cumulus VX 4.3.0 (Leaf C)
Cisco IOU-L3 (For all end hosts)

Device Loopbacks:

Spine A – Loopback 0 (2.2.2.1/32) and Loopback 12 (12.12.12.12/32 – Anycast RP)
Spine B – Loopback 0 (2.2.2.2/32) and Loopback 12 (12.12.12.12/32 – Anycast RP)
Leaf A – Loopback 0 (1.1.1.1/32)
Leaf B – Loopback 0 (1.1.1.2/32)
Leaf C – Loopback 0 (1.1.1.3/32)

The lab will provide multi-tenancy across the fabric, with each customer having 2 segments within their respected tenancy. Full symmetrical inter-VNI routing will be performed for each customer in their respected tenancy.

See below for the high level breakdown of the customer segmentation and VNI allocation

Tennant/Customer 10
- VNI90010 – (192.168.10.0/24 – mcast 239.0.0.10)
- VNI90020 – (192.168.20.0/24 – mcast 239.0.0.20)
- VNI10010 – Layer-3 VNI for Symmetric Inter-VNI routing
Tennant/Customer 20
- VNI92010 – (172.16.10.0/24 – mcast 239.0.20.10)
- VNI92020 – (172.16.20.0/24 mcast 239.0.20.20)
- VNI20010- Layer-3 VNI for Symmetric Inter-VNI routing

Technical Overview

Physical

The physical topology will leverage the CLOS / Spine and leaf architecture that has become synonymous with these technologies – predictable round-trip time along with scope to leverage full ECMP whilst forwarding across the fabric are the two big driving factors behind this physical topology.

OSPF

OSPF will be deployed as a single area backbone to provide end to end reachability across the underlay between VTEPs (Leafs). Nothing particularly spectacular about the OSPF configuration, just ensure all loopbacks and transit networks are advertised to OSPF.

On the Spine nodes I did ensure that the OSPF RID was manually set to Lo0 as opposed to the default behaviour of the anycast RP Lo12 address – eliminating any possible duplicate RID issues.

NXOS – Leaf A – OSPF Configuration/Validation

Cumulus – Leaf C – OSPF Configuration/Validation

All leafs have OSPF peerings with all spines as expected.

Multicast

My original choice was to configure bidirectional PIM as all VTEPs will be multicast senders and receivers, However, during the configuration I did find out that Cumulus does not support PIM BiDir (please see) so opted for ASM PIM Sparse.

Both spines will be participating in an anycast RP group of 12.12.12.12 so no need to manually configure MSDP between spine switches, this syncronisation of sources is taken care of by the NXOS anycast rp group feature.

All spines and leafs will be configured with a manual RP of the anycast address of 12.12.12.12/32.

All Transit and Loopback interfaces will be configured with PIM Sparse.

NXOS – Spine A – Multicast Configuration

Cumulus – Leaf C – Multicast Configuration

MP-BGP

The BGP topology consists of single autonomous system (ASN 65000) with both spine nodes performing the function of BGP route reflectors to eliminate the requirement for a full-mesh iBGP peering across the fabric.

The l2vpn evpn family has been configured along with the distribution of extended communities.

VXLAN/EVPN

The local host-side encapsulation is VLAN based, these vlans are mapped to specific VXLAN segments that are again mapped to a specific multicast group (as detailed above)

L3 NVI has been configured to allow symmetric inter-vni routing – this removes the requirement for all vni segments to be present on all VTEPs

The VLAN segments have been tied to their specific VNI’s

The NVE interface performs the vxlan excap/decap on NXOS – we also have enabled control plane learning via BGP over the NVE interface with the “host-reachability protocol bgp” command. Each VNI is allocated its dedicated multicast group.

The SVI’s have been placed into their appropriate VRF’s depending on the respected tenancy – the anycast gateway feature has been configured to enable the same L2 address at each SVI on each VTEP – this is especially useful for Virtual Machine mobility within the fabric.

The L3 VNI’s have been configured with the ip forward command, this enables routing functionality but without the need for an ip-address.

The Cumulus configuration isn’t a million miles away from the NXOS configuration in terms of syntax for vxlan – In all honesty I actually prefer it.

Cumulus and NXOS Generate their RD’s in the same manner, so no requirement to deviate from the automatic route import/export schema. As with the VXLAN configuration, this is only required to be configured on the VTEP’s the spines are essentially acting as multicast/mp-bgp/ospf forwarders.

Verification

We can see the EVPN type 2 mac/ip routes by issuing show bgp l2vpn evpn on Leaf-A – We see the intra-vni routes advertised with no ip address and the inter-vni routes advertised with an ip address.

The Cumulus show command is almost identical to the cisco NXOS equivalent – makes troubleshooting nice and easy between platforms – net show bgp evpn route

Configs

https://github.com/thecraigus/nxoscumulusevpn

GNS3 OnDemand TestBed

GNS3 is not the only show in town nowadays – relative newcomers like EVE-NG and CML ( RIP VIRL) have given GNS3 stiff competition and seem to be gaining ground – whenever I see a screenshot of someone’s virtual lab it tends to be in EVE-NG nowadays. I have yet to make the switch and in all honesty Ive not really heard a compelling argument to move from GNS3 – it does everything I need it to do for my studies. Can someone change my mind?

One of the things I recently discovered about GNS3 was its pubic facing API – The ability to programaticly stand up test infrastructures for general labbing, study or integration into a CI/CD pipeline (something I want to cover in another blog) is a feature that I had to explore. The API is on version 2 now so I may be late to the party but this write-up will give a broad overview of my dealings with the API

I am running the GNS3VM locally however these concepts would still apply if you were running on a remote server

The API has great documentation avalable at: https://gns3-server.readthedocs.io/en/latest/#

All code avalable on github: https://github.com/thecraigus/gns3ondemand

Credentials

The GNS3 API is a RESTful API and is authenticated by default. It accepts Basic authentication – the username and password are buried away in a the gns3_server.conf file as below so open the file and make note of these.

Creating a Project (createPoject.py)

Each GNS3 project has a name and a unique Project ID – I created a dictionary with these values in and a dictionary for storing the credentials.

A simple post request to the /projects API endpoint allowed me to create a project workspace.

Adding Nodes (addNodes.py)

The /nodes API operates in much the same way to the project API endpoint – albeit it expects a rather weighty json document per node – in retrospect I should have really authored these json documents using values from YAML files and a jinja2 template rather than having multiple verbose json documents that embedded in the script itself. Maybe this is something I will revisit later.

Adding Links and Starting Nodes (startTestbed.py)

Each link in GNS3 is represented by a json array of 2 nodes and their corresponding ports/adapter numbers. The nodes are referenced by their nodeID that can obtained by the /nodes API get request.

Running the start startTestbed.py script will spin up the topology

if we open up the newly created project file we can see that our testbed topology has been created programaticly.

Summary

The ability to stand up test infrastructures on demand is something that can be leveraged in personal study, labbing and testing as part of a CI/CD pipeline. The infrastructure that is deployed in this walkthrough has the blank default configuration, we could however take this a step further and perform some bootstrapping or an automated console session using something like telnetlib initially to get the infrastructure to a state that mirrors your production environment ready to test your changes.