When I first joined the Cisco IT team in 2018 after graduating from university, I was eager to see how a large, global company like Cisco can efficiently develop and deliver solutions to its customers. I quickly discovered that while Cisco is big and complex, it operates very much like an agile startup, fueled by the self-motivated innovation and ongoing collaboration of its 74,000-strong workforce.
One standout example of the entrepreneurial spirit at Cisco is the evolution of Cisco TraceLogger from a log collection and management solution to a connected, scalable, and distributed customer experience solution. Cisco IT and the Cisco Customer Experience (CX) team, using a cross-functional, collaborative development model, created a better technical support experience for customers and partners who use Cisco Collaboration applications. These two teams have brought together tools, systems, and data to help reduce engineering hours and deliver better service.
The story of this partnership is outlined in this recent white paper. But I wanted to underscore a few things about the initiative that impressed me personally. Topping the list is the fact that Cisco TraceLogger was built by operational engineers at Cisco during their free time. They used their innovative drive to turn a mundane log-collecting task into a fully fledged, modern, industry-standard software application that will help Cisco customers as well as generate revenue for Cisco.
Cisco IT as “Customer Zero”
Cisco IT and CX’s teamwork in developing Cisco TraceLogger is, in many ways, a natural extension of their ongoing collaboration. Cisco uses its own products internally, which Cisco IT manages. The team is well-versed in operating—and often, breaking—these products once they are deployed in Cisco’s large IT environment. The CX team, meanwhile, works directly with customers to troubleshoot and identify gaps in Cisco’s product suite. And Cisco IT and CX both work with the product engineering team at Cisco to provide feedback and ensure issues such as bugs, feature requests, and architecture failures are resolved in future releases.
However, because the engineering team has a large backlog of features to deliver and go to market with before competitors do, it can’t always prioritize requests from Cisco IT or CX. Fortunately, there is another option: Cisco IT provides the perfect environment to test any solutions that CX develops by acting as Customer Zero. And as Cisco IT obtains more software-focused skills, the team is able to contribute directly to product development by CX—as well as other divisions within Cisco. That, in turn, helps to accelerate the development and delivery of new solutions, like Cisco TraceLogger.
Cisco TraceLogger is a solution that Cisco IT desperately needed, which is a key reason the team jumped at the opportunity to help the CX team evolve it. Cisco IT supports video calling for all of Cisco through the infrastructure we have set up on Cisco’s internal network. If a call goes wrong, our team must diagnose and resolve the problem as quickly as possible. (It’s important to note here that Cisco customers are likely to face the same problems as Cisco IT in supporting the video calling service in their own environment.)
Our team found the diagnostic tools for on-premises video calling solutions to be lacking. For example, there’s no way to know what route a call took. The call setup process could have hopped through several Cisco Unified Call Manager (CUCM) instances across different geographic regions. To piece this information together, an engineer has to pull logs manually from each individual call manager, and then go through the log files line by line to assemble the call setup flow. It’s a long, mundane, and error-prone process.
Often, Cisco IT must send the log files to a team within CX for more troubleshooting help. Log files can be tens of gigabytes in size, which makes transferring them an arduous task. (In some cases, we have had to mail hard drives.) This inefficiency has led to our team spending up to a month trying to diagnose some call issues.
Finally, we asked our colleagues in CX if they knew of a better way. That’s when they told us about their in-house initiative, TraceLogger, which can automatically pull down logs from CUCMs and piece together calls using an algorithm. This automated process takes minutes and provides excellent case resolution times.
Cisco IT offered to help accelerate the project, and we were able to provide new features and make TraceLogger more stable. Our team also deployed TraceLogger in our production environment to see how it would scale and find any bugs and flaws before customers would. Here’s a look at some of the high-level technical changes we made:
Improvement #1: A distributed architecture
Cisco TraceLogger has a microservice architecture comprised of individual components that work together to perform a task. It’s a modern approach to designing software. In fact, TraceLogger is built with very new technologies such as the ELK Stack for data ingestion, storage, and searching; Docker for virtualization; Flask for the web service, and a custom multi-processing log-parsing library built with Python. Learning these skills is proof of the commitment and drive from Cisco IT engineers, who, again, did this on their own time.
TraceLogger was designed to be installed as a single instance to support, at most, the processing of 15,000 log files per hour. That’s more than enough for most customers; however, Cisco needs multiple instances due to the size of its IT environment. Another challenge: Cisco’s video infrastructure is spread around the world. Cisco IT alerted the CX team to these challenges to ensure TraceLogger could collect logs from different places around the world. That capability saves Cisco IT time and money, and it will do the same for customers who buy the solution. It’s also a prime example of how Cisco IT creates value by serving as Customer Zero.
In the new production version of TraceLogger, Cisco IT and CX decided to split the solution into two services: TraceLogger Central and TraceLogger Edge. TraceLogger Edge can be deployed geographically near the video infrastructure and contains all the microservices associated with pulling and processing logs, limiting much of the heavy computational work to the scope of a region. That means any impact on network traffic through the transfer of log files is minimal. Meanwhile, the TraceLogger Edge node sends only lightweight, summarized information that is useful to the TraceLogger Central instance. This approach can scale for as many TraceLogger Edge nodes required per customer, allowing for a service that is inclusive of all customers.
Improvement #2: Stronger security posture
Cisco IT also paced TraceLogger through our security workflows in our production environment, which allowed us to benchmark the product’s security and catch any vulnerabilities before our customers would. A small part of that process involves running scans against the source code as well as the running instance. Through that process, our team found several holes in the original TraceLogger source code. We deployed fixes for the vulnerabilities and documented and reported those issues to the CX team so they could track them and ensure that they were addressed in Cisco TraceLogger. The real beneficiaries here are Cisco customers, who will be receiving a more secure product because of these efforts.
Improvement #3: API-centric design
When Cisco IT deployed TraceLogger into region-specific locations, we noticed that it was difficult to search across those locations because there were no standard application programming interfaces (APIs) to talk to the TraceLogger instances. So, one of the key features of Cisco TraceLogger that our team pushed for was ensuring that all functionality be available as an API so that we, and other customers, can build custom tools on top of it.
We didn’t stop there, though. To prove how valuable APIs can be, we built Cisco Webex Teams bots, which we use to diagnose the health of all TraceLogger instances in the Cisco IT environment. We can show management metrics such as the number of calls per day, the most active regions for calls, and more. Ensuring that Cisco TraceLogger is API-centric allows a world of future opportunities for Cisco IT—and Cisco customers—to build on top of TraceLogger’s rich data.
An invitation to share knowledge at Cisco
When CX decided to make Cisco TraceLogger available to all Cisco customers, Cisco IT was there to provide architecture recommendations and help ready the project for a production release. I was personally invited by CX to spend a month with the Cisco TraceLogger team to help accelerate progress. Using what I learned at Cisco IT, I helped implement some of the technical changes that our team thought would help TraceLogger scale to any customer’s needs, especially large financial institutions that have similar scaling needs as Cisco.
The CX team also asked me to go to Bangalore to train and teach their newly hired engineers about the Cisco TraceLogger architecture and work with them to ensure shipment to customers is on track. These opportunities to teach (and continue learning) would not be possible without the support from the excellent professionals I work with every day at Cisco. I can truly say, speaking as someone who was not sure what to expect joining such a large company after graduation, that startup-like innovation is alive and thriving at Cisco.
Brilliant article Vicky!