Avatar

Sometime back, we wrote about the need for #consistentAI to accelerate the enterprise digital transformation. To summarize 1) we discussed some trends then – the widespread adoption of Kubernetes, the need to manage the lifecycle of AI using consistent tools, and the need to bring together the data scientist and IT. 2) We articulated why Cisco contributed to Kubeflowa core set of applications needed to develop, build, train, and deploy models on Kubernetes efficiently. This allows our customers to manage their AI/ML lifecycles better e.g. make it easy to train models, tune models, deploy ML workloads anywhere.

Fast forward to 2020. Kubeflow 1.0 is ready. Through perseverance and hard work of some talented individuals and close collaboration across several organizations, together we have achieved a pivotal milestone for the community.

In this article we would like to take a step back, celebrate the success, and discuss some of the steps we need to take the project to the next level. The Kubeflow project has several components and, we describe our journey through the lens of the contributions we made to a few of these components. This focus was deliberate and was guided by the gaps and our customer needs.

Our Journey

Kubebench

Challenge: When we joined the community, it was early and many current components did not exist then. A big gap was a benchmarking tool for Kubeflow, which was critical for performance testing.

Solution: We got to know the community better and how to effectively work together. This allowed us to agree on  how the user can run complex ML workloads including some industry standard workloads like MLPerf. And, Kubebench was born. Over time, Kubebench will allow the user to run complex ML benchmarks including some industry standard workloads like MLPerf.

Operators

Challenge: TFJob and PyTorchJob operators were early and not production ready. There was a need to unify the different operators and reduce code duplication etc.

Solution: We first built a common/shared library and refactored pytorch and tensorflow operators. The benefit was common operator implementation which can now support all the well known frameworks e.g. TF, Pytorch, MXnet, MPI, XGboost, thus improving the core of Kubeflow.

Katib

Challenge: Our next big push was to address the community need around a general hyperparameter tuner that worked across different operators.

Solution: We revamped the Katib API and made it simpler to use. Also we added a new neural architecture search capability to Katib. This will help users improve and speed up their data science exploration by improving models without having to hand tune them.

On-premise

Challenge: A big burning issue in the community has been (and still is) to tame the on-premise and hybrid user experience of Kubeflow. We identified this as a big area for the community a year ago and we committed to a north star where the customer/user can have a smooth experience as she moves back and forth between cloud providers and on-premise infrastructure.

Solution: We enabled an on-premise file storage solution with NFS or a local disk provisioning solution, we contributed a solution for authorization and authentication to enable a multi-user experience in Kubeflow and we volunteered to co-own the Kubeflow releases. This has lead to a much better on-premise user experience, on par with any cloud based deployment.

Community Service

Challenge: Understanding the needs of the customer or consumer is important for open source communities and the technical product managers play an important role to glue the different sub teams working as a big team. It is even more important when we have folks from different organizations with different priorities.

Solution: We decided to contribute to the shared PM effort. We helped create the user surveys  that helped define the priorities for the different capabilities. We also helped form the technical advisory council (TAC) where we transcended our own organizations’ own priorities to provide longer term technical vision of the project and work with different teams to drive the project forward. The impact has been better application guidelines, a more robust codebase, better technical documentation etc.

Evangelism

Challenge: Our customers and our internal teams were looking for MLOps solutions like Kubeflow and they needed some assurance that Kubeflow could be helpful for their AI transformation.

Solution: We presented several introductory sessions at  Cisco Live events and other internal events. We also presented at KubeCon Asia, KubeCon North America, and at KubeCon Europe. Additionally, we also published Kubeflow based work in  peer-reviewed  academic venues such as ICML, NeurIPS, and USENIX OpML.

The Road Ahead

Cisco has been part of the Kubeflow journey since the project’s formative stage. Kubeflow 1.0 is a great milestone in a long journey to simplify the ML lifecycle management (or MLOps) and accelerate the digital transformation of the enterprise, in a consistent fashion. We have achieved the initial stability that we desired for the base MLOps features e.g. pipelines, training, serving, notebooks, hyperparameter training and solutions for cross cloud pipelines.

This was a significant investment. It has taken several organizations and a lot of precious resources to get here. For example, Google has put in an incredible amount of work within a few years to create capabilities like Fairing, Metadata, and Pipelines that has changed the user experience. Organizations like JPMC and USBank have enriched the on-prem community with their feedback. Shell has contributed to Spark integration and Arrikto has contributed to on-premise efforts and miniKF.

We are very excited about the future of Kubeflow. We would like to see the community get stronger and more diverse, and we would like to request more individuals and organizations to join the community. It would be great for organizations to contribute to performance analysis for large scale workloads. We w0uld like to call upon the research community to participate in the community. The current Kubeflow features are the day-0 capabilities needed for consistent MLOps. We would like to see better interoperability with other MLOps tools and platforms like MLFlow. We have a long way to go and many more capabilities to build (e.g., ML fairness, ML drift detection, ML debugging, ML CI/CD etc).

Acknowledgements

We congratulate the entire Kubeflow community for this great milestone and would like to take this opportunity to thank all contributors for their hard work and for helping cross many hurdles, big and small, throughout the lifetime of the project.

Finally, we would also like to do a shout out for the key engineers from Cisco whose contributions made Cisco among the top 3 organizations, along many metrics: @Elvira for PM, Johnu for operators and Katib, Xinyuan for Kubebench, Andrey for Katib, Adhita for Central UI and On-premises, Krishna for auth and on-premise stability, Ramdoot for validating and providing feedback about Kubeflow by internally developing and testing several ML applications on it, and Amit for evangelism and training. Apart from these core set of engineers, we are deeply indebted to a large number of colleagues, including interns, who have helped us develop code and demos, make public presentations, brainstorm on important features, and have relentlessly evangelized the project, both internally and externally.