Data Science Knowledge Management with a Community of Practice

The reorganization of companies from a collection of knowledge-oriented functional teams into a constellation of product-focused cross-functional teams brought about enormous reductions in the amount of time required to develop and release new products. Indeed, it is common today for data-driven products to be created by teams comprising a mix of data scientists together with data engineers, ML ops specialists, and analytics translators. But these gains in delivery velocity come at the expense of splintering the knowledge of the organization’s functional communities, and most companies continue to struggle with knowledge management in the cross-functional setting.

This post describes the nature of the knowledge management challenge; argues why we need to reconsider the traditional, IT-based approach to it; and proposes a social alternative based on the concept of a community of practice.

A community divided

I want to start by illustrating the situation I encountered at a recent client, which I’m sure sounds familiar to many.

This organization’s data scientists (and engineers) were split into three isolated teams; the only moment in the week where members of the three teams would mix, aside from occasionally walking to the canteen together, was a 15-minute logistics-focused all-hands meeting Monday morning. Each team had their own projects, and consequently their own stakeholders in the larger organization, but sometimes there was overlap in the product owners they’d collaborate with and frequently there was overlap in the projects‘ data sources. Additionally, onboarding was very team-focused, a blinders-on approach with the aim of getting new hires productive on their project as quickly as possible at the expense of general organizational knowledge and relationships with their fellow data scientists.

Wallpaper Flare. Source: Wallpaper Flare

What were the consequences? For one, technical knowledge, such as good Python programming practices, the nuances of working with Delta Lake, or what the spectrum of visualization libraries looks like, remained siloed in the team, while multiple teams used the same tools and techniques. Moreover, domain knowledge, such as the meanings of various categorical abbreviations found in common datasets, how to treat certain columns‘ outliers, or whom in the organization knew the most about a particular process, was also siloed. The result was frequent "reinventing the wheel"-type scenarios, and of course because this technical and domain knowledge wasn’t spread and reinforced in other teams, when a senior data scientist left the organization, that knowledge often left with her.

Even if there had been greater contact between teams, the inconsistency in data science practice, for instance in differing code styles, repository structures, and tooling, hampered data scientists from one team from sharing their knowledge with colleagues in other teams, migrating between teams, and onboarding new hires.

The challenge of knowledge management

The above describes an organization struggling with knowledge management: the identification of knowledge that enables the data scientists of the organization to perform their job effectively, keep that knowledge updated, and spread it across the community despite the members of that community being distributed across the organization. Many companies would respond to this challenge by attempting to encode the above knowledge in an internal wiki, such as Confluence, and this organization was no different. But what typically occurs with such static knowledge stores is that they quickly become knowledge graveyards: community members lack incentives to update their pages as conditions evolve. And even if a community member should dare delve into its likely-outdated pages, a wiki’s discovery functionality is often weak: search is limited to keyword scans and announcements of new content are typically automatically shunted by users to a mailbox that’s never read.

The most effective way of storing, updating, and sharing knowledge in an organization looks very different from a wiki.

According to the excellent Cultivating Communities of Practice by E. Wenger, R. McDermott, and W. M. Snyder (“CCoP”), the principal reason that these wiki-like knowledge management approaches fail is that information is frequently confused with knowledge. What gives rise to this difference? CCoP points to four reasons:

Knowledge lives in the human act of knowing.

Sure, you can read a blogged case study to learn a bit about doing better data science. But far more effective is to put what you read into practice, developing a feel for under what conditions what you read can be applied and how to implement the subject of the case study at your particular organization. Even better is getting feedback from a more experienced practitioner as you go along.

Knowledge is tacit as well as explicit.

In the same way that a trained neural network is far more effective at distinguishing between images of cats and dogs than an arbitrarily-long cascade of if-else statements, the craft of doing data science cannot be reduced to an instruction manual. Call it taste, or feel, or instinct, but there are some aspects of the job which cannot be easily put into words.

Knowledge is social as well as individual.

No single data scientist in your organization is a master of all aspects of the job. Combine this with what we’ve seen about the impossibility of writing everything about doing the job down with perfect fidelity, we see that the strength of an organization’s data science knowledge necessarily arises from data scientists sharing their perspectives with one another and sometimes butting heads.

Knowledge is dynamic.

Never mind how rapidly the palette of data science techniques evolves, simply the tooling changes at a pace whereby it makes little sense to evaluate what works best when once and never again reevaluate that choice. An organization’s data science community needs to constantly be undergoing a process of discovery, evaluation, and integration, and for this reason a wiki will only ever be as useful as it is updated.

What is knowledge management. Source: https://www.linkedin.com/pulse/evolving-online-education-learning-together-ankit-mittal

As a consequence of the four reasons above, the most effective way of storing, updating, and sharing knowledge in an organization looks very different from a wiki. Stewarding knowledge must involve people, interacting both regularly and frequently as a "living repository" for the organization’s knowledge. They must engage voluntarily, on the basis of collegial rather than authoritative relationships. And there can be written guidance, but only insofar as that supplements conversation, coaching, and storytelling.

How we moved beyond the wiki

A community of practice is an approach for managing knowledge which acknowledges the differences between information and knowledge. According to the authors of CCoP, it is:

A group of people who share a concern, a set of problems, or a passion about a topic, and who deepen their knowledge and expertise in this area by interacting on an ongoing basis…[A] unique combination of three fundamental elements: a domain of knowledge, which defines a set of issues; a community of people who care about this domain; and the shared practice that they are developing to be effective in their domain.

In the case of my client, because the number of data scientists in the organization was small (fewer than fifteen), we could satisfy this definition simply by gathering all data scientists together to talk shop. (As the role of data science in the organization grows, it could make sense to split off child communities with narrower domains like marketing or more specialized practices like deep learning.) We implemented this by booking a conference room where the data scientists from all three teams could meet for an hour a week before heading down together for lunch. We dot-voted on weekly topics at the beginning of every quarter, leaving some weeks free so that a team stuck on its project could ask for an impromptu community brainstorming session. Members volunteered to present on topics or to coordinate "guest lectures" from a neighboring domain, and they were free to organize the session as they saw fit. Sometimes when the idea was to hack on a new library or data set, we’d eat lunch in the conference room and extend the session an extra hour, and once we organized a day-long experimentation session.

The community of practice has a coordinator role, whose responsibility is to facilitate community meetings. While she can run the dot-voting sessions, maintain a barebones "meeting notes" wiki, or book the conference room, the coordinator is expressly not responsible for generating session topic ideas, creating presentations, or tracking attendance. (That said, though researching why some members don’t regularly participate, and if appropriate addressing those reasons with the community so that attendance-encouraging changes can be made, could be a task the coordinator takes up.) In the case of my client, I, rather than the Lead Data Scientist, volunteered to act as the community’s first coordinator, in order to help assure its members that neither participation nor the focus of any particular session was not something demanded of management.

Results, expected and unexpected

So what did this data science community of practice deliver in the first six months since its inception? Of course, plenty of knowledge-sharing concerning statistical techniques, visualization, data set oddities, model monitoring, etc. Additionally, a much greater sense of camaraderie among the data scientists, motivated by goals that spanned projects and simply having a venue to sit in a room together and talk shop. But a few unexpected things were output, as well.

What is knowledge management.

For one, the community conducted a way-of-working workshop with the goal of standardizing "80%" of how all teams practice data science. The policies that resulted led to faster data product development, lowered barriers to providing input in other teams‘ code reviews and brainstorming sessions, and facilitated both hiring and inter-team mobility. (More on why and how we conducted this workshop in a future post!)

Second, a new relationships-based onboarding process.This new process not only ensures new hires have access to all data science tooling as quickly as possible, but also gives a broad overview of the domain-specific aspects of doing data science at the organization through a combination of presentations and informational coffee sessions with domain experts in the business. Moreover, it ensures the new hire feels like she’s joining a community of data scientists and not just a project team by being assigned an onboarding buddy to guide her through the process and by having coffee with data scientists from all teams during which she learns both about other teams’ projects and the backgrounds of their members.

Third, an internal data science library created by a subset of the community that is more engineering-minded. This actively-maintained library contains frequently-used PySpark-based feature engineering functions and combined data set views that, along with deduplicating considerable technical effort across teams, also gave this subset the opportunity to work on developing their software engineering capabilities.

Lastly, the data engineers of the organization were inspired to form a community of practice of their own, as well.

Acknowledgments

Special thanks to Steven Nooijen for introducing GoDataDriven to the concept of a community of practice and for daring to be first to implement one at one of our clients. Read more about the Randstad case study. In addition to Steven, thanks also to my colleagues Arjan van den Heuvel and Julian de Ruiter for providing feedback on this blog post.

Postscript

It bears noting that, on the surface, this structure very much resembles that of a guild or chapter from the Spotify Agile model; however, communities of practice are different from both. Guilds are more open, spanning multiple roles across all of the organization’s tribes, and as a consequence have a less focused domain and encompass a large range of practices. Chapters are headed by a line manager, responsible for setting salaries and designing development paths for members and thus accountable for the success of the chapter. Chapters therefore put a particular emphasis on practice over domain knowledge and work against the engagement and innovation brought by removing reporting relationships. That said, a community of practice can act as a stepping stone to a constellation of chapters as the organization grows and matures.

Subscribe to our newsletter

Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.