Chapter 1. Scoping: Why Before How
Most people start working with data from exactly the wrong end. They begin with a data set, then apply their favorite tools and techniques to it. The result is narrow questions and shallow arguments. Starting with data, without first doing a lot of thinking, without having any structure, is a short road to simple questions and unsurprising results. We don’t want unsurprising—we want knowledge.
As professionals working with data, our domain of expertise has to be the full problem, not merely the columns to combine, transformations to apply, and models to fit. Picking the right techniques has to be secondary to asking the right questions. We have to be proficient in both to make a difference.
To walk the path of creating things of lasting value, we have to understand elements as diverse as the needs of the people we’re working with, the shape that the work will take, the structure of the arguments we make, and the process of what happens after we “finish.” To make that possible, we need to give ourselves space to think. When we have space to think, we can attend to the problem of why and so what before we get tripped up in how. Otherwise, we are likely to spend our time doing the wrong things.
This can be surprisingly challenging. The secret is to have structure that you can think through, rather than working in a vacuum. Structure keeps us from doing the first things to cross our minds. Structure gives us room to think through all the aspects of a problem.
People have been creating structures to make thinking about problems easier for thousands of years. We don’t need to invent these things from scratch. We can adapt ideas from other disciplines as diverse as philosophy, design, English composition, and the social sciences to make professional data work as valuable as possible. Other parts of the tree of knowledge have much to teach us.
Let us start at the beginning. Our first place to find structure is in creating the scope for a data problem. A scope is the outline of a story about why we are working on a problem (and about how we expect that story to end).
In professional settings, the work we do is part of a larger goal, and so there are other people who will be affected by the project or are working on it directly as part of a team. A good scope both gives us a firm grasp on the outlines of the problem we are facing and a way to communicate with the other people involved.
A task worth scoping could be slated to take anywhere from a few hours with one person to months or years with a large team. Even the briefest of projects benefit from some time spent thinking up front.
There are four parts to a project scope. The four parts are the context of the project; the needs that the project is trying to meet; the vision of what success might look like; and finally what the outcome will be, in terms of how the organization will adopt the results and how its effects will be measured down the line. When a problem is well-scoped, we will be able to easily converse about or write out our thoughts on each. Those thoughts will mature as we progress in a project, but they have to start somewhere. Any scope will evolve over time; no battle plan survives contact with opposing forces.
A mnemonic for these four areas is CoNVO: context, need, vision, outcome. We should be able to hold a conversation with an intelligent stranger about the project, and afterward he should understand (at a high level), why and how we accomplished what we accomplished. Hence, CoNVO.
All stories have a structure, and a project scope is no different. Like any story, our scope will have exposition (the context), some conflict (the need), a resolution (the vision), and hopefully a happily-ever-after (the outcome). Practicing telling stories is excellent practice for scoping data problems.
We will examine each part of the scoping process in detail before looking at a fully worked-out example. In subsequent chapters, we will explore other aspects of getting a good data project going, and then we will look carefully at the structures for thinking that make asking good questions much easier.
Writing down and refining our CoNVO is crucial to getting it straight. Clear writing is a sign of clear thinking. After we have done the thinking that we need to do, it is worthwhile to concisely write down each of these parts for a new problem. At least say them out loud to someone else. Having to clarify our thoughts down to a few sentences per part is extremely helpful. Once we have them clear (or at least know what is still unclear), we can go out and acquire data, clarify our understanding, start the technical work, clarify our understanding, gradually converge on something smart and useful, and…clarify our understanding. Data science is an iterative process.
Every project has a context, the defining frame that is apart from the particular problems we are interested in solving. Who are the people with an interest in the results of this project? What are they generally trying to achieve? What work, generally, is the project going to be furthering?
Here are some examples of contexts, very loosely based on real organizations, distilled down into a few sentences:
- This nonprofit organization reunites families that have been separated by conflict. It collects information from refugees in host countries. It visits refugee camps and works with informal networks in host countries further from conflicts. It has built a tool for helping refugees find each other. The decision makers on the project are the CEO and CTO.
- This department in a large company handles marketing for a shoe manufacturer with a large online presence. The department’s goal is to convince new customers to try its shoes and to convince existing customers to return again. The final decision maker is the VP of Marketing.
- This news organization produces stories and editorials for a wide audience. It makes money through advertising and through premium subscriptions to its content. The main decision maker for this project is the head of online business.
- This advocacy organization specializes in ferreting out and publicizing corruption in politics. It is a small operation, with several staff members who serve multiple roles. They are working with a software development team to improve their technology for tracking evidence of corrupt politicians.
Contexts emerge from understanding who we are working with and why they are doing what they are doing. We learn the context from talking to people, and continuing to talk to them until we understand what their long-term goals are. The context sets the overall tone for the project, and guides the choices we make about what to pursue. It provides the background that makes the rest of the decisions make sense. The work we do should further the mission espoused in the context. At least if it does not, we should be aware of that.
New contexts emerge with new partners, employers, or supervisors, or as an organization’s mission shifts over time. A freelancer often has to understand a new context with every project. It is important to be able to clearly articulate the long-term goals of the people we are looking to aid, even when embedded within an organization.
Sometimes the context for a project is simply our own curiosity and hunger for understanding. In moderation (or as art), there’s no problem with that. Yet if we treat every situation only as a chance to satisfy our own interests, we will soon find that we have passed up opportunities to provide value to others.
The context provides a project with larger goals and helps to keep us on track. Contexts include larger relevant details, like deadlines, that will help us to prioritize our work.
Everyone faces challenges. Things that, were they to be fixed or understood, would advance the goals they want to reach. What are the specific needs that could be fixed by intelligently using data? These needs should be presented in terms that are meaningful to the organization. If our method will be to build a model, the need is not to build a model. The need is to solve the problem that having the model will solve.
Correctly identifying needs is tough. The opening stages of a data project are a design process; we can draw on techniques developed by designers to make it easier. Like a graphic designer or architect, a data professional is often presented with a vague brief to generate a certain spreadsheet or build a tool to accomplish some task. Something has been discussed, perhaps a definite problem has even been articulated—but even if we are handed a definite problem, we are remiss to believe that our work in defining it ends there. Like all design processes, we need to keep an open mind. The needs we identify at the outset and the needs we ultimately try to meet are often not the same.
If working with data begins as a design process, what are we designing? We are designing the steps to create knowledge. A need that can be met with data is fundamentally about knowledge, fundamentally about understanding some part of how the world works. Data fills a hole that can only be filled with better intelligence. When we correctly explain a need, we are clearly laying out what it is that could be improved by better knowledge. What will this spreadsheet teach us? What will the tool let us know? What will we be able to do after making this graph that we could not do before?
Data science is the application of math and computers to solve problems that stem from a lack of knowledge, constrained by the small number of people with any interest in the answers. In the sciences writ large, questions of what matters within the field are set in conferences, by long social processes, and through slow maturation. In a professional setting, we have no such help. We have to determine for ourselves which questions are the important ones to answer.
It is instructive to compare data science needs to needs from other related disciplines. When success is judged not by knowledge but by uptime or performance, the task is software engineering. When the task is judged by minimizing classification error or regret, without regard to how the results inform a larger discussion, the task is applied machine learning. When results are judged by the risk of legal action or issues of compliance, the task is one of risk management. These are each valuable and worthwhile tasks, and they require similar steps of scoping to get right, but they are not problems of data science.
Consider some descriptions of some fairly common needs, all ones that I have seen in practice. Each of these is much condensed from how they began their life:
- The managers want to expand operations to a new location. Which one is likely to be most profitable?
- Our customers leave our website too quickly, often after only reading one article. We don’t understand who they are, where they are from, or when they leave, and we have no framework for experimenting with new ideas to retain them.
- We want to decide between two competing vendors. Which is better for us?
- Is this email campaign effective at raising revenue?
- We want to place our ads in a smart way. What should we be optimizing? What is the best choice, given those criteria?
And here are some famous ones from within the data world:
- We want to sell more goods to pregnant women. How do we identify them from their shopping habits?
- We want to reduce the amount of illegal grease dumping in the sewers. Where might we look to find the perpetrators?
Needs will rarely start out as clear as these. It is incumbent upon us to ask questions, listen, and brainstorm until we can articulate them clearly and they can be articulated clearly back to us. Again, writing is a big help here. By writing down what we think the need is, we will usually see flaws in our own reasoning. We are generally better at criticizing than we are at making things, but when we criticize our own work, it helps us create things that make more sense.
Like designers, the process of discovering needs largely proceeds by listening to people, trying to condense what we understand, and bringing our ideas back to people again. Some partners and decision makers will be able to articulate what their needs are. More likely they will be able to tell us stories about what they care about, what they are working on, and where they are getting stuck. They will give us places to start. Sometimes those we talk with are too close to their task to see what is possible. We need to listen to what they are saying, and it is our job to go beyond listening and actively ask questions until we can clearly articulate what needs to be understood, why, and by whom.
Often the information we need to understand in order to refine a need is a detailed understanding of how some process happens. It could be anything from how a widget gets manufactured to how a student decides to drop out of school to how a CEO decides when to end a contract. Walking through that process one step at a time is a great tactic for figuring out how to refine a need. Drawing diagrams and making lists make this investigation clearer. When we can break things down into smaller parts, it becomes easier to figure out where the most pressing problems are. It can turn out that the thing we were originally worried about was actually a red herring or impossible to measure, or that three problems we were concerned about actually boiled down to one.
When possible, a well-framed need relates directly back to some particular action that depends on having good intelligence. A good need informs an action rather than simply informing. Rather than saying, “The manager wants to know where users drop out on the way to buying something,” consider saying, “The manager wants more users to finish their purchases. How do we encourage that?” Answering the first question is a component of doing the second, but the action-oriented formulation opens up more possibilities, such as testing new designs and performing user experience interviews to gather more data.
If it is not helpful to phrase something in terms of an action, it should at least be related to some larger strategic question. For example, understanding how users of a product are migrating from desktop to mobile versions of a website is useful for informing the product strategy, even if there is no obvious action to take afterward. Needs should always be specified in words that are important to the organization, even if they’re only questions.
Until we can clearly articulate the needs we are trying to meet, and until we understand how meeting those specific needs will help the organization achieve its larger goals, we don’t know why we’re doing what we’re hoping to do. Without that part of a scope, our data work is mostly going to be fluff and only occasionally worthwhile.
Continuing from the longer examples, here are some needs that those organizations might have:
- The nonprofit that reunited families does not have a good way to measure its success. It is prohibitively expensive to follow up with every individual to see if they have contacted their families. By knowing when individuals are doing well or poorly, the nonprofit will be able to judge the effectiveness of changes to its strategy.
- The marketing department at the shoe company does not have a smart way of selecting cities to advertise to. Right now it is selecting its targets based on intuition, but it thinks there is a better way. With a better way of selecting cities, the department expects sales will go up.
- The media organization does not know the right way to define an engaged reader. The standard web metric of unique daily users doesn’t really capture what it means to be a reader of an online newspaper. When it comes to optimizing revenue, growth, and promoting subscriptions, 30 different people visiting on 30 different days means something very different from 1 person visiting for 30 days in a row. What is the right way to measure engagement that respects these goals?
- The anti-corruption advocacy group does not have a good way to automatically collect and collate media mentions of politicians. With an automated system for collecting media attention, it will spend less time and money keeping up with the news and more time writing it.
Note that the need is never something like, “the decision makers are lacking in a dashboard,” or predictive model, or ranking, or what have you. These are potential solutions, not needs. Nobody except a car driver needs a dashboard. The need is not for the dashboard or model, but for something that actually matters in words that decision makers can usefully think about.
This is a point that bears repeating. A data science need is a problem that can be solved with knowledge, not a lack of a particular tool. Tools are used to accomplish things; by themselves, they have no value except as academic exercises. So if someone comes to you and says that her company needs a dashboard, you need to dig deeper. Usually what the company needs is to understand how they are performing so they can make tactical adjustments. A dashboard may be one way of accomplishing that, but so is a weekly email or an alert system, both of which are more likely to be incorporated into someone’s workflow.
Similarly, if someone comes to you and tells you that his business needs a predictive model, you need to dig deeper. What is this for? Is it to change something that he doesn’t like? To make accurate predictions to get ahead of a trend? To automate a process? Or does the business need to generalize to a new case that’s unlike any seen in order to inform a decision? These are all different needs, requiring different approaches. A predictive model is only a small part of that.
Before we can start to acquire data, perform transformations, test ideas, and so on, we need some vision of where we are going and what it might look like to achieve our goal.
The vision is a glimpse of what it will look like to meet the need with data. It could consist of a mockup describing the intended results, or a sketch of the argument that we’re going to make, or some particular questions that narrowly focus our aims.
Someone who is handed a data set and has not first thought about the context and needs of the organization will usually start and end with a narrow vision. It is rarely a good idea to start with data and go looking for things to do. That leads to stumbling on good ideas, mostly by accident.
Having a good vision is the part of scoping that is most dependent on experience. The ideas we will be able to come up with will mostly be variations on things that we have seen before. It is tremendously useful to acquire a good mental library of examples by reading widely and experimenting with new ideas. We can expand our library by talking to people about the problems they’ve solved, reading books on data science or reading classics (like Edward Tufte and Richard Feynman), following blogs, attending conferences and meetups, and experimenting with new ideas all the time.
There is no shortcut to gaining experience, but there is a fast way to learn from your mistakes, and that is to try to make as many of them as you can. Especially if you are just getting started, creating things in quantity is more important than creating things of quality. There is a saying in the world of Go (the east Asian board game): lose your first fifty games of Go as quickly as possible.
The two main tactics we have available to us for refining our vision are mockups and argument sketches.
A mockup is a low-detail idealization of what the final result of all the work might look like. Mockups can take the form of a few sentences reporting the outcome of an analysis, a simplified graph that illustrates a relationship between variables, or a user interface sketch that captures how people might use a tool. A mockup primes our imagination and starts the wheels turning about what we need to assemble to meet the need. Mockups, in one form or another, are the single most useful tool for creating focused, useful data work (see Figure 1-1).
Mockups can also come in the form of sentences:
Keep in mind that a mockup is not the actual answer we expect to arrive at. Instead, a mockup is an example of the kind of result we would expect, an illustration of the form that results might take. Whether we are designing a tool or pulling data together, concrete knowledge of what we are aiming at is incredibly valuable.
Without a mockup, it’s easy to get lost in abstraction, or to be unsure what we are actually aiming toward. We risk missing our goals completely while the ground slowly shifts beneath our feet. Mockups also make it much easier to focus in on what is important, because mockups are shareable. We can pass our few sentences, idealized graphs, or user interface sketches off to other people to solicit their opinion in a way that diving straight into source code and spreadsheets can never do.
A mockup shows what we should expect to take away from a project. In contrast, an argument sketch tells us roughly what we need to do to be convincing at all. It is a loose outline of the statements that will make our work relevant and correct. While they are both collections of sentences, mockups and argument sketches serve very different purposes. Mockups give a flavor of the finished product, while argument sketches give us a sense of the logic behind the solution.
For example, if we want to know whether women and men are equally interested in flexible time arrangements, there are a few parts to making a convincing case. First, we need to have a good definition of who the women and men are that we are talking about. Second, we need to decide if we are interested in subjective measurement (like a survey), if we are interested in objective measurement (like the number of applications for a given job), or if we want to run an experiment. We could post the same job description but only show postings with flexible time to half of the people who visit a job site. There are certain reasons to find each of these compelling, ranging from the theory of survey design to mathematical rules for the design of experiments.
Thinking concretely about the argument made by a project is a valuable tool for orienting ourselves. Chapter 3 goes into greater depth about what the parts of an argument are and how they relate to working with data. Arguments occur both in a project and around the project, informing both their content and their rationale.
Pairing written mockups and written argument sketches is a concise way to get our understanding across, though sometimes one is more appropriate than the other. Continuing again with the longer examples:
- Example 1
- Vision: The nonprofit that is trying to measure its successes will get an email of key performance indicators on a regular basis. The email will consist of graphs and automatically generated text.
- Mockup: After making a change to our marketing, we hit an enrollment goal this week that we’ve never hit before, but it isn’t being reflected in the success measures.
- Argument sketch: The nonprofit is doing well (or poorly) because it has high (or low) values for key performance indicators. After seeing the key performance indicators, the reader will have a good sense of the state of the nonprofit’s activities and will be able to adjust accordingly.
- Example 2
Here are several ideas for the marketing department looking to target new cities, depending on the details of the context:
- Idea 1
- Vision: The marketing department that wants to improve its targeting will get a report that ranks cities by their predicted value to the company.
- Mockup: Austin, Texas, would provide a 20% return on investment per month. New York City would provide an 11% return on investment per month.
- Argument sketch: The department should focus on city X, because it is most likely to bring in high value. The definition of high value that we’re planning to use is substantiated for the following reasons….
- Idea 2
- Vision: The marketing department will get some software that implements a targeting model, which chooses a city to place advertisements in. Advertisements will be targeted automatically based on the model, through existing advertising interfaces.
- Mockup: 48,524 advertisements were placed today in 14 cities. 70% of them were in emerging markets.
- Argument sketch: Advertisements should be placed proportional to their future value. The department should feel confident that this automatic selector will be accurate without being watched.
- Idea 3
- Vision: The marketing department will get a spreadsheet that can be dropped into the existing workflow. It will fill in some characteristics of a city and the spreadsheet will indicate what the estimated value would be.
- Mockup: By inputting gender and age skew and performance results for 20 cities, an estimated return on investment is placed next to each potential new market. Austin, Texas, is a good place to target based on age and gender skew, performance in similar cities, and its total market size.
- Argument sketch: The department should focus on city X, because it is most likely to bring in high value. The definition of high value that we’re planning to use is substantiated for the following reasons….
- Example 3
- Vision: The media organization trying to define user engagement will get a report outlining why a particular user engagement metric is the ideal one, with supporting examples; models that connect that metric to revenue, growth, and subscriptions; and a comparison against other metrics.
- Mockup: Users who score highly on engagement metric A are more likely to be readers at one, three, and six months than users who score highly on engagement metrics B or C. Engagement metric A is also more correlated with lifetime value than the other metrics.
- Argument sketch: The media organization should use this particular engagement metric going forward because it is predictive of other valuable outcomes.
- Example 4
- Vision: The developers working on the corruption project will get a piece of software that takes in feeds of media sources and rates the chances that a particular politician is being talked about. The staff will set a list of names and affiliations to watch for. The results will be fed into a database, which will feed a dashboard and email alert system.
- Mockup: A typical alert is that politician X, who was identified based on campaign contributions as a target to watch, has suddenly showed up on 10 news talk shows.
- Argument sketch: We have correctly kept tabs on politicians of interest, and so the people running the anti-corruption project can trust this service to do the work of following names for them.
In mocking up the outcome and laying out the argument, we are able to understand what success could look like. The final result may differ radically from what we set out to do. Regardless, having a rough understanding at the outset of a project is important. It is also okay to have several potential threads at this point and be open to trying each, such as with the marketing department example. They may end up complementing each other.
The most useful part of making mockups or fragments of arguments is that they let us work backward to fill in what we actually need to do. If we’re looking to send an email of key performance indicators, we’d better come up with some to put into the email. If we’re writing a report outlining why one engagement metric is the best and tying it to a user valuation model, we need to come up with an engagement metric and find or develop a user valuation model. The pieces start to fall into place.
At the end of everything, the finished work will often be fairly simple. Because of all of the work done in thinking about context and need, generating questions, and thinking about outcomes, our work will be the right kind of simple. Simple results are the most likely to get used.
They will not always be simple, of course. Having room to flesh out complicated ideas is part of the point of thinking so much at the outset. When our work is complicated, we will benefit even more from having thought through some of the parts first.
When we’re having trouble articulating a vision, it is helpful to start getting something down on paper or out loud to prime our brains. Drawing pretend graphs, talking through examples, making flow diagrams on whiteboards, and so on, are all good ways to get the juices flowing.
We need to understand how the work will actually make it back to the rest of the organization and what will happen once it is there. How will it be used? How will it be integrated into the organization? Who will own its integration? Who will use it? In the end, how will its success be measured?
If we don’t understand the intended use of what we produce, it is easy to get lost in the weeds and end up making something that nobody will want or use. What’s the purpose of all this work if it does nobody any good?
The outcome is distinct from the vision; the vision is focused on what form the work will take at the end, while the outcome is focused on what will happen when we are “done.” Here are the outcomes for each of the examples we’ve been looking at so far:
- The metrics email for the nonprofit needs to be set up, verified, and tweaked. Sysadmins at the nonprofit need to be briefed on how to keep the email system running. The CTO and CEO need to be trained on how to read the metrics emails, which will consist of a document written to explain it.
- The marketing team needs to be trained in using the model (or software) in order to have it guide their decisions, and the success of the model needs to be gauged in its effect on sales. If the result ends up being a report instead, it will be delivered to the VP of Marketing, who will decide based on the recommendations of the report which cities will be targeted and relay the instructions to his staff. To make sure everything is clear, there will be a follow-up meeting two weeks and then two months after the delivery.
- The report going to the media organization about engagement metrics will go to the head of online business. If she signs off on its findings, the selected user engagement metric will be incorporated by the business analysts into the performance measures across the entire organization. Funding for existing and future initiatives will be based in part on how they affect the new engagement metric. A follow-up study will be conducted in six months to verify that the new metric is successfully predicting revenue.
- The media mention finder needs to be integrated with the existing mention database. The staff needs to be trained to use the dashboard. The IT person needs to be informed of the existence of the tool and taught how to maintain it. Periodic updates to the system will be needed in order to keep it correctly parsing new sources, as bugs are uncovered. The developers who are doing the integration will be in charge of that. Three months after the delivery, we will follow up to check on how well the system is working.
Figuring out what the right outcomes are boils down to three things. First, who will have to handle this next? Someone else is likely to have to interpret or implement or act on our work. Who are they, what are their requirements, and what do we need to do differently from our initial ideas to address their concerns?
Second, who or what will handle keeping this work relevant, if anyone? Do we need to turn our work into a piece of software that runs repeatedly? Will we have to return in a few months? More often than not, analyses get re-run, even if they are architected to be run once.
Third, what do we hope will change after we have finished the work? Note again that “having a model” is not a suitable change; what in terms that matter to the partners will have changed? How will we verify that this has happened?
Thinking through the outcome before embarking on a project, along with knowing the context, identifying the right needs, and honing our vision, improves the chance that we will do something that actually gets used.
Seeing the Big Picture
Tying everything together, we can see that each of these parts forms a coherent narrative about what we might accomplish by working with data to solve this problem.
First, let’s see what it would look like to sketch out a problem without much structured thinking:
We will create a logistic regression of web log data using SAS to find patterns in reader behavior. We will predict the probability that someone comes back after visiting the site once.
Compare this to a well-thought-out scope:
This media organization produces news for a wide audience. It makes money through advertising and premium subscriptions to its content. The person who asked for some advice is the head of online business.
This organization does not know the right way to define an engaged reader. The standard web metric of unique daily users doesn’t really capture what it means to be a reader of an online newspaper. When it comes to optimizing revenue, growth, and promoting subscriptions, 30 different people visiting on 30 different days means something very different from 1 person visiting for 30 days in a row. What is the right way to measure engagement that respects these goals?
When this project is finished, the head of online business will get a report outlining why a particular user engagement metric is the ideal one, with supporting examples; models that connect that metric to revenue, growth, and subscriptions; and a comparison against other metrics.
If she signs off on its findings, the selected user engagement metric will be incorporated into the performance measures across the entire organization. Institutional support and funding for existing and future initiatives will be based in part on how they affect the new engagement metric. A follow-up study will be conducted in six months to verify that the new metric is successfully predicting revenue, growth, and subscription rates.
A good story about a project and a good scope of a project are hard to tell apart.
It is clear that at the outset, we do not actually know what the right metric will be or even what tools we will use. Focusing on the math or the software at the expense of the context, need, vision, and outcome means wasted time and energy.