Posts tagged ‘Metrics’

Lightweight Experiments for Process Improvement

[This post is a recap on the second talk I gave at XP2010. This was the big one, the experience report talk, one of 15 experience reports published at XP2010. You can download the slides (pdf) or the full paper (pdf) from this website or from XP2010.org.]

Process improvement is important for nearly all teams but it can sometimes be difficult for a team to know what is working, what isn’t working, and what techniques or methods to try when attempting to improve. Performing a scientific experiment is one way help overcome these problems but as academic research has shown us, while experimentation can yield interesting results, running an experiment is time consuming, expensive, and requires some serious thinking and control to pull off. From a practitioner’s standpoint this means that experimentation is a non-starter.

Of course, that’s only if you run experiments like an academic.

Banner from the XP2010 conference in front to the hotel.

Back Story

Just over a year ago, my MSE studio team at Carnegie Mellon had a problem. We had decided we would use Extreme Programming for the construction phase of our project but some team members had doubts concerning pair programming. We had decided that we would use some kind of peer review, having already seen the many benefits of inspection when reviewing other artifacts. The dispute arose over whether pair programming would give similar enough results. Also, not all team members had experience with pair programming but everyone on the team knew and enjoyed solo programming.

The number one concern was whether pair programming would allow us to meet our very strict deadline. We had just over three months to complete the construction phase of the project. According to our threshold of success this meant implementing all “must have” requirements with a minimum level of quality. Did we really have time to waste by having two people working on the same code at the same time? Wouldn’t working independently and inspecting code on an as needed basis allow us to get more work done faster?

At the time it just so happened that I was taking a reading class with Mary Shaw and in that class we discussed some research findings that might help settle this debate. Research from Laurie Williams, Ward Cunningham, Barry Boehm, and many others showed that pair programming requires more effort (although never double the effort) but is faster than programming alone (pdf). Also pair programming creates code of about the same quality as coding alone with inspection (pdf). Of course, the research may not apply to us since Square Root is closer to a professional team working on a large project with a real client, not undergrads working on short term toy projects.

After an iteration where some teammates used pair programming and others refused, we decided to try an experiment to see which practice actually worked better. The original idea was that we might be able to validate some of the research but decided instead that it was more important just to resolve our own internal conflicts and figure out which processes worked better.

Conducting a Lightweight Experiment

With the scientific method as our guide we planned and executed a lightweight experiment which pitted programming alone against pair programming. The results were amazing (and you can find the raw data in our project archive). In conducting the experiment we used a set of novel techniques which I think can be useful in conducting other lightweight experiments. There’s more background in the experience report so I’m only putting the meaty stuff in this post.

Focus narrowly on a single question – The essential key to keeping an experiment light is to only tackle one thing at a time. In this case we focused on comparing and contrasting a single technique, pair programming, rather than multiple techniques or an entire process (such XP vs. TSP).

Divide work, not teams – If I were comparing pair programming to programming alone in an academic setting, I would put together two teams of about the same experience and have them each build their own version of the same software, one team using pair programming, the other programming alone. In a business setting this is a complete waste and few companies can afford to have two teams duplicating effort. By dividing work instead of teams you may lose some control over variables in the experiment but in most cases isolating more variables doesn’t add any further clarity to helping answer the narrowly focused question. To divide work successfully you need to have some way of estimating work units for division. We used use case points as shown in the figure depicting our modified planning game.

Steps in modified planning game for dividing work into experiment groups

Continue making releases – Since we still needed to make a comparison, rather than dividing into teams and duplicating effort we divided the features that were released each iteration. In this way we built about half the features released during an iteration using each technique. Working on about half the features using pair programming meant that at least some features were being built by individuals. At the time this was a risk reduction decision to make sure that if pair programming completely failed we’d still have something to ship at the end of the iteration. Explicitly managing risks is the only way to know if the lightweight experiment may cause problems for making releases. Also, we had a strictly defined cut-off for stopping the experiment if it ever stopped us from shipping to our client.

Use the data you have – In almost all cases we were able to get the data we needed to evaluate our hypothesis from our current process. When we couldn’t, we only had to make minor modifications to our data collection practices, for example adding a check box to our SharePoint server for indicating whether a task was paired or individual.

One of the more interesting things we did was to create a “tally sheet” for collecting pair programming issue detection statistics in real time, as the issues were discovered. Given the near instantaneous code-inspect-fix cycle when programming in pairs, this was the only way to collect similar data for comparing pair programming to inspection.

Example of a real time tally sheet used for tracking issues discovered while pair programming.

Statistical significance is overrated – The whole point of running a lightweight experiment is to collect just enough data to help you make a better decision or validate your gut feeling. This technique is not meant for uncovering universal truths or proving something to the rest of the world. In exchange for keeping the experiment light, the results will only apply to your team. Over the course of an iteration or two, 4-6 weeks, you’ll only get enough data to start to see trends. In our case the results were not statistically significant using individual T-tests but that didn’t matter. The most important thing is that we had data that could be used for comparison, data that everyone felt good about and that helped us gain clarity into what we did and how well it worked.

Retrospectives get immediate value – The whole reason the experiment is light is to reduce cost and decrease the lag time to providing value to the team. Just to give you a little perspective, it took us 6 weeks to run the experiment and had enough data and casual observations to make a decision during the retrospective when the analyzed data was shared. That event occurred in early August of 2009. This experience report required almost nine full months of gestation from the paper proposal to the talk I gave at the conference. The gestation period on “universal truth” research can be even longer. We, as practitioners, don’t have to wait for those universal truths to be born to get value from research. By running your own quick and dirty, lightweight experiments, you can get results in a timely fashion that you know will apply to your team because your team was the subject of the experiment. It’s all about closing the gaps between research and practice and taking the information you need now instead of waiting for academic research to catch up.

Overall Conclusions

For the Square Root team it turned out that pair programming was faster, cheaper, and produced code that had more predictable albeit slightly worse quality. The more important lesson is that we discovered a technique, lightweight experimentation, for learning other interesting things about our team and about software engineering in general.

My paper and this blog post were all about trying to describe the technique, using our experiment as an example. I think it would be awesome if teams around the world conducted lightweight experiments on a variety of topics. If enough folks share what they learn, we might start to see trends emerge across teams that could lead to universal truths, validate research, or at least discover some great rules of thumb.

What else might make for a great experiment? Anything you’ve got a question about on your team!

  • What is the clearer way to write requirements, user stories or use cases?
  • Which estimation technique is more accurate of X and Y?
  • Can we skip unit testing if we use inspection (looking at quality, knowledge sharing)?
  • Is UML a better design notation than the one we made up as a team?
  • What else…?

If you do a lightweight experiment, let me know! Share what you learn as a blog post or whitepaper. Let others know what you’ve learned! Even if the specific results only apply to your team and the way you’ve executed your project, your experiences help form a baseline, a sort of shared understanding for how software development works, how some of these practices work. And there’s so much about software engineering that we have yet to learn.

Acknowledgements

This paper was my first experience report and it was an awesome journey. Naturally a lot of folks helped me along the way and I would like to take a moment to make sure they know that I appreciate their influences and support. The Square Root team: Marco Len, Yi-Ru Liao, Abin Shahab, and especially my fellow experiment co-champion Sneader Sequeira for having the guts to go along with this idea in the first place. Some of the faculty at Carnegie Mellon: Dave Root and John Robert (my studio mentors) for bringing up the idea of writing a paper, and Jonathan Aldrich for helping review my proposal. Artem Marchenko was my XP2010 paper shepherd after the proposal was accepted, and the quality of each draft only improved because of his inputs. A group of my fellow employees at Net Health Systems sat through an early draft of the presentation I gave and shared valuable feedback for improving it. And finally I thank, Marie, my wife, who was with me from start to finish and read more drafts and sat through more practice talks than anyone else. She’s probably as much an expert on this subject by now as I.

A Final Aside

I wrote the initial draft of this paper as my final reflection paper for my Master of Software Engineering degree (pdf). That draft has a very different tone, approach, conclusion, and direction than what I eventually published for XP2010. This is half due to there not being a hard page limit but also I had a lot more time to think about what was really important when writing for XP2010. There’s some interesting information, mostly in the lessons learned, that might prove interesting to those who are interested. You should check out my Square Root teammates’ reflection papers as well since they are all interesting and well written.

The Reality of Risk Exposure

Over the past few weeks I’ve been thinking a lot about risk exposure in the context of managing projects. Exposure is a technique used almost universally when managing risks, yet as I’ve already discussed, exposure can cause major problems because it’s a precise number based on mostly made-up information. At the same time, exposure is used widely and successfully – otherwise there wouldn’t be as much literature throughout the web telling you to calculate risk exposure.

This begs the question: is risk exposure really as meaningless as I’ve made it out to be? I’ve collected some data that helps answer this question.

Data Collection and Context

Risk management is one of the basic subjects covered in the Managing Software Development course, one of the five core courses students of the Carnegie Mellon Master of Software Engineering program take in completing their degree. Students learn about the continuous risk management paradigm from the Software Engineering Institute. Two of the cornerstones of this technique are threshold of success and condition-consequence based risk statements.

Having ready access to risk management experts at the SEI, nearly every team conducts a facilitated small team risk evaluation workshop in which risks are collected with the help of a taxonomy-based questionnaire (pdf), analyzed, and prioritized using group multi-voting. The basic workshop has been conducted the same way for close to a decade and many teams have put their risk data collected during the workshop in the MSE’s project archive.

I’ve gathered data from these small team risk evaluation workshops for 9 MSE Studio teams, a total of 164 identified, analyzed, prioritized risks.

What’s in the Data?

During a risk evaluation workshop, teams identify risks using their threshold of success as a guide. Once identified, risks are briefly analyzed and assigned an impact, probability, and time frame value based on a rough average from the team members’ initial gut feeling on the risk. These values are assigned simply so when a manager asks to see the probability, for example, there is a value to give him. Each of impact, probability, and time frame can only be one of 3-4 values. The idea is that by decreasing the precision we can increase the accuracy. Values are assigned based on a rubric. For the purposes of calculating a risk exposure I assigned each of the analysis categories a number. Time frame is not used in calculating exposure.

Impact

  • Catastrophic – The team will be unable to meet threshold of success. (numeric value 4)
  • Critical – The team can only meet the threshold of success with significant additional effort and stress. (numeric value 3)
  • Marginal – The team can meet the threshold of success with minimal extra effort. (numeric value 2)
  • Negligible – There is no real impact on achieve the threshold of success or little increase in effort. (numeric value 1)

Probability

  • High – Chance of becoming a problem is above about 80%. (numeric value .8)
  • Medium – Chance of becoming a problem is about 50/50. (numeric value .5)
  • Low – Chance of becoming a problem is below about 20%. (numeric value .2)

Time Frame

  • Short – May occur in about a month or less.
  • Medium – May occur in 1 to 3 months.
  • Long – May occur in more than 3 months.

Instead of relying on the results from the analysis, teams perform 3 to 4 rounds of multi-voting. The final multi-voting rank is shown. Not all teams ranked all risks since teams generally only deal with the top few risks, usually less than 10. This idea is captured in the priority. A risk is either a high priority, meaning the team is actively addressing it, or a low priority meaning the team is aware of it but it was not ranked high enough to deal with yet. Teams might choose different strategies for determining priority. The two most popular are to only examine the top X or to rely on consensus derived from how the risks clustered as a result of multi-voting. Usually there is strong team consensus for the top 4 to 5 risks and weak consensus after this.

Analysis and Discussion

My hypothesis is that teams’ rankings will generally match exposure, meaning that risks that are ranked highly will also have a high exposure. As the data shows, this is generally the case. On average nearly every team’s high priority risks were also the ones with the highest exposure.

Graph showing Teams' Average Risk Exposure by Priority.

Examining the risks rank and exposure tells a similar story but not convincingly. There is a relatively weak negative correlation (correlation coefficient of -0.22) between exposure and team assigned rank. Basically the best that can be said is that there is a general downward trend in exposure as the rank increases but there is enough variation that I can’t really say anything for certain.

Graph showing risk data for all teams.

I have two possible explanations for this. First, traditional risk exposure does not take into account time frame while teams evaluating risks in this data set do. So, all things equal from an exposure perspective, a long term risk might be ranked very low while a short term risk will be ranked much higher. If this were the case, we’d see more short-term risks assigned high ranks than long-term risks and this is indeed the case. In fact, the majority of risks identified are short-term risks with nearly three times more short-term risks being identified than long term risks. Mid-term risks are, unsurprisingly in the middle. A better exposure number might be had by taking into account risks’ time frame values.

Graph showing the count of risks per time frame by rank

The second possible explanation I have is that 3 – 4 buckets isn’t sufficient to allow for enough variation to form a strong correlation between rank and exposure. Indeed this is one of the greatest differences between this data set and traditional risk exposure calculations in which impact might take on nearly any number and exposure is usually a percentage from 10 – 100%. That said there still is a general trend which shows that most of the time, multi-vote ranking very roughly corresponds to exposure.

There is one more catch about this data and it’s a subtle but important one. Values for probability, impact, and time frame were determined as a team using a sort of rough average approach where team members vote and the approximate averages are rounded to the nearest bucket. Since all the values and rankings were determined through a group effort, it would make sense that they should roughly correspond.

Conclusions

As it turns out, risk exposure is a rough and somewhat accurate indicator for relative risk priority, at least when calculating exposure or rank using group-driven techniques. Teams relying only on exposure are likely to rank some risks higher than they otherwise might. Part of this is due to exclusion of the concept of time from traditional exposure, part of it might be differences of opinion within the group as far as impact or probability are concerned.

Talking with other MSE alumni, and I mostly agree with them, the most important thing about risk management is bringing up concerns and talking about them. Delphi mutli-voting is an easy way to encourage conversation since differences of opinion are addressed as part of the multi-voting process. No matter what technique you use, exposure (with time somehow included), multi-voting, or some combination, do not reduce risk management to simple numbers. It’s really all about communication. Encourage this communication using whatever techniques work for your team.

Raw data used for analysis in CSV format.

A Closer Look at Risk Burndown

I like the idea of the risk burndown chart. Burndown is an effective and satisfying visual indicator of progress and it’s relatively easy to calculate to boot. But does looking at a project’s risks through the lens of a burndown chart make sense?

I see several problems with thinking about risk in this way.

Numbers can be Misleading

The first key to effective risk management is to value accuracy over precision. This means that it’s better to be right in your predictions than it is to be spot on correct. Remember, risk is about assessing your likelihood for project success. It doesn’t matter if you miss your threshold of success by a little or a lot; either way you still fail the project!

Pop quiz. Say there are two risks in your project. There’s a 25% probability that Risk A will become a problem while Risk B only has a 20% probability. For now, assume the impact is the same for both risks. Which risk is a greater threat to the project?

That one’s easy. Risk A is a greater threat because, impacts aside, Risk A has a 5% greater probability of turning into a problem.  Ok.  What if I told you that I made up probabilities based on my gut feelings so I could easily rank risks? Now which risk is a greater threat to the project?

The real question I’m asking you is this. Are you willing to bet the success of your project on those numbers? Because if my best guess, gut feeling probabilities are off by more than 5%, the project could be in serious trouble depending on the risks’ impacts.

I know, I know. That was a trick question. Nobody on your team would make up numbers on one of your software projects. In all fairness, nobody goes out of their way to fabricate false values. Use your logics. If you were any good at guessing the probability of futures events occurring, you would not be reading this post right now. You would be a multi-millionaire, off enjoying your gambling winnings from the ponies. Too much precision gives folks too much confidence in the correctness of your assessment when the reality is that probability and impact are based on best guesses and gut feelings. Probability and impact numbers just make it easier to calculate exposure so risks can be ranked automatically.  Burndown is a fairly precise metric.

Not all Risks are Created Equal

If you are monitoring project risk with a risk burndown chart, how do you know whether the right risks are being reduced? Let’s take a look at an example.  Which of these sets of risks should be addressed?

Set 1 with a total exposure of 7 days made up of the following risks:

  • Risk A has a probability of 20% and an impact of 15 for an exposure of 3 days.
  • Risk B has a probability of 25% and an impact of 10 days for an exposure of 2.5 days.
  • Risk C has a probability of 30% and an impact of 5 days for an exposure of 1.5 days.

Or Set 2 with a total exposure of 7 days (6.7 rounded up) made of the the following risk:

  • Risk D has a probability of 95% and an impact of  7 days for an exposure of 6.7 days.

In the first set, I can mitigate 3 risks, each with very low probability of becoming problems. In the second set I mitigate only 1 risk that is almost certainly going to become a problem. Reducing the imminent risk seems to make the most sense but this choice is not reflected in a risk burndown chart. Simply reducing risk over time is not enough. You have to reduce the right risks.

Impact Isn’t Really About Money or Effort

The only way for a visual chart such as risk burndown to work is if we’re able to quantify risks. This is generally done with exposure. Exposure = probability x impact. Impact is a funny thing. Impact is an assessment of how much the consequence of a risk will affect the project if the risk becomes a problem. Traditionalists like to think about this from a money perspective (which makes sense since software engineers stole most of our risk management practices from the finance world, originally anyway). For small teams, effort is a better measure as in the number of person days a risk that becomes a problem will cost to fix. This is a quantifiable loss.

There’s a problem with thinking about impact in terms days of loss. Since not all risks are created equal, not all loss is truly equal either. Some kinds of loss can’t be measured in terms of effort. It really all depends on your project’s threshold of success. Some example risks (which don’t rely on ye olde life-critical system standby) from which you might never recover if they became problems include:

  • We don’t have a reliable backup solution; might lose all of our project data. (Lost yer data? You’re up a creek, son!)
  • We don’t have backup power for our data center; data centers might go offline for more than a few hours. (How many days will it take you to get those customers back?)
  • The demo has bugs and our contract renewal is based exclusively on how much the client likes our demo; a bug might occur during the demo. (HA! HA! You don’t have a job!)

In all of these cases you would reduce the risk by working on attributes other than impact (e.g. reduce probability, eliminate the condition, extend the time frame). Enough said. When it comes to calculating exposure, each of these risks has a catastrophic impact. That’s catastrophic, short for epic failure. No amount of days can really capture the essence of complete catastrophe.  Impact works best when considered in terms of success, not days or dollars lost.

Forget Risk Burndown

I want risk burndown to make sense, but given the problems I can’t help but think of it as a meaningless metric. Sure, some risks will be reduced and some will go away by converting into problems or being overcome by events. And a chart showing this would be really neat. But you’ll also uncover new risks as the project goes on. And some risks are just not worth caring about while others deserve a lot of attention. Risk management is about identifying the things that are most likely to kill your project so you can deal with them before it becomes too expensive (or impossible).  A burndown chart doesn’t reflect any of these things directly.

Burndown masks project risks too much and gives teams a false sense of confidence. To put it another way, there’s a risk with using risk burndown:

Our new risk management strategy assumes our estimation precision is better than it is; we may not mitigate the right risks.

Exposure is a ruse. And risk burndown is a metric for showing a reduction in exposure over time. To wax poetic, perception is reality and risk burndown provides a false perception.

That said, any risk management is better than none at all.  If a risk burndown chart helps to get your team thinking about risk, then so be it.  But there are other ways (might not be as fancy) to manage risk which are easier and more effective.

Project Signaling

Van Halen may have known more about project management than most program managers. Van Halen’s legendary “No Brown M&Ms Rider” is simultaneously the greatest example of rock star excess and project signaling I’ve ever seen. As David Lee Roth puts it:

The contract rider read like a version of the Chinese Yellow Pages because there was so much equipment, and so many human beings to make it function. So just as a little test, in the technical aspect of the rider, it would say “Article 148: There will be fifteen amperage voltage sockets at twenty-foot spaces, evenly, providing nineteen amperes . . .” This kind of thing. And article number 126, in the middle of nowhere, was: “There will be no brown M&M’s in the backstage area, upon pain of forfeiture of the show, with full compensation.”

So, when I would walk backstage, if I saw a brown M&M in that bowl . . . well, line-check the entire production. Guaranteed you’re going to arrive at a technical error. They didn’t read the contract. Guaranteed you’d run into a problem. Sometimes it would threaten to just destroy the whole show. Something like, literally, life-threatening.

In economics, signals are indicators that convey specific meaning between producers and consumers. For example, when you see THX on the side of a set of speakers, you know the speakers are going to probably be of audiophile quality. The THX logo is the speaker manufacturer’s signal to you, the consumer, that these speakers are really good. To David Lee Roth and the Van Halen road crew, the presence of brown M&Ms indicated that the hosting venue had not understood all details of the contract and had very likely made a mistake in configuring the set. One mistake in this case could cause malfunctions during the show or even the death of a crew member.

As it turns out, signaling software projects isn’t that difficult. The 12 step Joel Test is a reasonable signal for software development companies. While the Joel Test is nice for getting a feel for a company before you work for them, the concept is still useful once you’ve got the job and the project is in full swing.

Ultimately signals, also known as tripwires or triggers, are really just binary metrics for uncovering potential problems your project might be facing before the problems explode in your hands. When some condition is met (the signal), you know it has specific significance and prompts certain actions to prevent a problem from occurring. Triggers are most often used with risk management but their use should not be exclusive to that practice. In fact, if you’re collecting real data, you have even more opportunities for identifying signals outside of risk management.

On past projects I’ve used signals for a variety of issues. Here are some examples.

  • During the past 3 iterations the team identified between 15 and 20 defects. I expect a similar number of defects to be detected for this iteration. If more defects are detected, there may be a disconnection in understanding between requirements, design, and implementation. If fewer defects are detected, tests may not have been as rigorously defined as they should have been.
  • A Fagan inspection completed in less than one hour with a rate of 400 LOC/hour. Since most inspections have covered only 250 LOC/hour it is likely that this inspection was not effective and the results not reliable since the inspection team sped through the code.
  • When evaluating potential open source libraries, Source Forge projects without a website shows a general lack of dedication to the project and indicates that the software is probably of poor quality or ill-maintained; the library is worth neither the time nor effort to use.
  • Tasks that have been estimated to require longer than 9 hours have probably not been thoroughly thought through.
  • No risks have been identified for this project or risks have not been updated for several iterations. This implies that the team doesn’t have a realistic understanding of what problems the project faces.

In each of these examples, when the signal is heard, I knew there was going to be a problem on the project.

Work with your team to establish signals for your project. The best part is that once you’ve decided on the signals for your team, when triggers are tripped you can throw a Van Halen sized rock star fits in your cubicle! Well, try to resist throwing your monitor out the window anyway.

Binary is a Metric Too

Software developers are, in their heart of hearts, dataphiles – people who are absolutely in love with data. When was the last time you had a passionate discussion about frame rates, hardware benchmarks, gadget specs, sports statistics, dungeons and dragons, the merits of high def…the list goes on. Face it, you love data. You love comparing things using data. You don’t feel comfortable making decisions without a comprehensive comparison of data.

Why then do most software developers treat software development differently?

Tom DeMarco recently brought his own famous quote into question (pdf), musing that not only is it possible to control what you can’t measure, but the most important stuff you need to control on a software project is impossible to measure. Once again, DeMarco is wrong (in my opinion anyway). To prove his point DeMarco pointed at Wikipedia, something extremely valuable that was built without the use of metrics or formal control. This is a romanticized view of Wikipedia.

Wikipedia is one of the most controlled projects on the planet

On the surface, Wikipedia is the Wild West of online content. Not only can anyone edit any page, but content from Wikipedia is widely proliferated in the media and (sadly) school reports. Wikipedia is the single greatest success of user generated content in the history of mankind (“The Internet,” as the medium, doesn’t count). What started with a dozen humble articles has evolved into the most comprehensive encyclopedia ever created and includes everything from the fundamentals of science to the definitive source on Babylon 5.

What folks seem to forget is that even in the Wild West, there were laws and there were lawmen. Though we love to think romantically about such brigands and gunslingers as Jesse James, Billy the Kid, and Butch Cassidy, most stories about these historic figures are greatly exaggerated. So too is the case with Wikipedia.

Let’s take a closer look at the Wikipedia entry for Billy the Kid. This article belongs to a number of internal WikiProjects, visible from the top of the article’s talk page. The WikiProject Biography is not unlike most projects in Wikipedia. There are defined processes for assessing articles and conducting peer reviews. There are rubrics defined for assessing the quality of articles within the project. People even take on specific roles and responsibilities within the project. The collection of processes and information serves as the main means of coordination for contributors and helps the group control articles within the scope of the project.

The WikiProject Biography even collects metrics on articles which it then uses to make decisions concerning the articles under the project. The metrics are derived from quantifiable data and help control the project.

As it turns out, Wikipedia is not the lawless territory of the internet it has been made out to be.

You can measure the immeasurable

Wikipedia works because people were able to figure out ways to measure things that usually can’t be measured. The fundamental principle that many people overlook is that binary is a metric too. Yes or no questions can be just as effective a measure as any complex metric. Did everyone fill out their task data today? Yes or no. Did the estimate match the actual? Yes or no. Did the test pass? Yes or no. Is the project done? Yes or no. Have we identified risks? Yes or no. Has this risk become a problem? Yes or no.

At the heart of every complicated metric is really a series of yes or no, binary questions. When considering whether the project is done, you have to define done. One way of defining done is in terms of a checklist. Is feature 1 done? Is feature 2 done? Defining done for a feature could be as simple as checking whether all the tests have passed for the feature, again a binary measure.

For more subjective assessments, you can rely on observation-based, experience-defined rubrics. Does the team get along with one another? In the simplest form, this could be a binary metric (Am I friends with everyone on the team?) but it could also be more complicated relying on gut feelings and a guiding rubric (“we never hang out together and don’t trust one another” might indicate low harmony while “we hang out often and feel comfortable sharing personal stories” could indicate high harmony). Teachers use rubrics and experience to judge subjective assignments everyday. The difference is that they slap a grade on it and send it home as a report card.

While DeMarco is correct that many of most critical things in a project are the most difficult to measure, it is possible to create measurements if you feel it is important enough to do so. How would you assess whether you have a good architecture that solves the problem at hand? Rubrics might play a part but so too might binary gates based on quality attribute scenarios or intricate observations concerning design trends over time. If you think hard enough, you’ll find that it’s extremely easy to find measuring points for nearly every aspect of a software project.

Whatever you do, don’t become a mindless, data-driven robot

I love data and I know you do to. While it’s tempting to inject data collection and derive metrics for every aspect of a project (because it’s fun and informative!) don’t. Collecting data and calculating metrics can be expensive. Not so expensive that you shouldn’t use it, but expensive enough so that you shouldn’t use it on everything. I like to compare using metrics to eating out at restaurants. Once or twice a week isn’t that big a deal, but it’s not something you should do every day if you’re trying to watch your budget.

DeMarco is right about one thing: control is not the end-all-be-all of software engineering. Consider carefully, what are the most risky parts of my project? What are the parts of my project that even require control? What are the parts in which I need more insight or want to improve? Strategically develop metrics for these areas and don’t worry about measuring the rest. Trust me, the world won’t end. If you don’t know what you’re doing, start with a simple binary measure. And above all, if something isn’t working, change it.

Software Craftsmanship: Engineering by Coincidence

I was extremely disappointed to read a recent article on Coding Horror reflecting on an IEEE editorial written by Tom DeMarco. If you have not already, please read Tom DeMarco’s article now. It’s only two pages and it’s well written.

With all due respect, Tom DeMarco is wrong.

And Jeff Atwood made things worse.

According to Atwood’s interpretation of DeMarco, since we can’t control software projects, there is no sense in trying to engineer software.

What DeMarco seems to be saying — and, at least, what I am definitely saying — is that control is ultimately illusory on software development projects. If you want to move your project forward, the only reliable way to do that is to cultivate a deep sense of software craftsmanship and professionalism around it.

Atwood’s conclusion simply is not supported by DeMarco’s article. DeMarco made two points in his piece.

  1. We don’t have as much control over software as we think we do — even when we can measure the software on which we work.
  2. We should be focusing more on the upfront “conception” activities than the areas that currently receive the most attention, construction.

My interpretation of “conception” activities are things like requirements, architecture, and design — details that ultimately help you figure out whether it makes sense to build the thing you think you want to build. By framing DeMarco’s argument as “craftsmanship” vs. “engineering” Atwood misses the whole point and reopens the tired art or engineering debate.

Overlooked by Atwood, DeMarco never questioned the idea that software should be engineered.

I’m gradually coming to the conclusion that software engineering is an idea whose time has come and gone. I still believe it makes excellent sense to engineer software. But that isn’t exactly what software engineering has come to mean. The term encompasses a specific set of disciplines including defined process, inspections and walkthroughs, requirements engineering, traceability matrices, metrics, precise quality control, rigorous planning and tracking, and coding and documentation standards. All these strive for consistency of practice and predictability.

DeMarco is really saying that the engineering part of software engineering has become overshadowed by a collection of best practices for building software. In my mind this isn’t necessarily a bad thing. All it means is that what has become known as “software engineering” is different than the original definition intended by the NATO Conference on Software Engineering.

But by discounting current software engineering practices, DeMarco dismisses the real engineering that went into advancing the field to where it is today.

DeMarco seems to imply that what we really want software engineering to be–the application of systematic, disciplined, quantifiable approaches to the development of software–and what software engineering has become cannot coexist. Essentially, to reach a state where metrics and measures, quantifiable approaches, are used correctly and consistently by the software development community we must stop using the term “engineering” to describe the current set of practices.

This is backwards thinking.

Engineering is more than something you do; it’s also a way of thinking about problems and solutions. Reaching the point in software engineering that we are today required systematic, disciplined, and quantifiable thinking. Over time results of this thinking have been codified into the set of best practices that most developers now take for granted.

For example, we know that there is a 100x or more difference in costs between defects discovered later in the software lifecycle than earlier. We know that certain practices can effectively remove defects at different costs and at different times throughout the lifecycle (for example, inspection vs. prototyping vs. unit testing vs. system testing). We also know that historical data is an excellent indicator of future performance on software projects.

Systematic, disciplined, quantifiable thinking was required to make these discoveries.

Because of these codified best practices, it is not always necessary to conduct experiments on a project to trust that they are working. I know unit testing combined with regular system integrations will flush certain defects from my software before those defects become a problem during system testing. I know that statistical analysis of collected task tracking data will help me better predict how long future tasks of a similar size will take. It doesn’t matter whether I completely understand the engineering behind the practice or whether I simply follow the process or use the tool. The benefits will be the same.

Does that make me less of an engineer? I don’t think so.

Using best practices codified as processes, methods, or tools on a software project means you are engineering software whether you like it or not. With many of these practices, the control mechanisms are already built in so you don’t realize that you’re already controlling your project. As DeMarco points out, it simply isn’t necessary that every engineering detail be painstakingly scrutinized for a project to be successful. For many projects, the essence of the project is sufficient to overcome the accidents encountered when engineering by coincidence.

But just because you engineer by coincidence it doesn’t make you a software craftsman. To prove it, I’m calling Jeff Atwood out. Jeff, I dare you and the Stack Overflow team to take the PSP Challenge. Take a course on the Personal Software Process, honestly give it a try — use actual software engineering for a few weeks — then tell me that software engineering is dead. But don’t knock it until you’ve tried it.