An Original Algorithm to Find .NET Code Duplicate | Patrick Smacchia


: 3

The upcoming Visual Studio 2012 version comes with tooling to find code duplicate.

However, today, I’d like to talk about another code duplicate tool for .NET code. This tool comes as a NDepend Power-Tool. Power-Tools are a set of open-source tools based on NDepend.API. The source code of Power-Tools can be found in $NDependInstallPath$\ NDepend.PowerTools.SourceCode\ NDepend.PowerTools.sln.

The algorithm underlying this code duplicate Power-Tool is simple, yet powerful in practice. It consists in defining sets of methods that are using the same members, i.e calling the same methods, reading the same fields, writing the same fields. We call these sets, suspect-sets. Suspect-sets are sorted by the number of same members used.

Let’s run this Power-Tool on the NUnit v2.6.0 code base. The screenshot below shows the 3 steps of the algorithm executed, and the first suspect-set. This first suspect-set is, as often from my testing experience, the entry points of the application. Very often an application has several entry points that does almost the same initialization things. Sometime it is pure duplicate, sometime it is slightly changed. Most of the time this initialization code can be factorized in one or several parametrized method.

Notice the option o for Open callers methods decls. The declarations are opened in Visual Studio actually.

The second suspect-set is the following one…

bingo, we have an exact code duplicate!

The third suspect-set is the following one…

…by looking at the code below, this suspect-set can eventually be considered as a false positive. Personally, I still consider such code as code duplicated. These two methods implement a unique layout algorithm, that could be parametrized. So here I’d vote for a factorization.

I could continue on and on, but I invite you to download the 14-days full featured evaluation of NDepend to see what this algorithm will report on your own code.

Hopefully duplicate results found by this algorithm are pretty relevant. Its main advantage over others algorithms is that it is not disturbed by dirty copy-paste where the pasted code is then sightly modified. Another advantage is that this algorithm can be run on IL code only, source code is not required. Notice that what this current blog post doesn’t show, is that a suspect-set can have more than 2 suspect methods.

On the cons side, sometime a suspect-set can be humanly considered as a false positive. At least it is almost always possible to refactor found duplicates to factorize them into one or several parametrized methods (as explained for the 2 layout methods). The fact is that two or several methods don’t use the same set of members by coincidence.

Also this algorithm is very fast to run, from 10 to 100 times faster than the VS2012 one. Concretely it takes few seconds to be executed on a real-world large code base. This speed is the consequence of the fact that NDepend.API is optimized to browse dependencies very quickly.

Concerning the algorithm itself, it is open-source so if you are interested in browsing and tweaking it you can. This open-source + NDepend.API form makes it particularly suited to be integrated into a CI process.

The algorithm consists in 3 steps:

  1. Investigate each method (including third-party ones) to see if their callers could be considered as suspect (methods called very often like the ones from types in System.Collection.Generics or System.Xml are discarded to limit false positive).
  2. Merge suspect sets obtained from step 1).
  3. Sort suspect-sets from a weigh computed on number of members called.

Hopefully this algorithm can be optimized further, and your feedback is welcomed. Enjoy!