Posted on 1/30/2008 11:51:00 AM by Justin Etheredge
It looks like
Rob Conery inadvertently
followed number 3 from
my previous post. So I guess I was right, that sure didn't take very long. :-)
Posted on 1/30/2008 1:44:00 AM by Justin Etheredge
Update: In case it wasn't ABUNDANTLY clear, this post is sarcastic.
1) Make promises you can't keep - I promise if you read this blog every day you'll be the best programmer ever in one year!
2) Make statements and claims that you can't possibly defend - no one is more popular than the guy who is constantly saying ridiculous crap. (Can you say Bill O'Reilly?)
3) Make inflammatory posts about other bloggers (preferably those whose blogs are already popular) - Your popularity will skyrocket once the other blogger posts his/her "that guy is a douchebag" response post.
4) Insert completely pointless and totally off-topic images into your posts - My favorite is the staircase.
5) Reference rules and statistics that are completely made up - In reality though, I've heard that only about 20% of bloggers actually do this.
6) Brag about how much money your ads bring in - Mine? Oh, I have earned 6 bucks so far this month. Take that!
7) Go out of your way to offend an entire group of people - For example "windows users", "Republicans", "Democrats", "Grandmothers" or pretty much anyone who holds different views than yours.
8) Post 18 times a day - Sure, go ahead and made a post about what you had for breakfast, then you can make another one later about what it was like coming out! Oh, and that guy who cut you off on the way to work, you should blog about that too.
9) Ask questions with no answers - You really should ask your users questions that no one can answer but everyone thinks they can. Will doing this really make you popular?
10) Make top 10 lists - no one will ever take you seriously unless you can fit all of your opinions into bullet points.
And before anyone says anything, I am probably guilty of half the items on this list. :-) Enjoy!
Posted on 1/28/2008 11:48:28 PM by Justin Etheredge
I will refer to this from now on as the Law Of Crappy Code or simply LOCC. The LOCC is all about you, and the code you write. You know what I am getting at, you write crap code. We all do. You ever wonder why you will write something and then come back to it a year later and think "this is crap." Well, that is because it is. Now I hope that you aren't taking offense at any of this because it isn't your fault, you are human, and humans make mistakes (lots of them). The tact to take is to look at the problem and say "how can we mitigate this?"
You may be surprised to hear that the "industry average" for flaws in code is all over the map and is very different based on who is reporting the numbers (sarcasm). I have looked all over the web and from what I can tell the industry average is somewhere between .5 and 25 (some numbers I saw went as high as 50, but that is just plain ridiculous in most modern languages. I cannot imagine introducing an error every 20 lines of code!) errors per thousand lines of code. Now, you are going to be closer to .5 if you are writing something like a database or a web server and probably closer to 25 if you are writing general business software. But even somewhere in the middle, like 12 errors per thousand lines, still means that you are introducing a bug roughly every 83 lines of code that you produce. Is that number acceptable? Well, I hope not. In a 250,000 line program that would mean over 3000 errors. Now some of these errors may be trivial, and the numbers on the severity of these errors just isn't out there (because that would be grossly subjective), but you can still see that the more code you have the more likely you will have bugs.
And yes, the statement that the number of bugs will increase as the number of lines of code is increased is fairly obvious, it is a bit like saying "the more you drive your car, the more likely you are to have an accident." Which is completely true, assuming that you remove all other variables. You obviously can't compare one guy who drives 500 miles per week to a guy who only drives 20 per week, but does so ridiculously drunk. So, lets just agree that even if you have very very low bug rates, if you were to double your codebase, then you would probably have larger number of bugs. Even though you will still have very few bugs, you still have more code in which a bug can be introduced. And studies have shown that the numbers of errors per lines of code increases as the application increases in size. Anyone who has had to work on a very large system that wasn't designed in a great way can attest to this. Can we say feature interaction?
So, how can we mitigate this issue? The obvious way is to write less code! Now, I'm not saying that you should be lazy and start programming less, I am saying that every time you add code to your application you should do so consciously. While you are adding code you should be conscious of whether you are repeating functionality, because any time that you can reduce the amount of code you are better off. Just slow down and think about what you are doing. In the software world our schedules can be crazy and in the haste to get the code written and the application out the door we can cut short the parts of software development that actually lead to good applications.
Another thing you can do is to constantly look for code to refactor in order to shorten, simplify, or completely remove it. Constantly ask yourself if a piece of code needs to be there or not. Refactoring can be troublesome though, and if you do not have sufficient tests to check your refactorings then you may end up introducing more bugs than you are avoiding.
One last thing you can do is to avoid large or complex methods and classes. You may think that bugs are distributed evenly around an application, but that just isn't true. According to Steve McConnell one study found that "Eighty percent of the errors are found in 20 percent of a project's classes or routines" and another found that "Fifty percent of the errors are found in 5 percent of a project's classes". The more large and complex a method or routine is, the more likely it is to contain errors and the harder those errors will be to fix.
So, while introducing better development techniques such as TDD, Pair Programming, etc... in order to reduce bug counts in software is the first step, you next need to consider the amount of code you have and whether all of it is truly needed. And before you say that since TDD and other techniques reduce bug levels so adding more code is less of a concern, just remember that bugs are just as likely in test code as they are in application code.
Posted on 1/28/2008 12:26:47 AM by Justin Etheredge
In my last post I introduced you to the new HashSet class in .net 3.5. I showed you how easy it was to do a plethora of set operations using this class. Well, Dave over at Encosia saw this class and thought it might be a good fit for a program he had wrote that searched through variable length letter permutations to find items in a list of valid strings. He wondered if it might be faster to load up a HashSet and do an intersect operation than the current method he had of using the Dictionary class to lookup valid words. So, I asked him to send me over the source so that I could do a bit of performance testing. He happily sent me the source, and so here is my performance comparison between the two.
Method 1: This method uses a Dictionary. It first opens up the word file and loads up the Dictionary with the words. It then takes the input string and generates all the permutations of it. After each permutation is generated it checks the Dictionary of valid words using the "ContainsKey" method.
Method 2: This method uses a HashSet. It starts off like Method 1 and opens the word file and loads the HashSet with the valid words. It then generates all of the permutations and loads them into a second HashSet. It then calls the "IntersectWith" method on the first HashSet passing in the second HashSet which returns the valid words.
The tests were run in a loop of 10 passes with 500 lookups per pass. Lets look at the results and then discuss:
As you can see, in the 4, 5, and 6 letter combinations the numbers look very similar. For 6 letters it took roughly 650 milliseconds to do 500 looks-ups. When we hit the 7 letter mark things start looking very different. The Dictionary begins to pull ahead by a good margin. The HashSet had an average time of 5529 milliseconds and the Dictionary averaged 4495 milliseconds. A different of more than 20%! So, why the sudden difference? Well, the HashSet method goes through all of the possible permutations and loads up a giant HashSet object with all of those permutations. So, at 6 letters we have 1956 possible permutations, but since it jumps up to 13700 permutations at 7 letters, it begins to seriously affect the performance of the HashSet method.
Now, at this point you may be saying "Heyyyyy! 6 letters only has 6! (720) possible combinations!" And you would be partially correct, but we are doing all combinations including lengths of 5,4,3,2, and 1. The formula for calculating this is:
In this forumla n is the set size and r is the size of the permutation. So 6 letters is actually 6!/(6-6)! + 6!/(6-5)! + 6!/(6-4)! etc...
So, looking at this you might think that the choice between Dictionary and HashSet is easy if you are dealing with large sets, and you'd be wrong. :-) The issue is that the HashSet spent most of its time loading up the second HashSet in order to do the "Intersection" operation. Well, the HashSet also has a "Contains" method so we really didn't need to load up the second HashSet at all! If we switch out the HashSet in the Dictionary method and replace "ContainsKey" with the "Contains" method we end up with almost identical performance even at 7 characters.
So, in the end, these two classes have almost identical lookup speeds, the real question is whether you need the ability to assign a key to a piece of data (the dictionary has a seperate key and value type, the HashSet does not) or whether you need the robust set operations that are given to you by the HashSet object.
And finally, if you want to see the code that was generating the permutations for the HashSet and the Dictionary, here it is:
//HashSet Permutations
public void GeneratePermutations(HashSet<string> permutations,
string prefix, List<char> letters)
{
for (int i = 0; i < letters.Count; i++)
{
_permutations++;
string word = prefix + letters[i];
List<char> j = new List<char>(letters);
permutations.Add(word);
j.RemoveAt(i);
GeneratePermutations(permutations, word, j);
}
}
//Dictionary Permutations
protected void GeneratePermutations(List<string> results,
string prefix, List<char> letters)
{
for (int i = 0; i < letters.Count; i++)
{
_permutations++;
string word = prefix + letters[i];
List<char> j = new List<char>(letters);
if (CheckWord(word))
results.Add(word);
j.RemoveAt(i);
WordSearch(results, word, j);
}
}
Posted on 1/25/2008 1:35:55 AM by Justin Etheredge
In the .net 3.5 framework we now have a typesafe class specifically engineered for high performance set operations. The HashSet class is so useful when performing set operations on groups of objects that I just had to share it. If you are not familiar with set theory, I will give you a quick overview. First, a set is just a group of objects. So, here is our set:
The first operation that we will perform on the set is an intersect. An intersection is the set of items in two overlapping sets. For example:
Here we have two overlapping sets, the intersection is the items that are in both sets. The next operation that we are going to look at is a union. A union is the combination of all items in two sets.
Here the Union is all items in both sets, including that items that are only in one of the sets. The next concept is a subset. A subset is a set that is comprised of items that are all in another set. For example:
By definition a set is automatically a subset of itself, since all of its items are contained in itself. So, we have something called a "proper subset" which is a subset that is not equal to itself. The set shown in the picture above would be a proper subset. This leads us to the concept of a Superset, which is the opposite of a subset:
There is also a "proper superset" which is just a superset that is not equal to itself. So, now that we have gotten some quick set operations out of the way, lets look at the HashSet class and how we would go about using it to do these things. First we will look at an intersection:
var stringSet1 = new HashSet<string> { "John", "Mike", "Fred" };
var stringSet2 = new HashSet<string> { "Bob", "Ted", "John" };
stringSet1.IntersectWith(stringSet2);
First of all we are using a collection initializer to setup our hashsets and then I call IntersectWith which leaves stringSet1 with just a single item "John". Next we will do a union:
var stringSet1 = new HashSet<string> { "John", "Mike", "Fred" };
var stringSet2 = new HashSet<string> { "Bob", "Ted", "John" };
stringSet1.UnionWith(stringSet2);
This will leave stringSet1 with "John", "Mike", "Fred", "Bob", and "Ted". Notice that "John" is not in there twice, any duplicates are removed in a Union operation because in a true union operation two items that have the attributes are considered to be the same. In this case the two strings are equal and therefore they are considered to be the same. Next we will look at the subset and superset methods:
var stringSet1 = new HashSet<string>
{ "John", "Mike", "Fred", "Bob", "Ted" };
var stringSet2 = new HashSet<string> { "John", "Bob" };
stringSet1.IsProperSupersetOf(stringSet2); //true
stringSet2.IsProperSubsetOf(stringSet1); //true
stringSet1.IsProperSubsetOf(stringSet2); //false
stringSet1.IsSubsetOf(stringSet1); //true
stringSet1.IsSupersetOf(stringSet1); //true
They correlate directly to the set operations and work exactly as expected. I'm not going to go through each one since we already went over the functions above. They also provide an "ExceptWith" method:
var stringSet1 = new HashSet<string>
{ "John", "Mike", "Fred", "Bob", "Ted" };
var stringSet2 = new HashSet<string> { "John", "Bob" };
stringSet1.ExceptWith(stringSet2);
This returns the set "Mike", "Fred", and "Ted". This method just removes all of the items in the second set from the first set. The HashSet class also provides an "Overlaps" method that tells us if a second set has any items in common with the current set. There is a "SetEquals" method that allows us to determine if the first set is equal to the second set. The last two methods that we are going to look at is "SymmetricExceptWith" and "RemoveWhere". "SymmetricExceptWith" simply returns all items in both sets except the items in common:
var stringSet1 = new HashSet<string>
{ "John", "Mike", "Fred", "Bob", "Ted" };
var stringSet2 = new HashSet<string> { "John", "Bob", "Scott" };
stringSet1.SymmetricExceptWith(stringSet2);
This operations leaves set 1 with "Mike", "Fred", "Scott", and "Ted". "John" and "Bob" a removed since both sets share the item. Next we have "RemoveWhere" which allows us to remove items that match our expression:
var stringSet1 = new HashSet<string>
{ "John", "Mike", "Fred", "Bob", "Ted" };
stringSet1.RemoveWhere(x => x.StartsWith("F"));
In this case since we are using strings I can call "StartsWith" and therefore we end up with stringSet1 containing "John", "Mike", "Bob", and "Ted".
So, there you have it. This post ended up a little like a page out of a textbook, but I guess sometimes we just have to grin and bear it. The HashSet class is just so useful in so many ways that I just had to expose people to it. None of these operations are incredible difficult, but now we have a build in class that we can always count on being there. Enjoy!