Thursday, December 20, 2012

Algorithms: Design an Algorithm to Determine if there are Repeating Characters in a String

Lately, I've been practicing coding techniques by taking a look at different coding problems available for those preparing for computer science positions. The practice problem I worked on today involves both algorithm design and coding. The problem comes from Gayle Laakmann McDowell's Cracking the Coding Interview.

Implement an algorithm to determine if a string has all unique characters. What if you cannot use additional data structures?

The  first thing I did was start thinking about possible ways to solve this. For the example, I am using C++. C and C++ recognize a string as an array of characters. If I was given this problem as part of an interview, I would ask if the string would be passed in ASCII or in another format. For this example I am assuming that all characters are ASCII.

My first algorithm is a straightforward, brute force approach. Basically, we take each character in the string and compare it to every other character in the string. We do this using for loops. This would mean a time complexity of O(n^2) because of the nested loop.

bool noRepeatingCharacters(string myString)
{
    for (int x = 0; x < myString.length(); x++)
    {
        for (int y = 0; y < myString.length(); y++)
        {
            if (myString[x] == myString[y] && x != y) return false;
        }
    }
    return true;
}
 

This code is very simple, but there are some problems. First, comparisons are repeated. Let's say we have a string of five characters, and the string is "ABCDE". On the first iteration, we will compare A to B, then to C, then to D, then to E. On the second iteration, we will compare B to A, then to C, then to D, then to D. Do you see how the comparison between A and B is repeated on the second iteration?

We can optimize this code a bit by altering the loops so that each character is only compared to every character AFTER it. This would mean that once A is compared to all the other characters, the other characters never compare themselves to A again.

bool noRepeatingCharacters(string myString)
{
    for (int x = 0; x < myString.length() - 1; x++)
    {
        for (int y = x + 1; y < myString.length(); y++)
        {
            if (myString[x] == myString[y]) return false;
        }
    }
    return true;
}


This code is a bit better than the previous code since we avoid repeated comparisons. The number of comparisons made with a string with n characters is (n - 1) + (n - 2) + (n - 3) + ... + (1). With a string of 5 characters, this would mean 10 comparisons, or 4 + 3 + 2 + 1. This algorithm works if I am not allowed to use any additional data structures.

I came up with another solution that is even faster, but it is only faster at the expense of memory. By using an additional data structure, we can speed up comparisons. How? By using a hash table. Each character can be cast to an integer in order to determine its ASCII value. We can use that integer as a hash key inside an array that serves as a hash table. The array itself can be made up of boolean values for quick comparisons.

bool noRepeatingCharacters(string myString)
{
    /*
    This is a hash table algorithm. This will speed up comparison times by 

    using a 128 slot hash table based on the ASCII encoding of each char
    in the string. The index number matches the ASCII decimal value. If a
    character is in the string, then its ASCII slot is marked true. If
    another character collides with that slot, then the function returns
    false. If no collisions are detected, the function returns true.
   
    The hash table uses more memory than the brute force algorithm, but
    comparison time is drastically improved.
    */

    bool myHashTable[256] = {false};
    int myIndex;

    for (int x = 0; x < myString.length(); x++)
    {
        myIndex = int(myString[x]);
        if (myHashTable[myIndex]) return false;
        else myHashTable[myIndex] = true;
    }
    return true;
}


This is much better than the brute force algorithm. The time complexity for comparisons becomes O(n). The memory requirements for the hash table is constant. As long as memory is not a concern, this algorithm will give us a much better comparison time.

In order to validate that both algorithms work correctly, I wrote a short testing method that will take in a user-defined string, determine if there are repeating characters in the string, and then pause to show the results. Here is the full code for the exercise:

//Cracking the Coding Interview
//Chapter 1: Problem 1
//
//Design an algorithm to determine if a string has all unique characters.
//What if you cannot use additional data structures?
//
//Solution by Brandon Theolet
//December 20, 2012

#include <iostream>
#include <string>

using namespace std;

bool noRepeatingCharacters(string myString);

void main()
{
    //Method to test the algorithm.

    string myString;
    cout << "Please enter any phrase: ";
    cin >> myString;
    cout << "\n";
    if (noRepeatingCharacters(myString)) cout << "There are no repeating characters in this string.\n\n";
    else cout << "There are repeating characters in this string.\n\n";
    system("pause");
}

bool noRepeatingCharacters(string myString)
{
    /*
    This is a hash table algorithm. This will speed up comparison times by
    using a 256 slot hash table based on the ASCII encoding of each char
    in the string. The index number matches the ASCII decimal value. If a
    character is in the string, then its ASCII slot is marked true. If
    another character collides with that slot, then the function returns
    false. If no collisions are detected, the function returns true.
   
    The hash table uses more memory than the brute force algorithm, but
    comparison time is drastically improved.
    */

    bool myHashTable[256] = {false};
    int myIndex;

    for (int x = 0; x < myString.length(); x++)
    {
        myIndex = int(myString[x]);
        if (myHashTable[myIndex]) return false;
        else myHashTable[myIndex] = true;
    }
    return true;
}

/*

bool noRepeatingCharacters(string myString)
{
    //This is a brute force implementation that uses no additional data structures.
    //Instead, two loops will iterate through the string and compare all chars
    //to every other char in the string AFTER the starting char. This is to avoid
    //repeating comparisons. If a match is found, the function returns false. If
    //the entire string is compared and no matches were found, the function returns
    //true.


    for (int x = 0; x < myString.length() - 1; x++)
    {
        for (int y = x + 1; y < myString.length(); y++)
        {
            if (myString[x] == myString[y]) return false;
        }
    }
    return true;
}

*/


Update: The author of the text proposed another solution that I had not considered with this problem. The solution involved first sorting the string. A well-implemented quicksort algorithm would sort the string in O(n log n) time on average. For comparisons of unique characters, the ordering of the string does not need to be preserved. Once the string is sorted, we can simple iterate through the sorted string, starting at the first character, and comparing this character to the character in the next index. If any neighboring characters are equal, we have non-unique characters in the string.

Another point the author made that I did not consider in my solution involved optimization. If we assume that the ASCII encoding was used to pass the string, then at most there can be 256 characters, which is one unique character for every possibly ASCII character. Thus, if we pass in a string with more than 256 characters, we can assume that at least one is repeated and then return an appropriate value from the function. This optimization would involve O(1) time, or constant time, for strings that were longer than 256 characters. Improving on this idea, I imagine we could also involve error checking where we would throw an error if we received an empty string.