About me

My name is Marko Markovic and I am a software developer, a musician and a researcher. I am intrigued by algorithms and new techniques and I want to improve my software development skills.

While working on different projects I like to document what I have learned. A new technique, a new algorithm, solution to a problem or a tool. The reason I am writing this blog is to help out other people in their programming endeavors and also I write so I won't forget.

If you are interested to find out more about me visit markomarks.com

Saturday, June 22, 2013

Spatial Data Clustering with C#

Howdy y'all, a big one from me today. Today we are dealing with spatial data clustering. There is a huge demand in these kind of techniques, especially for the clustering of geographical data. The problem is if you have, for example, a huge number of products that you want to display on the map, you would probably run into the problem of rendering those data. On one hand, the problem of speed of fetching the data, and secondly, if the pins are too close to each other it would be hard to interact with a specific pin. In both cases you would need to shrink the data that is received. The most logical solution would be grouping of the data that are geographically close enough to each other into one object that contains combined information.
For those who are not familiar with the terms like spatial data and clustering here is a short description of those terms.

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)[source: Wiki]. 

Spatial data also known as geospatial data or geographic information it is the data or information that identifies the geographic location of features and boundaries on Earth, such as natural or constructed features, oceans, and more.[source:Webopedia].

I would also like to add, that I have tried to keep things as modular as I can, so you can customize the code to your needs without many changes to the algorithm.


The Object and Data

So, what exactly do we need? We need the data of course and we will need a class object that would describe the cluster itself. Before any clustering, we will treat each entry as an cluster. An advice would be remapping the data into the cluster object before the algorithm if you can. 

Cluster - In this case imagine the cluster as a 2D square or a plane. With a simple math check you could see if the point is in or out of the imaginary square. The check would consist of asking whether the latitude is in a certain range and longitude in a different range. It is similar to dealing with an X and Y coordinates. 

Of course, the cluster can be any shape that you like, for example a circle, which requires a different math check but it is a bit more intensive on the calculations and general performance. It all depends on the data you are dealing with and your needs.

I will use the following object class to describe the cluster object.

public class SimpleCluster
{
    public int ID { get; set; }
    public List<string> NAMES { get; set; }
    public LAT_LONG LAT_LON_CENTER { get; set; }
    public List<LAT_LONG> LAT_LON_LIST { get; set; }
}

public class LAT_LONG
{
    public double LATITUDE { get; set; }
    public double LONGITUDE { get; set; }
}

The LAT_LONG class object represents the coordinates of an item. It consists of LATITUDE and LONGITUDE double type values. It is a simple class, and I will not pay much attention to it.


Properties
  1. Every cluster has to have an ID. This will be used as means of identification of the current cluster.
  2. LAT_LON_CENTER - this is the latitude and longitude coordinates of representative point of the cluster.
  3. LAT_LON_LIST - list of latitude and longitude coordinates participating in the cluster
  4. NAMES - list of product names, in order to represent some additional data that you require. This isn't crucial to the clustering algorithm, but you always need some data to tag along with the object.
Constructors
Besides the typical empty constructor I have also used two types of constructor
public SimpleCluster(int id, string name, double latitude, double longitude)
{
    ID = id;
    LAT_LON_CENTER = new LAT_LONG {
        LATITUDE = latitude,
        LONGITUDE = longitude };
    NAMES = new List<string>();
    NAMES.Add(name);
    LAT_LON_LIST = new List<LAT_LONG>();
    LAT_LON_LIST.Add(LAT_LON_CENTER);
}

public SimpleCluster(SimpleCluster old)
{
    ID = old.ID;
    LAT_LON_CENTER = new LAT_LONG {
       LATITUDE = old.LAT_LON_CENTER.LATITUDE,
       LONGITUDE = old.LAT_LON_CENTER.LONGITUDE };
    LAT_LON_LIST = new List<LAT_LONG>(old.LAT_LON_LIST );
    NAMES = new List<string>(old.NAMES);
}

The first constructor creates a fresh instance of the cluster. Something to start out with. In the parameters you will see all the data that is needed for the object on first initialization. The "awkward" part of this is adding the name and latitude and longitude in the corresponding lists. On the first run, your cluster center point is the same as the "only" coordinates in the LAT_LON_LIST list. As you remember, this is because at start, each data input entry is treated as a separate cluster. This constructor represents that behavior.

I would like to refer to the second constructor as the "copy" constructor.  While iterating through the list I came up to a problem of interfering with object data using the new keyword for an copy of the instance and I needed a quick solution for a deep copy of an object. This maybe a bit lazyish to someone, but feel free to implement this any way you want.


Algorithm

In the beginning, all the items that are received from some source (databases, files and such) are clusters on their own. You will probably repack the data in a "cluster friendly" object and deal with that list of objects.
public void ClusterTest()
{
    //LOAD DATA
    var clusterList = LoadSimpleClusterList();
    var latitudeSensitivity = 3;
    var longitutdeSensitivity = 3;
            
    //CLUSTER THE DATA
    var clusteredData = ClusterTheData(clusterList, latitudeSensitivity, longitutdeSensitivity);           
}

This small piece of code represents the generalized algorithm. The clusterList is loaded from some source, and we have the settings for the latitude and logitude sensitivity. This is the data that is needed to be passed to the ClusterTheData method which will, like the name says, cluster the data. As mentioned, you can surround that method with some kind of a loop if you need multiple clusterizations of the data.

The latitude and longitude sensitivity represents the means to calculate the outer bounds of the cluster square area. Using the sensitivity variables and the center point of the cluster, it is easy to calculate the outer bounds of the cluster.

public Dictionary<int, SimpleCluster> ClusterTheData(List<SimpleCluster> clusterList, double latitudeSensitivity, double longitutdeSensitivity)
{
    //CLUSTER DICTIONARY
    var clusterDictionary = new Dictionary<int, SimpleCluster>();

    //Add the first node to the cluster list
    if (clusterList.Count > 0)
    {
        clusterDictionary.Add(clusterList[0].ID, clusterList[0]);
    }

    //ALGORITHM
    for (int i = 1; i < clusterList.Count; i++)
    {
        SimpleCluster combinedCluster = null;
        SimpleCluster oldCluster = null;
        foreach (var clusterDict in clusterDictionary)
        {
            //Check if the current item belongs to any of the existing clusters
            if (CheckIfInCluster(clusterDict.Value, clusterList[i], latitudeSensitivity, longitutdeSensitivity))
            {
                //If it belongs to the cluster then combine them and copy the cluster to oldCluster variable;
                combinedCluster = CombineClusters(clusterDict.Value, clusterList[i]);
                oldCluster = new SimpleCluster(clusterDict.Value);
            }
        }

        //This check means that no suitable clusters were found to combine, so the current item in the list becomes a new cluster.
        if (combinedCluster == null)
        {
            //Adding new cluster to the cluster dictionary 
            clusterDictionary.Add(clusterList[i].ID, clusterList[i]);
        }
        else
        {
            //We have created a combined cluster. Now it is time to remove the old cluster from the dictionary and instead of it add a new cluster.
            clusterDictionary.Remove(oldCluster.ID);
            clusterDictionary.Add(combinedCluster.ID, combinedCluster);
        }
    }
    return clusterDictionary;
}

Technically we are going through two loops. One is going through the list of items that we received as a list of data objects and the second one is looping through the temporary dictionary list of clusters. Through each iteration of the first loop we alter the dictionary in some way. 

In its base, if the current item belongs to a cluster, join the data from the currently inspected item to the cluster, remove the old reference in the dictionary and add the combined object as the new cluster. In case there aren't any cluster that the current item belongs to, create a new item in the dictionary list. Every time a new cluster is created or the current one is altered, recalculate the current latitude and longitude center point of the cluster and set it as the new position in the coordinate system.

public bool CheckIfInCluster(SimpleCluster home, SimpleCluster imposter, double latitudeSensitivity, double longitutdeSensitivity)
{
   if ((home.LAT_LON_CENTER.LATITUDE + latitudeSensitivity) > imposter.LAT_LON_CENTER.LATITUDE
          && (home.LAT_LON_CENTER.LATITUDE - latitudeSensitivity) < imposter.LAT_LON_CENTER.LATITUDE
          && (home.LAT_LON_CENTER.LONGITUDE + longitutdeSensitivity) > imposter.LAT_LON_CENTER.LONGITUDE
          && (home.LAT_LON_CENTER.LONGITUDE - longitutdeSensitivity) < imposter.LAT_LON_CENTER.LONGITUDE
      )
   {
       return true;
   }
       return false;
}

This is the check I have used for testing if an item belongs to a cluster. In its essence, from the center point of the current cluster in the cluster dictionary list, create an rectangle area using the latitude and longitude sensitivities. The sensitivity is added left and right from the longitude, and up and down to the latitude. A fact that is worth mentioning is that this technique is used quite often in 2D games programming.

public static SimpleCluster CombineClusters(SimpleCluster home, SimpleCluster imposter)
{
    //Deep copy of the home object
    var combinedCluster = new SimpleCluster(home);
    combinedCluster.LAT_LON_LIST.AddRange(imposter.LAT_LON_LIST);
    combinedCluster.NAMES.AddRange(imposter.NAMES);

    //Combine the data of both clusters
    combinedCluster.LAT_LON_LIST.AddRange(imposter.LAT_LON_LIST);
    combinedCluster.NAMES.AddRange(imposter.NAMES);

    //Recalibrate the new center
    combinedCluster.LAT_LON_CENTER = new LAT_LONG {
        LATITUDE = ((home.LAT_LON_CENTER.LATITUDE + imposter.LAT_LON_CENTER.LATITUDE) / 2.0),
        LONGITUDE = ((home.LAT_LON_CENTER.LONGITUDE + imposter.LAT_LON_CENTER.LONGITUDE) / 2.0)
    };

    return combinedCluster;
}

As the method name suggests this is the place where combining of the clusters occurs. This idea is simple. Assign the home cluster as a combined cluster, add the data from the imposter cluster to the combined cluster. Probably some filtering of the data is required, and I am leaving that to the reader to figure that out and design it to his/hers needs. The important part is the re-calibration of the center point of the cluster. You have added new points to the cluster and because of that you have to recalculate the center point. For example, points might be on the one of the edges, and you are just making sure that the cluster is respecting that dynamically, by following the group.

Also notice that the combined cluster receives the home cluster ID. This ID (even if not unique compared to the original list) is enough for the algorithm to function, and I didn't have the need of special IDs.

I am assuming that you will need new unique IDs when dealing with the resulting list. The only important thing here is for the ID to be unique so the clustering would work. For the unique ID you could implement a function that gives the combined cluster a new ID based on your business logic. Another approach maybe assigning the IDs when looping through the resulting list after the clustering was done. 

In the end, you may have noticed that you can reiterate through the entire algorithm again with the resulting dictionary cluster list. Of course, that would mean that you are combining those clusters in a bigger cluster, so in accordance to that, you would have to change the sensitivity of the clusters, because with the fixed latitude and longitude sensitivities you would receive a result that is pretty much the same as previous one.

The end result of the algorithm gives you a list of items which you can store and edit for later or immediate use.


Complexity

The time complexity shouldn't be too bad for this algorithm. You are passing through the entire list once and through each iteration you pass through the list of cluster. In worst case scenario it would be something around n*k! where the n is the number of items in the list and k is the number of clusters. In worst case scenario it would be n*n which doesn't make much sense because it would mean that data wasn't clustered at all and we don't want that now, do we?.

Well, that would be it for today. Hope this article will help you on your future endeavors. And please, if you find any problems with this logic let me know.

Have a great weekend!
Cheers!

6 comments:

  1. Found a minor bug, in the CombineClusters method you're adding the new cluster twice leading to duplicates in the combined cluster.

    ReplyDelete
    Replies
    1. Hello there Michael, I am glad you are enjoying this article and sorry for the late response. I would say that that bug was left on purpose.

      The reason behind is data related. In that method you have to check when two points in an cluster are equal. That is the root of the problem. In my case, my data had clusters that have duplicated latitude and longitude, so I couldn't distinct them with that. The only way for me to prevent the duplication of data was to check additional data in the point (like name of the cluster, category, id and etc.). That is why I left that method the way it is. In the article I mentioned filtering of data is required. Excuse me for not being clear enough. Hope this sheds some light on the situation.
      All the best!
      Marko

      Delete
  2. Hello Marko,
    I am currently working this related topics. Summary is: The tools read the differet format of data(e.g: DTED format level 10,1,2) and save it into a class object. object is then stored into a list(it contains all 3 levels of processed data).Since list contains lots of dted level 0,1,2 data so i want to merge all 0 level into one level and all 1 level to 1 level and all 2nd level to 2 level. that means finally my list will contain only 0 level 1 level 2 level insted of containing all the small level of data.any suggestion will be appriciate.

    Hussain

    ReplyDelete
    Replies
    1. Hello there Hussain, sorry for the late response. I was on a vacation and didn't have time to respond you earlier. Let me start by saying that I don't have a vast knowledge about the DTED format, but as far as I figured the data is represented in arc seconds or in an angle. As far as I figured, you can calculate elevation values from it (in Ft, meters and etc).

      Correct me if I am wrong, but as far as I can tell is that the some points from level 0, 1, 2 overlap and that the only difference between the levels is the size of the matrix used (0 has the least, 2 has the most points). Each point has its calculated elevation value and that is it. You can use those values to cluster the dots by elevation and using that you can get the contour of the elevations on the map. The levels will only define a finesse of the contour.

      If you just want to split the data into groups, you can use all the data you receive and go through it 3 times for each level (0,1,2). Each time you are passing through the data you will set a fixed sensitivity of cluster. From all the points gathered in the cluster you will choose the point closest to the center value of the cluster and that would become a point that represents the cluster. And with that I think you will only get points of a certain level.

      Additionally you can constrain the cluster to be certain size and not to grow anymore, so all the clusters would be the same size, resembling a grid in some way.

      This is the idea that I have come up with, I haven't tried it with real data, and I am not sure that this will work.

      Hope this helps,
      let me know how it goes,

      Marko

      Delete
  3. I am not sure why we use double for latitude and longitude instead of decimal. double seems to loose precision points for large numbers

    ReplyDelete

Blog Archive