mercredi 7 janvier 2015

Data Mining using machine learning and statistic on timing data in c++?

I am trying to create a tool that looks at timing data and forms clusters around these data set. I do not know how many clusters I have ahead on time and most of my data is std::string data, but for some of this data I can later parse in to custom enumerations and structures so this is a moot point. I also do not want to make any assumption about the data before processing the data.


My goal is too create a cluster set with the timer values as the output set and the other value as input set to easily discover patterns in the data that might be overlooked, using the c++ standard libraries include c++ 11 libraries with boost as a last resort.


For my first step, I have sorted my data based on the timers values. Now the next step is to group the data based on a matching score and timer values.


The matching score compares the values for similarity of data rather than sameness as string data.


code for processing the data should look something like this:



bool recomputeClusterGroup = false;
for( auto& data : dataset ) // data set is the key value pairs
{
bool isPlaced = false;
for( auto& cluster : clusterGroups )
{
//MatchingLimit is declared earlier
if(cluster.IsInRange(data) && cluster.MatchingScore(data) <= MatchingLimit)
{
isPlaced = true;
cluster.addData(data);
break;
}
}
if(isPlaced == false)
{
recomputeClusterGroup = true;
clusterGroups.push_back(new Cluster(data));
}
}
if(recomputeClusterGroup == true)
{
repartitionClusterGroups(clusterGroups); // moves data to better clusterGroups
}

Aucun commentaire:

Enregistrer un commentaire