vendredi 27 mars 2015

Reading a formatted text in C++

I have been trying to read a formatted text into my C++ program, and I have found a solution. Yet, I am not content with the elegance/style of my code, and I am asking for your help to find out if there is any better solution to my problem.


I need to load the rcv1 dataset to my program. The dataset, as you may have already known, provides the vectorized content in the format <did> [<tid>:<weight>]+, where did, tid and weight denote the document id, term id and weight of the corresponding term, respectively. Each line in a file in the dataset contains a unique document id, and there are different number of [tid,weight] pairs for that document.


I have found a solution to loading the data as follows (below is a code excerpt from a loop in my program):



while( docCount < docPerW ) {
getline( docsFile, line );

if ( docsFile.eof() ) {
docsFile.close();
docPos = 0;

fileID++;
if ( fileID >= numFiles ) {
topsFile.close();
tag = 1;
break;
}

docsFile.open( dataFolder + docFName +
to_string( fileID ) + ".dat" );
docsFile.seekg( docPos );
continue;
}

istringstream docsStream( line );

docsStream >> docDocID;
map< int, double > docData;

while( docsStream >> tID >> docDelim >> tWeight)
docData[ tID ] = tWeight;

documents[ docDocID ] = docData;
docCount++;
}


Above, the file reading takes place in a master-worker setting. Each worker needs to read up to docPerW documents if there still exist files to read from. The topsFile is another file object, which I need to read the topics from (this is not relavent to my question).


In summary, is there a better way to read the terms and their weights related to a document given in such a format than

1. getting the line first,

2. then converting it to an istringstream object, and,

3. finally reading from that object until the stream ends?


Thanks for your help and suggesstions.


Aucun commentaire:

Enregistrer un commentaire