Saturday, January 5, 2013

Get Sentence from raw text


String getSentence(String text, Set<String> dictionary);

// text is a string without spaces, you need to insert spaces into text, so each word seperated by the space in the resulting string exists in the dictionary, return the resulting string

// getSentence("iamastudentfromwaterloo", {"from, "waterloo", "hi", "am", "yes", "i", "a", "student"}) -> "i am a student from waterloo"


Approach:
String [] tokens = {"from", "waterloo", "hi", "am", "as", "stud", "yes", "i", "a", "student"};
String text = "iamastudentfromwaterloo";


Lets create a hashtable named tokenMap which will contain all the token of tokens array.

We will also have 2 stacks, lets say indexStack to keep track of starting index of all the matched tokens in text (this will become clear below .. keep on reading :))
And another stack resultStack which will store all the tokens of the final expected sentence.

We will also have a StringBuffer sb which will keep track of substring from text which would be a candinate token (we will verify that by checking in hashtable tokenMap)

Now,
We will read one character at a time from "text"
StringBuffer sb = "i";
Our start index = 0
Lets push index onto indextStack
sb = "i" is present in the hashtable tokenMap, so now we have following
indexStack = 0 -> null
resultStack = "i" -> null

we now reset sb = "";
Next we see sb = "a"
"a" is in tokenMap so now we have following
indexStack = 1 -> 0 -> null;
resultStack = "a" -> "i" -> null

Now things get interesting as we keep on reading the characters from text and appending to sb but the string is not there in hashtable tokenMap
like sb = "mastudent" ...
So we need to decide to backtrack in such case (logic can be if sb length goes beyond max length of any of the given tokens in tokens array)

So when we backtrack we pop the stacks and reset StringBuffer sb
indexStack = 0 -> null
resultStack = "i" -> null
sb = "";

now again we start appending characters to sb till sb has a match in tokenMap HashTable

We keep repeating the above steps till we have constructed our sentence or till we decide its NOT possible to do so.

Input:

String [] tokens = {"from", "waterloo", "hi", "am", "as", "stud", "yes", "i", "a", "student"};
String text = "iamastudentfromwaterloo";

Output:

Max Token Length: 8
Sentence is: i am a student from waterloo




1 comment: