Explain Stream Tokenizer
By Ramakrishna on Dec 17, 2008 in StreamTokenizer
Stream Tokenizer :
StreamTokenizer is a direct subclass of Object class. StreamTokenizer is included in java.io. Package. The StreamTokenizer class takes an input stream and parses it into “tokens” , allowing the tokens to be read one at a time. The parsing process is controlled by a table and a number of flags ( like TT_WORD, TT-EOF,TT-EOL etc., all representing some integer value) that can be set to various states. The StreamTokenizer can recongnize identifiers, numbers, quoted strings and various comment styles. Tokenizing is a feature of compilers, interpreters and parsers.
A stream can contain three types of tokens.
- Words ( that is, multiple character tokens )
- Single-character tokens
- Whitespace( including C/C++/Java-style comments )
Some constants, defined in StreamTokenizer, used as flags to identity the tokens :
int TT-EOL : A constant indicating that the end of the line has been read.
Int TT-EOF : A constant indicating that the end of the stream has been read.
int TT-WORD : A constant indicating that a word token has been read.
Int ttype : After a call to the nextToken method, this field contains the type of the token just read
Aim : To count the number of words in a file using StreamTokenizer and whitespace as delimiter File name is passes as command-line argument.
Sample program of StreamTokenizer as follows :
import java.io.*
public class StreamTokenizerDemo {
static int words = 0;
public static void wordCount(Reader r)throws IOException
{
StreamTokenizer st = new StreamTokenizer(r);
st.wordChars(33,255);
// if token in not EOF
while(st.nextToken()!=st.TT_EOF)
{
//if token is word
if(st.ttype == st.TT_WORD)
words++;
}
}
public static void main(String args[])throws IOException
{
// pass file name as command-line
FileReader fr = new FileReader(args[0]);
wordCount(fr);
System.out.pritnln(” Total words in file :”+words);
}
}
Method signature of wordChars();
public void wordChars(int low, int high);
Specifies that all characters between thew range low and high are word constituents. A word token consists of a word constituent followed by zero or more word constituents or number constituents
