Handling strings in Java

K. Bretonnel Cohen's home page

In all the world, no language has better support for string-handling than Perl. However, Java is becoming increasingly popular, so it behooves the linguist to learn string-handling in Java. This page gives some techniques for performing common string-handling tasks in Java. I'll add to the page as the demands of my job dictate that I learn new techniques. :-)

This page is intended for readers who already have some familiarity with Java.

Contents:

The String class versus the StringBuffer class

Irritatingly, Java has two classes with similar contents, but different functionality. These are the String and StringBuffer classes. Having these two separate classes probably allows Java to be much more efficient in its use of memory, but it's a minor pain in the rear for the developer.

The two classes differ in that instances of the String class cannot be modified in place, while instances of the StringBuffer class can be modified in place.

Since instances of the String class cannot be modified in place, calling a method such as toUpper or toLower on a String typically causes another String to be returned. This can force the use of lots of temporary variables.

If you're building a string, the best way to do it is with a StringBuffer and the StringBuffer.append() method. Adding to a String is waaaay slower.

Analyzing a string token-by-token

Tokenization in Java consists of two separate issues: the case where tokenization is on a character-by-character basis, and the case where tokenization is done on the basis of a separator character. The former case is well-supported in the Java platform, by way of the StringTokenizer class. The latter must be approached algorithmically.

Analyzing a string character-by-character

You will use: The method String.charAt() returns the character at an indexed position in the input string. For example, the following code fragment analyzes an input word character-by-character and prints out a message if the input word contains a coronal consonant:

// the next two lines show construction of a String with a constant
String input = new String ("mita");
String coronals = new String("sztdSZ");
int index;
char tokenizedInput;
// the String.length() method returns the length of a String. you
// subtract 1 from the length because String indices are zero-based.
for (index = 0; index < input.length() - 1; index++) {
    tokenizedInput = input.charAt(index);
    // String.indexOf() returns -1 if the string doesn't contain the character
    // in question. if it doesn't return -1, then you know that it
    // does contain the character in question.
    if (coronals.indexOf(tokenizedInput) != -1){
        System.out.print("The word <");
        System.out.print(input);
        System.out.print("contains the coronal consonant <);
        System.out.print(tokenizedInput);
        System.out.println(">.");
    }
}

This produces the output The word <mita> contains the coronal consonant <t>.

Analyzing a string word-by-word

You will use:
    // make a new String object
    String input = new String("im ani le?acmi ma ani");
    // make a new tokenizer object. note that you pass it the
    // string that you want parsed
    StringTokenizer tokenizer = new StringTokenizer(input);
    // StringTokenizer.hasMoreTokens() returns true as long as
    // there's more data in it that hasn't yet been given to you
    while (tokenizer.hasMoreTokens()) {
        // StringTokenizer.nextToken() returns the
        // next token that the StringTokenizer is holding.
        // (of course, the first time you call it, that
        // will be the first token in the input. :-) )
        String currentToken = tokenizer.nextToken();
        // ...and now you can do whatever you like with
        // that token!
        checkForCoronalConsonants(currentToken);

Getting Perl-like string manipulation abilities in Java

If you're using a version of Java prior to 1.4...

The nice folks at Jakarta have made available their Jakarta ORO packages. Their Perl5Util package implements Perl's m// and s// operators, and the Perl split() function. You can get to the Perl5Util documentation, albeit somewhat circuitously, through this link.

If you're using Java 1.4...

Life smiles on you. This version of Java comes with built-in support for regular expressions, through the java.util.regex package. Furthermore, popular functionalities can be accessed through the String class, via the matches(), replaceFirst(), replaceAll(), and split() methods.

A short quote from Flanagan's Java In A Nutshell, 4th edition, which I strongly suggest that you buy:

The matches(), replaceFirst(), replaceAll(), and split() methods are suitable for when you use a regular expression only once. If you want to use a regular expression for multiple matches, you should explicitly use the Pattern and Matcher classes of the java.util.regex package.

Other sources to check out

K. Bretonnel Cohen's home page