Matthew's Development Review: Microsoft Word, Javadoc, and Perforce filenames

I've come across an interesting part of the Java documentation article "How to Write Doc Comments for the Javadoc Tool".

Quote:

Troubleshooting Curly Quotes (Microsoft Word)

Problem - A problem occurs if you are working in an editor that defaults to curly (rather than straight) single and double quotes, such as Microsoft Word on a PC -- the quotes disappear when displayed in some browers (such as Unix Netscape). So a phrase like "the display's characteristics" becomes "the displays characteristics."

The illegal characters are the following:

* 146 - right single quote
* 147 - left double quote
* 148 - right double quote

What should be used instead is:

* 39 - straight single quote
* 34 - straight quote

Preventive Solution - The reason the "illegal" quotes occurred was that a default Word option is "Change 'Straight Quotes' to 'Smart Quotes'". If you turn this off, you get the appropriate straight quotes when you type.

Fixing the Curly Quotes - Microsoft Word has several save options -- use "Save As Text Only" to change the quotes back to straight quotes. Be sure to use the correct option:

* "Save As Text Only With Line Breaks" - inserts a space at the end of each line, and keeps curly quotes.
* "Save As Text Only" - does not insert a space at the end of each lines, and changes curly quotes to straight quotes.

We've encountered this issue before when developers copy and paste documentation from design documents into code. The troubling part is when it exists in an SCM and needs to be corrected for all versions and all files that contain that illegal character. I encountered the situation with Perforce and had to backup to several checkpoints (> 2GB) and corrected it by writing a regex to process all of the illegal characters and replace them with the correct one. We even encountered issues with character codes 96 (base 16) "-" and FB (base 16) "û" that were pasted from Microsoft Word documents as filenames as well as in files themselves. This presented us with a real issue when it came to processing our maintenance jobs with Perforce.

I just thought I'd share the following work I've done. Hopefully someone out there can utilize it.

Regex used to process checkpoints:

Convert 96 (base 16) "-" to 2D (base 16) "-" dash Example: //depot/project/some– filename.cat
Convert FB (base 16) "û" to (not: 20 (base 16) " " space) to nothing Example: //depot/project/docs/some û other doc.doc@ 1 65539

# Find results and print
perl -n -e '/^(.*)([\xfb])(.*)$/ && print "$1$2$3\n"' checkpoint.1 > correction.1/checkpoint.1.FB
perl -n -e '/^(.*)([\x96])(.*)$/ && print "$1$2$3\n"' checkpoint.1 > correction.1/checkpoint.1.96

perl -n -e '/^(.*)([\xfb])(.*)$/ && print "$1$2$3\n"' checkpoint.2 > checkpoint.2.FB
perl -n -e '/^(.*)([\x96])(.*)$/ && print "$1$2$3\n"' checkpoint.2 > checkpoint.2.96

# Replace all
perl -pe 's/\xfb//g' checkpoint.1 > checkpoint.1.fb_removed
perl -pe 's/\x96/\x2d/g' checkpoint.1.fb_removed > checkpoint.1.96_and_fb_removed

p4d -jr /path/to/perforce/corrected/checkpoint/checkpoint.1.96_and_fb_removed

Regex to help parse Perforce specs and logs:

.*@$//.*$@.*

\1 db file
\2 workspace spec
\3 perforce spec

$//.*$@+.*@+.*

$//.*$\#.*

Force depot names to mv style to rename to lowercase from Perforce maint log

//$.*$/$.*$\#[0-9].*
\1 directory
\2 filename
mv /path/to/perforce/\1/\L\2 /path/to/perforce/\1/\e\2
filenameMap.put("/path/to/perforce/\1/\e\2", "/path/to/perforce/\1/\L\2");

To capture text/binary+l
//$.*$/$.*$\#[0-9].*($.*$).*
\1 directory
\2 filename
\3 type
filenameList.add(new FileRenameGroup("/path/to/perforce/\1/\e\2", "/path/to/perforce/\1/\L\2", "\3"));

A java utility program to rename bad files to correct filenames based on above regex:

import java.io.File;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

public class PerforcemaintRenameFiles {
public static void main(String[] args) {
Map filenameMap = new HashMap();
filenameMap
.put(
"/absolute/path/to/filename/containing/bad/character/file.bad,filename",
"/absolute/path/to/filename/containing/bad/character/file.correct.filename");

Set filenameMapKeySet = filenameMap.keySet();
Iterator filenameIterator = filenameMapKeySet.iterator();
int filenameCorrectionsCount = 0;
int filenameTotal = 0;
int filenameLowercaseCount = 0;

while (filenameIterator.hasNext()) {
String filenameCorrectCase = (String) filenameIterator.next();
String filenameWrongLowercase = (String) filenameMap.get(filenameCorrectCase);

File correctCaseFile = new File(filenameCorrectCase + ",d");
File wrongLowercaseFile = new File(filenameWrongLowercase+",d");

if (correctCaseFile.exists()) {
filenameCorrectionsCount++;
}

if (wrongLowercaseFile.exists()) {
filenameLowercaseCount++;

}
else {
System.out.println("File: " + wrongLowercaseFile.getAbsolutePath() + " does not exist.");
}
filenameTotal++;
}

System.out.println("There were " + filenameTotal + " total files.\n\r"
+ filenameCorrectionsCount + " correct files exist.\n\r"
+ filenameLowercaseCount + " lower case files exist.");
}

}

Matthew's Development Review

Monday, August 18, 2008

Microsoft Word, Javadoc, and Perforce filenames

No comments:

Blog Archive

Links