How To Open Unicode Data Files

While some programs open Unicode and ASCII data files equally well, some statistical packages and programming languages require a little extra effort to work with Unicode. Here are some solutions if you have a Unicode dataset you need to work with.

Convert Unicode to ASCII in Windows

Naturally, you can only convert Unicode characters that have ASCII equivalents. Most datafiles in English will only use characters available in both encodings, so conversion is often an option. There are at least two opens available to convert Unicode files to ASCII files in Windows.

WordPad

  1. Open the file with WordPad.
  2. Go to File -> Save As -> in the drop down menu just below the file name field change the file type from Unicode Text Document to Text Document.
  3. Now enter the file name you want remembering to specify the suffix you want such as .csv. The default is .txt.

The WordPad option is convenient, but may not work for very large files and requires a lot of pointing and clicking. The following command line option solves those problems.

TYPE command

From the Windows command line, you can convert a unicode encoded file to an ASCII encoded file using the TYPE command.
  1. Click start, click run, type "cmd", click ok.
  2. Use the TYPE command as follows
TYPE "path and file of unicode file" > "path and file of ascii file to create"

Here's an example
TYPE "C:\Documents and Settings\Robert\Desktop\ExampleUnicode.csv" > "C:\Documents and Settings\Robert\Desktop\ExampleASCII.csv"

If you wish to avoid specifying the path, you can opt to cd into the appropriate folder first.

The TYPE command is essentially reading the input file and printing it to the output file. In the process, it converts the input to the format it would use if just printing to the screen. If you leave off the "> out file", this command prints directly to the screen. Note, this trick may not work for UTF-8 encoding, which is backwards compatible with ASCII.

Convert Unicode to ASCII in Linux

iconv command

From a linux terminal you can convert a file encoded in pretty much any format using the iconv command.
  • To convert a Unicode file to ASCII use iconv command as follows
       iconv -f UTF-16 -t ASCII//TRANSLIT//IGNORE ExampleUnicode.csv > ExampleASCII.csv
    
    • This command instructs iconv that the input file is UTF-16 encoded and that you want ASCII output with characters with out exact matches replaced with approximations if possible or removed from the output if not possible.
  • a list of character encoding formats know to iconv can be displayed using the command as follows.
       iconv -l
    

Open the Unicode Data File Directly in R

(This section needs creating. If you know how to do this, please create this section.)

(See also this May 2008 R-help thread.)

Open the Unicode Data File Directly in Stata

(This section needs creating. If you know how to do this, please create this section.)

Open the Unicode Data File Directly in SAS

(This section needs creating. If you know how to do this, please create this section.)

Sample Data to Play With

The following files contain the following dataset delimited by commas.

Name Age EyeColor
Andrea 71 Green
Bobby 72 Hazel
Charles 73 Brown

Topic revision: r4 - 16 Apr 2009, CharlesDupont
 

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback