2014/11/26

[R] 第7篇 讀取檔案

R 的讀取檔案

開始讀寫檔案前,必須先知道工作目錄在哪,才知道讀取的檔案該放在哪。用 getwd() 顯示工作目錄,如果要改變路徑可用 setwd() 來修改。最基本的就是 read.table(),無論是 read.csv() 或 read.delim(),底層都還是 read.table(),透過 ? 能查詢函式的使用方式。
> getwd()
[1] "C:/Users/******/Documents"
> ?read.csv
> read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) 
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x0000000007870d78>
<environment: namespace:utils>
read.csv
csv (Comma Separated Values) 檔案是一種以逗號分開來表示資料的格式,用 read.csv() 來讀取。下面的檔案內容擷取自 http://www.andrewpatton.com/countrylist.html
檔案中如果第一行是描述 columns' name 稱作 header,如果檔案中沒有 header 存在,第一行是 data,那要將 header=FALSE 傳入函式,像是 read.csv("fileName.csv", header=FALSE)
> data <- read.csv("example-csv.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## header
Sort Order,Common Name,Formal Name,Type,Sub Type,Sovereignty,Capital,ISO 4217 Currency Code,ISO 4217 Currency Name,ITU-T Telephone Code,ISO 3166-1 2 Letter Code,ISO 3166-1 3 Letter Code,ISO 3166-1 Number,IANA Country Code TLD
## data
263,Tristan da Cunha,,Proto Dependency,Dependency of Saint Helena,United Kingdom,Edinburgh,SHP,Pound,290,TA,TAA,,
264,Antarctica,,Disputed Territory,,Undetermined,,,,,AQ,ATA,10,.aq
...
read.csv2
read.csvread.csv2 最大不同是格式,因為通常在表示數字時會使用逗號,方便閱讀,像是金錢。那如果還是用逗號作為分隔,就會發生錯誤,所以 read.csv2 辨認的分隔符號為分號,read.csv 的數字表示符號為 .,像是 0.01 及 19546。
> data <- read.csv2("example-csv2.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## data
...
263;Tristan da Cunha;;Proto Dependency;Dependency of Saint Helena;United Kingdom;Edinburgh;SHP;Pound;290;TA;TAA;;
264;Antarctica;;Disputed Territory;;Undetermined;;;;;AQ;ATA;10;.aq
265;Kosovo;;Disputed Territory;;Administrated by the UN;Pristina;CSD and EUR;Dinar and Euro;381;CS;SCG;891;.cs and .yu
...
read.delim / read.delim2 兩者辨認的分隔符號為 tab (\t),所以開啟檔案時會發現 data 之間是用大空白隔開,read.delim / read.delim2 間的差異一樣是數字的符號。
> data <- read.delim("example-tab.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## data
...
264 Antarctica  Disputed Territory  Undetermined     AQ ATA 10 .aq
265 Kosovo  Disputed Territory  Administrated by the UN Pristina CSD and EUR Dinar and Euro 381 CS SCG 891 .cs and .yu

沒有留言:

張貼留言