[R筆記] 第7回 讀取檔案

R的讀取檔案

要開始read/write檔案之前,必須要先知道工作目錄在哪,這樣才知道要讀取的檔案要放何處,寫出的檔案位於何處。利用getwd()可以知道目前的工作目錄,如果要改變就使用setwd()來設定。最基本的就是read.table(),無論是read.csv()及read.delim(),底層接的還是read.table(),透過 ' ? ' 能查詢函式的使用方式。
> getwd()
[1] "C:/Users/******/Documents"
> ?read.csv
> read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) 
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)
<bytecode: 0x0000000007870d78>
<environment: namespace:utils>

read.csv
csv(Comma Separated Values)檔案其實是一種以逗號分開來表示資料的格式,在R中,用read.csv()來讀取。
下面的檔案內容擷取自 http://www.andrewpatton.com/countrylist.html
檔案中如果第一行是描述columns' name,這裡稱作header,如果檔案中沒有header存在,第一行便是data,那在使用時,要將header=FALSE傳入函式中,像是read.csv("fileName.csv", header=FALSE)。
> data <- read.csv("example-csv.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## use Notepad++ open file
## header
Sort Order,Common Name,Formal Name,Type,Sub Type,Sovereignty,Capital,ISO 4217 Currency Code,ISO 4217 Currency Name,ITU-T Telephone Code,ISO 3166-1 2 Letter Code,ISO 3166-1 3 Letter Code,ISO 3166-1 Number,IANA Country Code TLD
## data
263,Tristan da Cunha,,Proto Dependency,Dependency of Saint Helena,United Kingdom,Edinburgh,SHP,Pound,290,TA,TAA,,
264,Antarctica,,Disputed Territory,,Undetermined,,,,,AQ,ATA,10,.aq
...

read.csv2
read.csv及read.csv2最大的不同是格式,因為有時候我們表示數字時會用逗號來分隔,像是19,546之類的。那這時如果還是用逗號作為分隔,就會發生錯誤,所以read.csv2()辨認的分隔符號為分號,那在表示數字時就能使用逗號,read.csv()的數字表示符號為'.',像是0.01及19546。
> data <- read.csv2("example-csv2.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## use Notepad++ open file
...
263;Tristan da Cunha;;Proto Dependency;Dependency of Saint Helena;United Kingdom;Edinburgh;SHP;Pound;290;TA;TAA;;
264;Antarctica;;Disputed Territory;;Undetermined;;;;;AQ;ATA;10;.aq
265;Kosovo;;Disputed Territory;;Administrated by the UN;Pristina;CSD and EUR;Dinar and Euro;381;CS;SCG;891;.cs and .yu
...

read.delim及read.delim2與read.csv不同的是分隔符號,read.delim及read.delim2兩者辨認的分隔符號為tab(\t),所以開啟檔案時會發現data之間是用大空白隔開。而read.delim與read.delim2之間的差異一樣是數字之間的符號。
> data <- read.delim("example-tab.csv")
> head(data)
  Sort.Order                                        Common.Name Formal.Name
1        263                                   Tristan da Cunha          NA
2        264                                         Antarctica          NA
3        265                                             Kosovo          NA
4        266 Palestinian Territories (Gaza Strip and West Bank)          NA
5        267                                     Western Sahara          NA
6        268                     Australian Antarctic Territory          NA
...

## use Notepad++ open file
...
264 Antarctica  Disputed Territory  Undetermined     AQ ATA 10 .aq
265 Kosovo  Disputed Territory  Administrated by the UN Pristina CSD and EUR Dinar and Euro 381 CS SCG 891 .cs and .yu

沒有留言:

張貼留言