TOP

使用python中的pyhdfs連接HDFS進行操作——pyhdfs使用指導(附代碼及運行結果)
2018-11-29 16:19:33 】 瀏覽:1795
Tags:

【原創】pyhdfs使用指導——附代碼及運行結果

碼字不易,轉載請注明出處,謝謝!

hdfs官方文檔:

1.HdfsClient類

pyhdfs中的HdfsClient類非常關鍵。使用這個類可以實現連接HDFS的Namenode,對HDFS上的文件進行查詢、讀、寫操作等。

In[1]:

import pyhdfs

class pyhdfs.HdfsClient(hosts=u'localhost', randomize_hosts=True, user_name=None, timeout=20, max_tries=2, retry_delay=5, requests_session=None, requests_kwargs=None)

參數解析:

  • hosts:主機名 IP地址與port號之間需要用","隔開 如:hosts="45.91.43.237,9000" 多個主機時可以傳入list, 如:["47.95.45.254,9000","47.95.45.235,9000"]
  • randomize_hosts:隨機選擇host進行連接,默認為True
  • user_name:連接的Hadoop平臺的用戶名
  • timeout:每個Namenode節點連接等待的秒數,默認20sec
  • max_tries:每個Namenode節點嘗試連接的次數,默認2次
  • retry_delay:在嘗試連接一個Namenode節點失敗后,嘗試連接下一個Namenode的時間間隔优乐棋牌app下载,默認5sec
  • requests_session:連接HDFS的HTTP request請求使用的session优乐棋牌app下载优乐棋牌app下载,默認為None

In[0]:

# 代碼示例
client = pyhdfs.HdfsClient(hosts="45.91.43.237,9000",user_name="hadoop")

2.返回這個用戶的根目錄

get_home_directory(**kwargs)

In[ 2 ]:

# 返回這個用戶的根目錄
print client.get_home_directory()

Out [ 2 ]:

/user/hadoop

ps:注意連接時需要修改本機host文件中的IP地址與主機名的映射,不然會報錯。

具體解決方案在這里:

3.返回可用的namenode節點

get_active_namenode(max_staleness=None)

In[3]:

# 返回可用的namenode節點
print client.get_active_namenode()

Out[3]:

45.91.43.237:50070

4.返回指定目錄下的所有文件

listdir(path, **kwargs)

In[5]:

# 返回指定目錄下的所有文件
print client.listdir("/user/hadoop")

Out[5]:

[u'.password', u'.sparkStaging', u'QuasiMonteCarlo_1525339502176_165201397', u'QuasiMonteCarlo_1525340283182_484907947', u'QuasiMonteCarlo_1525340542994_724956601', u'QuasiMonteCarlo_1525428514052_1305531458', u'QuasiMonteCarlo_1525428870962_320046470', u'QuasiMonteCarlo_1525429827638_1734729002', u'QuasiMonteCarlo_1525430442752_1819520486', u'QuasiMonteCarlo_1525430754280_1904667948', u'QuasiMonteCarlo_1525431222757_1446112904', u'QuasiMonteCarlo_1525431511572_67243213', u'QuasiMonteCarlo_1525437383596_1909178162', u'_sqoop', u'ceshi', u'exercise1.txt', u'exercise1map.py', u'exercise1reduce.py', u'speech_text.txt']

5.打開一個遠程節點上的文件,返回一個HTTPResponse對象

open(path, **kwargs)

In[14]:

# 打開一個遠程節點上的文件,返回一個HTTPResponse對象
response = client.open("/user/hadoop/speech_text.txt")
# 查看文件內容
response.read()

Out[14]:

'Fellow-citizens, being fully invested with that high office to which the partiality of my countrymen has called me, I now take an affectionate leave of you. You will bear with you to your homes the remembrance of the pledge I have this day given to discharge all the high duties of my exalted station according to the best of my ability, and I shall enter upon their performance with entire confidence in the support of a just and generous people.\n hello! there\'s a new message!\n\n hello! there\'s a new message!\n'

6.從本地上傳文件至集群

copy_from_local(localsrc, dest, **kwargs)

7.從集群上copy到本地

copy_to_local(src, localdest, **kwargs)

In[15]:

# 從本地上傳文件至集群之前,集群的目錄
print "Before copy_from_local"
print client.listdir("/user/hadoop")
# 從本地上傳文件至集群
client.copy_from_local("D:/Jupyter notebook/ipynb_materials/src/test.csv","/user/hadoop/test.csv")
# 從本地上傳文件至集群之后,集群的目錄
print "After copy_from_local"
print client.listdir("/user/hadoop")

Out[15]:

Before copy_from_local
[u'.password', u'.sparkStaging', u'QuasiMonteCarlo_1525339502176_165201397', u'QuasiMonteCarlo_1525340283182_484907947', u'QuasiMonteCarlo_1525340542994_724956601', u'QuasiMonteCarlo_1525428514052_1305531458', u'QuasiMonteCarlo_1525428870962_320046470', u'QuasiMonteCarlo_1525429827638_1734729002', u'QuasiMonteCarlo_1525430442752_1819520486', u'QuasiMonteCarlo_1525430754280_1904667948', u'QuasiMonteCarlo_1525431222757_1446112904', u'QuasiMonteCarlo_1525431511572_67243213', u'QuasiMonteCarlo_1525437383596_1909178162', u'_sqoop', u'ceshi', u'exercise1.txt', u'exercise1map.py', u'exercise1reduce.py', u'speech_text.txt']
After copy_from_local
[u'.password', u'.sparkStaging', u'QuasiMonteCarlo_1525339502176_165201397', u'QuasiMonteCarlo_1525340283182_484907947', u'QuasiMonteCarlo_1525340542994_724956601', u'QuasiMonteCarlo_1525428514052_1305531458', u'QuasiMonteCarlo_1525428870962_320046470', u'QuasiMonteCarlo_1525429827638_1734729002', u'QuasiMonteCarlo_1525430442752_1819520486', u'QuasiMonteCarlo_1525430754280_1904667948', u'QuasiMonteCarlo_1525431222757_1446112904', u'QuasiMonteCarlo_1525431511572_67243213', u'QuasiMonteCarlo_1525437383596_1909178162', u'_sqoop', u'ceshi', u'exercise1.txt', u'exercise1map.py', u'exercise1reduce.py', u'speech_text.txt', u'test.csv']

8.向一個已經存在的文件中插入文本

append(path, data, **kwargs)

9.融合兩個文件

concat(target, sources, **kwargs)

In[16]:

# 向一個已經存在的文件中插入文本
# 先看看文件中的內容
response = client.open("/user/hadoop/test.csv")
response.read()

Out[16]:

'n,n+2,n*2\r\r\n0,2,0\r\r\n1,3,2\r\r\n2,4,4\r\r\n3,5,6\r\r\n4,6,8\r\r\n5,7,10\r\r\n6,8,12\r\r\n7,9,14\r\r\n8,10,16\r\r\n9,11,18\r\r\n'

In[17]:

# 使用append函數插入string
client.append("/user/hadoop/test.csv","0,2,0\r\r\n")
# 再看看文件中的內容
response = client.open("/user/hadoop/test.csv")
response.read()

Out[17]:

'n,n+2,n*2\r\r\n0,2,0\r\r\n1,3,2\r\r\n2,4,4\r\r\n3,5,6\r\r\n4,6,8\r\r\n5,7,10\r\r\n6,8,12\r\r\n7,9,14\r\r\n8,10,16\r\r\n9,11,18\r\r\n0,2,0\r\r\n'

10.創建新目錄

mkdirs(path, **kwargs)

In[20]:

# 添加目錄,先看看當前路徑下的文件
client.listdir("/user/hadoop/")

Out[20]:

[u'.password',
 u'.sparkStaging',
 u'QuasiMonteCarlo_1525339502176_165201397',
 u'QuasiMonteCarlo_1525340283182_484907947',
 u'QuasiMonteCarlo_1525340542994_724956601',
 u'QuasiMonteCarlo_1525428514052_1305531458',
 u'QuasiMonteCarlo_1525428870962_320046470',
 u'QuasiMonteCarlo_1525429827638_1734729002',
 u'QuasiMonteCarlo_1525430442752_1819520486',
 u'QuasiMonteCarlo_1525430754280_1904667948',
 u'QuasiMonteCarlo_1525431222757_1446112904',
 u'QuasiMonteCarlo_1525431511572_67243213',
 u'QuasiMonteCarlo_1525437383596_1909178162',
 u'_sqoop',
 u'ceshi',
 u'exercise1.txt',
 u'exercise1map.py',
 u'exercise1reduce.py',
 u'speech_text.txt',
 u'test.csv']

In[22]:

# 創建新目錄
client.mkdirs("/user/hadoop/data")

Out[22]:

True

In[23]:

# 再看看當前路徑下的文件
# 多了個data路徑
client.listdir("/user/hadoop/")

Out[23]:

[u'.password',
 u'.sparkStaging',
 u'QuasiMonteCarlo_1525339502176_165201397',
 u'QuasiMonteCarlo_1525340283182_484907947',
 u'QuasiMonteCarlo_1525340542994_724956601',
 u'QuasiMonteCarlo_1525428514052_1305531458',
 u'QuasiMonteCarlo_1525428870962_320046470',
 u'QuasiMonteCarlo_1525429827638_1734729002',
 u'QuasiMonteCarlo_1525430442752_1819520486',
 u'QuasiMonteCarlo_1525430754280_1904667948',
 u'QuasiMonteCarlo_1525431222757_1446112904',
 u'QuasiMonteCarlo_1525431511572_67243213',
 u'QuasiMonteCarlo_1525437383596_1909178162',
 u'_sqoop',
 u'ceshi',
 u'data',
 u'exercise1.txt',
 u'exercise1map.py',
 u'exercise1reduce.py',
 u'speech_text.txt',
 u'test.csv']

11.查看是否存在文件

exists(path, **kwargs)

In[29]:

# 查看文件是否存在
client.exists("/user/hadoop/test.csv")

Out[29]:

True

12.查看路徑總覽信息

get_content_summary(path, **kwargs)

In[28]:

# 查看路徑總覽信息
client.get_content_summary("/user/hadoop")

Out[28]:

ContentSummary(spaceQuota=-1, length=268497153, directoryCount=34, spaceConsumed=805491459, quota=-1, fileCount=98)

13.查看文件的校驗和(checksum)

get_file_checksum(path, **kwargs)

In[27]:

# 查看文件的校驗和(checksum)
client.get_file_checksum("/user/hadoop/test.csv")

Out[27]:

FileChecksum(length=28, bytes=u'0000020000000000000000009b79c1de3fbc34132510593a6073ecf500000000', algorithm=u'MD5-of-0MD5-of-512CRC32C')

14.查看當前路徑的狀態(可路徑可文件)

list_status(path, **kwargs)

In[24]:

# 查看當前路徑下的文件狀態
client.list_status("/user/hadoop")

Out[24]:

[FileStatus(group=u'supergroup', permission=u'400', blockSize=134217728, accessTime=1532665989204L, pathSuffix=u'.password', modificationTime=1517972575373L, replication=3, length=4, childrenNum=0, owner=u'hadoop', storagePolicy=0, type=u'FILE', fileId=17768),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'.sparkStaging', modificationTime=1521528004629L, replication=0, length=0, childrenNum=4, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=26735),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525339502176_165201397', modificationTime=1525339503697L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=28309),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525340283182_484907947', modificationTime=1525341538004L, replication=0, length=0, childrenNum=2, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=28326),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525340542994_724956601', modificationTime=1525341600823L, replication=0, length=0, childrenNum=2, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=28343),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525428514052_1305531458', modificationTime=1525428515590L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=29623),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525428870962_320046470', modificationTime=1525428872502L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=29641),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525429827638_1734729002', modificationTime=1525429829220L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=29909),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525430442752_1819520486', modificationTime=1525430444346L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=29926),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525430754280_1904667948', modificationTime=1525430755899L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=29936),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525431222757_1446112904', modificationTime=1525431224390L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=30072),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525431511572_67243213', modificationTime=1525431513121L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=30089),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'QuasiMonteCarlo_1525437383596_1909178162', modificationTime=1525437385222L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=30099),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'_sqoop', modificationTime=1517981304673L, replication=0, length=0, childrenNum=1, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=18255),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'ceshi', modificationTime=1517977450123L, replication=0, length=0, childrenNum=2, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=17847),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=0, accessTime=0, pathSuffix=u'data', modificationTime=1532943534037L, replication=0, length=0, childrenNum=0, owner=u'hadoop', storagePolicy=0, type=u'DIRECTORY', fileId=34289),
 FileStatus(group=u'supergroup', permission=u'644', blockSize=134217728, accessTime=1529049559630L, pathSuffix=u'exercise1.txt', modificationTime=1529049559773L, replication=3, length=109, childrenNum=0, owner=u'jovyan', storagePolicy=0, type=u'FILE', fileId=33021),
 FileStatus(group=u'supergroup', permission=u'644', blockSize=134217728, accessTime=1529049596083L, pathSuffix=u'exercise1map.py', modificationTime=1529049596226L, replication=3, length=1063, childrenNum=0, owner=u'jovyan', storagePolicy=0, type=u'FILE', fileId=33022),
 FileStatus(group=u'supergroup', permission=u'644', blockSize=134217728, accessTime=1529049638764L, pathSuffix=u'exercise1reduce.py', modificationTime=1529049638904L, replication=3, length=456, childrenNum=0, owner=u'jovyan', storagePolicy=0, type=u'FILE', fileId=33023),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=134217728, accessTime=1532939667735L, pathSuffix=u'speech_text.txt', modificationTime=1532940839913L, replication=3, length=49827, childrenNum=0, owner=u'hadoop', storagePolicy=0, type=u'FILE', fileId=34287),
 FileStatus(group=u'supergroup', permission=u'755', blockSize=134217728, accessTime=1532943080708L, pathSuffix=u'test.csv', modificationTime=1532943291036L, replication=3, length=107, childrenNum=0, owner=u'hadoop', storagePolicy=0, type=u'FILE', fileId=34288)]

In[25]:

# 查看單個文件狀態
client.list_status("/user/hadoop/test.csv")

Out[25]:

[FileStatus(group=u'supergroup', permission=u'755', blockSize=134217728, accessTime=1532943080708L, pathSuffix=u'', modificationTime=1532943291036L, replication=3, length=107, childrenNum=0, owner=u'hadoop', storagePolicy=0, type=u'FILE', fileId=34288)]

以上,就是pyhdfs的全部常用命令。感謝各位的閱讀。

請關注公眾號獲取更多資料



】【打印繁體】【】【】 【】【】【】 【關閉】 【返回頂部
上一篇datax測試 讀mysql 寫hdfs 下一篇關于CDH頁面下載HDFS文件地址解析..