一級(jí)大頭兵Python常用操作匯總

工作過程中，經(jīng)常會(huì)用python 執(zhí)行一些自動(dòng)化，腳本操作。非Python專業(yè)大頭兵，最近匯總的一些常用操作。

1. 如何在list中查找

for x in []:

if x meet condition:

#do sone action

2. 字符串轉(zhuǎn)換成數(shù)字

s=”99”

int(s)

3. python 生產(chǎn)隨機(jī)數(shù)

import random

random.randint(1, 50) #1~50之間的隨機(jī)數(shù)

4. 創(chuàng)建文件夾

os.mkdir()

創(chuàng)建子目錄

os.makedirs()

os.makedirs('temp/1/1/')

5. 檢查文件夾是否存在

os.path.exists

6. 檢查文件夾是否存在

枚舉目錄

os.walk(dir)

def enumDirs(dir):

abs_files=[]

for root, dirs, files in os.walk(dir):

for file in files:

s=os.path.join(root, file)

print(f'abs file={s}')

abs_files.append(s)

pass

或者

os.listdir(dir) #只能獲取檔次dir的所有文件，文件夾

7. list獲取文件絕對(duì)路徑

listdir(dir)

for item in dir:

absPath=os.path.join(dir, item)

#absPath 可以用

8. 刪除子目錄，還有子文件

os.removedirs(dir) #依賴沒有子文件

shutil.rmtree(dir) #直接全部干掉

9. 檢查是否是一個(gè)文件

os.path.isfile(filename)

10. 生產(chǎn)一個(gè)隨機(jī)數(shù)

import random

random.randint(0, 9) #生成0,9之間的隨機(jī)數(shù)

11. 讀寫txt文件

按行讀取文件

with open(filename, ‘r’, encoding=’utf-8’) as file_to_read:

while True:

lines=file_to_read.readline()

if not lines:

break;

process_line(lines)

寫文件

with open(filename, ‘w’) as file_to_write:

file_to_write.write(‘this is a line’)

12. python unitest 傳遞參數(shù)

可以通過命令行傳遞參數(shù)，也可以使用環(huán)境變量來(lái)傳遞參數(shù)

windows

SET HOST=localhost

from os import environ

print(envirom[‘HOST’])

13. list拼接字符串

str=‘’

list=[‘123’,’2222’]

str.join(list)

#output 1232222

14. 日期字符串轉(zhuǎn)date

def str2date(s):

#2021-11-05 06:35:18.370

try:

return datetime.datetime.strptime(s,'%Y-%m-%d %H:%M:%S.%f')

except ValueError:

print('convert str to date failed.%s' %(s))

print(s)

sys.exit(-1)

return ''

15. timer

import threading

def func():

print(“timer cb”)

timer=threading.Timer(5, func)

timer的基本用法

傳遞參數(shù)

def func(*argc **kvgc):

print()

timer=threading.Timer(5, func, (), {“param1”:1})

16. 執(zhí)行windows cmd命令

import subprocess

return subprocess.Popen(cmd, shell=True, stdout=None)

os.system(cmd)

執(zhí)行并且返回output內(nèi)容

def exec(cmd):

output=subprocess.getoutput(cmd)

return output

17. python 獲取html表單

import pandas as pd

simpsons=pd.read_html(‘xxxxx.html’)

只能獲取html里面有table的內(nèi)容，不是div哦

18. 命令行參數(shù)click

經(jīng)常寫腳本處理相關(guān)任務(wù)的時(shí)候，比如會(huì)用到命令行參數(shù)，可以用getopt庫(kù)，這里使用click庫(kù)，click使用起來(lái)非常方便，功能超級(jí)強(qiáng)大

@click.group()

def helper():

pass

@click.command(‘—type’)

def work():

pass

helper.add(work)

if __name__==“__main__”:

helper()

官方網(wǎng)址

https://click.palletsprojects.com/en/8.1.x/

19. 定義一個(gè)屬性

class ebook(Base):

__tablename__='book'

@property

def helper():title(self):

pass

#二進(jìn)制方式讀取

With open(“filename.zip”, mode=”rb”) as zip_file:

context=zip_file.read()

21. 獲取文件大小

Import os

Statinfo=os.state(filename)

os.st_size

22. 獲取類名和方法名稱

logger.info("Enter %s:%s"%(self.__class__.__name__,sys._getframe().f_back.f_code.co_name))

logger.info("Leave %s:%s"%(self.__class__.__name__,sys._getframe().f_back.f_code.co_name))

23. 去掉字符串的前后空格換行

str.strip()

24. 查找整個(gè)單詞

def findWholeWord(w):

return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

a=findWholeWord(‘a(chǎn)aaa’)(str)

if a :

find it

else

No find

25. 字符串startswith

判斷字符串是否xxx開始

27. python logger的使用

import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger=logging.getLogger(__name__)

不寫什么大型的代碼，直接貼代碼，打log

28. 好用的f

打印log，或者debug 的時(shí)候，很多中方法f用起來(lái)非常方便

s=‘helloworld’

print(f’string={s}’) # string=helloworl

29. 好用的pytest

如果寫的代碼多了，一般寫一些算法函數(shù)的時(shí)候，要么單獨(dú)搞個(gè)test來(lái)測(cè)試，或者簡(jiǎn)單點(diǎn)，搞個(gè)單元測(cè)試。在工作目錄里面創(chuàng)建一個(gè)test_xxx.py文件，然后倒入你要測(cè)試的函數(shù)，類，直接寫測(cè)試方法。寫完之后在目錄執(zhí)行pytest -s 就可以自動(dòng)執(zhí)行

備注：-s 會(huì)把print的語(yǔ)句輸出出來(lái)

近在處理一些和有關(guān)電影的工作，需要用到一些北美電影票房數(shù)據(jù)，而這部分?jǐn)?shù)據(jù)最權(quán)威的網(wǎng)站當(dāng)屬Box Office Mojo（以下簡(jiǎn)稱BOM），于是就上去查看了一下。估計(jì)經(jīng)常關(guān)注這個(gè)網(wǎng)站的盆友們都知道，這個(gè)網(wǎng)站最近進(jìn)行了改版，網(wǎng)頁(yè)排版全面更新，還專門針對(duì)移動(dòng)設(shè)備進(jìn)行了優(yōu)化（以前的網(wǎng)站頁(yè)面只有電腦版的），頁(yè)面雖然好看了不少，但卻少了很多數(shù)據(jù)，之前的網(wǎng)站幾乎所有數(shù)據(jù)都能查到，而現(xiàn)在則只能查到部分?jǐn)?shù)據(jù)，有些數(shù)據(jù)則要到BOM Pro版才能查到，而這個(gè)服務(wù)是收費(fèi)的。為了更好地使用數(shù)據(jù)，還想不花錢，那就只有自己動(dòng)手豐衣足食，所以筆者就自己寫了個(gè)Python爬蟲，爬取了過去多年的票房數(shù)據(jù)。以下就以"北美電影票房每日票房數(shù)據(jù)"為例，介紹一下如何爬取，其他票房數(shù)據(jù)類似，只需修改少數(shù)代碼即可。

圖1. 要抓取的部分網(wǎng)頁(yè)的截圖

這個(gè)爬蟲程序完全采用Python語(yǔ)言完成，使用軟件為Anaconda 2019.10版（這個(gè)目前是最新版的，理論上其包含的各種Python庫(kù)也是最新的或接近最新的，所以下面的爬蟲程序在部分老版軟件上可能會(huì)出問題，如有問題請(qǐng)及時(shí)更新）。爬蟲程序主要包括兩部分：爬取并存儲(chǔ)數(shù)據(jù)，以及根據(jù)數(shù)據(jù)簡(jiǎn)單繪制圖片。下面就一一講解一下。

一、爬取和存儲(chǔ)數(shù)據(jù)

# 首先把需要的包都導(dǎo)入進(jìn)來(lái)。
import requests
import pandas as pd
import time
import matplotlib.pyplot as plt
import matplotlib.dates as mdate
import pylab as mpl  # 導(dǎo)入中文字體，避免顯示亂碼

# 這個(gè)是我們要用到的每日票房的URL，中間的%s是一會(huì)兒要替換的年份
urltemplate=r'https://www.boxofficemojo.com/daily/%s/?view=year' 

#這個(gè)是數(shù)據(jù)保存的地方，放在了桌面的一個(gè)Excel文檔中，因?yàn)閿?shù)據(jù)很少，所以根本用不到數(shù)據(jù)庫(kù)，Excel足以，當(dāng)然這里也可以用CSV格式。這里我的路徑中包含中文，使用時(shí)沒有問題，如果大家出現(xiàn)問題，最好使用英文路徑。
fileLoc=r'C:\BoxOffice\Box Office Mojo票房爬蟲\Daily\daily-data.xlsx'

# 這個(gè)是爬蟲頭部，防止網(wǎng)站的反爬機(jī)制。
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

下面是爬蟲主體部分，這里有三點(diǎn)要說(shuō)明，一是mode='a'這里，這個(gè)是0.25.1版pandas以后才有的功能，之前的老版本沒有這個(gè)功能；二是，不知道是不是我的網(wǎng)絡(luò)有問題，在爬取過程中有掉線的現(xiàn)象出現(xiàn)，所以在這里用了requests.ConnectionError來(lái)處理掉線問題；三是，用了一個(gè)小竅門，如果直接用pd.read_html(url)也可以讀取網(wǎng)頁(yè)中的數(shù)據(jù)，但這里先用requests讀取網(wǎng)頁(yè)，再把requests讀取的網(wǎng)頁(yè)代碼放入pd.read_html中，這樣既可避免網(wǎng)站的反爬蟲機(jī)制，也可以加快讀取速度，因?yàn)閜d.read_html直接讀取網(wǎng)頁(yè)實(shí)在太慢了。

def scraper(file, headers, urltemp, year_start, year_end):
    writer=pd.ExcelWriter(file, engine='openpyxl', mode='a') 
    for i in range(year_start, year_end+1):
        url=urltemp % i
        try:
            r=requests.get(url, headers=headers)
            if r.status_code==200:
                source_code=r.text
                df=pd.read_html(source_code)
                df=df[0]
                df.to_excel(writer, sheet_name=str(i), index=False)
                time.sleep(2)
        except requests.ConnectionError:
            print('Can not get access to the %s year daily data now' % i)
            return
    writer.save()
    writer.close()


scraper(fileLoc, headers, urltemplate, 1977, 2019)

因?yàn)榫W(wǎng)站只提供到最早1977年的數(shù)據(jù)，所以就把1977年到2019年數(shù)據(jù)都給抓下來(lái)。

圖2. 抓取的部分?jǐn)?shù)據(jù)的截圖

二、根據(jù)數(shù)據(jù)簡(jiǎn)單繪圖

# 下面這個(gè)str_to_datetime函數(shù)，是除掉數(shù)據(jù)Date列中一些不必要的文字，比如有些數(shù)據(jù)帶有“New Year’s Eve”字樣，要把這些東西去掉
def str_to_datetime(x):
    if len(x) > 14:
        temp=x.split('2019')
        x=temp[0]+'2019'
    return x

# 這個(gè)str_to_num函數(shù)是把“Top 10 Gross”列的數(shù)據(jù)都轉(zhuǎn)換成數(shù)值，因?yàn)檫@些數(shù)據(jù)從Excel讀取到pandas后，都是string格式的數(shù)據(jù)，要轉(zhuǎn)換成數(shù)值格式
def str_to_num(x):
    x=x.replace('$', '')
    x=x.replace(',', '')
    x=int(x)
    return x

table=pd.read_excel(fileLoc, sheet_name='2019')
data=table[['Date', 'Top 10 Gross']]
data['Date']=data['Date'].apply(str_to_datetime)
data['Top 10 Gross']=data['Top 10 Gross'].apply(str_to_num)

# 設(shè)置x軸和y軸的數(shù)據(jù)，x軸是時(shí)間數(shù)據(jù)，y軸是票房數(shù)據(jù)，其值太大，所以改小點(diǎn)，方便作圖
x=pd.to_datetime(data['Date'])
y=data['Top 10 Gross']/1000000

# 找出票房數(shù)據(jù)中最大的那個(gè)值和其在序列y中的位置，然后找出對(duì)應(yīng)x序列的位置，也就是對(duì)應(yīng)的哪一天
max_loc=y.idxmax()
max_y=y.max()
max_date=x.loc[max_loc]

# 設(shè)置相關(guān)參數(shù)
mpl.rcParams['font.sans-serif']=['SimHei']  # 設(shè)置為黑體字
fig=plt.figure(figsize=(16, 6.5))

# 生成axis對(duì)象
ax=fig.add_subplot(111)  # 本例的figure中只包含一個(gè)圖表
ax.set_ylim([0, 200]) 
plt.tick_params(labelsize=13)

# 設(shè)置x軸為時(shí)間格式，這個(gè)要注意，否則x軸顯示的將是類似于‘796366’這樣的轉(zhuǎn)碼后的數(shù)字格式
ax.xaxis.set_major_formatter(mdate.DateFormatter('%Y-%m-%d'))
plt.xticks(pd.date_range(x[len(x)-1], x[0], freq='M'), rotation=90)
text=r'票房最高的一天是%s，其票房為%.2f億' % (max_date.date(), max_y/100)
plt.annotate(text, xy=(max_date, max_y), fontsize=14, \
             xytext=(max_date+pd.Timedelta(days=10), max_y+10), \
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), \
             xycoords='data')
plt.ylabel('票房/百萬(wàn)美元', fontdict={'size':14})
plt.plot(x, y)

完成后這個(gè)圖片效果如下

圖3. 2019年北美票房每日數(shù)據(jù)圖

三、結(jié)語(yǔ)

上面這個(gè)爬蟲程序比較簡(jiǎn)單，并沒有用到數(shù)據(jù)庫(kù)、多線程等復(fù)雜技術(shù)，我們更多地應(yīng)該從所得到的數(shù)據(jù)中來(lái)挖掘更多的價(jià)值，筆者接下來(lái)會(huì)從這些數(shù)據(jù)中來(lái)分析一下好萊塢電影行業(yè)過去一年的發(fā)展，屆時(shí)會(huì)分享給大家，敬請(qǐng)期待。

言

在這，我們接續(xù)昨天的 Python 數(shù)據(jù)處理（二），將以 R 語(yǔ)言的形式，重新實(shí)現(xiàn)一遍同樣的數(shù)據(jù)提取操作

為什么我要這么做呢，其實(shí)我是這么想的：

很多人對(duì)于不同語(yǔ)言之間的差異，很難有個(gè)清楚的認(rèn)識(shí)
還有人認(rèn)為每種編程語(yǔ)言相互獨(dú)立，總是無(wú)法找到它們之間的關(guān)聯(lián)性
以我自己的理解，很多編程語(yǔ)言的思想都是互通的，只是實(shí)現(xiàn)的方式各有區(qū)別。而我們要做的，就是抓住它們之間的共同點(diǎn)，不論語(yǔ)言怎么變，語(yǔ)法怎么變，核心思想還是在那。
最后，希望大家在我這種講解的方式中，體會(huì)到它們之間的緊密聯(lián)系，這樣就不會(huì)苦惱于學(xué)誰(shuí)好，先學(xué)誰(shuí)。只要理解這種編程思想，不管什么語(yǔ)言，學(xué)起來(lái)都很快的。而且可以快速提高自己對(duì)編程的理解，對(duì)實(shí)現(xiàn)方式的認(rèn)知更加全面。

下面進(jìn)入正題

獲取文件內(nèi)容

1. 使用模塊

在 Python 爬蟲項(xiàng)目中，最常用的是 requests 模塊。

而在 R 中，我們使用 rvest 模塊，進(jìn)行網(wǎng)頁(yè)解析以及獲取網(wǎng)頁(yè)內(nèi)容。

# install "rvest" package
install.packages("rvest")
# library 
library(rvest)

2. 解析網(wǎng)頁(yè)

我們還是以昨天的鏈接為例

首先用 read_html 讀取網(wǎng)頁(yè)鏈接

然后用 html_text 讀取整個(gè)網(wǎng)頁(yè)內(nèi)容，返回的是一個(gè)字符串

# 網(wǎng)頁(yè)鏈接
URL <- "http://rest.kegg.jp/get/cpd:C01290"
# 獲取 URL 網(wǎng)頁(yè)
res <- read_html(URL)
# 讀取網(wǎng)頁(yè)文本
text <- html_text(res)

3. 提取內(nèi)容

# 將文本按行分割
# strsplit 返回的是長(zhǎng)度為 1 的 list，因此，可以用 unlist 轉(zhuǎn)換為 character
line_list <- unlist(strsplit(text, split = '\n'))
# 新建空 list，用于存儲(chǔ)我們的數(shù)據(jù)
data <- list()
for (i in 1:length(line_list)) {
  line <- line_list[i]
  # 提取前 12 個(gè)字符，substr(x, start, stop)
  # 提取 start,stop 指定的 x 中字符的起始和結(jié)束位置
  prefix <- substr(line, 1, 12)
  # 判斷是否包含字母數(shù)字
  if (grepl("\\w+", prefix)) {
    # 去除多余的空白字符
    key <- sub(pattern = "\\s+", replacement = "", x = prefix) 
  }
  # 獲取 line 中位置 13 到末尾的字符，nchar(x) 計(jì)算字符串 x 的長(zhǎng)度
  value <- substr(line, 13, nchar(line))
  if (key == "ENTRY") {
    # 在這，使用 perl 形式的正則 perl = TRUE，以多個(gè)空白字符分隔字符串
    data$entry <- unlist(strsplit(value, split = "\\s+", perl = TRUE))[1]
  } else if (key == "NAME") {
    v <- substr(value, 1, nchar(value)-1)
    data$name <- c(data$name, v)
  } else if (key == "ENZYME") {
    v <- unlist(strsplit(value, split = "\\s+", perl = TRUE))
    data$enzyme <- c(data$enzyme, v)
  } else if (key == "DBLINKS") {
    v = unlist(strsplit(value, ": "))
    data$dblinks[v[1]] <- v[2]
  }
}

輸出提取的信息

> data
$entry
[1] "C01290"

$name
[1] "Lactosylceramide"                                      
[2] "beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(11)-ceramide"
[3] "beta-D-Galactosyl-1,4-beta-D-glucosylceramide"         
[4] "Gal-beta1->4Glc-beta1->1'Cer"                          
[5] "LacCer"                                                
[6] "Lactosyl-N-acylsphingosine"                            
[7] "D-Galactosyl-1,4-beta-D-glucosylceramid"               

$enzyme
 [1] "2.4.1.92"  "2.4.1.206" "2.4.1.228" "2.4.1.274" "2.4.99.1"  "2.4.99.9"  "3.2.1.18"  "3.2.1.22" 
 [9] "3.2.1.23"  "3.2.1.47"  "3.2.1.52" 

$dblinks
       PubChem          ChEBI      LIPIDMAPS      LipidBank 
        "4509"        "17950" "LMSP0501AB00"      "GSG1147"

總結(jié)

仔細(xì)看看代碼邏輯，和 Python 是很像的。

其中一些函數(shù)可以映射到 Python 中的函數(shù)，雖然他們之間用法、參數(shù)、返回值會(huì)有點(diǎn)差別，但是作用是類似的。

上面代碼中我們都是用 R 自帶的字符串操作，用起來(lái)比較麻煩，不是很得心應(yīng)手。

后面，我會(huì)給大家講講 R 的字符串操作模塊 stringr，能極大提升我們開發(fā)效率。

在線咨詢

上一篇：頭條首發(fā)必備指南#為了避免首發(fā)內(nèi)容被他人抄襲
下一篇：web前端程序員必看之浮動(dòng)布局與彈性布局的區(qū)別

您的項(xiàng)目需求

*請(qǐng)認(rèn)真填寫需求信息，我們會(huì)在24小時(shí)內(nèi)與您取得聯(lián)系。

整合營(yíng)銷服務(wù)商

一級(jí)大頭兵Python常用操作匯總

言

獲取文件內(nèi)容

1. 使用模塊

2. 解析網(wǎng)頁(yè)

3. 提取內(nèi)容

總結(jié)

您的項(xiàng)目需求