HttpClient對(duì)URL編碼的處理方式解惑！ - Java - JavaEye論壇

汲取者 2010-04-06

展開(kāi)全文

HttpClient是Apache基金下jakarta commons項(xiàng)目中的一個(gè)小項(xiàng)目，該項(xiàng)目封裝了對(duì)遠(yuǎn)程地址下載的一些功能，最新版本為3.0。該項(xiàng)目地址：http://jakarta./commons/httpclient

最近在編寫Spider的時(shí)候就用到了HttpClient。在使用過(guò)程中發(fā)現(xiàn)一個(gè)有趣現(xiàn)象：有些URL的編碼方式是 utf-8，有些URL的編碼方式是gbk。他總能夠正確識(shí)別，但是有些他又不能識(shí)別(抓取回來(lái)后是亂碼)。調(diào)用的是：httpMethod.getResponseBodyAsString(); 方法。

在進(jìn)行進(jìn)一步分析時(shí)，發(fā)現(xiàn)他對(duì)在http頭信息中有charset描述的就正確正常識(shí)別。如：

HTTP/1.1 200 OK

Connection: close

Content-Type: text/html; charset=utf-8

Set-Cookie: _session_id=066875c3c0530c06c0204b96db403560; domain=; path=/

Vary: Accept-Encoding

Cache-Control: no-cache

Content-Encoding: gzip

Content-Length: 8512

Date: Fri, 16 Mar 2007 09:02:52 GMT

Server: lighttpd/1.4.13

而沒(méi)有charset描述信息時(shí)，就會(huì)是亂碼。再查看相關(guān)文檔時(shí)，可以指定URL的編碼方式。如：HttpClientParams.setContentCharset("gbk");，指定了編碼后，就能夠正確識(shí)別對(duì)應(yīng)編碼的URL了。問(wèn)題出現(xiàn)了，因URL編碼不一樣，Spider不可能把URL的編碼方式寫死。并且只有在抓取回來(lái)后才知道編碼是否正確。于是再仔細(xì)研究一下httpclient的源代碼，發(fā)現(xiàn)他使用編碼的順序是：http頭信息的 charset，如果頭信息中沒(méi)有charset，則查找HttpClientParams的contentCharset，如果沒(méi)有指定編碼，則是ISO-8859-1。

/**

* Returns the character set from the Content-Type header.

* @param contentheader The content header.

* @return String The character set.

protected String getContentCharSet(Header contentheader) {

LOG.trace("enter getContentCharSet( Header contentheader )");

String charset = null;

if (contentheader != null) {

HeaderElement values[] = contentheader.getElements();

// I expect only one header element to be there

// No more. no less

if (values.length == 1) {

NameValuePair param = values[0].getParameterByName("charset");

if (param != null) {

// If I get anything "funny"

// UnsupportedEncondingException will result

charset = param.getValue();

}

if (charset == null) {

charset = getParams().getContentCharset();

if (LOG.isDebugEnabled()) {

LOG.debug("Default charset used: " + charset);

}

return charset;

}

/**

* Returns the default charset to be used for writing content body,

* when no charset explicitly specified.

* @return The charset

public String getContentCharset() {

String charset = (String) getParameter(HTTP_CONTENT_CHARSET);

if (charset == null) {

LOG.warn("Default content charset not configured, using ISO-8859-1");

charset = "ISO-8859-1";

}

return charset;

}

這個(gè)該死的iso-8859-1害了多少人啊(Tomcat對(duì)提交的數(shù)據(jù)處理默認(rèn)也是iso-8859-1)??！

經(jīng)過(guò)仔細(xì)思考后，決定httpclient再封裝一次，思路如下：

先不設(shè)定HttpClientParams的charset；

executemethod后，再檢查http頭信息中的charset是否存在；

如果charset存在，返回httpMethod.getResponseBodyAsString(); ；

如果charset不存在，則先調(diào)用httpMethod.getResponseBodyAsString();得到 html后，再分析html head的meta的charset <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">；

從meta中分析出charset后，設(shè)置到HttpClientParams的contentCharset；

再調(diào)用httpMethod.getResponseBodyAsString()，并返回該值。

經(jīng)過(guò)以上思路處理后，發(fā)現(xiàn)抓回來(lái)的URL再也沒(méi)有亂碼了。爽！

以上步驟中，就是第四步稍微麻煩一些，不過(guò)，也可以利用第三方的html paser工具來(lái)分析meta的charset！

如果沒(méi)有特別注明，本Blog文章豈為原創(chuàng)。

轉(zhuǎn)貼請(qǐng)注明出處： http://netbus.

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來(lái)自：汲取者 > 《我的圖書(shū)館》

舉報(bào)/認(rèn)領(lǐng)