使用httpclient抓取时,netstat 发现很多time_wait连接

http://wiki.apache.org/HttpComponents/FrequentlyAskedConnectionManagementQuestions

1. Connections in TIME_WAIT State

After running your HTTP application, you use the netstat command and detect a lot of connections in stateTIME_WAIT. Now you wonder why these connections are not cleaned up.

1.1. What is the TIME_WAIT State?

The TIME_WAIT state is a protection mechanism in TCP. The side thatcloses a socket connection orderly will keep the connection in stateTIME_WAIT for some time, typically between 1 and 4 minutes. Thishappens afterthe connection is closed. It does notindicate a cleanup problem. The TIME_WAIT state protects against lossof data and data corruption. It is there to help you. For technicaldetails, have a look at the Unix Socket FAQ, section 2.7.

1.2. Some Connections Go To TIME_WAIT, Others Not

If a connection is orderly closed by your application, it will go tothe TIME_WAIT state. If a connection is orderly closed by the server,the server keeps it in TIME_WAIT and your client doesn't. If aconnection is reset or otherwise dropped by your application in anon-orderly fashion, it will not go to TIME_WAIT.

Unfortunately, it will not always be obvious to you whether aconnection is closed orderly or not. This is because connections arepooled and kept open for re-use by default. HttpClient 3.x, HttpClient4, and also the standard Java HttpURLConnection do that for you. Mostapplications will simply execute requests, then read from the responsestream, and finally close that stream. 
Closing the response stream is notthe same thing as closing the connection! Closing the response streamreturns the connection to the pool, but it will be kept open ifpossible. This saves a lot of time if you send another request to thesame host within a few seconds, or even minutes.

Connection pools have a limited number of connections. A pool mayhave 5 connections, or 100, or maybe only 1. When you send a request toa host, and there is no open connection to that host in the pool, a newconnection needs to be opened. But if the pool is already full, an openconnection has to be closed before a new one can be opened. In thiscase, the old connection will be closed orderly and go to the TIME_WAITstate. 
When your application exits and the JVM terminates, the open connections in the pools will not be closed orderly. They are reset or cancelled, without going to TIME_WAIT. To avoid this, you should call theshutdownmethod of the connection pools your application is using beforeexiting. The standard Java HttpURLConnection has no public method toshutdown it's connection pool.

1.3. Running Out Of Ports

Some applications open and orderly close a lot of connections withina short time, for example when load-testing a server. A connection instate TIME_WAIT will prevent that port number from being re-used foranother connection. That is not an error, it is the purpose ofTIME_WAIT.

TCP is configured at the operating system level, not through Java.Your first action should be to increase the number of ephemeral portson the machine. Windows in particular has a rather low default for theephemeral ports. The PerformanceWiki has tuning tips for the common operating systems, have a look at the respective Network section. 
Only if increasing the number of ephemeral ports does not solve yourproblem, you should consider decreasing the duration of the TIME_WAITstate. You probably have to reduce the maximum lifetime of IP packets,as the duration of TIME_WAIT is typically twice that timespan to allowfor a round-trip delay. Be aware that this will affect all applications running on the machine. Don't ask us how to do it, we're not the experts for network tuning.

There are some ways to deal with the problem at the applicationlevel. One way is to send a "Connection: close" header with eachrequest. That will tell the server to close the connection, so it goesto TIME_WAIT on the other side. Of course this also disables thekeep-alive feature of connection pooling and thereby degradesperformance. If you are running load tests against a server, theuntypical behavior of your application may distort the test results.[[BR] Another way is to not orderly close connections. There is a trickto set SO_LINGER to a special value, which will cause the connection tobe reset instead of orderly closed. Note that the HttpClient API willnot support that directly, you'll have to extend or modify some classesto implement this hack. 
Yet another way is to re-use ports thatare still blocked by a connection in TIME_WAIT. You can do that byspecifying the SO_REUSEADDR option when opening a socket. Java 1.4introduced the methodSocket.setReuseAddress for this purpose. You will have to extend or modify some classes of HttpClient for this too, but at least it's not a hack.

1.4. Further Reading

Unix Socket FAQ

java.net.Socket.setReuseAddress

Discussion on the HttpClient mailing list in December 2007

PerformanceWiki

netstat command line tool 

 

 

 

http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html#ss2.7

 

2.7 Please explain the TIME_WAIT state.

Remember that TCP guarantees all data transmitted will be delivered, if atall possible. When you close a socket, the server goes into a TIME_WAITstate, just to be really really sure that all the data has gone through. When a socket is closed, both sides agree by sending messages to eachother that they will send no more data. This, it seemed to me was goodenough, and after the handshaking is done, the socket should be closed. The problem is two-fold. First, there is no way to be sure that the last ack was communicated successfully. Second, there may be "wandering duplicates" left on the net that must be dealt with if they are delivered.

Andrew Gierth (andrewg@microlise.co.uk) helped to explain the closing sequence in the following usenet posting:

Assume that a connection is in ESTABLISHED state, and the client is aboutto do an orderly release. The client's sequence no. is Sc, and the server'sis Ss. The pipe is empty in both directions.

 

   Client                                                   Server
   ======                                                   ======
   ESTABLISHED                                              ESTABLISHED
   (client closes)
   ESTABLISHED                                              ESTABLISHED
                <CTL=FIN+ACK><SEQ=Sc><ACK=Ss> ------->>
   FIN_WAIT_1
                <<-------- <CTL=ACK><SEQ=Ss><ACK=Sc+1>
   FIN_WAIT_2                                               CLOSE_WAIT
                <<-------- <CTL=FIN+ACK><SEQ=Ss><ACK=Sc+1>  (server closes)
                                                            LAST_ACK
                <CTL=ACK>,<SEQ=Sc+1><ACK=Ss+1> ------->>
   TIME_WAIT                                                CLOSED
   (2*msl elapses...)
   CLOSED

Note: the +1 on the sequence numbers is because the FIN counts as one byte of data. (The above diagram is equivalent to fig. 13 from RFC 793).

Now consider what happens if the last of those packets is dropped in thenetwork. The client has done with the connection; it has no more data orcontrol info to send, and never will have. But the server does not knowwhether the client received all the data correctly; that's what the lastACK segment is for. Now the server may or may not care whether theclient got the data, but that is not an issue for TCP; TCP is a reliablerotocol, and must distinguish between an orderly connection closewhere all data is transferred, and a connection abortwhere data mayor may not have been lost.

So, if that last packet is dropped, the server will retransmit it (it is,after all, an unacknowledged segment) and will expect to see a suitableACK segment in reply. If the client went straight to CLOSED, the onlypossible response to that retransmit would be a RST, which would indicateto the server that data had been lost, when in fact it had not been.

(Bear in mind that the server's FIN segment may, additionally, containdata.)

DISCLAIMER: This is my interpretation of the RFCs (I have read all theTCP-related ones I could find), but I have not attempted to examineimplementation source code or trace actual connections in order toverify it. I am satisfied that the logic is correct, though.

More commentarty from Vic:

The second issue was addressed by Richard Stevens (rstevens@noao.edu,author of "Unix Network Programming", see1.5 Where can I get source code for the book [book title]?).I have put together quotes from someof his postings and email which explain this. I have brought togetherparagraphs from different postings, and have made as few changes as possible.

From Richard Stevens (rstevens@noao.edu):

If the duration of the TIME_WAIT state were just to handle TCP's full-duplex close, then the time would be much smaller, and it would be some function of the current RTO (retransmission timeout), not the MSL (the packet lifetime).

A couple of points about the TIME_WAIT state.

 

  • The end that sends the first FIN goes into the TIME_WAIT state, because thatis the end that sends the final ACK. If the other end's FIN is lost, orif the final ACK is lost, having the end that sends the first FIN maintain state about the connection guarantees that it has enough information to retransmit the final ACK.
  • Realize that TCP sequence numbers wrap around after 2**32 bytes have been transferred. Assume a connection between A.1500 (host A, port 1500) and B.2000. During the connection one segment is lost and retransmitted. But the segment is not really lost, it is held by some intermediate router and then re-injected into the network. (This is called a "wandering duplicate".) But in the time between the packet being lost & retransmitted, and then reappearing, the connection is closed (without any problems) and then another connection is established between the same host, same port (that is, A.1500 and B.2000; this is called another "incarnation" of the connection). But the sequence numbers chosen for the new incarnation just happen to overlap with the sequence number of the wandering duplicate that is about to reappear. (This is indeed possible, given the way sequence numbers are chosen for TCP connections.) Bingo, you are about to deliver the data from the wandering duplicate (the previous incarnation of the connection) to the new incarnation of the connection. To avoid this, you do not allow the same incarnation of the connection to be reestablished until the TIME_WAIT state terminates.Even the TIME_WAIT state doesn't complete solve the second problem, given what is called TIME_WAIT assassination. RFC 1337 has more details.
  • The reason that the duration of the TIME_WAIT state is 2*MSL is that the maximum amount of time a packet can wander around a network is assumed to be MSL seconds. The factor of 2 is for the round-trip. The recommended value for MSL is 120 seconds, but Berkeley-derived implementations normally use 30 seconds instead. This means a TIME_WAIT delay between 1 and 4 minutes. Solaris 2.x does indeed use the recommended MSL of 120 seconds.

A wandering duplicate is a packet that appeared to be lost and wasretransmitted. But it wasn't really lost ... some router had problems,held on to the packet for a while (order of seconds, could be a minuteif the TTL is large enough) and then re-injects the packet back intothe network. But by the time it reappears, the application that sentit originally has already retransmitted the data contained in that packet.

Because of these potential problems with TIME_WAIT assassinations, one should not avoid the TIME_WAIT state by setting the SO_LINGER option to send an RST instead of the normal TCP connection termination (FIN/ACK/FIN/ACK). The TIME_WAIT state is there for a reason; it's your friend and it's there to help you :-)

I have a long discussion of just this topic in my just-released "TCP/IPIllustrated, Volume 3". The TIME_WAIT state is indeed, one of the mostmisunderstood features of TCP.

I'm currently rewriting "Unix Network Programming" (see1.5 Where can I get source code for the book [book title]?). and will include lots more on this topic, as it is often confusing and misunderstood.

An additional note from Andrew:

Closing a socket: if SO_LINGER has not been called on a socket, thenclose() is not supposed to discard data. This is true on SVR4.2 (and,apparently, on all non-SVR4 systems) but apparently not on SVR4; theuse of eithershutdown() or SO_LINGER seems to be required toguarantee delivery of all data.

http://blog.csdn.net/liuxuejin/article/details/8552677

 

时间: 2024-10-26 05:34:55

使用httpclient抓取时,netstat 发现很多time_wait连接的相关文章

再浅谈百度抓取时出现的200 0 64现象

只有经历过网站关键词搜索排名跌宕起伏的站长才能真正明白,等待不是一种方式,结果需要努力和勤劳来弥补.笔者经历了网站改版到降权,关键词一无所有到关键词排名起色的过程,这个过程让人难熬和艰辛,如果有一天每一位站长都经历过这样的历程,我想百度会比现在弱小很多. 笔者的站在近3个月前进行一次改版,改版的目的就是为了URL标准和简单,同时也做了网站网页布局的修改,从改版后开始网站关键词一无所有,等待我的只有坚持内容更新和外链发布,直至上周网站频道关键词和长尾关键词开始进入百名,从网站改版到目前有所成就的过

玩玩小爬虫——抓取时的几个小细节

      这一篇我们聊聊在页面抓取时应该注意到的几个问题. 一:网页更新      我们知道,一般网页中的信息是不断翻新的,这也要求我们定期的去抓这些新信息,但是这个"定期"该怎么理解,也就是多长时间需要 抓一次该页面,其实这个定期也就是页面缓存时间,在页面的缓存时间内我们再次抓取该网页是没有必要的,反而给人家服务器造成压力. 就比如说我要抓取博客园首页,首先清空页面缓存, 从Last-Modified到Expires,我们可以看到,博客园的缓存时间是2分钟,而且我还能看到当前的服务

jsoup-网页抓取时,如何判断一个页面是导航页面,还是内容页面

问题描述 网页抓取时,如何判断一个页面是导航页面,还是内容页面 在做网页抓取的时候,我想先判断这个网页是导航页面(目录页面),还是内容页面 例如 http://sky.news.sina.com.cn/ 这是一个导航页面 http://sky.news.sina.com.cn/2013-10-10/094444474.html 这是一个正文页面 可以通过url进行判断我知道的,能不能通过分析页面源代码进行判断啊,比如说正文字数,主要区域链接个数等等 谢谢大家,请给点思路

utf8-使用nodejs抓取时,出现问题(详看正文)

问题描述 使用nodejs抓取时,出现问题(详看正文) newsList和newsDetail单独拿出来调试没有问题.但如下放在一起时,提示 body is not defined 经多次调试,错误节点应该是出现在readNewsDetail()中 // 转换 gbk 编码的网页内容 body2 = iconv.decode(body, 'gbk'); // 根据网页内容创建DOM操作对象 var $ = cheerio.load(body2); 这两句.因为直接 var $ = cheerio

关于数据抓取时网页编码各不相同的问题

问题描述 关于数据抓取时网页编码各不相同的问题 最近在学习数据抓取的一些技能,抓取指定数据,网页编码都是不一样的, 有没有方法写个公用的类或者对象来处理,求代码 解决方案 python 判断网页编码的方法: import urllib f = urllib.urlopen('http://outofmemory.cn/').info() print f.getparam('charset') 2 import chardet 你需要安装一下chardet第3方模块判断编码 data = urll

关于URL URLCONNECTION或httpclient抓取网页全部内容时,中文丢失问题

问题描述 比如我抓取下来的内容应该是<li>唱片公司:环球音乐</li>,结果用httpclient抓下来之后变成<li>唱片公司:环球音乐</li>用URL或者UrlConnection也一样的问题,直接右键查看网页源代码也是这样的问题...求解 问题补充:谢谢maowei009,但是我把环球音乐贴进记事本,然后用ie或者火狐打开,可以正常显示"环球音乐"四个字,求解,这是何种编码格式?在google中贴这些也能正常显示中文....头大

httpclient抓取页面返回信息不全

问题描述 第一种:第二种:做httpclient模拟登录抓取页面信息时,有时会出现抓取数据不全的现象,有的是卡在某个div就结束了.还有两种情况就是上图这样求高手解答.非常感谢 解决方案 解决方案二:不要沉啊~~~~~~~~解决方案三:不要沉啊解决方案四:该回复于2014-01-13 08:27:58被版主删除解决方案五:帮你顶一下吧我也遇到这个问题了

网页抓取时遇到相对路径怎么办啊,高手快帮帮我

问题描述 各位 遇到个问题, 谁能帮我解决一下我举个例子 现在要抓取 http://www.xxx.com/123/123/321/xxx.html 下的一篇文章,其中连图片也要一起抓所以我抓到这个页面后需要根据img 元素里的url再单独抓图片.问题来了,img给的url很多都是 像../../图片.jpg 或者 ./img/图片.jpg 等格式的相对路径,整的我没办法正常抓取,有没有什么办法 解决方案 URI base=new URI(baseURI);//基本网页URI URI abs=b

httpclient抓取网页碰到403怎么解决

问题描述 packagetools.crawler;importjava.io.IOException;importjava.io.InputStream;importjava.io.InputStreamReader;importjava.util.zip.GZIPInputStream;importorg.apache.commons.httpclient.HttpClient;importorg.apache.commons.httpclient.methods.GetMethod;pub