運(yùn)行中的ipvs
ipvs 的規(guī)則實(shí)現(xiàn)原理
ipvs的規(guī)則是如何生效的,先來看看他實(shí)現(xiàn)的原理
簡(jiǎn)單的來講,ipvs無非就是修改了數(shù)據(jù)報(bào)頭信息來完成client -> virus server -> real server的調(diào)度.調(diào)度的目的是使realservers之間的負(fù)載接近于平衡狀態(tài).這里牽扯到2個(gè)問題,修改數(shù)據(jù)報(bào)的方式和調(diào)度的策略.
我們先來看看修改數(shù)據(jù)報(bào)的具體方式,現(xiàn)在2.6內(nèi)核中ipvs實(shí)現(xiàn)的方式和原來有點(diǎn)不一樣.引用一下ipvs的作者張文嵩先生的一段話
我們分別在Linux 內(nèi)核2.0和內(nèi)核2.2中修改了TCP/IP協(xié)議棧,在IP層截取和改寫/轉(zhuǎn)發(fā)IP報(bào)文,
實(shí)現(xiàn)了三種IP負(fù)載均衡技術(shù),并提供了一個(gè)ipvsadm程序進(jìn)行虛擬服務(wù)器的配置和管理。在Linux
內(nèi)核2.4和2.6中,我們把它實(shí)現(xiàn)為NetFilter的一個(gè)模塊,很多代碼作了改寫和進(jìn)一步優(yōu)化,
目前版本已在網(wǎng)上發(fā)布,根據(jù)反饋信息該版本已經(jīng)較穩(wěn)定。
好吧,說得很清楚了,ipvs就是借用netfilter來修改數(shù)據(jù)報(bào)的.那么簡(jiǎn)單了解一下netfilter的工作原理還是很有必要的,如圖
netfilter一共有5個(gè)規(guī)則鏈,每個(gè)規(guī)則鏈都能存放若干條規(guī)則,規(guī)則之間都順序(也就是優(yōu)先級(jí)),一旦有規(guī)則被匹配到,完成相應(yīng)動(dòng)作后,跳出該規(guī)則鏈.這5個(gè)規(guī)則鏈分別是PREROUTING,INPUT,FORWARD,OUTPUT,POSTROUTING.我們可以將機(jī)器中的連接分成3中狀態(tài)
- 從外部進(jìn)入主機(jī)的連接,經(jīng)過 PREROUTING -> INPUT
- 從主機(jī)出去的連接,將經(jīng)過 OUPUT -> POSTROUTING
- 由主機(jī)轉(zhuǎn)發(fā)的連接,經(jīng)過PREROUTING -> FORWARD -> POSTROUTING
每個(gè)規(guī)則鏈里的規(guī)則會(huì)在數(shù)據(jù)經(jīng)過該規(guī)則鏈的時(shí)候起作用(也就是調(diào)用相應(yīng)的函數(shù)進(jìn)行處理).看上去很簡(jiǎn)單吧,比如ipvs作為netfilter的一個(gè)模塊,往這些規(guī)則鏈里寫入規(guī)則就好可以了
等等.如果netfilter有很多模塊,都往一個(gè)規(guī)則鏈里寫入規(guī)則,會(huì)不會(huì)很亂呢?優(yōu)先級(jí)如何控制呢?所以規(guī)則鏈里的規(guī)則我們會(huì)根據(jù)不同的作用將其分類進(jìn)行管理,每一類的規(guī)則用一個(gè)整數(shù)來表示他的優(yōu)先級(jí),越小,優(yōu)先級(jí)越高.如果是同一類型的規(guī)則,則根據(jù)規(guī)則的先后順序來決定(鏈表結(jié)構(gòu),越靠前,優(yōu)先級(jí)越高)
netfilter本身有3個(gè)作用,所以他的規(guī)則分為3種類型,用3個(gè)表來表示,分別為filter表(過濾),nat表(修改數(shù)據(jù)報(bào)頭),mangle表(修改數(shù)據(jù)).而ipvs模塊就相當(dāng)于在netfilter里添加了一張新的ipvs表一樣.關(guān)于netfilter的更多信息,請(qǐng)參考文獻(xiàn)一
ipvs 的規(guī)則實(shí)現(xiàn)過程
每當(dāng)有新的連接(數(shù)據(jù)報(bào))經(jīng)過netfilter的規(guī)則鏈時(shí),就會(huì)調(diào)用NF_HOOK()函數(shù).此函數(shù)會(huì)訪問一個(gè)全部變量nf_hooks.這個(gè)變量里存放了netfilter的所有表(包括filter,nat,mangle和ipvs附加表等),以及每個(gè)表的規(guī)則鏈,規(guī)則鏈里的函數(shù)調(diào)用.然后遍歷nf_hooks變量里相應(yīng)規(guī)則鏈里的所有信息,根據(jù)優(yōu)先級(jí)進(jìn)行相應(yīng)的函數(shù)調(diào)用,每個(gè)規(guī)則鏈里的函數(shù)都會(huì)根據(jù)該規(guī)則鏈里的規(guī)則對(duì)數(shù)據(jù)報(bào)進(jìn)行匹配和處理
還記得在前一部分的最后,講到的nf_register_hook()部分嗎?正是ipvs使用ret = nf_register_hooks(ip_vs_ops, ARRAY_SIZE(ip_vs_ops)); 往nf_hooks變量里加入了一些數(shù)據(jù),才使得ipvs的規(guī)則能被netfilter執(zhí)行.接下來我們來看看加入的都是些什么數(shù)據(jù)
ip_vs_ops的數(shù)據(jù)內(nèi)容是
net/ipv4/ipvs/ip_vs_core.c
- static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
-
-
{
- .
hook = ip_vs_in,
- .
owner = THIS_MODULE,
- .
pf = PF_INET,
- .
hooknum = NF_INET_LOCAL_IN,
- .
priority = 100,
-
},
-
-
{
- .
hook = ip_vs_out,
- .
owner = THIS_MODULE,
- .
pf = PF_INET,
- .
hooknum = NF_INET_FORWARD,
- .
priority = 100,
-
},
-
-
{
- .
hook = ip_vs_forward_icmp,
- .
owner = THIS_MODULE,
- .
pf = PF_INET,
- .
hooknum = NF_INET_FORWARD,
- .
priority = 99,
-
},
-
-
{
- .
hook = ip_vs_post_routing,
- .
owner = THIS_MODULE,
- .
pf = PF_INET,
- .
hooknum = NF_INET_POST_ROUTING,
- .
priority = NF_IP_PRI_NAT_SRC-1,
-
},
-
};
|
可以看到,ipvs一共在INPUT,FORWARD,POSTROUTING這3個(gè)規(guī)則鏈里一共添加了4個(gè)處理的函數(shù).接下來一個(gè)一個(gè)來分析
ip_vs_in()
ip_vs_in()被放置在INPUT規(guī)則鏈里,會(huì)檢查進(jìn)入本機(jī)的所有數(shù)據(jù)報(bào).作用是將訪問vs(虛擬服務(wù)器)的連接轉(zhuǎn)給rs(真實(shí)服務(wù)器),達(dá)到負(fù)載均衡的目的,如何調(diào)度與配置時(shí)的調(diào)度算法相關(guān).如何修改數(shù)據(jù)報(bào)頭部與VS的類型相關(guān),VS有3種類型
- VS/NAT會(huì)修改s_addr, d_addr, d_port(可能)
- VS/DR會(huì)修改d_addr, d_port(可能)
- VS/TUN直接在原來數(shù)據(jù)報(bào)的基礎(chǔ)上加一個(gè)新的包頭,也叫封裝
在這個(gè)函數(shù)中,對(duì)所有目的地址為本機(jī)(調(diào)度服務(wù)器)的數(shù)據(jù)進(jìn)行了處理,從skb(sk_buff)中提出連接的協(xié)議結(jié)構(gòu)pp(ip_vs_protocol),找出哪些skb(sk_buff)符合虛擬服務(wù)的規(guī)則svc(ip_vs_service),并找到與之對(duì)應(yīng)的cp(ip_vs_conn),如果沒有找到就new一個(gè)cp,并將其加入到ip_vs_conn_tab列表中).最后根據(jù)cp->packet_xmit()的方法對(duì)數(shù)據(jù)進(jìn)行傳送.當(dāng)然,有很多的參數(shù)需要更新,比如連接的狀態(tài),pp,cp,skb的計(jì)數(shù)器等等...
net/ipv4/ipvs/ip_vs_core.c
-
-
static unsigned int
-
ip_vs_in(unsigned int hooknum, struct sk_buff *skb,
-
const struct net_device *in, const struct net_device *out,
-
int (*okfn)(struct sk_buff *))
-
{
-
struct iphdr *iph;
-
struct ip_vs_protocol *pp;
-
struct ip_vs_conn *cp;
-
int ret, restart;
-
int ihl;
-
-
-
if (unlikely(skb->pkt_type != PACKET_HOST
- ||
skb->dev->flags & IFF_LOOPBACK || skb->sk)) {
-
IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n",
-
skb->pkt_type,
-
ip_hdr(skb)->protocol,
-
NIPQUAD(ip_hdr(skb)->daddr));
-
return NF_ACCEPT;
-
}
-
-
iph = ip_hdr(skb);
-
if (unlikely(iph->protocol == IPPROTO_ICMP)) {
-
int related, verdict = ip_vs_in_icmp(skb, &related, hooknum);
-
-
if (related)
-
return verdict;
-
iph = ip_hdr(skb);
-
}
-
-
-
pp = ip_vs_proto_get(iph->protocol);
-
if (unlikely(!pp))
-
return NF_ACCEPT;
-
-
ihl = iph->ihl << 2;
-
-
-
cp = pp->conn_in_get(skb, pp, iph, ihl, 0);
-
-
if (unlikely(!cp)) {
-
int v;
-
-
if (!pp->conn_schedule(skb, pp, &v, &cp))
-
return v;
-
}
-
-
if (unlikely(!cp)) {
-
-
IP_VS_DBG_PKT(12, pp, skb, 0,
-
"packet continues traversal as normal");
-
return NF_ACCEPT;
-
}
-
-
IP_VS_DBG_PKT(11, pp, skb, 0, "Incoming packet");
-
-
-
if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
-
-
-
if (sysctl_ip_vs_expire_nodest_conn) {
-
-
ip_vs_conn_expire_now(cp);
-
}
-
-
__ip_vs_conn_put(cp);
-
return NF_DROP;
-
}
-
-
ip_vs_in_stats(cp, skb);
-
restart = ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pp);
-
if (cp->packet_xmit)
-
ret = cp->packet_xmit(skb, cp, pp);
-
-
else {
-
IP_VS_DBG_RL("warning: packet_xmit is null");
-
ret = NF_ACCEPT;
-
}
-
-
-
atomic_inc(&cp->in_pkts);
-
if ((ip_vs_sync_state & IP_VS_STATE_MASTER) &&
-
(((cp->protocol != IPPROTO_TCP ||
-
cp->state == IP_VS_TCP_S_ESTABLISHED) &&
-
(atomic_read(&cp->in_pkts) % sysctl_ip_vs_sync_threshold[1]
- ==
sysctl_ip_vs_sync_threshold[0])) ||
-
((cp->protocol == IPPROTO_TCP) && (cp->old_state != cp->state) &&
-
((cp->state == IP_VS_TCP_S_FIN_WAIT) ||
-
(cp->state == IP_VS_TCP_S_CLOSE)))))
-
ip_vs_sync_conn(cp);
-
cp->old_state = cp->state;
-
-
ip_vs_conn_put(cp);
-
return ret;
-
}
|
ip_vs_out()
此函數(shù)放在FORWARD規(guī)則鏈上,經(jīng)過本機(jī)進(jìn)行轉(zhuǎn)發(fā)的skb都會(huì)被該函數(shù)處理.在vs/nat模式下,內(nèi)網(wǎng)的rs返回給client的數(shù)據(jù)會(huì)經(jīng)網(wǎng)關(guān)(本機(jī))轉(zhuǎn)發(fā),這個(gè)時(shí)候需要修改數(shù)據(jù)報(bào)的源地址,將其修改為網(wǎng)關(guān)的公網(wǎng)ip地址,這樣才能使連接持續(xù)下去,否則client將無法訪問到rs(內(nèi)網(wǎng)地址)
net/ipv4/ipvs/ip_vs_core.c
-
-
static unsigned int
-
ip_vs_out(unsigned int hooknum, struct sk_buff *skb,
-
const struct net_device *in, const struct net_device *out,
-
int (*okfn)(struct sk_buff *))
-
{
-
struct iphdr *iph;
-
struct ip_vs_protocol *pp;
-
struct ip_vs_conn *cp;
-
int ihl;
-
-
EnterFunction(11);
-
-
if (skb->ipvs_property)
-
return NF_ACCEPT;
-
-
iph = ip_hdr(skb);
-
if (unlikely(iph->protocol == IPPROTO_ICMP)) {
-
int related, verdict = ip_vs_out_icmp(skb, &related);
-
-
if (related)
-
return verdict;
-
iph = ip_hdr(skb);
-
}
-
-
pp = ip_vs_proto_get(iph->protocol);
-
if (unlikely(!pp))
-
return NF_ACCEPT;
-
-
-
if (unlikely(iph->frag_off & htons(IP_MF|IP_OFFSET) &&
- !
pp->dont_defrag)) {
-
if (ip_vs_gather_frags(skb, IP_DEFRAG_VS_OUT))
-
return NF_STOLEN;
-
iph = ip_hdr(skb);
-
}
-
-
ihl = iph->ihl << 2;
-
-
-
cp = pp->conn_out_get(skb, pp, iph, ihl, 0);
-
-
if (unlikely(!cp)) {
-
if (sysctl_ip_vs_nat_icmp_send &&
-
(pp->protocol == IPPROTO_TCP ||
-
pp->protocol == IPPROTO_UDP)) {
-
__be16 _ports[2], *pptr;
-
-
pptr = skb_header_pointer(skb, ihl,
-
sizeof(_ports), _ports);
-
if (pptr == NULL)
-
return NF_ACCEPT;
-
if (ip_vs_lookup_real_service(iph->protocol,
-
iph->saddr, pptr[0])) {
-
-
if (iph->protocol != IPPROTO_TCP
- || !
is_tcp_reset(skb)) {
-
icmp_send(skb,ICMP_DEST_UNREACH,
-
ICMP_PORT_UNREACH, 0);
-
return NF_DROP;
-
}
-
}
-
}
-
IP_VS_DBG_PKT(12, pp, skb, 0,
-
"packet continues traversal as normal");
-
return NF_ACCEPT;
-
}
-
-
IP_VS_DBG_PKT(11, pp, skb, 0, "Outgoing packet");
-
-
if (!skb_make_writable(skb, ihl))
-
goto drop;
-
-
-
if (pp->snat_handler && !pp->snat_handler(skb, pp, cp))
-
goto drop;
-
ip_hdr(skb)->saddr = cp->vaddr;
-
ip_send_check(ip_hdr(skb));
-
-
-
if (ip_route_me_harder(skb, RTN_LOCAL) != 0)
-
goto drop;
-
-
IP_VS_DBG_PKT(10, pp, skb, 0, "After SNAT");
-
-
ip_vs_out_stats(cp, skb);
-
ip_vs_set_state(cp, IP_VS_DIR_OUTPUT, skb, pp);
-
ip_vs_conn_put(cp);
-
-
skb->ipvs_property = 1;
-
-
LeaveFunction(11);
-
return NF_ACCEPT;
-
-
drop:
-
ip_vs_conn_put(cp);
-
kfree_skb(skb);
-
return NF_STOLEN;
-
}
|
ip_vs_forward_icmp()
該函數(shù)和前面講到的ip_vs_out()在同一個(gè)FORWARD規(guī)則鏈上,但是的優(yōu)先級(jí)為99,比ip_vs_out()的100要小(高),所以優(yōu)先執(zhí)行.
函數(shù)非常簡(jiǎn)單,就是將經(jīng)過FORWARD規(guī)則鏈的所有icmp數(shù)據(jù)報(bào)交給ip_vs_in_icmp()處理.為什么進(jìn)入本機(jī)的數(shù)據(jù)會(huì)到FORWARD規(guī)則鏈上呢,原因在于local配置成透明設(shè)備時(shí),tcp/udp協(xié)議是比較容易將forward的數(shù)據(jù)讓它input的,而icmp則沒有那么簡(jiǎn)單了,所以有一些發(fā)往本機(jī)的icmp報(bào)文會(huì)跑到forward規(guī)則鏈上來(具體原因不明),所以在這里把漏掉的進(jìn)入vs的icmp交給ip_vs_forward_icmp()處理
net/ipv4/ipvs/ip_vs_core.c
-
-
static unsigned int
-
ip_vs_forward_icmp(unsigned int hooknum, struct sk_buff *skb,
-
const struct net_device *in, const struct net_device *out,
-
int (*okfn)(struct sk_buff *))
-
{
-
int r;
-
-
if (ip_hdr(skb)->protocol != IPPROTO_ICMP)
-
return NF_ACCEPT;
-
-
return ip_vs_in_icmp(skb, &r, hooknum);
-
}
|
ip_vs_post_routing()
此函數(shù)的優(yōu)先級(jí)為NF_IP_PRI_NAT_SRC-1,比POSTROUTING上的nat,mangle的優(yōu)先級(jí)都高,保證了早于他們執(zhí)行,目的就是防止被ipvs修改過的數(shù)據(jù)報(bào)再次被netfilter修改.具體做法如下
net/ipv4/ipvs/ip_vs_core.c
-
-
static unsigned int ip_vs_post_routing(unsigned int hooknum,
-
struct sk_buff *skb,
-
const struct net_device *in,
-
const struct net_device *out,
-
int (*okfn)(struct sk_buff *))
-
{
-
if (!skb->ipvs_property)
-
return NF_ACCEPT;
-
-
return NF_STOP;
-
}
|
|