Discussion:
RAM and conntrack performance
(too old to reply)
Herve Eychenne
2003-10-28 15:10:32 UTC
Permalink
Hi everyone,

Can someone post a state of the art summary for netfilter conntrack
(and maybe NAT) performance tweaking?
The only things I'm currently aware of are:
- modprobe ip_conntrack hashsize=$HASHSIZE
- echo $CONNTRACK_MAX > /proc/sys/net/ipv4/ip_conntrack_max

I think it would be good to end up with a small document which would
give every detail about how to choose optimal values for HASHSIZE and
CONNTRACK_MAX, and every other mean to get the best out of the
conntracking/NAT system...

Here are things I've collected so far, that it would be good to have
in this little document. I have questions, also:
- CONNTRACK_MAX and HASHSIZE get default values at boot time.
By default, CONNTRACK_MAX = n * 64, where n is the RAM size in MB,
am I right?
What about HASHSIZE default value? How to read it at runtime?
What is the exact link between these 2 values?
- HASHSIZE should be an odd number, and even better: a prime number.
What happens when you set it to an even number, or a non-prime number?
Why enable people to set even and non-prime numbers at all?
- Default values are "reasonnable" for a typical host, but we may
increase them on high-loaded firewalling-only systems, right?
Which values are the "best"? I.e., can someone give a formula with
this potential parameters (if pertinent):
- total RAM size
- size of the memory that should be left for non-conntrack data in
the kernel and userspace in general (what is a reasonnable value for
a firewall doing only firewalling with very few applications
running, and how to measure that at runtime?)
- number of rules, connections rate, etc.
- CONNTRACK_MAX can be modified at run time with /proc. What does it
do exactly (when shinked, when extended)?
When you modify CONNTRACK_MAX, should you also modify HASHSIZE
accordingly? Why? How?
- Is it possible to modify HASHSIZE at runtime when ip_conntrack is
not compiled as a module? If not, shouldn't we enable this with
/proc, like CONNTRACK_MAX?
- Does any of these operations currently (or possibly, if soon
implemented) lead to some rehashing at runtime?
I suppose it would be quite slow... How long does/would it take?
How to proceed to keep current conntrack entries at runtime as much
as possible? (I suppose unloading ip_conntrack module and
reinserting it with another hashsize value clears the table...)

Please comment...

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Harald Welte
2003-11-03 08:12:40 UTC
Permalink
Post by Herve Eychenne
Hi everyone,
Hi Herve!
Post by Herve Eychenne
Can someone post a state of the art summary for netfilter conntrack
(and maybe NAT) performance tweaking?
- modprobe ip_conntrack hashsize=$HASHSIZE
- echo $CONNTRACK_MAX > /proc/sys/net/ipv4/ip_conntrack_max
you shouldn't need to tweak anything else. Recent kernels have the
jenkins2b hash instead of our old one, and hash distribution should thus
be more optimal.
Post by Herve Eychenne
I think it would be good to end up with a small document which would
give every detail about how to choose optimal values for HASHSIZE and
CONNTRACK_MAX, and every other mean to get the best out of the
conntracking/NAT system...
I guess there hasn't been any performance testing. Ideally you'd have
as many buckets as you have conntrack entries in the system. However,
every bucket will
Post by Herve Eychenne
Here are things I've collected so far, that it would be good to have
- CONNTRACK_MAX and HASHSIZE get default values at boot time.
By default, CONNTRACK_MAX = n * 64, where n is the RAM size in MB,
am I right?
well, it's true on i386. See the algorithm below.
Post by Herve Eychenne
What about HASHSIZE default value? How to read it at runtime?
What is the exact link between these 2 values?
/* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB
* machine has 256 buckets. >= 1GB machines have 8192 buckets. */
if (hashsize) {
ip_conntrack_htable_size = hashsize;
} else {
ip_conntrack_htable_size
= (((num_physpages << PAGE_SHIFT) / 16384)
/ sizeof(struct list_head));
if (num_physpages > (1024 * 1024 * 1024 / PAGE_SIZE))
ip_conntrack_htable_size = 8192;
if (ip_conntrack_htable_size < 16)
ip_conntrack_htable_size = 16;
}
ip_conntrack_max = 8 * ip_conntrack_htable_size;

I guess it's hard to describe the algorithm any better in written
language.
Post by Herve Eychenne
- HASHSIZE should be an odd number, and even better: a prime number.
What happens when you set it to an even number, or a non-prime number?
hash distribution will be less optimal.
Post by Herve Eychenne
Why enable people to set even and non-prime numbers at all?
because we're lazy (and it doesn't cause a malfunction)
Post by Herve Eychenne
- Default values are "reasonnable" for a typical host, but we may
increase them on high-loaded firewalling-only systems, right?
yes.
Post by Herve Eychenne
Which values are the "best"? I.e., can someone give a formula with
- total RAM size
- size of the memory that should be left for non-conntrack data in
the kernel and userspace in general (what is a reasonnable value for
a firewall doing only firewalling with very few applications
running, and how to measure that at runtime?)
- number of rules, connections rate, etc.
This is not a fixed formula. If it was, we could just do it
automatically that way. In the ideal case, you have a machine _just_
doing packet filtering (i.e. almost no userspace running, at least none
that would have a growing memory consumption like proxies, ...). Then
you put a decent amount of memory into that box, and use all but 64MB
(or 128MB) for conntrack (which can easily be half a gig of ram
considering todays memory prices).

size_of_mem_available_for_ct =
ip_conntrack_max*sizeof(struct ip_conntrack) +
hashsize*sizeof(struct list_head)

struct ip_conntrack is about 300 bytes (depending on your compile-time
configuration, see the printout at module load time). struct list_head
is 2 times the size of a pointer on the respective arch. on i386 it's 8
bytes total.
Post by Herve Eychenne
- CONNTRACK_MAX can be modified at run time with /proc. What does it
do exactly (when shinked, when extended)?
When you modify CONNTRACK_MAX, should you also modify HASHSIZE
accordingly? Why? How?
it increases the counter of maximum allowed conntrack entries.
yes, you should also modify the hash size, since now the average number
of conntrack entries per hash bucket is increasing
(ip_conntrack_max/hashsize in the optimal case) and thus we need to
iterate over more list entries per conntrack lookup. Having a large
hashsize is not bad at all - it will just occupy
hashzize*sizeof(struct list_head) bytes of non-swappable kernel memory,
whether you have any connections or not.
Post by Herve Eychenne
- Is it possible to modify HASHSIZE at runtime when ip_conntrack is
not compiled as a module? If not, shouldn't we enable this with
/proc, like CONNTRACK_MAX?
no. It is non-trivial to change the hash size after we have conntrack
entries in the table. It would mean we'd need to re-hash all alrady
existing connections.

With 2.6.x you should be able to set hashsize at boottime using the new
module parameter stuff (which I haven't yet looked into, sorry).
Post by Herve Eychenne
- Does any of these operations currently (or possibly, if soon
implemented) lead to some rehashing at runtime?
no. the hash is just initialized with some random values at the time we
receive the first packet. This is to make the hash function not
guessable from the outside (and thus less likely to be attacked).
Post by Herve Eychenne
I suppose it would be quite slow... How long does/would it take?
no idea.
Post by Herve Eychenne
How to proceed to keep current conntrack entries at runtime as much
as possible? (I suppose unloading ip_conntrack module and
reinserting it with another hashsize value clears the table...)
yes. You just don't do that. You configure your firewall, and put it
in place. You should know your network traffic beforehand and configure
it correctly.
Post by Herve Eychenne
Herve
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Herve Eychenne
2003-11-25 15:35:43 UTC
Permalink
Hi!

Thank you very much for your detailled answer, Harald.
Sorry for the delay. I'm currently writing this little document, based mainly
on your answers.
Post by Harald Welte
Post by Herve Eychenne
I think it would be good to end up with a small document which would
give every detail about how to choose optimal values for HASHSIZE and
CONNTRACK_MAX, and every other mean to get the best out of the
conntracking/NAT system...
I guess there hasn't been any performance testing. Ideally you'd have
as many buckets as you have conntrack entries in the system. However,
every bucket will
Something was lost in space... Will? ;-)
Post by Harald Welte
Post by Herve Eychenne
Here are things I've collected so far, that it would be good to have
- CONNTRACK_MAX and HASHSIZE get default values at boot time.
By default, CONNTRACK_MAX = n * 64, where n is the RAM size in MB,
am I right?
well, it's true on i386.
See the algorithm below.
Post by Herve Eychenne
What about HASHSIZE default value? How to read it at runtime?
So, it cannot be read at runtime, I suppose... It would be really nice,
though... would /proc be ok?
Post by Harald Welte
Post by Herve Eychenne
What is the exact link between these 2 values?
/* Idea from tcp.c: use 1/16384 of memory. On i386: 32MB
* machine has 256 buckets. >= 1GB machines have 8192 buckets. */
if (hashsize) {
ip_conntrack_htable_size = hashsize;
} else {
ip_conntrack_htable_size
= (((num_physpages << PAGE_SHIFT) / 16384)
/ sizeof(struct list_head));
if (num_physpages > (1024 * 1024 * 1024 / PAGE_SIZE))
ip_conntrack_htable_size = 8192;
We could put a "else" here.
BTW, why this hard limit of 8192? On really high-speed and high-loaded networks,
you may perfectly want to set to an upper value...
Post by Harald Welte
if (ip_conntrack_htable_size < 16)
ip_conntrack_htable_size = 16;
}
ip_conntrack_max = 8 * ip_conntrack_htable_size;
I guess it's hard to describe the algorithm any better in written
language.
Post by Herve Eychenne
- HASHSIZE should be an odd number, and even better: a prime number.
What happens when you set it to an even number, or a non-prime number?
hash distribution will be less optimal.
But reading the algorithm, hashsize is never automatically set to a
prime number... but an even one. So how do you explain
that I have 4091 (which is probably a prime number, right?) buckets on
my system by default?
Post by Harald Welte
Post by Herve Eychenne
Why enable people to set even and non-prime numbers at all?
because we're lazy (and it doesn't cause a malfunction)
Lazyness is the mother of all vices. ;-)
Post by Harald Welte
Post by Herve Eychenne
Which values are the "best"? I.e., can someone give a formula with
- total RAM size
- size of the memory that should be left for non-conntrack data in
the kernel and userspace in general (what is a reasonnable value for
a firewall doing only firewalling with very few applications
running, and how to measure that at runtime?)
- number of rules, connections rate, etc.
This is not a fixed formula. If it was, we could just do it
automatically that way.
No, because we don't know the amount of memory potentially used by
non-conntrack data.
Post by Harald Welte
In the ideal case, you have a machine _just_
doing packet filtering (i.e. almost no userspace running, at least none
that would have a growing memory consumption like proxies, ...). Then
you put a decent amount of memory into that box, and use all but 64MB
(or 128MB) for conntrack (which can easily be half a gig of ram
considering todays memory prices).
size_of_mem_available_for_ct =
ip_conntrack_max*sizeof(struct ip_conntrack) +
hashsize*sizeof(struct list_head)
struct ip_conntrack is about 300 bytes (depending on your compile-time
configuration, see the printout at module load time). struct list_head
is 2 times the size of a pointer on the respective arch. on i386 it's 8
bytes total.
So on i386,
size_of_mem_available_for_ct =~
300 * ip_conntrack_max + hashsize * 8 =~
300 * ip_conntrack_max + ip_conntrack_max =~
300 * ip_conntrack_max =~
300 * RAM / 16384 =~
RAM / 55 by default
On a firewall-only machine (without proxies), this is not much, as
we could run with
ip_conntrack_max = RAM - 128MB / 300

So, on a firewall-only machine with 512MB and 128MB "reserved" for
non-conntrack things (which is really big already for a firewall in
console mode), we could have 40 times more conntrack entries
than the default value without any problem. Interesting.
Post by Harald Welte
Post by Herve Eychenne
- CONNTRACK_MAX can be modified at run time with /proc. What does it
do exactly (when shrinked, when extended)?
You don't really answer to my question: what happens when you set
conntrack_max to a smaller number than the currently stored conntrack
entries? I suppose conntrack entries are deleted? According to which
criterias?
Post by Harald Welte
Post by Herve Eychenne
When you modify CONNTRACK_MAX, should you also modify HASHSIZE
accordingly? Why? How?
it increases the counter of maximum allowed conntrack entries.
yes, you should also modify the hash size, since now the average number
of conntrack entries per hash bucket is increasing
(ip_conntrack_max/hashsize in the optimal case) and thus we need to
iterate over more list entries per conntrack lookup. Having a large
hashsize is not bad at all - it will just occupy
hashzize*sizeof(struct list_head) bytes of non-swappable kernel memory,
whether you have any connections or not.
Yes, but globally, if we have
conntrack_max = 8 * hashsize,
size_of_mem_available_for_ct =~
300 * ip_conntrack_max + hashsize * 8 =~
300 * ip_conntrack_max + ip_conntrack_max =~
300 * ip_conntrack_max

But if we take conntrack_max = hashsize,
size_of_mem_available_for_ct is still around 300 * ip_conntrack_max
(on my system, it is not 300, but exactly 292)
So I simply think that on firewall-only machines with 512Mo, we should
simply use conntrack_max = hashsize without any questioning.

Oh, BTW, what happens if hashsize > conntrack_max?
And what happens exactly when the number of active sessions exceeds
conntrack max?
Post by Harald Welte
Post by Herve Eychenne
How to proceed to keep current conntrack entries at runtime as much
as possible? (I suppose unloading ip_conntrack module and
reinserting it with another hashsize value clears the table...)
yes. You just don't do that. You configure your firewall, and put it
in place. You should know your network traffic beforehand and configure
it correctly.
That's not always that simple. Suppose you're working for a company for which
availability and performance is critical... and suppose the growing
network traffic forces you to increase your bandwidth by about 10.
Well, in these sort of cases, you certainly want to avoid to reboot
(and loose connections) too much, believe me.
Yes, netfilter is sometimes used in these kind of companies. And
yes, I sometimes happen to do some missions for them.
And no, I can hardly give you any names. ;-)

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Harald Welte
2003-11-25 20:57:23 UTC
Permalink
Post by Herve Eychenne
Hi!
Thank you very much for your detailled answer, Harald.
Sorry for the delay. I'm currently writing this little document, based mainly
on your answers.
Post by Harald Welte
Post by Herve Eychenne
I think it would be good to end up with a small document which would
give every detail about how to choose optimal values for HASHSIZE and
CONNTRACK_MAX, and every other mean to get the best out of the
conntracking/NAT system...
I guess there hasn't been any performance testing. Ideally you'd have
as many buckets as you have conntrack entries in the system. However,
every bucket will
Something was lost in space... Will? ;-)
hm. don't remember what i wanted to say. oh, yes. every bucket will
occupy some space, whether there are any connections in that bucket or
not.
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
What about HASHSIZE default value? How to read it at runtime?
So, it cannot be read at runtime, I suppose... It would be really nice,
though... would /proc be ok?
yes. It is printed at startup via syslog, however.
Post by Herve Eychenne
We could put a "else" here.
BTW, why this hard limit of 8192? On really high-speed and high-loaded
networks, you may perfectly want to set to an upper value...
yes, and you can if you do so by hand. however, just because a system
has loads of ram, it doesn't mean it will actually do lots of
connections... there are people using computers for something else than
firewalling ;)
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
- HASHSIZE should be an odd number, and even better: a prime number.
What happens when you set it to an even number, or a non-prime number?
hash distribution will be less optimal.
But reading the algorithm, hashsize is never automatically set to a
prime number... but an even one. So how do you explain
that I have 4091 (which is probably a prime number, right?) buckets on
my system by default?
maybe you're running a different kernel?
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
- CONNTRACK_MAX can be modified at run time with /proc. What does it
do exactly (when shrinked, when extended)?
You don't really answer to my question: what happens when you set
conntrack_max to a smaller number than the currently stored conntrack
entries? I suppose conntrack entries are deleted? According to which
criterias?
no, there are none deleted. we just skip creating new ones until the
number has dropped below the limit. There is no special case for that,
we just chek >= conntrack_max at conntrack allocation time.
Post by Herve Eychenne
But if we take conntrack_max = hashsize,
size_of_mem_available_for_ct is still around 300 * ip_conntrack_max
(on my system, it is not 300, but exactly 292)
So I simply think that on firewall-only machines with 512Mo, we should
simply use conntrack_max = hashsize without any questioning.
yes. but just because your suse or redhat default packetfilter script
modprobes ip_conntrack, there is no way we can assume that this is a
firewall-only machine.
Post by Herve Eychenne
Oh, BTW, what happens if hashsize > conntrack_max?
nothing. you will waste memory by keeping empty buckets.
Post by Herve Eychenne
And what happens exactly when the number of active sessions exceeds
conntrack max?
at this time, please read the comments in the code. we try to evict old
unconfirmed conntracks.
Post by Herve Eychenne
Post by Harald Welte
yes. You just don't do that. You configure your firewall, and put it
in place. You should know your network traffic beforehand and configure
it correctly.
That's not always that simple. Suppose you're working for a company for which
availability and performance is critical... and suppose the growing
network traffic forces you to increase your bandwidth by about 10.
Well, in these sort of cases, you certainly want to avoid to reboot
(and loose connections) too much, believe me.
Yes, netfilter is sometimes used in these kind of companies. And
yes, I sometimes happen to do some missions for them.
And no, I can hardly give you any names. ;-)
well, patches are welcome ;)
Post by Herve Eychenne
Herve
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Herve Eychenne
2003-11-26 03:42:32 UTC
Permalink
On Tue, Nov 25, 2003 at 09:57:23PM +0100, Harald Welte wrote:

Hi,
Post by Harald Welte
Post by Herve Eychenne
Thank you very much for your detailled answer, Harald.
Sorry for the delay. I'm currently writing this little document, based mainly
on your answers.
Post by Harald Welte
Post by Herve Eychenne
I think it would be good to end up with a small document which would
give every detail about how to choose optimal values for HASHSIZE and
CONNTRACK_MAX, and every other mean to get the best out of the
conntracking/NAT system...
I guess there hasn't been any performance testing. Ideally you'd have
as many buckets as you have conntrack entries in the system. However,
every bucket will
Something was lost in space... Will? ;-)
hm. don't remember what i wanted to say. oh, yes. every bucket will
occupy some space, whether there are any connections in that bucket or
not.
Yes, but that is really negligible (2 * size_of_pointer * HASHSIZE).
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
What about HASHSIZE default value? How to read it at runtime?
So, it cannot be read at runtime, I suppose... It would be really nice,
though... would /proc be ok?
yes. It is printed at startup via syslog, however.
Syslog can be enough for humans, but not for scripts...
I think you can add "make hashsize value available through /proc" to
the TODO list (whose size is unfortunately ever growing ;-)).
Post by Harald Welte
Post by Herve Eychenne
We could put a "else" here.
BTW, why this hard limit of 8192? On really high-speed and high-loaded
networks, you may perfectly want to set to an upper value...
yes, and you can if you do so by hand. however, just because a system
has loads of ram, it doesn't mean it will actually do lots of
connections... there are people using computers for something else than
firewalling ;)
Of course. That was just a statement for a specific configuration, and
this must be decided by a human being.
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
- HASHSIZE should be an odd number, and even better: a prime number.
What happens when you set it to an even number, or a non-prime number?
hash distribution will be less optimal.
But reading the algorithm, hashsize is never automatically set to a
prime number... but an even one. So how do you explain
that I have 4091 (which is probably a prime number, right?) buckets on
my system by default?
maybe you're running a different kernel?
Debian standard kernel. Maybe they are patching netfilter? These are smart
guys! ;-)
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
- CONNTRACK_MAX can be modified at run time with /proc. What does it
do exactly (when shrinked, when extended)?
You don't really answer to my question: what happens when you set
conntrack_max to a smaller number than the currently stored conntrack
entries? I suppose conntrack entries are deleted? According to which
criterias?
no, there are none deleted. we just skip creating new ones until the
number has dropped below the limit. There is no special case for that,
we just chek >= conntrack_max at conntrack allocation time.
Don't you think it would be good to shrink the lists immediately?
Waiting til the number has dropped below the limit can take days...
Post by Harald Welte
Post by Herve Eychenne
But if we take conntrack_max = hashsize,
size_of_mem_available_for_ct is still around 300 * ip_conntrack_max
(on my system, it is not 300, but exactly 292)
So I simply think that on firewall-only machines with 512Mo, we should
simply use conntrack_max = hashsize without any questioning.
yes. but just because your suse or redhat default packetfilter script
modprobes ip_conntrack, there is no way we can assume that this is a
firewall-only machine.
Of course. Once more, I didn't propose that this should be done
automatically, I just wanted to know if someone had any objection to
that statement.
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
yes. You just don't do that. You configure your firewall, and put it
in place. You should know your network traffic beforehand and configure
it correctly.
That's not always that simple. Suppose you're working for a company for which
availability and performance is critical... and suppose the growing
network traffic forces you to increase your bandwidth by about 10.
Well, in these sort of cases, you certainly want to avoid to reboot
(and loose connections) too much, believe me.
Yes, netfilter is sometimes used in these kind of companies. And
yes, I sometimes happen to do some missions for them.
And no, I can hardly give you any names. ;-)
well, patches are welcome ;)
Yet another TODO++...

Oh, I nearly forgot... The little document about conntrack/NAT tuning
is located to http://www.wallfire.org/misc/netfilter_conntrack_perf.txt
for the moment.
Corrections and ideas are welcome.

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Henrik Nordstrom
2003-11-26 04:13:49 UTC
Permalink
Post by Herve Eychenne
Debian standard kernel. Maybe they are patching netfilter? These are smart
guys! ;-)
Or maybe you/they have a prime set in modules.conf?

Regards
Henrik
Herve Eychenne
2003-11-27 04:56:20 UTC
Permalink
Post by Henrik Nordstrom
Post by Herve Eychenne
Debian standard kernel. Maybe they are patching netfilter? These are smart
guys! ;-)
Or maybe you/they have a prime set in modules.conf?
I looked at both modules.conf and Debian kernel source (patched, but
netfilter code seems unaffected), and could find nothing that explains
why I have 4091 buckets (which is indeed a prime number, that's cool)
by default instead of 4096 (I have 512MB).

So it's a bit strange.


Debian iptables package maintainer TO'ed.

Context: I have a Debian testing (sarge) with kernel-image-2.4.22-1-686
package and 512MB of RAM. I should then logically get ip_conntrack
module initialized with 4096 buckets (size of the netfilter conntrack hash
table, that should be computed automatically by netfilter code according
to the amount of RAM), but I can read 4091 in syslog message.
4091 is better (prime number), but I cannot understand why I get this value
instead of 4096, as nothing particular is done for the moment in
netfilter code to ensure that the computed value will be a prime number.

For further reading about the subject, you can read
http://www.wallfire.org/misc/netfilter_conntrack_perf.txt

Any ideas, Laurence?

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Willy Tarreau
2003-11-28 11:00:50 UTC
Permalink
Post by Herve Eychenne
Context: I have a Debian testing (sarge) with kernel-image-2.4.22-1-686
package and 512MB of RAM. I should then logically get ip_conntrack
module initialized with 4096 buckets (size of the netfilter conntrack hash
table, that should be computed automatically by netfilter code according
to the amount of RAM), but I can read 4091 in syslog message.
4091 is better (prime number), but I cannot understand why I get this value
instead of 4096, as nothing particular is done for the moment in
netfilter code to ensure that the computed value will be a prime number.
Perhaps you have a small portion of this RAM dedicated to a video RAM so
that the amount of system ram is slightly lower than 512 MB (eg: 510 MB).
Then, dividing this would give you something which is not a power of 2.

Cheers,
Willy
Harald Welte
2003-11-26 11:36:45 UTC
Permalink
Post by Herve Eychenne
Yes, but that is really negligible (2 * size_of_pointer * HASHSIZE).
well, sizof(void *) is 4 bytes on most archs... two times is 8. so if
you have let's say 100k buckets, that's 800k non-swappable kernel
memory...
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
So, it cannot be read at runtime, I suppose... It would be really nice,
though... would /proc be ok?
yes. It is printed at startup via syslog, however.
Syslog can be enough for humans, but not for scripts...
I think you can add "make hashsize value available through /proc" to
the TODO list (whose size is unfortunately ever growing ;-)).
i'd rather write a patch than add it to the todo list. adding and
removing that item from the list would be about the same amount of work,
i guess.
Post by Herve Eychenne
Post by Harald Welte
maybe you're running a different kernel?
Debian standard kernel. Maybe they are patching netfilter? These are smart
guys! ;-)
IIRC debian has still 2.4.18 which had a different hashing algorithm
Post by Herve Eychenne
Post by Harald Welte
no, there are none deleted. we just skip creating new ones until the
number has dropped below the limit. There is no special case for that,
we just chek >= conntrack_max at conntrack allocation time.
Don't you think it would be good to shrink the lists immediately?
Waiting til the number has dropped below the limit can take days...
well, it might be a good idea. but I somehow doubt this is a valid
scenario. And if we would shrink the list: how do we select which
entries to evict? I'd rather wait for ctnetlink to appear in mainstream
kernels and then leave that to a userspace process.
Post by Herve Eychenne
Post by Harald Welte
yes. but just because your suse or redhat default packetfilter script
modprobes ip_conntrack, there is no way we can assume that this is a
firewall-only machine.
Of course. Once more, I didn't propose that this should be done
automatically, I just wanted to know if someone had any objection to
that statement.
ah, ok.
Post by Herve Eychenne
Herve
(°= Hervé Eychenne
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Patrick McHardy
2003-11-26 16:26:17 UTC
Permalink
Post by Harald Welte
Post by Herve Eychenne
Don't you think it would be good to shrink the lists immediately?
Waiting til the number has dropped below the limit can take days...
well, it might be a good idea. but I somehow doubt this is a valid
scenario. And if we would shrink the list: how do we select which
entries to evict? I'd rather wait for ctnetlink to appear in mainstream
kernels and then leave that to a userspace process.
PF uses "adaptive timeouts" to scale down timeouts if the table gets full.
IIRC until some threshold is reached, all entries have 100% of their normal
timeouts, from then on its scaled down until 0% for a completly full table.
I've been thinking about adding this to ip_conntrack for some time because
I often have problems with my roommates edonkey and overflowing conntrack
table. I decided to experiment with it when the timeout handling of conntrack
is changed from beeing handled by a per-conntrack timer to a global cleanup
timer. I recall someone wanted to make a patch for this some time ago to
prevent timer storms, do you have any information if anyone is currently
working on this ? Otherwise I might just do both ..

Best regards,
Patrick
Harald Welte
2003-11-27 11:10:19 UTC
Permalink
Post by Patrick McHardy
I recall someone wanted to make a patch for this some time ago to
prevent timer storms, do you have any information if anyone is currently
working on this ? Otherwise I might just do both ..
I think somebody had already written such a patch (gandalf?), however
he didn't see any significant performance difference.
Post by Patrick McHardy
Best regards,
Patrick
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Herve Eychenne
2003-11-27 03:33:52 UTC
Permalink
Post by Harald Welte
Post by Herve Eychenne
Yes, but that is really negligible (2 * size_of_pointer * HASHSIZE).
well, sizof(void *) is 4 bytes on most archs... two times is 8. so if
you have let's say 100k buckets, that's 800k non-swappable kernel
memory...
Which is really not that much when you have 512 MB... (0.0015 %)
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
maybe you're running a different kernel?
Debian standard kernel. Maybe they are patching netfilter? These are smart
guys! ;-)
IIRC debian has still 2.4.18 which had a different hashing algorithm
Standard Debian stable, maybe. But you may want to run testing, or
sid, and I run testing (sarge), so I have a 2.4.22.
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
no, there are none deleted. we just skip creating new ones until the
number has dropped below the limit. There is no special case for that,
we just chek >= conntrack_max at conntrack allocation time.
Don't you think it would be good to shrink the lists immediately?
Waiting til the number has dropped below the limit can take days...
well, it might be a good idea. but I somehow doubt this is a valid
scenario. And if we would shrink the list: how do we select which
entries to evict?
That seems relatively simple to me:
- reduce timeouts on every entries proportionally
- sort the entries by order of importance (state, timeout (time to
live), protocol (icmp ping/pong, udp, tcp), maybe unprivileged ports
matter less, etc.). Then evict "bad scores" first.
Post by Harald Welte
I'd rather wait for ctnetlink to appear in mainstream
kernels and then leave that to a userspace process.
Maybe such a job (probably happening while network is stressed) is
better done in kernel space? That doesn't seem so complicated...

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Henrik Nordstrom
2003-11-27 09:56:28 UTC
Permalink
Post by Herve Eychenne
Which is really not that much when you have 512 MB... (0.0015 %)
Err... 800K or 512M is .15% which is not insignificant if you do not plan
on using that memory. If each subsytem of the kernel did the same "just in
case" then there would be very little memory left for the user.

Regards
Henrik
Harald Welte
2003-11-30 22:25:05 UTC
Permalink
Post by Herve Eychenne
Post by Harald Welte
well, it might be a good idea. but I somehow doubt this is a valid
scenario. And if we would shrink the list: how do we select which
entries to evict?
- reduce timeouts on every entries proportionally
- sort the entries by order of importance (state, timeout (time to
live), protocol (icmp ping/pong, udp, tcp), maybe unprivileged ports
matter less, etc.). Then evict "bad scores" first.
well, but how do you set those 'scores' or the 'importance'? somebody
running a packet filter in front of an important DNS server will care
more about UDP than somebody else with a large ftp server. And you
definitely don't want to add a sophisticated user-configurable interface
for this rare case.

I'd rather say we provide a mechanism for userspace:
1) limiting ip_conntrack_max via sysctl()
2) evicting entries via ctnetlink, based on whatever choice a userspace
program might want.
Post by Herve Eychenne
Post by Harald Welte
I'd rather wait for ctnetlink to appear in mainstream
kernels and then leave that to a userspace process.
Maybe such a job (probably happening while network is stressed) is
better done in kernel space? That doesn't seem so complicated...
mh. when the network is stressed you want to add addotional pressure by
reducing the number of conntracks? doesn't really sound like a
reasonable thing to me. Also, ordering and priorizing the list would
have to be done with a WRITE_LOCK on ip_conntrack_lock... again
something that wouldn'd be a wise thing if your network is stressed.
Post by Herve Eychenne
Herve
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Herve Eychenne
2003-11-27 04:14:52 UTC
Permalink
Post by Harald Welte
Post by Herve Eychenne
Post by Harald Welte
Post by Herve Eychenne
So, it cannot be read at runtime, I suppose... It would be really nice,
though... would /proc be ok?
yes. It is printed at startup via syslog, however.
Syslog can be enough for humans, but not for scripts...
I think you can add "make hashsize value available through /proc" to
the TODO list (whose size is unfortunately ever growing ;-)).
i'd rather write a patch than add it to the todo list. adding and
removing that item from the list would be about the same amount of work,
i guess.
I had a quick look at the existing code in ip_conntrack_core.c.

First I would have been happy to write a small patch, but I'm not really a
kernel guy and register_sysctl_table API seems _completely crappy_ to
me.

So I took the risk to ridiculize myself in public and wrote something, though,
but I'm unsure about my patch. Especially unsure about the Binary ID of
the ctl_table... I took NET_IP_CONNTRACK_MAX + 1 = 2090 because I could find
no occurence of 2090 user for sysctl in the whole kernel tree... but it
seems crappy and hazardous. Who the hell is in charge of ensuring the
unicity of each sysctl binary entry? Where's the list?

So I didn't even took time to compile the attached patch, but with the help of
gods if will hopefully work.

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Henrik Nordstrom
2003-11-27 10:09:03 UTC
Permalink
Post by Herve Eychenne
First I would have been happy to write a small patch, but I'm not really a
kernel guy and register_sysctl_table API seems _completely crappy_ to
me.
It is not that bad compared to the alternatives..
Post by Herve Eychenne
So I took the risk to ridiculize myself in public and wrote something, though,
but I'm unsure about my patch. Especially unsure about the Binary ID of
the ctl_table... I took NET_IP_CONNTRACK_MAX + 1 = 2090 because I could find
no occurence of 2090 user for sysctl in the whole kernel tree... but it
seems crappy and hazardous. Who the hell is in charge of ensuring the
unicity of each sysctl binary entry? Where's the list?
You also need to make the sysctl read-only.. you can not change the
ip_conntrack hash size while conntrack is running. If you do there will be
serious hazard.

Also what kernel version did you do this in? Your source does not seem to
match either 2.4.22 or 2.6.0-test10.. in the sources I see the conntrack
sysctls are all in ip_conntrack_standalone.c not ip_conntrack_core.c..
(looks like you got the filename wrong only.. the sources seem to match
ip_conntrack_standalone.c even if the header say
ip_conntrack_core.c.new/orig)
Henrik Nordstrom
2003-11-27 10:13:34 UTC
Permalink
Post by Henrik Nordstrom
You also need to make the sysctl read-only.. you can not change the
ip_conntrack hash size while conntrack is running. If you do there will be
serious hazard.
Looking again I see that you did take care of this. Sorry.

Regards
Henrik
Herve Eychenne
2003-11-27 11:38:15 UTC
Permalink
Anyway, Harald had already made a patch for pom without telling,
and Patrick discussed it some hours ago.
Post by Henrik Nordstrom
Also what kernel version did you do this in? Your source does not seem to
match either 2.4.22 or 2.6.0-test10.. in the sources I see the conntrack
sysctls are all in ip_conntrack_standalone.c not ip_conntrack_core.c..
(looks like you got the filename wrong only.. the sources seem to match
ip_conntrack_standalone.c even if the header say
ip_conntrack_core.c.new/orig)
I worked on 2.4.22 Debian kernel sources (patched), but in vanilla
kernel 2.4.22, the files you're talking about are identical to the
Debian sources (which contain only a few patches).
And I maintain that my patch is against ip_conntrack_core.c...

This is crazy, as I'm aware that Harald's patch what against
ip_conntrack_standalone.c... but I just downloaded a fresh 2.4.22
kernel from kernel.org, and I can swear that sysctls are in
ip_conntrack_core.c (and not ip_conntrack_standalone.c).

***@comet:/usr/src/linux-2.4.22/net/ipv4/netfilter$ grep sysctl *
ip_conntrack_core.c:#include <linux/sysctl.h>
ip_conntrack_core.c:static struct ctl_table_header *ip_conntrack_sysctl_header;
ip_conntrack_core.c:unregister_sysctl_table(ip_conntrack_sysctl_header);
ip_conntrack_core.c: ip_conntrack_sysctl_header
ip_conntrack_core.c: = register_sysctl_table(ip_conntrack_root_table, 0);
ip_conntrack_core.c: if (ip_conntrack_sysctl_header == NULL) {
ip_queue.c:#include <linux/sysctl.h>
ip_queue.c:static int sysctl_maxlen = IPQ_QMAX_DEFAULT;
ip_queue.c:static struct ctl_table_header *ipq_sysctl_header;
ip_queue.c: { NET_IPQ_QMAX, NET_IPQ_QMAX_NAME, &sysctl_maxlen,
ip_queue.c: sizeof(sysctl_maxlen), 0644, NULL, proc_dointvec },
ip_queue.c: ipq_sysctl_header = register_sysctl_table(ipq_root_table, 0);
ip_queue.c: goto cleanup_sysctl;
ip_queue.c:cleanup_sysctl:
ip_queue.c: unregister_sysctl_table(ipq_sysctl_header);
ipt_REJECT.c: /* FIXME: Use sysctl number. --RR */
ipt_ULOG.c: * nlgroup now global (sysctl)
***@comet:/usr/src/linux-2.4.22/net/ipv4/netfilter$

Has someone an explanation, before I'm going nuts?

Herve
--
_
(°= Hervé Eychenne
//)
v_/_ WallFire project: http://www.wallfire.org/
Henrik Nordstrom
2003-11-27 11:57:02 UTC
Permalink
Post by Herve Eychenne
I worked on 2.4.22 Debian kernel sources (patched), but in vanilla
kernel 2.4.22, the files you're talking about are identical to the
Debian sources (which contain only a few patches).
And I maintain that my patch is against ip_conntrack_core.c...
Right.. my sources have current patch-o-matic applied..
80_ip_conntrack-proc.patch moves these to ip_conntrack_standalone.c

Regards
Henrik
Harald Welte
2003-11-27 11:14:04 UTC
Permalink
Post by Herve Eychenne
I had a quick look at the existing code in ip_conntrack_core.c.
First I would have been happy to write a small patch, but I'm not really a
kernel guy and register_sysctl_table API seems _completely crappy_ to
me.
;)
Post by Herve Eychenne
So I took the risk to ridiculize myself in public and wrote something, though,
the ctl_table... I took NET_IP_CONNTRACK_MAX + 1 = 2090 because I could find
no occurence of 2090 user for sysctl in the whole kernel tree... but it
seems crappy and hazardous. Who the hell is in charge of ensuring the
unicity of each sysctl binary entry? Where's the list?
the list is in include/linux/sysctl.h And sysctl-by-numbers has been
deprecated anyways, nobody should safely assume that sysctl via number
is safe anymore...
Post by Herve Eychenne
So I didn't even took time to compile the attached patch, but with the help of
gods if will hopefully work.
I've written up a patch that ensures it is exported only read-only,
please have a look at
http://cvs.netfilter.org/netfilter/patch-o-matic/pending/76_conntrack_bucket_sysctl.patch
Post by Herve Eychenne
Herve
--
- Harald Welte <***@netfilter.org> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Loading...