Servperso Systems

  • Status Closed
  • Percent Complete
  • Task Type Incident
  • Category Backbone
  • Assigned To No-one
  • Operating System All
  • Severity Critical
  • Priority Very High
  • Reported Version Development
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Servperso Systems
Opened by Servperso - 11.03.2021
Last edited by Servperso - 01.07.2021

FS#2 - Instability BGP dusseldorf

During the day of 02/27, we encountered some BGP stability issues in dusseldorf

Closed by  Servperso
01.07.2021 12:32
Reason for closing:  Fixed
Servperso commented on 11.03.2021 17:24

For now, we have mitigated the problem by taking the following actions:
- Simplification of configuration
- POP isolation and moving bgp architecture to "federation"
- Setting up another router temporarily in place of dus1-core
- Temporary reduction in the number of forwarders on the IPv6 stack
These actions make it possible to contain the problem while waiting for a definitive hardware upgrade.

Servperso commented on 17.03.2021 00:05

We prod new main core router with "minimal" configuration. If we don't got anny issue, we increase progressively up to the full usual load.

Servperso commented on 18.03.2021 20:17

We still detect small issue. With two time more powerfull hardware.
Some bgp session on V6 got "hold timer expired" error.

We continue to evaluate the possibility to move to another router software with dampening support.

Issue remain mitigated, no impact at customer level.

Servperso commented on 20.03.2021 02:49

We finaly identify the bottleneck. On IPv6 there is much update on the fulltable. Insertion / deletion on table was something linux kernel have limit with.

To fully restore the routing table, we have 2 option:
- BGP Dampening (not handled by bird, have to validate a new routing platform).
- Insert partial table on kernel and rely on default for far route (no impact for customer too, we keep fulltable at bgp level).

We maybe mitigate with first one, and then in a long run qualify a new software for our core routing platform.

Servperso commented on 21.03.2021 15:18

We finaly choose VyOS as an alternative to bird for core.

We actualy qualify that software. If the first qualify step work, we put them as a replacement for CORE2-DUS-MY in 1 or 2 days.

Servperso commented on 22.03.2021 13:11

CORE2-DUS6 now run on VyOS with IPv4 dampening enabled.
We progressively enable features onto it and monitor if FRR is more suitable than bird in terms of stability.

Servperso commented on 26.03.2021 17:01

CORE2-DUS6 run for multiple days without any issue.
They run on the less powerfull server without trouble.

Now full bgp configuration was loaded onto it.

We finish firewall part and i think we can "qualify" VyOS as a viable solution to replace core router software.

Servperso commented on 28.03.2021 14:10

DUS2-CORE: Firewall / acl was now restored
Weathermap: our weathermap is now up to date with new network

Servperso commented on 05.04.2021 00:23

Looginglass: Our lookinglass now support VyOS, last main work is to migrate core1-dus6 too

Servperso commented on 05.04.2021 21:27

Both routers are now ported to VyOS.
We monitor network stability on long run, but now we think the unstability is over !

Servperso commented on 01.07.2021 12:28

No issue since the migration to VyOS. We close the ticket


Available keyboard shortcuts


Task Details

Task Editing