What's new at lucketts.net?
2 Jan 2009: We're finally caught up after the wind storm. We took damage on 1 battery backup system, 4 backhaul links, and various customer antennas. Our new internet link didn't even blink at the wind, but the old link lost 2 of its 4 routes. Overall, the system stayed up and most customers never lost connection. About 20% did lose connection for 20 minutes, and 5% lost connection for a few hours for system related problems ( Power or wind damage to APs, or backhaul connections down). A handful had local damage to radios or antennas that knocked them offline until we could get there to fix it, but I don't count them in system outage stats. These storms are good at helping us identify weak points in the network, and the silver lining is that the network is getting stronger s we find and upgrade those weak points.
Really happy about our major data links. We had serious antenna damage on a few of them, but most never lost connection. One went from normal -55db signal to -86db signal, but was still passing over 3 Mbps total to a group of 50 customers. Not as fast as normal, but not bad for losing 99.9% of the signal. Remember the days when the original service had 50 people sharing one T1 at a total of 1.5 Mbps and it felt fast? Our standards have increased significantly since then.
Right now, we think we are caught up with all storm damage. If you have been having trouble since the storm and we haven't fixed it yet, give us a call. We think we've caught up, and even resolved a few problems that existed before the storm.
New internet connection is now officially online, boosting our official web download speed to 3 Mbps. Non-web and all upload still limited to 1 Mbps max, but we will be shifting everything over to the new connections as we incrementally bring it up to full speed. All downloads will be 3 Mbps at least, and our near term goal is 5 Mbps. We'll post updates as progress is made.
The proxy servers are almost ready to go away, for those that keep track. The next internal project is to implement BGP, and that router is already built and software being installed. Depending on how the software install goes, we can start migrating blocks of customers off the proxies in a week or two. The proxies are the outstanding problem source remaining on the network. They served their purpose in giving customers access to data from different providers allowing us to load balance across medium size links. They do well for normal web sits, but did cause problems for select sites, forcing regular workarounds for us. But with a large link, and BGP backup allowing direct web access, we will soon mothball them.
25 Dec: Our new official web download speed is 3 Mbps . Other protocols and
uploads are still capped at 1 Mbps. We are still improving our internet
connections, and all upload and downloads speeds will continue to
increase over the next few months. Merry Christmas.
3 Dec: We've now repaired all access point damage we're aware of from Monday's surprise thunderstorm. And a few customers that were individually broken are now fixed. We only have one known individual outage remaining and a few with degraded signals, and we'll be working on them through the end of the week. Sorry for the outages today on a few access points. We had to replace failing cards before they died completely, so we went from place to place ASAP, no real chance to give effective warning. 2 of today's fixes went from the planned 2 minute outage to become 20 & 40 minute outages because of second failures while fixing the first.
The new link required another tower climb today, but this time it looks like we have two working radios. The box we put on the tower to conenct us to the internet has two radios built into it so that we can have a backup in case one dies, and if they are both working then we can even use them both at the same time to double effective speeds. Problem is, our vendor had a QC problem and sent us a bad batch of radios that tend to die hours after being turned on and stredded. And the ones that don't die tend to interfere with thier neighbor. For the past week, that means that every time we put a new box on the tower, one of the radios in it dies within hours or days. Today was the 4th tower climb in a week to replace dead radios. But so far, so good. Both radios in the box are working fine.
Yesterday the effective throughput of the new link was 12 Mbps. Today, I ran it up to 45 Mbps for awhile to test. Since that appears to be working well, I've turned up the download speed for most customers to 3 Mbps. Here's how you get 3 Mbps downloads: use a web browser and download the file from a normal web site. All port 80 traffic (normal web site downloads) are being sent out via the new link. Https, VPNs, mail and everything else is going out the old path for now, so it is still limited to 1 Mbps. This will change as we make progress on the link.
Now that this phase of the new link is working, we'll continue to test and run data through it as fast as possible. We'll also start the process to route the rest of our network out via this link so that all protocols can take advantage of the higher speeds. And now that we know how well this link works and can measure actual signals to give us a baseline, we'll be buying a custom microwave link to give us even faster speeds - should place the order Thursday for delivery next week.
If anyone is a BGP expert, I'd be happy to pick their brain... Even be willing to buy some consulting time. We need to triple home our network at two different locations. . And one of the providers will not pass BGP announcements for us, but they will announce our ARN for us. I'm open to suggestions. Looks like i have to get out the BGP book tonight and read it again.
23 Nov: For anyone considering Verizon internet service over us, consider this:
today in a simple small power outage Verizon failed for 1 hour. They
have gone down twice a week for the past few months, and they are not
showing any progress towards fixing the problem. We will announce when
we are no longer reliant on Verizon. It's our top priority.
21 Nov: Fewer
interruptions today as we have a handle on the problem with the
existing link and are working with the vendor to resolve it. And the
new link is hours from completion. Working both solutions, so Friday
night should have improved speeds.
There will be a few 2-3 minute web browsing interruptions between 10am and 11am.
13 Nov: Verizon lines went down today - again. That's really the only thing that drastically impacts all of our network. Web services can work around broken T1 lines, but they require manual intervention. Other stuff like mail, https, VoIP, etc are much more difficult to re-route. We did install a fix for mail a month ago, so this time -NO- mail was lost by our servers being disconnected from the internet. And with some quick work here, we were able to get outbound and inbound mail working in realtime again pretty promptly after the Verizon lines failed. We had two small glitches that impacted a few outbound emails... our reverse DNS and the SPF records did not match the backup solution. We're aware of those glitches now, so our backup solution will be better in the future.
Hopefully we will not need a backup again, as we're very close to dropping the AT&T/Verizon T1s. The cell tower work on Tuesday -almost- has the new link running. When the new link is up, and we get BGP (a geek thing) up and running we'll be able to drop a link and still keep all services online without interruption and without requiring manual intervention.
Thanks to everyone that checked our web page instead of just calling us when you saw a problem. Checking our web page is a good first step, because if there is a system wide problem we post the details as soon as possible. When working around broken stuff, having 100 phone calls an hour is rather distracting. Calls that actually get through tend to eat up 10 minutes telling us how important it is they get online now before we manage as politely as possible to get rid of them. If you're reading this, then you probably already know that each call reminding us of something we are already aware of has the net effect of delaying everyone getting back online by the time of the call, plus however long it takes us to heckle the caller after they hang up. If you're not reading this, wait a sec ............ Ok, we can go on now.
Please don't call my cell directly when our switchboard doesn't pick up. If we don't call you back promptly when you leave a message on the office line, it means we are probably really-really busy. And, if my cell is the only working phone, we are probably using it to coordinate with the providers such as AT&T and Verizon that are broken and causing the system wide outage. As mentioned in previous posts, calling my cell to bypass the queue does not put you on the list you think it does. Wait a sec ........ , OK, moving on now.
Our new provider told me yesterday that we'll have initial new link up today. That may still happen, cross your fingers. We will get some immediate boosts to speed when that link come up. We're planning a party.
11 Nov: So much for the within-a-week guess. We did get the connection into Lucketts powered up today, so all RF segments are in place. There are still many more steps, some giving us jumps in performance, but the most difficult steps are now complete. Additional steps:
1) finish crossover connections at the various hops back to the internet
2) complete end to end characterization, tune and test of high speed link
3) transition proxy over to new link for immediate improvement of existing architecture
4) implement BGP and routing of our ARN
5) re-provision IP addresses of customers so that they can have direct, non-proxied access to internet.
These steps all still involve other organization, so the timeline is not under our control. But we will be doing ASAP. It's been a good day.
12 Oct: It's like wading through mud. On Saturday we finally got the last of the new antennas and radios installed on a 400 ft cell tower. Now we have to go back and put up the cables to the radios so we can actually turn them on. With good weather, we should start testing new internet connections within a week.
In the meantime, we're seeing some performance bottlenecks. Web pages are still seeing 1.5 Mbps downloads most of the time, but non-web downloads are not doing as well. Today the non-web download speeds dropped to 400Kbps for a few hours. For the geeks: all port 80 web traffic goes through one of two fire hose size connections. Other traffic, such as mail, ftp, nntp, https, VoIP and VPN go over a garden hose size connection. For normal traffic levels the balance is about right. Today, everyone was playing games and doing https downloads, so that little garden hose connection was full.
By comparison, the new connection we're installing is a water main. It has many times the capacity of all of our current connections put together, and it is designed to just carry all traffic... no more splitting it into different pipes. We'll be turning it on at 10% capacity by weeks end and that will double our current network capacity. We'll be playing with the partial new connection for awhile until it is tested out as very stable, then we'll start speeding it up to about 3x our current bandwidth and shifting traffic to it. About 60 days later, we'll be licensing the link so we can turn it up to full speed - OC3 level performance. When that happens, out internet connection will no longer be a bottleneck. We'll have to start working on the internal distribution then, upgrading internal connections as we discover which ones cannot handle 10x speeds.
21 Sept: I really cannot top the 5 Aug entry. So in a boring recap of what's going on: Just bought $24K of licensed RF gear to install an OC3. That means the much faster internet connection we've been talking about is much closer to being real. We are a few days from installing the preliminary connections that will result in a 10x increase to our bandwidth. We'll feed as much of that out to customers as the infrastructure will support.
We are working to improve the infrastructure constantly. Today we upgraded connection to Village Green - doubled the power of the original connection and added a second connection to provide redundancy and extra bandwidth. Sorry the system ended up going off twice for 60 seconds. It was supposed to be a hot swap, but I pulled a cable loose while moving stuff around. If you are in Village Green and you had two short interruptions today, it was not your router, so you don't have to go get your sister's laptop.
We have some new webpage tools coming soon. We'll announce them as they come online.
5 Aug: I've had time to cool off, so I'll relate now what I learned this weekend about customer relations. Doing a help desk job can be difficult for a person that has other things on their plate. It really should be given to someone who can focus on the customers immediate problem, rather that worrying about the server that is under attack while a customer is on the line asking for the 5th time this in as many weeks how to change the password of their router. A good help desk person has the individual customer as their top priority.
When the owner/sys-admin/accountant/installer/network engineer is the only person in the office, they also become the help desk. And that is not ideal. When an unknown problems shuts down internet to1500 people, and 10 phone calls a minute are coming in, the caller who's PC hasn't been able to connect to their router for the past week but didn't bother to call us is not going to be very high on the priority list if their call gets through. If they cop an attitude about it, it could be an unfortunate situation.
(The following story is completely true, except for the parts I made up.)
So this weekend, just before a pretty serious attack on our servers (nothing compromised, but we had to work to keep it that way), a customer calls about their PC not being able to reach the internet. I checked their internet connection, found that our radio worked fine and we could communicate directly to their router, so I told them that the problem was probably between their PC and router, and that they should check the cable connecting PC to router, reboot router, and reboot PC.
The next morning they call back (right in the middle of the attack on the servers) and tell me that the problem is with our service, not with their stuff. They connect their PC directly to the router by cat5, and did not screw that up. I politely give them my humble, but basically expert option that the problem is indeed in their part of the system, between the router they own and the PC they own. They insist the problem is with us, so I ask what error message they are getting. Maybe I can help them decipher it, as many PC error messages are rather cryptic. They report that the error message is "A Network cable is unplugged".
Well, calling up my years of people skills and network savvy, I respond back with an "oh... OK then. That explains everything. Basically what that messages tells you is that a network cable is unplugged.". In restrospect, this was probably the moment when the call started going downhill.
Very happy with myself for the quick, concise and obviously helpful answer to their problem, I was shocked when the customer was less than thankful for the transfer of PC wisdom. They tell me that their service being down this long is unacceptable. I explain to them that the router is their router, and that the connection between their router and their PC is their responsibility, and that if the router is functioning properly, there is nothing I could even do to help directly as a router prevents outside intrusion. "But," I explained, "we do try to offer courtesy technical support for situation like this where the customer doesn't have the technical knowledge to debug their own equipment." That downhill slide is getting steeper.
I offer up that if we were to actually come out to fix their internal problem, we would check the cable and then hook a (known good) laptop into the router to see if it gets a connection. If it does, then their PC ethernet port is fubar, otherwise the router port is fubar. Fubar is one of those those technical terms that conveys what I was thinking without actually using the more technically correct terms, if you know what I mean. Trust me, being "fubar" is bad. I offered to them that if they had a 2nd PC in the house, it would have been a trivial effort to do this same test at any time to resolve their problem. Another problem solved! But no, they report "there is no other computer in the house" and that this internet service is totally unacceptable.
I was shocked... after so many useful tidbits of wisdom they were still unappreciative. So I tried to start disengaging, worried that that attack on the server may start getting ahead of me again and that this call was going nowhere. I passed on the options as I saw them: "Since debugging the router/PC connection yourself is not going to happen, your best course is to bet on a bad router and replace it." Best mostly because they could go to Best Buy themselves and not bother me any more, and a little bit because it was the cheapest likely fix. Not impressed, they said they bought the router from me and it was my responsibility to replace it.
Sidetracked by this tangential topic, I let loose another useful bit of knowledge. "It is not my responsibility to replace your router. That's what warranty cards are for, and good luck getting Netgear to replace a router with a burned out ethenet port. We have replaced routers on occasion, when a helpless customer has no clue how to proceed and we felt sorry for them, and it was cheaper than wasting any more time with them." Did I mention that I had been up a very long time, and was distracted by the ongoing attack on our servers? And I added "If we were to come out to take care of the problem with your equipment, we would probably end up replacing your router and charging you for the house call and router. Or, it may be a bad PC ethernet port, and to fix that we would have to operate on the PC or add some workaround wireless connection to the PC to make things work. All billable effort."
At this point the customer go mad for no reason. And lost their cool. And made their fatal mistake: "This is absolutely unacceptable." they said. "We won't accept service like this" they said. "Your service to us has been down for a week, and you expect us to pay to fix it" they said. "And you expect us to use a wireless work-around like my sister does" they said. "and blah blah blah....." they said but I didn't hear, go back one complaint.... "What does your sister have to do with this?" I asked. pause..... no answer.
Feeling a "Legally Blonde" moment coming on, I pursued a promising line of questioning that just presented itself. "Where is your sister" I ask. Defendant's reply: "In our basement." "And she has a wireless connection" "Yes" "and that wireless connection is to your router?" "Yes" "And does your sister have a PC making this wireless connection, or is it an imaginary wireless connection?" OK, I didn't really ask that, I'm not Joe Pesci.
Anticipating the next answer, I ask " And has your sister been connected to the internet through your router via our internet service this past week while you had no service?" The response was "Yes".
So now I feel a little better and wrap up the case with: "So let me recap: you say you have not been connected to the internet for a week and believe it is our problem. You use a router and you are getting an error message on your PC that specifically tells you that your network cable is unplugged. I told you that I can verify that the connection from the internet to your router is OK, and I suggest that the message telling you your network cable is unplugged probably means your network cable is unplugged., You say there is no second PC there to troubleshoot with, but your sister is in your basement and connected to your router and has had no problem using our service to reach the internet for the past week while you have been down. At this point, what exactly do you want me to do for you?"
Response: "Stop talking down to me. I've worked with computers for 30 years and do this for a living."
My closing arguments to the jury as we go over the cliff.... "Well, I know what your problem is now, and I don't think I have enough time to fix that. If you don't like our service, please call to schedule an uninstall."
click...and nothing but dial tone.
End of Story
This call really happened. One of our real customers called and complained about service being out. In the process of working out the problem, which was clearly theirs from the start, they apparently lied to me about what was available for them to work with. They wanted me to come out and waste my time fixing something they were responsible for, and do it at my expense. And their sister really was connected successfully via their router the entire time. While we try to offer as much complementary support for router problems as we can, I still get really PO'ed when someone "demands" a courtesy be extended. And ironically, they are behind in their payments, so they are not even technically a customer this past month.
We have a few staff here to answer basic help desk calls during the day. Their instructions are to not let me pick up the phone, and only pass trouble calls to me after they have narrowed down the problem to our system, or at least narrowed it enough so that our system is a potential source of the problem. And by the way, we do understand the frustration when thing are not working, and share it with you. We go to great lengths to fix our systems when they cause outages to our customers, and work some pretty long hours. Some folks we do better with than others, and I am sorry for anyone I've been curt to when trying to resolve problems. We do take your problem very seriously.
As for this PO'd customer, I assume they have already discontinued service, as they are way behind in billing. If they do leave us, I'll be all broken up about it, I may even tear up a little. In the end it may be a good cathartic learning experience and benefit us all. The ex-customer will be happier with another service (as long as they don't use a router), my BP may go down a little, and you may get better tech support in the future now that I've demonstrated how not to provide support..
3 Aug: It's been a very busy month, but we -may- be catching up and have a moment to breathe now and then.
What's changed recently... Matthew went off to active duty army blowing stuff up, but welcome to our new installer Beth. She's already been instrumental in helping us catch up.
New bandwidth... the perpetual "just another month until we get a huge increase in bandwidth" may be coming closer to reality. A backbone provider has a final agreement on the table with us, for an install within the next 30 days. I'll post more on this as it develops.
Saturday we had another external attack on our DNS servers just before and just after 2pm on Saturday. The attack bogged new connections down for 20 minutes each time until we blocked it. This time they did get access to a small piece of one of our servers, so this morning has been spent reviewing data logs and checking for rootkits. They only had limited access and it was fairly easy to clean out. BTW, the intrusion this time was controlled from Russia instead of China, not that that really proves anything..
[I removed the customer support call sequel. The new one was to fresh, so I'll cool down for awhile first before reposting it.]
26 June: Not today, but once in awhile, a customer calls us, irate, complaining about an outage. We hate outages, and work our rears off to mitigate any outage so that you keep as much internet access as possible. When one of our internet providers goes down, we adjust the network as quickly as possible to use the other providers to pick up the slack. Some outages are worse than others and may take hours to mitigate, but most you never even notice because we adjust the network in seconds.
But back to that irate customer that thinks the grass may be greener with other providers: today's outage from 9:30 to 11:30 was because Verizon's commercial circuits to us went down. Commercial circuits are not supposed to go down, and when they do, they are extremely high priority to repair. The Verizon problem appears to be due to them not having working backup batteries in their local distribution center, so when power goes out locally, they go offline instantly. We have working batteries and standby generators, so we can handle most power outages with no impact whatsoever. Verizon was notified of this problem in Jan 08, and we have been escalating formal complaints since then to get this fixed. Yet each time power goes out, they still go down and each time they report to us a day later that their Distribution Center had a bad battery. By the way, Verizon is the only provider we compete against in the Lucketts area.
OK, but how about the 2nd problem at 10:30 Wed night? That wasn't Verizon. We have contracted to another local businesses to provide us bandwidth, They so far have demonstrated better reliability than Verizon and are much easier to deal with. One of those providers, Roadstar, sells us significant commercial bandwidth, but last night their router that controls our service went offline 4 or 5 times, for a total of 2.5 hours. We can adjust to that as soon as we see it by shifting to a different local provider, but the break in their link does cause interruptions to currently running downloads. I am not criticizing Roadstar for this in any way. Their performance has been outstanding. But even a good provider can have network problems.
The lesson here is, at least yesterday, the problems came from that other side of the hill and messed up our grass.
25 June: Verizon bit us again. 2.5 hours down without notice. It appears that Verizon does not have backup batteries for our commercial circuits, as they go down whenever power goes down. We're escalating to make it more painful for them to keep letting this happen.
If you are reading this, you probably already know, but please, if you are down check our website. If you can get to www.lucketts.net (and click refresh to make sure you have a new copy), you know that the problem is not with your connection, and you may get to see a status announcement telling you what is going on.
We don't often answer phones when our service providers go down. That may sound callous or irresponsible, but far from it. I can spend 2 to 5 minutes telling each customer the same thing, while being told by them how important the connection is for their work/gaming/adult entertainment , and sometimes I get the added bonus of getting yelled at. That 2 to 5 minutes is pure, unproductive delay in getting everyone online that is still down. And if you do yell at us, you don't really get added to the top of the list you probably want added to.
If you call my cell phone directly, that is not good either. The cell is the backup line used to coordinate with Verizon, AT&T, and the other carriers to get their service back up to us ASAP. We don't let customer calls forward to it during serious outages for a reason. 2 to 5 calls a minutes would make it impossible to debug with our providers.
14 Jun: Another weekend and more storms coming. Saturday night our tech support is driving an ambulance, so there may be some delays in returning calls. On the topic of phone calls, if you called us this week and we didn't respond, and you expected us to respond, please call again. We lost lots of calls with the storms & damage & power outages. If you got my cell phone voicemail that says please call 703.349.3661 to leave a message there, but left the message on my cell anyway thinking that it would be responded to faster, that doesn't happen.
All of our tech support response is keyed from our primary phone line and its voicemail. If you cannot leave a message there, then rest assured we already know of the (probably major) problem. It's hard for us to miss a full mailbox.
7 Jun: Announcement history from the recent storm:
Friday: Only 2 APs still down (Hamilton Station & Old Wheatland
Rd). Power is still off at those sites. Everything else is Up and
running, and we are wading into the individual customer issues. Lots of
routers needing rebooting, with an occasional wet radio mixed in.
Friday afternoon network status here. That's a lot better than the previous status.
7PM:
Almost everything back UP. T1s came back up when power came up to
Lucketts Rd. Still have power problem on Milltown Rd. Hamilton Station
just came partway up. St Clair AP requires replacement - will do that
at 8:30AM Friday. Only remaining problems St Clare, Wilt Store/Rt15,
Waterford, and Milltown.
Status 1:30 PM
Thursday: Power out at selected critical places, Verizon is still
offline, and everything running on batteries has pretty much run
down. In spite of that, about 80% of our customer base is connected
with good connections. The rest either have a limited connection or
their access point is actually powered down. Those with limited
connection will be fixed as we go through the list. It takes about 20
to 40 minutes per re-route, so we started with the largest access
points first, saving the smallest for last. No word from NOVAC yet. [
1335 5 Jun ]
Status: Power is out almost everywhere. We have
deployed generators at critical nodes, but the outage is overwhelming.
Of the 5 location we still have power, there are only about 20
customers still online that will be able to read this. All phones are
dead, all T1s are dead. Wed Network status snapshot is here. Power estimate ranges into days. More info to follow.
3 Jun: Overheard from a recent customer service call:
Support: Hello, this is lucketts.net
Customer: Is this Lucketts.net?
Support: Yes, this is Lucketts.net.
Customer: Can you make our internet faster? It is very slow right now.
Support: Excuse me, are you one of our customers?
Customer: Yes. (provides account info)
S: What kind of trouble are you having?
C: My children are complaining about how slow the lucketts.net connection is.
S: What specifically is slow for you: web pages, or email, VPN or just everything?
C: I don't know.
S: Is your connection slow now?
C: I don't know.
S: How are you determining you have a slow connection?
C: My children keep telling me how slow it is.
S: Let's check the status of the equipment.... Well, the radio we installed is connected with a good signal, its logs show no problems, and I just ran a speed test from the radio to the internet at full speed. Your connection is fine. Have you tried rebooting your router?
C: Yes.
S: What kind of router do you have? Sometimes certain models have failure modes we are familiar with and can help you work around.
C: What's a router?
S: (Explains what a router is)... Could you try rebooting the router to see if that helps.
C: No, you send someone out to do that.
S: Mrs _, the router is your equipment, just like your computer. As a courtesy, we can offer free help and advice over the phone if you are having trouble with your equipment, but if we send a technician out to reboot your equipment there will be a service charge. If our equipment is broken, then of course we fix that for free.
C: Verizon would come out and fix it for us.
(At this point we knew she was not from around here...)
S: Let me see if I can find a problem with your connection from here.... OK, I see a connection from you to the internet right now, and it is downloading data from (a specific website).com at 1.5 Mbps.
C: See, I told you it was slow.
S: Our advertized service speed is 1Mbps, so the connection you have right now is running at 150% of the speed you are buying from us.
C: See.
S: That that's 1 and 1/2 times faster than we told you it would be when you signed up.
C: And my children are telling me it's slow.
S: I also see that you are running limewire connections to the internet. Do you understand that if you run that at the same time you are surfing the net, it will slow down your surfing.
C: See, now you agree it is slow.
S: Limewire uses up your connection, so that little is left for you to use for web browsing. If you want your web browsing to go faster, turn off the limewire program.
C: You send someone out to turn it off for free?
S: I think in this case, we may be able to turn it off from here.
C: finally you do something for me.
26 May: Here is the trail of Verizon problems the past 27 hours. Very painful when the core link goes down. We'll design accordingly when the T3s are installed. The main difference is that the T3s will support BGP, which will auto adjust if one goes down.
All services back up and running [ 1400 ]
Verizon says they have crews in the field now fixing the problem. [ 1030 ]
Wed
Morning: Between 1AM and 3 AM Verizon could not get access to the Bell
South facility. Between 3 and 5 they could not find a tech. Don't know
what the next excuse will be. We are building new connections to
bypass Verrizon, but it is very slow going. Not sure what the schedule
will be today . Verizon has not been very informative or helpful. [0630]
Our
phones are back up. At least we now have Vonage forwarding to a copper
phone line so we can answer. Sorry for being out of touch so long, but
when the only working phone was my cell, I needed it to talk to
AT&T and Verizon. [2300]
Verizon says all T1s in their LEC are down. Not really a comfort to know that everyone using T1s is off-line. A technician is on-site now working the issue (again) [10:20]
12 May: Here's what's going on the past few days: Steve & family went to Illinois to his Dad's surprise 70th birthday party. Dad hasn't seen the family for over 4 years in person, so this was a big deal. We couldn't announce the trip in advance, as that would have given the surprise away. As soon as they were out of town, the rains started killing links that have been running fine for over a year... particularly the link to Taylorstown/Lovettsville. Matthew and Laura worked the entire weekend to keep things running.
The network, particularly west of Furnace Mountain, suffered all weekend. Sunday was really bad. New circuits were brought up to cover the folks that were down. About 1/2 are on a new backup, but 1/2 are still pretty slow. We're working today to bring everyone else there onto the new backup link to improve speeds. As soon as the rain stops, we'll be out cutting tree limbs to fix the primary problem.
21 April: More thunderstorms causing us difficulty. With the lessons learned Sunday, we'll be out this week installing additional grounding systems at the access points to make them more robust.
Todays's score:
7AM storms roll through, 5 APs go down. 10AM 5 more APs go down. 6PM 5 more APs go down and extended power outage. 11:20PM Primary internet provider degrades to less than 10%. At 2AM, all major access points are back up and all bandwidth 100% via either primary or backup providers.
14 April: Attack blocked. This was the 2nd attack of the day. The first stopped withing seconds of us blocking it at 4PM. The 2nd attack brought in extra attackers and didn't stop when we blocked it, effectively tying up our hardline connection for time sensitive protocols. AT&T only blocked one attacker their first attempt. The network started recovering, and as AT&T blocked more attackers our connections cleared. The sad thing is, the attack was just a waste of time for everyone, even the attackers that were trying to crash the primary DNS server and gain control. That server just ignores any traffic hitting it more than a few hundred times a second, and discarded the extra 120,000 packets per second that the bad guys were sending. All they ended up doing is filling our data lines with junk preventing us from getting out..
3 April: System Status: UP. Why this needs to be mentioned: While adjusting the new connections late tonight, we unintentionally disabled routes to about 1/5th of the internet. A few trouble calls came in that all service was down, so we went off to find something wrong that could cause a total outage to specific customers. While troubleshooting the individual problems, the hidden problem sat there. As later, more accurate calls came in telling us that only some of the internet was down, but users could reach www.lucketts.net, we realized that the problem was routing on our end and focused on the proper corrective action. But by this time we forgot where we were at in the original effort and broke things worse as we tried to fix. After a total shutdown and restore everything is back up.
If you were down, and could not reach www.lucketts.net,
the problem was with your router. We have between 1 and 5 customers
break their routers every night and call us. It happens, and we try to
help when possible. But if you could reach www.lucketts.net
and not the internet, then we broke you. Sorry. The new links we're
adding that caused that initial problem are actually to increase
speeds and add redundancy. Didn't work out that way tonight. The
partial outage was from 9:15 to 11:15, with partial outage alternating
with total outage between 11:15 and 11:40. At 11:40 we initiated a
system restore that had everyone up instantly at 50% and at full speed
by 11:55.
17 March: Quick note just to catch up: We're adding a new bandwidth provider this week. It should double or better our current web capability. The new provider is adding a 20 Mbps line immediately, and working on delivery of a T3. That makes 2 T3s we have on order. Our plan is to have both, from different providers, at the same time. That way we have a fully redundant connection to the internet. Currently, if the big carrier goes down, we fall back to the multiple T1s. Thats keeps us alive, but much too slow. With 2 T3s, one carrier dropping out will not really be noticed. And we can consider getting rid of the T1s.
As an added benefit, if we have odd routing problems with one provider like we did this week, we can just route the messed up traffic over the other provider. And as internet congestion moves around, we can adjust our connection out to whichever has the lowest latency at the moment. We'll be delaying (again) some of the smaller access point installations until we get this extra capability. Right now it is more important to increase reliability and speed than to spend two or three days putting in a new access point for 1 or 2 new customers. Sorry, but we are obligated to put first priority on our existing customers.
The panoramic pictures are pretty cool. Still waiting for the software to get a few bugs fixed by the vendor. When it's ready we can schedule appointments for anyone that wants a 360 degree portrait of their property.
13 Feb: Ice can be a real problem. As it builds up on antennas the signal gets worse, eventually getting totally blocked. Here's a screenshot from one of our monitors showing tonights ice on one of our main links. The green area shows the bandwidth going through this particular link at any one time. The lower graph shows the RF noise background (usually -92db) and the RF signal we are receiving (usually -62dB). As the signal starts approaching the noise, the connection gets worse. As you can see, just after 2AM this link was pretty much shut down for a few hours. The it warmed up a little and started sending traffic through again. At 7AM this morning a 2nd wave of wet ice started forming, and at 8AM the signal was blocked again. Dry ice is better for us than wet ice. That link will be down until the ice is pretty much melted.

We do have other links still working, although they could be affected by the ice as well. They are just a little tougher. If all the links go down, then we'll be relying on T1s to carry the load to keep the network running. AT least they are not impacted by ice.
5 Feb: Quick email tutorial: How to stop SPAM
Our servers scan each incoming email and if they look like SPAM, we add this to the subject line: [SPAM] . No, the people sending the SPAM are not being polite and pre-marking the SPAM for you. Once you download the alleged SPAM, you can then either ignore the marking and read the message anyway, or you can set your email client program (i.e. Outlook Express) to automatically delete or move to another folder . With this method, the [SPAM] is still downloaded to your PC where it is then deleted. If you need help setting up filters, you can use "help" in outlook express or you can google "outlook express filters".
We offer additional solutions now. The first option all of our users have access to: Deleting all SPAM as soon as it gets to our server.
- Log into webmail from our home page.
- Select the "Filter and Forward Mail" option.
- Click on add a new filter
- Select Condition for Filter: Email classified as SPAM
- Select Action if condition is matched: Throw away
- Click on "Create"
From now on every email that our server thinks is SPAM will be deleted immediately. It will not show up in your inbox. This will catch 99% of the SPAM sent to you, and almost never delete real mail. This is perfect for a child's account. They may still see some of the stuff, but they will not be overwhelmed by it.
Additional options are available for everyone on the new email server ( everyone except those with @lucketts.net addresses). You can go to the "SPAMAssassin Mail Filter" option and then Message Modification. Here you can change the score that SPAMAssassin used to determine if an email is SPAM, and you can chage the [SPAM] message, and you can change the behavior of the program so that your SPAM is sent to you as an attachment to a warning email. That way you never have to see it.
A last option available is asking us to turn SPAMAssassin off for your account (only). We can do that for all account except @lucketts.net email
Why do we pick on @lucketts.net addresses? We did have a new server that was to replace @lucketts.net in Dec, but our other mail server died. That diverted the new machine and we have to start over to build a new one for @lucketts.net. Once we have it up and running, then everyone will have similar options.
1 Feb: We are having serious trouble with ice on antennas. Both links to our
new provider are down, so we are running entirely on backup T1s. The
T1s were attacked again this AM by 2 site in Japan at the same time.
It's been a tough day. Speeds are very slow right now, averaging 150
Kbps to 300 Kbps. We're working on various solutions. Traffic such at
P2P will be drastically limited to keep everyone online.
4 Jan: Our new provider came through with their link. It's currently running at 15 MBps, replacing the old 10 Mbps link from the guys that went out of business. This provider is working on a 50 Mbps link to be installed towards the end of Jan. Hats off to the Rapid DSL guys for getting a replacement high speed link installed in three weeks.
The entire network is tuned to 2 Mbps right now to do a serious test of the new link. Web traffic ought to fly along. Non-web traffic may still be limited to 1 Mbps, and P2P is still frowned on, but even that will start relaxing when we get the 50 Mbps link at the end of the month.
2 Jan: The temporary backhaul is in but it's not up to speed yet. We're working with the provider to get it to full speed. Until then, we'll be balancing the users across the other connections. Top speeds during really busy times appear to be between 300 Kbps and 500 KBps. We'll be balancing the network by hand for the duration to keep everyone running as fast as possible. I'll post updates when we know of any status changes.
While we are in the crunch, P2P will be very restricted. In tight bandwidth situations, the 100's of connections P2P programs use give them an unfair share of the available bandwidth. The P2P bandwidth caps will be much tighter than normal to keep everyone runnning with basic services. Normal downloads will not be affected by those caps, so if you need something, just download it directly. If you -have- to use P2P to get something, coordinate with us and we can set up a special rate for your download.
29 Dec: Update on new bandwidth: A new link was installed 28 Dec, but it has growing pains. It works at about 70% the previous link's data rate of 10 Mbps. Our target was 20 Mbps. The provider is working the problem to increase at least to our previous level. This temporary link will be handling most of our web traffic for the next 3 to 4 weeks until a new 50 Mbps link is installed. Assuming we do not get the temporary link running faster than 10 Mbps, bandwidth will be a little tight in January. Those 2 Mbps downloads we were getting in Nov & Dec will not happen as often. If we do get the temporary link up to 10 Mbps or higher, we will be more comfortable and release bandwidth to the user faster than 1 Mbps more often.
Once the 50 Mbps link is installed, we'll be turning on a new class of service - 4 Mbps - or possibly higher if the network can handle it. The new service level will not cost more. We will not raise anyones rates. But to take advantage of the faster service, most customers will need a new radio. We will be replacing all of the old radios eventually so that all customers can take advantage of the increased bandwidth. There is no charge for this upgrade, but it will take awhile to replace them all. If we upgrade 5 units per week, it will take almost 2 years to complete. We'd go faster, but each upgrade has substantial cost and we cannot absorb more than a few upgrades per week.
There is another option. If a customer is willing to pay an install charge to get the new radio sooner, we'll move them up the replacement list, placing them just below emergency replacements but above courtesy replacements. That way they can start taking advantage of the improved speed and reliability as soon as possible. With some customers paying a premium, that will offset some of our cost, we'll be able to upgrade the entire network faster. We'll announce the upgrade cost in January.
No, we will not tell you where you are on the courtesy upgrade list unless you are one of the next 5 in line. We'll call those folks when they reach the top 5 to schedule a courtesy upgrade for them, and as the upgrades are completed we'll go to the next person on the list.
Not all customers can be upgraded immediately. Those on the 900 Mhz systems cannot be upgraded until we have a special replacement access point in place for them. That is probably 3 to 6 months away.
23 Dec: Here's the scoop on the new bandwidth: On 26 Dec, weather permitting, we have a new provider hooking up a microwave link into our network. The new link is targeted at 20 Mbps, replacing a 10 Mbps link we're currently using to serve web traffic. That existing link has been having routing problems, plus it has been experiencing random frequency interference. We've had to monitor constantly to catch the interference and adjust channels to avoid it. So the new link should be faster and more reliable.
Once the replacement link is in place, we'll be installing another licensed microwave link from a different location directly to our network ops center (affectionately referred to as the barn by our employees.) The new link will start at 50 Mbps. Since this link was fewer links to travel through, and it is on licensed bandwidth, it will be even more reliable. There are significant changes that will have very noticeable, immediate impact on the network.
Until then, hang in there with us. While we are watching, the network runs about 1.5 Mbps to everyone. When we turn our back, it often bogs down to under 1 Mbps. We'll continue to monitor the next two days to keep it going as fast as possible. We're all looking forward to the new links getting installed as soon as possible.
Monday and Tuesday will be lightly manned, and Wed will be the new link day, so no appointments for routine work will be scheduled until Thursday. If you are offline or abnormally slow please give us a call.
20 Dec: [Temporary steps to restore mail access deleted... no longer needed.] All email accounts are online again. Some had passwords reset so call us if you do not remember the default password we originally gave you. All mail from before that was stored on the old email server is still on the old email server. You can still get to it via the webmail page, you just have to remember Wed's email password for your account. Follow the instructions on the webmail page and normal users will have no problem getting to any old mail. If you need help give us a call.
Advanced users should be able to download the old mail directly to an email client, but if they were advanced, they would not have been storing old email on the mail server in the first place. <grin>
The replacement server is a brand new shiny server we have been preparing for a few months now. But it was not fully checked out as ready for live internet yet. Please let us know any problems so we can work on them. There are lots of new features, and it's going to take us awhile to learn how to use them well. We'll provide instructions soon.
Hosted web sites should be up. We rebuilt them all by hand this evening. If your site is not working right, or is sort of missing, please let us know and we'll find it for you.
22 Nov: Wow, it's been a long time. To catch up quickly: replaced radios, fixed wires, added reliability, new bandwidth coming, found interfering sources, looking for technical support (web, unix) on outside contracts, upgraded most of our infrastructure.
If you missed the Medal of honor recipients, here are the links again. The Navy site is stunning.
Smith
Murphy
Dunham
17 Sep: Trying an experiment with our spam checking software. Instead of attaching a spam to a warning message, we'll just forward the spam on to you unchanged except for the subject will have [SPAM] added to it, and the headers will contain detailed information about how spammy the message is so the end user can filter/delete it themselves. The function is the same, but a few customers may actually see the content of the spam before deleting it. If you have any problems with that, let us know and we'll show you how to delete the items marked [SPAM] automatically. You can even just delete the really SPAM and leave the sort-of SPAM for review.
1 Sep: Forgot to add good news. The system has been running at 2 Mbps for most of the past week. Yesterday it was at 3 Mbps for a few hours, but was a bit strained, so we returned it to 2. If I'm in the office watching, I try to bump it back up to 3 and watch for bottlenecks so they can be fixed. Not ready to advertize as a 2 Mbps network yet, but looks like that's where we're going soon.
For future price plans, looks like we will upgrade all current customers to a 2(TBD) Mbps plan at no charge. When our really high speed link is fully activated, we'll be offering a 5 (TBD)Mbps plan. There may be an equipment upgrade charge to switch over- but the monthly price would be that same as the base plan. I would expect the one time upgrade charge to be between $100 and $150. This is all tentative, just throwing an idea out for feedback.
31 Aug: Lost another customer today. We always hate that. This customer had experienced slowdowns 2 months ago and called us repeatedly about it. When we went out there was never a problem with the connection to us, only from his PC to his router. We moved his router out of the attic heat, and actually replaced it for him when it still acted up. It was running great when we left, and he agreed to give us a call if any problems returned just in case there was an undetected problem with our equipment. We did this all at no charge.
We didn't hear from this customer again until he called today, very upset, and cancelled service. He told us we were the reason he got a virus on one of his PCs that destroyed the hard drive, and we were why he lost three hours of work this morning when he tried to send an email and his signal to his router was lost and his connection disappeared.
Reason was lost on this customer, and it's been a long week. So we just accepted that he wished to leave. I hate that. I particularly hate that there is an ex-customer out there that thinks our service was the cause of his problems, when the descripion in the above paragraph pretty much characterizes where the problem was (i.e. the customer).
Now we aren't above totally messing things up once in awhile. Particularly recently between storms and the upgrades to bandwidth we've been implementing. Some of our customers are just having troubles with 1 ISP. We have three. Try tracking down a problem with all three saying it's the others fault. :)
But sometimes it is pretty easy to isolate the problem. If we show up and you tell us your internet is dead, and we plug our laptop into the connection and it works perfectly, well odds are the problem isn't with the professionally managed gear.
There have been plenty of cases recently where we have been the cause of the problem. On Thursday while fixing an access point with a chewed up wire, we accidently kicked another connection loose, stopping all web traffic for 20 minutes. We were just taking our time on the roof to get the access point fix done right, not noticing the big blinking red lights in our office until the calls started coming in.
This all ties together with the same request I make at least once a month. Please call us if you are having trouble. Don't let a problem fester for days or even hours. If your router is offline, we don't call you because we just don't know that. A properly functioning router is supposed to look like it's offline to everyone outside of your home. If we have a widespread system problem, we will (usually) know about it and be scrambling. But there is always the occasional dead power supply or other random failure that will require us to come fix something. Let us know ASAP if a reboot doesn't work so we can figure out what is wrong and how to fix it.
19 Aug: Removed a rant from last entry that was a little too angry. I apologize if anyone thought I was talking about them and took offense. I obviously couldn't mean you, the loyal customer reading this log.
Thank you to all the customers that called and told us how cool the new bandwidth is. Average peak download speed is 3 Mbps right now. I saw one customer downloading 4.5 Mbps from the web (not from our web server, but the real internet) when we removed all limits for awhile. Unlimited speeds are still a little too dangerous to leave turned on all the time, as a single P2P program unleased on the network will then drag everyone down.
We have to seriously adjust the firewalls so that they can deal with these higher speeds and P2P. We'll probably set causal P2P use limits at 1 Mbps, and steady P2P use at 200kbps. Normal traffic will be set to allow peaks up to 3 Mbps, with heavy users limited to 1.5 Mbps during peak time. P2P is casually defined as > 25 simultanious connections to different internet IPs. It will take awhile to get this working correctly, but it is our design goal.
17 Aug: The prototype new internet connection is now an operational internet connection. (that means we're paying for it) Everyone's web traffic is going over it right now, with our other 2 connections acting solely as a backup. It feels faster to me... web pages are a lot snappier. This new link is easily expandible, and we will be able to stay ahead of customer need for bandwidth.
Last night we tried to turn on the new link, but after about 2 hours certain web pages started not working. In particular, hotmail failed. Turns out an address on the promary router that was 65.x.y.z/28 was entered as 65.x.y.z/8. Amazing things happen when you leave off the "2" - Microsoft dies. Apple would have loved to have learned this secret years ago...<grin> For you non-geeks, leaving off the 2 caused all traffic for any web site with an IP address starting with "65" to be routed away from the internet instead of to it, causing those web sites to fail but letting all others work just fine.
We reversed the changes to install the new link and worked on debugging the problem until about 2AM when we got distracted by the hate mail from people that couldn't reach certain web sites. It may have been wrong of me, but my sympathy meter did not peg just because Sallys house of thrills didn't get 1 or 2 additional visitors last night. You all should understand that when we are asked to see why you can't reach a web site, we can see what web site it is you are trying to reach. ;)
<deleted paragraph... 2AM rant was too over the top even for me. 19 Aug 2008>
We do want you to call us if you are having network problems, even if you have to call keep calling us. This is an expensive service, and we want 100% satisfaction for the customer. Not there yet, and for some we may not be close. I believe though, our level of performance is constantly improving. When we started, 40 customers on 1 T1 with weekly random access point reboots causing 4 hours down-time on average. Now 400 customers, 65 access points, three internet connections, and probably 50% of our customers see zero downtime in an average month. Since we are tryng to improve, give us that call. But please try rebooting your router first. And don't chew us out because your PC says your signal is weak. When that happens, free advice over the phone will become a offer to make a paid serice call to fix your PC's connection to your router.
[Edited 19 Aug] For those that didn't understand - if your PC is showing you a message about a weak signal, it is telling you that the signal between your PC and your router is weak. That is an issue with your internal network and has nothing to do with our service. It is sort of like calling direct TV because you cannot get your VCR to work with your TV, but individually either works fine with the satellite receiver. You will never see any message about our signal strength - you have no access to that. Each customer is responsible for their internal network, we are responsible only up to the cat5 cable that plugs into your router. If we come out and the connection out of our cat5 doesn't work, we fix for free. If the connection is just fine, we'll be glad to fix your equipment for you, since you just bought 1/2 hour of our time for the house call. This fee can be waived if there are mitigating circumstances, such as exotic but simple problems the customer cannot be reasonably expected to recognize, or a cold drink offered. If it is a weekend evening and we've just been yelled at on the phone, and we get to your place and find the router and radio you told us you rebooted show uptime of 5 days, you will be charged for the service call.
sorry - venting again and using too many hyphens and brackets. Those are poor replacements for hand gestures I guess. Did I mention that we have a new employee? Matthew started this week, and his first two days included a lightning strike, a tower climb, fixing 3 damaged access point, building 6 replacement radios and an unexpectedly complicated installation. And that was just his first 2 mornings <grin> If he comes back on Monday, looks like someone we'll want to keep around. Welcome aboard.
2 Aug: Been very hectic the past week between storms, funeral, and network upgrades. Not everything is done, but we're working on it. Thanks for all the customers that stayed with us. We only had one customer quit because we wouldn't leave my mothers bedside to come out and give him a repair. We did offer to come out the day after she died, but he considered that unacceptable. One other customer quit because we imposed an installation fee on them when they were moving to a new house that needed a complete survey/installation. She apparently thought that the installer should work for free, and that other folks paying for their installation should be bumped.
This was not a week to test the "customer is always right" policy. <grin> There are times when we don't need the pain in the rear. Usually sorry to see someone go, and try our best to make them happy. We've even been extra careful to be nice to folks that tell us we're down and we find that their computer is unplugged. But when someone says give me something for free or we'll take our business elsewhere, that crosses the line. We often are happy to offer stuff for free as a courtesy, when time permits, but draw the line at extortion.
A few customers recently have had trouble with speed tests on the internet. They often get 200 Kbps or less. Normally customers should expect to get between 700 Kbps and 1.5 Mbps. Turns out those customers are often the ones that run P2P software wide open, and the firewall limits their P2P bandwidth to allow other customers to be able to access the internet. The limits do not affect web browsing in any way. But the P2P limits do sometimes affect download tests, as those do not always use normal web traffic. If you get poor results with a speed test, try actually downloading something from the web. You may find that your web surfing is actually at full speed.
18 July: Steve's mother, Janet Treadwell, passed away on Monday this week. Her famliy was at her bedside with her. Thank you to all the customers that are showing great patience and working with us to delay scheduled appointments and installations until we can get caught up. We will try to make it up to you, as we do appreciate your business.
Oh yes, the prototype of the OC3 link is in place, and we have a few customers testing FIOS speed levels right now.
2 July: We fixed some more problems... blah blah blah ... enough of the problems. We will always have problems, but when we do, we'll fix them as quickly as we can. Tonight we write about the cool stuff instead. We have a firm offer on the table from AT&T for a T3. They have already delivered the router (very optimistic of them) and we just have to sign on the dotted line and we will get a very sigificant boost in bandwidth. Another offer is on the table for a OC3 microwave link to an un-named provider in civilized Sterling/Ashburn area. We're doing the engineering on that one now and will implement it if possible. If not, we'll do the T3. Either way we decide this week and get the ball rolling.
The first time we were working an OC3 link I announced in a few limited forums who was providing the connection to us. A few days later, the offer was withdrawn from the provider. An alternate provider was found and again we announced in limited forum incuding other service providers, and again the offer was withdrawn within days. My family may have a genetic disposition towards paranoid delusions, but come on. How many telecom companies offer a product, give a quote and then withdraw the offer? Well, to prove I don't believe in conspiracy theories, we have a new OC3 provider, hands have been shook, and the engineering work is ongoing. If we (that is Lucketts.net) can make it technically work, it will happen. 150 Mbps.... If someone is reaching out to touch us, good luck figuring this one out.
We have a new Access Point on Lovettsville Rd north of Furnace Mountain. It can see both east and west of the mountain, so we may finally have that 2nd link to bridge the Lovettsville Valley with the Lucketts Valley, improving reliability greatly, and providing access to a few people that have been waiting for awhile. You reading this John?
The cell tower on Furnace Muntain has not been worked very hard yet. We'll be focusing on the Poit of Rocks area in the very near future to start picking up customers there. Easy connections and an expanded customer base to help finance our soon to be much greater bandwidth from the internet.
Details are still in work, but we will be offering a 5 Mbps plan in the near future - near being once the T3 or better is installed. We will also be able to provide dedicated bandwidth connections to those that have a business need. And we will be able to support much greater P2P loads, so many of the restrictions we have on current network use will be lifted, or at least relaxed significantly. And the proxy servers can be retired, which will reduce some of the odd connection problems people have when they browse to a non-standard web site, such as a Microsoft Outlook webmail system.
Some of you may have noticed the nagging certificate warning is gone on the webmail servers. I gave up and paid for Microsoft approved security certificate. The cost was low compared to the confusion it was causing, so webmail is easier to get to now. The regular mail servers still use the old certificates, but we'll be transitioning them soon. Does anyone know how to strip the password out of a openssl certificate? Those new certs ask for a password twice, on the console, while rebooting the server. Makes it very hard to work on them remotely, as mail will not start up properly unless we are there to type in the password, twice. If you know how to turn that off, please email steve at lucketts.net.
19 June 7:45 PM: We knew the power outage was coming. There were plenty of batteries and 2 generators standing by at the network center. Power goes out, and we hook up the first generator. The UPSes decide they do not like the power coming from this generator, so they refuse to come off battery. Hook in the second generator, and they don't like that either. Start shutting down secondary systems at this point to conserve battery power. Run both generators and split the load trying to find which devices like which generators. None of the UPS will come off battery. By now they are starting to beep indicating imminent failure.
We start shutting down primary but non-critical systems now, trying to at least keep basic web access up. The rack decides to have a leg failure at this point and fall over. Nothing hit the floor, but there I am holding this rack up as it is tipping forward - can't let go. Laura and I find some replacement gear and get the rack back to level. In the meantime, the main router has gone offline because its UPS ran out. The the UPS dcides it likes the generator power after all, charges back up for a few seconds and reboots the router, which promptly takes all the power again causing the UPS to shut back down. This happened a few times while we were holding things up.
Power cycling a Unix device just after it has booted and is clearing up disk errors from a previous interruption is bad. Ours probably cycled 4 or 5 times. Shifted the router to the reserve UPS and tried to revive it for 15 minutes or so - pronounced dead around 10AM. Absolutely nothing on the network will get to the internet without this device working. Those devices still powered up were shut down at 1035. For the first time in three years the network center was quiet, other than for a few choice words that may have been dropped. As I was walking out the door to get a sandwich, the power came back up, sort of with a laughing sound.
So the job of rebuilding a router able to support the network started. We had a backup available just for this situation, but it had to be configured to match the current routing. Took about an hour to get the first customers back online, with most up by 1PM. There were a few errors in the routing tables we typed in, so some folks regained access later than others.
Right now we are up with all the basic systems restored. We only have access to 1 of our backbone connections with the backup router, so we are running at 50% bandwidth. Download speeds have been limited to keep everything running. The backup router could be adjusted to handle both backbones, but it would take at least 2 hours. Fixing the main router will probably only task about 3 hours, so we're working on that. Later tonight we should be back to 100%.
There have been many phone calls today, and we have not returned most of them yet. We will try to catch up on Tuesday, but for now the focus is restoring the network to full speed. As of 8PM, all access points are running and everyone should be able to get >300Kbps speeds from the internet. P2P is heavily restricted right now. If you are really down, or really slow please give us a call, even tonight. There may be something we missed affecting your site.
Hot spare router is on the list of upgrades, but it was lower priority than fixing the link problems we've been having the past few weeks. Might have to rethink that prioritization.
9 June mid-day: Fixed the link to the 140K jpeg image. It was late. All of the APs are back online. Most were victums of nearby lightning strikes that crashed the boards so hard that the automatic reboot features were disabled. That's really not supposed to happen. The current status looks a little better: 145 K JPEG of current status. The Waterford access points were different. They were never actually down, but the signal had dropped significantly. Like it was doing before we "fixed" them both. Since both access points are mounted ont eh same mast, we checked that first and found that the mast has slipped in it's bracket and both directional antennas were pointed over 20 degrees off. Rotated the mast back to the correct direction and both access points were fine. The brackets were tightened properly, but the wind was just a little too much. So I drilled a hole through the bracket-pole combination and inserted a screw to prevent casual rotation. That should do it this time.
There are still lots of individual customers with radios that need rebooting. If you're offline, just unplug the radio from AC power for a few seconds and then plug it back in. That should wake it up. Some folks have asked why the radios often need to be power cycled like this - many comments suggest shielding the radios to protect them. While protecting the wires into and out of the radios/antenna with surge protectors is possible, shielding the antenna isn't a real option. These are radios, after all. Shielding them from the electromagnetic pulse given from a lightning bolt would also shield them from the radio waves they are designed to capture. There are some mitigation techniques, and we are implementing those as we can on our access points. To do the same with client radios would increase the install cost by about $150. We are instead switching to a new type of client radio that will usually take care of itself if it gets zapped by lightning, at least up to a point.
9 June: Done for the night. AP 3, 13, 58 back up. Bunch more to go, will get to them ASAP Saturday morning. The status board is not supposed to look like this: 140KB jpeg image . The remaining down APs look to be power supply issues. This was a bad night to commit to running on an ambulance crew.
25 May: Beware the cool new bandwidth test from the Loudoun County Broadband office. It does not work. A number of customers have hit me over the head with it asking why they are only getting 500 Kbps. When I monitor their connection, they invariably are running much faster. As a test, I tried the Loudoun speed test, then compared to other speed tests. The Loudoun test consistently gives me numbers between 400 Kbps and 800 Kbps. Other tests give over 2 Mbps at the same time if day. I've contacted the county about this, and they said they are working on the test and will be putting a new one in place very soon. And oh, by the way sorry for not listing Lucketts.net on the list of broadband providers in Loudoun county.
Loudoun test with inaccurate report: http://www.lucketts.net/speed2.jpg
Valid test of same internet connection: http://www.lucketts.net/speed1.jpg
Lesson learned here: Be careful what bandwidth monitors you use. None are accurate all of the time, but some run by internet professionals are slightly more accurate that those run by govies. This one usually works nicely: http://www.wugnet.com/myspeed/speedtest.asp
Please don't beat us up for a bad test from Loudoun County. Wait, use a valid test and beat us up for that if it gives poor results.
13 May: Wow, that was bad. The 1AM storm came through and only caused a few minor impacts, so I went to bed confident that the network would survive. At 3:10, a massive lightning discharge hit the area, knocking out APs 1,3,23,47,53,54.16,19,and 27. A few others reset like they were suposed to and came right back up. The damage varied from outright physical damage-dead cards, to cpu lockups, and to software corruption. All APs were back up by 1 or 2 PM except for AP19. Their power was still off, so we got them partially back online at 4PM and fully restored by 5:30. All of the offline access points were those nice new ones we're so happy with. Looks like we'll be re-assessing the grounding and surge supression for those.
A few clients were also damaged, or at least knocked offline hard enough to require hard reboots. We'll be checking individual clients on Monday to catch the ones that are still sick. They don't show up on our normal monitor as a problem, so we'll have to test each by hand. This once a week weather event is getting old. I know it is for the customers.
While out fixing thing today, we did get one of those nice-to-do's completed. We now have a 16 Mbps backup link to the second bandwidth provider. This will allow a more stable bandwidth supply, so we will be able to turn the speeds up above 1 Mbps more often. Time and weather permitting I'll try to get that link integrated into the network tonight.
29 April 2007 Double ethernet port failure today shut down all traffic into taylorstown. This is very rare, and when it happened I assumed it was a simple switch failure. Turns out it was dual ethernet ports on the router and instead of a 20 minute repair it took over 2 hours to rebuild unit. Rare double failure, don't expect that to happen again, but am looking at quicker turn arounds anyway.
18 April 2007 The network troubles of the past few days have been tough on the customers and I'd like to thank all of you for hanging in there and not beating us up too badly. The largest network impact was the failure of the 30ft mast supporting AP8, AP18, AP28, AP14-2, and AP26. That mast falling interruped servive for about 1/2 of our customers west of Furnace Mountain on Monday. Rpai work on Monday night, Tuesday, and Wed morning sometimes affected the rest. Good news is re replaced alot of old equipment with new stuff that we had been wanting to do forawhile anyway. Almost everyone is back up. There are a couple of customers that will require site visits to get them running again. We're aware of two, there may be a few more.
The AP6 problem appears to be mitigated, and maybe solved. There were many small contributions that piled up and affected a particular problem with wi-fi systems. Took just the right combination of events. By fixing enough of the small problems and adding a few bandaids, we've had the AP running again at full speed for the past 48 hours (other than a few hours Tuesday morning to see what would happen if we removed tha bandaids.) Looks like we are through the worst there and can start moving forward again.
Oh yes, we also had power outages at AP10, AP20, AP24, AP40, AP35, and AP7 that lasted longer than the batteries. And the bathroom flooded, shorting out the furnace, soaking the basement drywall, and dripping into boxes of stored books and clothes. And our Propane ran out, but that was OK because the furnace was already shorted out and there was no time to shower. And there were the all-nighters all weekend doing the busness & personal taxes. And part of the roof blew off the barn. And one customer called 6 times on Monday because he couldn't figure out how to assign an IP to his xbox. It may have been a bit of an overreaction to charge him $180 for the tech support time. OK - we didn't really do that, but the thought was entertained. Overall, I am thankful for our relatively trival troubles considering what other famlies were going through on Monday.
We will be picking up on the planned network upgrades ASAP... new AP in Village Green, new high-speed backbone connection to the Waterford AP, and installing customers onto the cell tower site. After another day or two of network cleanup after the wind storm damage.
13 April 2007 part2: OK, it happened again for 8 minutes almost exactly 1 hour later. Has the characteristics of an unannounced reboot by AT&T techs during prime time to fix something, and it didn't get fixed the first time so they tried again. Let's see what happens at 8:26PM.
You can tell when it is an AT&T problem, we have time to put status announcements on teh web page. If we mess something up, we're usually in a rush to fix it and the web announcements sometimes get overlooked. By the way, if you've mastered the status monitor and look at it occasionally to see what's up, lossing the AT&T router will usually produce a serious amount of red on the chart. That's actually just the status program having a fit when the main connection drops. It would be very rare to have external web links go down at the same time as an internal router.
We did solve one mystery today - the LN08 reboots that affect Taylorstown and Lovettsville. One of our customers called this AM and told us he thinks he is causing our router to reboot. As it is highly unlikey that a customer could cause one of our routers to reboot, as well as it is highly unlikely that a customer could in any way determine he is causing our routers to reboot, we pretty much assumed it was just another misconfigured home router. Then I remembered that this customer is one of the few that actually knows IT stuff, and one of our routers did reboot unexpectedly just a few minutes before the call, so I asked "why do you think you're rebooting our routers". He gave a solid explaination and got our attention.
Turns out he has a VoIP setup that actually causes one of our routers to crash when he turns it on. Every time. His vonage connection works just fine, but the custom VoIP his work provides causes a problem. We'll work out a solution for this, but it will wait until the weekend so we stop knocking all the VPN custoemrs offline during the workday. This does explain the fair number of LN08 reboots last week as he was trying to figure out the problem, and the 3 or 4 this AM as we verified the problem.
13 April 2007 - AT&T had server problems again. Yesterday it was the DNS servers, today the main router. This is not what the link status display should look like:
2 Apr 2007 - Ok, power was installed to the cell tower today. Everything takes longer than it should. We'll start detailed surveys and perormance testing of the gear. It would be good to test out the site completely before putting customers on it.
AP6 issues continue, but are getting better. We sent an email to everyone affected by the problems explaining in detail what is going on. Todays we upgraded much of the electrical system supporting the AP, and it is handling the interference better than before. We also tested with a duplicate set of equipment and still saw the underlying problem. There is a radio transmitter sending corrupted data to the AP. We are cruising the neighborhood looking for the source, but it is a very slow process. We'll also be changing out client equipment to stuff that is much tougher.
30 Mar 2007 - Power is supposed to be installed at the cell tower today. If it was, we'll be turning on the equipment over the weekend. Lots of surveys to do, and then lots of installs.
The access point on Limestone School Rd has been having more and more problems recently. We've fixed many of the invididual issues, but there still has been a generic issue getting worse each day. Today we found a coax cable that had too sharp a bend, causing signal attenuation. We replaced the cable and now everyone gained signal strength and better response. This AP should now be all better.
The Stumptown link has been experiencing interference every evening between 4 and 8 PM. It affects a few customers on that AP directly, but one of our data links flows through that site so a big chuck of our bandwidth disappears when the link fails. I'll check things like coax cable this weekend to make sure it is not hardware related.
I've mentioned Pre-N routers before, but its probably a good idea to bring it up again. Pre-N routers are not compatible with normal wi-fi. They will jam out every normal 2.4 GHz wi-fi radio within approx 100 yards, maybe more, maybe less. Some folks think that's a problem for the other guy, or that since there are no other houses within 100 yards it will not be a problem. The radios we normally use to deliver service to your home are high powered 2.4 GHz radios with very sensitive antennas. A signal that a laptop would pick up at 100 yards max will be visible to our radios at 3 miles. Your pre-N router will jam out not just the radio on your roof that connects to Lucketts.net, but it will also jam out the radios servicing your neighbors up to a mile away.
If you have a Pre-N router and would like to stop the jamming, turn off the special features. Set the mode to b/g only, turn off the double channels. Set your radio to a normal wi-fi channel not near the one we use to send data to you. Multi-path antennas and range enhancement are OK. We are looking at using new radios that will work in the vicinity of Pre-N routers, but they cost alot. If a user in a neighborhood insists on using pre-N, we will at considerable expense change out the radios of the neighbors as needed to prevent interference. The other drawback to the new radios is that they in turn jam out Pre-N up to a mile away. Sort of irony there.
19 Mar 2007– AP8 more fixed now.<grin> We've shifted more customers off that access point to avoid the issue. The ones remaining have better, but not great connection. All can get faster than 1 Mbps. Our goal is 3 to 4 Mbps peaks speeds. Not everyone can be shifted off, so we need to find the source of the problem.
The cell tower was powered up today long enough to run some signal checks. Point of Rocks now has good signal. Next we check the western side of Maryland. The electrician is supposed to come out on Tuesday and install power lines. Then we need county and power company.
17 Mar 2007- The problem with Access Point(AP) 8 affecting 20 of the Lovettsville area customers is mostly fixed. We shifted customers most affected to other access points where possible, and we also found and replaced a bad coax cable high off the ground. Those changes allow everyone to connect normally again. But, there is still a random interference popping up that throw's bad data onto the network, causing the radios and the access point to work their CPUs much harder, sometimes overloading them and requiring a reboot. The access point has locked up twice in past 4 days, once Thursday overnight and once on Friday afternoon, requiring manual intervention from us. This is not a normal or acceptable situation, and we are trying to find the source.
Most likely source is a customer radio, but it could also be interference from a neighbor of one of our customers. We will replace access point equipment again to make sure the source is not the new AP we installed while trying to fix the problem. On the customer side, a radio problem could be caused by a number of issues. A power hit could have scrambled the radio or the customers router, causing either to send out bad data. A simple reboot of both would fix this. A customer router may also be running an old version of the router software. Please check your routers to make sure they have the most current firmware loaded. You can find the newest firmware for your router at the web site of the manufacturer, with instructions on how to update it. Also, if anyone has a Linksys WGR54G-v5, please let us know. Those are network killers. They do not work properly on a shared connection without special care and setup.
Another possibility is that a customer or customer neighbor installed one of those new Pre-N wireless routers. Those routers work very well for the owner, but jam all wi-fi for other people within 100 yards. All pre-N units are supposed to politely share radio channels with existing wi-fi, but they all have buggy code that turns off the polite function, and they just run over the existing wi-fi instead. This is supposed to be fixed when the N-standard is finalized. In the meantime, if you have a Pre-N router in your house, turning off the N features and running it as normal wi-fi will help alot. You don't really lose much, as dropping your top internal speed from 108 Mbps to 54 Mbps has little effect if your internet connection is down. And when everything works together correctly, the internet connection maxes out at 3 Mbps for most of our customers. Normal wi-fi at 54 Mbps is already much faster than that in the home, and the Pre-N will not make it faster. When we start supplying 30 to 40 Mbps to the home is when the Pre-N stuff will start being useful.
On the topic of much higher speeds: Lucketts.net will be shifting business models soon to directly compete with Fiber. We will match or beat fiber cost and capability in our service areas where fiber is available. We will roll out those same capabilities to areas without fiber as soon as we can. We are finallizing purchase of an OC3, with ugrade capability to OC3x2 or OC3x4. (not a typo - OC3 = 145 Mbps) With this additional bandwidth, all existing customers will be increased to 3 or maybe 4 Mbps peak speeds. We will offer higher speed tiers of service, details still TBD. We will probably offer new tiers of service at 5 Mbps and 20 Mbps, with dedicated links available up to 50 Mbps. Pricing will not go up for any existing customer, and in many cases it will cost less if they stay on the basic plan. Ugrade pricing has not been determined yet.
Furnace Mountain Cell tower: We installed the equipement on the cell tower and did some quick test runs, and it all works. We are waiting for the final AC power installation so that we can run the tower around the clock. Right now we have to bring out batteries and/or generators to do testing. Prelim AC wiring install date is this Monday or Tuesday, with county OK as soon as possible after that.
6 Mar 2007–
Many customers on Access Point 8 are still having difficulties. They will have a good connection for 5 to 30 minutes, then the connection
degrades to become almost unusable for a few minutes. Most likely causes are equipment failure at the access point from water intrusion
or bad power; equipment failure at a customer site that feeds bad data into the access point causing it to lock up, or external
RF interference.
We have replaced the access point electronics, and replaced cabling on Monday. On Tuesday we will replace the actual antenna, winds
permitting. If the problems continue we will spend the rest of the day working with individual customer to either find a broken radio,
or survey the area aournd their house for external interference, or get them a new connection.
Cell tower news: The equipment is up on the cell tower, but we still need power. The electrician will install AC power ASAP, and then we'll
do a complete system checkout and start hooking up customers waiting for service - after we solve the access point 8 problem.
25 Feb 2007–
Today was a tough day on the network. One of the proxy servers died last night just after we shut the office down. The
machine still responded to the the monitor program so no page was sent, but 1/2 the network lost normal browsing. The server was
rebuilt by late afternoon and is working again. While it was down, individual problems were not worked as quickly as our desired
standard. We'll be returning a backlofg of calls on Monday to make sure everyone is online.
The cell tower install is scheduled for Tuesday for those waiting for that capability to come online. We will not be doing
any installs or normal upgrades until later in the week. Once the tower is active, we will get started on the backlog of installs for the folks that have
been waiting.
5 Feb 2007–
Someone mentioned in passing that it is getting very cold out. Turns out that the new access points we installed last year are
wonderful, but only as long as the outside temperature is above zero. Last night and probably again tonight, the temperatures will
dip far enough to really mess up those access points.
The failure modes are different on each one, ranging from reboot, to lockup, to disconnecting everyone, to jamming neighboring
access points. Sunday night was pretty grim as 5 AP went down within 30 minutes of each other, for no apparent reason other than the
cold.
Those AP will likely have trouble again tonight. We will reboot and reset as fast as we see the problems come up, but if you are
down in the AM, please give us a call. Don't be insulted though if you get an answering machine. We still have calls from this AM that
have not been returned yet. Sorry. But calling will alert us to a problem we might have missed.
As for a fix, rather than just treating symptoms, we have new access points ready to replace the old ones. The new APs are good
down to -50C, so they should cover us for future cold snaps. We will start replacing last years models on Tuesday, so hang in there
with us.
If you have problems, give it a reboot once. If that doesn't work, please let us know so we can, well, reboot the AP. Even
we follow that advice sometimes.
14 Jan 2007–
Busy start to the new year so far. Here's a quick status to catch up: Cell tower lease is signed and equipment ordered, so
we will be providing service to areas around Point of Rocks in Feb. New access points have been put up in Evans Pond and south of
Waterford. Staff here has been on vacation for much of Jan, so we're still trying
to catch up with last year's to do list.
FIOS is starting to show up in some places. Our plan to deal with FIOS is to improve our delivery of bandwidth and match their
price. If someone really wants to go with the big company, we understand. But we are hoping many willl stay with us if we can
offer comparable service for the same price and give better customer support. Lot's of if's there, but we have been preparing the
backbone network to support very high speed all last year. We're pretty much there now. Still remaining is procurement of lots of bandwidth,
and locking down the design of the future customer equipment that goes into your houses to receive our 5 to 15 Mbps connections. So if
you get the "FIOS is now available at your address" letter, please give us a call before deciding to change. We should be able to match their price,
and very soon be able to match or beat their performance. Hopefully we're already better with customer service
Old 06 news is archived here: Old News for 2006
Old 05 news is archived here: Old News for 2005
Older 2004 news is archived here: Old News for 2004