SoB: Week 3

Overview

I worked on the goals I outlined last week. While I didn’t achieve all of them, I completed satisfactory work on most. The majority of my time was devoted to reviewing potential bugs from this week and last, writing new tests, and identifying improvements for the existing tests—much like the previous week.

Goals accomplished

Investigated the potential bug in handle_peer_error_or_warning(): As I suspected, this wasn’t an actual bug but rather a mistake in the setup of my harness. You’re supposed to call status_setup_async() before using any functions from common/read_peer_message.h, as it initializes the necessary global state. I’ve fixed that issue, but there are still some errors in the harness.
Reviewed the initial_channel test: This was one of the easier tests to review. During the review, I noticed that it fuzz-tests only one of the three functions in its target file, leaving initial_channel_tx() and channel_update_funding() untested. I added a test for channel_update_funding(), which causes the harness to crash. Based on my initial investigation, it appears to be another improbable bug. Regardless, I’ll discuss it with Matt next week before deciding on the next steps.
Reviewed fuzz-hsm_encryption: This was another easy test to review. I updated it by removing hard-coded values and making a few other minor improvements. I plan to push the changes in the near future.
Reviewed the cryptomsg test: This test fuzzes the cryptomsg_decrypt_body() and cryptomsg_decrypt_header() functions in common/cryptomsg.h. It already covers its target files thoroughly, and I believe there is little room for hidden bugs, so this test is marked as reviewed.
Added a wire test for wireaddr: Wire functions in common/wireaddr.{c, h} consume untrusted input. Since no fuzz-tests existed for these functions, I added a round-trip test for the wire operations of struct wireaddr, similar to the other wire fuzzers. I’ll push the changes upstream soon.
Added a wire test for wireaddr_internal: This test closely mirrors the one I created for wireaddr, primarily because the two are very similar in structure and behavior. To my surprise, this simple test led to the discovery of an actual bug in the address parsing code. The bug is an out-of-bounds error triggered when attempting to read a DNS address of length 255 from the wire. I discussed the issue with Matt and we agreed on an approach for the fix.

Full details of the bug and its fix are available in PR #8325, and the fuzz test can be found in PR #8324.
Added a stateful test for gossip routing: While examining the plugins/ code during my investigation of the potential bug in amount_msat_sub_fee(), I came across two tests—run-route-calc and run-route-overlong—that simulate a sequence of operations: adding a channel, updating a channel, and finding a route between nodes. I thought this pattern could work well as a stateful fuzz test, with each fuzzer input byte representing one of these operations, and proceeded to implement it.

Writing the initial test took much longer than expected—about a day for the first version and another day to resolve the initial bugs. However, the resulting test still fails, and the failure appears to be caused by a genuine bug in route calculation. I’ve done some investigation—which again, took longer than anticipated—and believe it is a real issue, though I can’t say for certain yet. I plan to write a high-level test to simulate triggering this bug through external input, after which we can determine the next steps.

Next week’s goals

I plan to put my current strategy—incrementally reviewing existing tests and working toward writing new ones—on hold for the next week. Many of the tests I’ve worked on are currently blocked by various errors, so I believe it’s better to focus on resolving those issues first before moving on to working on new tests. I’ll still probably review a test or two, but I plan on focusing mainly on resolving the outstanding issues. With this being said, here are some of my immediate goals for the next week:

Fix the handle_peer_error_or_warning() target: i fixed the NULL dereference error in this target this week, but it led to instances where the harness calls exit() in certain code paths. I’ve tried to find a way to intercept these calls, but haven’t had any success. I’ll need to discuss this with Matt in our next meeting on Thursday before deciding whether it’s worth spending more time on this target.
Fix other BOLT #1 message handler targets: The potential bug I encountered in handle_peer_error_or_warning() also affected some other BOLT #1 handlers I was working on—specifically the ping-pong and peer_init handlers. I fixed the issue in both, but that led to new issues emerging. I’ll work on resolving them fully in the coming week.
Investigate the potential bug in the calculate_our_funding(): I tried investigating around this issue but that proved to be inconclusive so I took the matters to Matt. He told me he’d look into it but since he’s just gotten back from vacation and has been dealing with some family issues, he couldn’t find the time to get into it. Until Matt finds the time to look into this, I’ve decided to give this issue another go in the upcoming week. Hopefully I find something conclusive.
Investigate the potential bug in route calculation: This issue has already taken up a significant portion of my time this week, so I plan to give it another day or two at most to see if I can find a way to trigger the bug from a higher level. If that doesn’t yield results, I’ll bring it to Matt’s attention.
Investigate the potential bug in initial_channel_tx(): In addition to the improbable bug in channel_update_funding() mentioned in the Goals Accomplished section, another failure occurred after introducing a test for initial_channel_tx(). Since this bug was discovered at the very end of last week, I haven’t had a chance to investigate it yet, but I plan to do so in the upcoming week.
Review other tests: Only four non-wire tests remain to be reviewed, so if I can spare some time from addressing the current breakages, I’ll aim to complete the review of these tests. That would wrap up a significant portion of the improving existing tests section of my project.

Challenges

I initially thought the work on BOLT #1 message handlers was complete, but after the NULL dereference error turned out to be a false positive, I had to revisit each handler. Unfortunately, all of them are now blocked by separate issues, even after applying the fix. I’ll tackle them one at a time and aim to complete the remaining work on these handlers in the upcoming week.

Apart from that, the biggest roadblocks right now are the crashes occurring across various tests. On top of this, I discovered that my Git stash was corrupted, which contained important progress I had made on some bug investigations—so I’ll have to start over for several of them. One key lesson I’ve taken from this is the importance of addressing errors and crashes as early as possible. Starting next week, I’ll make it a point to resolve these issues as soon as they arise. Haste makes waste after all :)

I uncovered my first real bug in CLN this week, which was quite exciting. Hopefully, the fix I’ve pushed gets merged soon and encourages the maintainers to take fuzz testing more seriously—especially since every other improvement I’ve made to the fuzz testing suite so far has largely been overlooked. In any case, I’m optimistic about finding a few more bugs—hopefully more serious ones—through my investigations in the coming week.

Till next time,

Chandra.

SoB: Week 2

SoB: Week 4