Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POA-155] Capture client/server timeouts as well as Agent parsing errors #245

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

shreys7
Copy link
Member

@shreys7 shreys7 commented Nov 10, 2023

JIRA: https://postmanlabs.atlassian.net/browse/POA-155
Confluence: https://postmanlabs.atlassian.net/l/cp/EtJHwV3u

Changes

In this PR, we have added two sets of changes

  • Capture client and server timeouts
    • We have used tcpStream's Accept function for this purpose. This function gets called on every packet captured from AssembleWithContext function of tcpAssembly utility.
    • The stream holds both the flows of the connection, one for the client and one for the server. When we observer a FIN packet for a particular flow, we check if there's data present on the other side of the flow.
    • If there was no packet seen on the other side, this would indicate that one of the sides closed the connection before even receiving any date from the other side, indicating a timeout.
  • Capture agent parsing errors
    • When the Agent is not able to decode or parse the request/response body, it only used to log the error.
    • With this PR, we will now also capture the same in the method metadata error field of the witness

TODO

@shreys7 shreys7 marked this pull request as draft November 10, 2023 06:58
learn/parse_http.go Outdated Show resolved Hide resolved
pcap/stream.go Outdated Show resolved Hide resolved
pcap/stream.go Outdated Show resolved Hide resolved
pcap/stream.go Outdated Show resolved Hide resolved
pcap/stream.go Outdated Show resolved Hide resolved
trace/backend_collector.go Outdated Show resolved Hide resolved
trace/backend_collector.go Outdated Show resolved Hide resolved
trace/backend_collector.go Outdated Show resolved Hide resolved
trace/backend_collector.go Outdated Show resolved Hide resolved
@@ -314,6 +333,32 @@ func (c *tcpStream) Accept(tcp *layers.TCP, _ gopacket.CaptureInfo, dir reassemb
}
}

// One of the flows initiated a connection close request
if tcp.FIN {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is important to handle RST too, or at least leave a FIXME about abnormal termination.

If the server crashes hard (a kernel failure) or there is a load balancer problem directing traffic to the wrong server, then the client will get a RST back instead of a FIN. Of course, we might not be still around to see the traffic in that case.

The server getting a RST is a bit more obscure.

The comments here do not make it clear how we handle the FIN-ACK from the side that did not initiate the shutdown.

@shreys7 shreys7 marked this pull request as ready for review November 15, 2023 19:42
Copy link
Contributor

@mgritter mgritter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many unit tests appear to be failing, and I'm not yet convinced that the core logic is working yet. Do you have any test or debugging output that shows otherwise?

firstPacketSeen bool

// Indicates who initiated this flow (client or server)
initiator *tcpFlowInitiator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using go-utils.optional here instead

Comment on lines +96 to +98
func (f *tcpFlow) TcpFlowInitiator() tcpFlowInitiator {
return *f.initiator
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons to worry about using *T instead of optional[T], the dereference here is unchecked.

(And might be the nil pointer derefs you are getting in unit tests? I didn't investigate.)

// and no packets have yet arrived on the reverse flow
// this would indicate a timeout on the current side
if currFlow.TcpFlowInitiator() == TCP_FLOW_INITIATOR_CLIENT {
c.outChan <- currFlow.toPNT(ac.GetCaptureInfo().Timestamp, ac.GetCaptureInfo().Timestamp, akinet.ClientShutdowntMetadata{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Shutdownt" => "Shutdown"?

Comment on lines +365 to +367
// The current flow is the one that initiated the connection close request
// and no packets have yet arrived on the reverse flow
// this would indicate a timeout on the current side
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really thing we need to do the design work here of showing the different scenarios and listing out the possibilities to be sure we're covering these correctly.

Client shutdown:

client -> SYN
               SYN-ACK <- server
client -> SYN
client -> HTTP request  
               ACK   <- server
              ...
client -> FIN
               FIN-ACK <- server

Are we called on every TCP frame, or only those with payloads? If every TCP frame, then wouldn't revFlow.FirstPacketSeen() be true? We probably need to set it only when there are bytes in the stream.

Server shutdown:

as above until
               ...
               FIN <- server 
client -> FIN-ACK

It seems to me that currFlow.TcpFlowInitiator() is not set, because we have not selected a parser.

These are subtle enough that we need unit tests for this code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants