Distributed Tracing with OpenTelemetry and Jaeger
Recently I began working on Distributed Tracing and tried to implement a simple hello world application to find out better how tracing is working, but before implementing the program I had to find out what is tracing and how should I set up the necessary parts to make it work.
TL;DR I’m going to share how to set up tracing by using OTEL agent and collector beside Jaeger, plus sharing a full sample HelloWorld Application which is instrumented with OpenTelemetry SDK to export traces to either otel-agent or Jaeger agent straightly, and finally going to check the traces in the Jaeger UI. Looking for code? check this repo!
This is quite a new subject, and as by myself was new to tracing, I faced several, let’s say questions or issues, in some cases because of lack of documentation, which after spending sometimes I could found the answer or the cause of such behaviors, Hence I thought it would be a good idea to share this information through this article, and wish you find it helpful. so let’s begin:
- Are you new to distributed tracing?
- Opentelemetry and Jaeger services setup
- _ How to deploy components
- __Production
- __ Test
- Sample HelloWorld Application
- General Suggestions
1- Are you new to distributed tracing?
If you are new to this subject, my suggestion is to first try to implement a sample HelloWorld application in any language which you prefer by yourself, and step by step try to search in google for the missing parts; there you will find out too many useful resources. However, among the available resources, I found the Jaeger and Opentelemetry videos on Youtube very interesting and informative, there are more in the CNCF youtube channel, or the videos by other vendors providing tracing based on Opentelemetry are also useful. Additionally, by myself, I read these two books:
1- Mastering Distributed Tracing by Yuri Shkuro who is the creator of Jaeger; I found this book very interesting, if you want to have a better overview of the tracing concepts and also more specifically regarding the Jaeger, this book gonna be a great start point. The text is easy to follow and overall I suggest you read it if you are completely new to distributed tracing world. Another interesting thing is most of the examples are provided in different languages like java, python, and Golang simultaneously. In my opinion, the only downside is that this book has been published by 2019 and all the examples are based on OpenTracing which got somehow deprecated (merged with OpenCensus) and lead to OpenTelemetry. so while the examples give you good sights but, you should take into account that, in your case, you need to implement them based on Opentelemetry SDK, and I had a quite hard time finding out how everything works with OpenTelemtry SDK. :)
2- Distributed Tracing in Practice, published by O’Reilly. This one published by 2020 and has more information about OpenTelmetry, variety of tracing concepts has been defined here as well. The examples are mostly based on Java and Nodejs, and still, in some examples, we can see the usage of OpenTracing instead of Opentelemetry. I found this one kind of complementary to the first book. Some concepts have been discussed in more depth or in better shape which makes it more understandable. However, still I found the available content on the web more interesting. so my suggestion to you is to read this book if you have time, and want to really go in-depth of different or specific details about tracing.
Besides these ones again I would like to highlight that the materials available on the web come much handier in some cases. The good thing is there are currently different companies providing tracing as a service and mostly are based on OpenTelemetry and support it; so even on their website, you can find nice articles and more examples about how everything is working.
now I assume you have basic knowledge for the rest of this article and are looking for a bit more serious/practical things to do, Let’s continue then.
2- Opentelemetry and Jaeger services setup
To be honest I think this is the first place someone going to write about how to use otel-collector and jaeger collector to have a smooth pipeline to export traces to jaeger from the client. There were some examples available about how to use these two together but it seemed like a magic box for me and at least wasn’t clear for me that how it’s gonna look like in production or in more detail.
One of the main issues was that I could see in the example scenarios we are setting up otel-collector and agents but in the end, we were also using the Jaeger all-in-one image which by default gonna run jaeger agent and collector as well, and at this point, I was like wtf … why we have both :D. later I found that approach is great for testing but things are different in production.
Ok, so story short, the figure below will show you how the proper pipeline gonna look like.
I would like to describe a bit quickly here the different components in the above figure. We have three main parts, our program which basically, we need to instrument it either manually or using auto instrumentation to create traces; then we have OpenTelemtry agent/collector and the Jaeger services. Something to take into account is that currently, we need both collectors from otel and Jaeger. why? because yet, Jaeger collector doesn’t support OTLP, and from another side, we cannot use otel-collector to store traces in our DB, so we need to have a pipeline like the above figure.
The otel-agent and otel-collector both use exactly the same image, and we can omit the agent part and straightly send our traces from our programs to the otel-collector but as we might have lots of traces, it’s gonna push too much pressure over the collector and the queues gonna fill very quickly which will result in having some traces lost or dropped. so the suggestion is to have the otel-agent as close as possible to the source of the traces (our programs) and then redirect them to the collector.
In my personal strategy, I deployed the otel-agent as DaemonSet, and Otel, Jaeger collector as deployment with 2 replicas.
On the Jaeger side, we need all the components except the jaeger-agent, as we are relying on the otel-agent. You can use a different DB for storing the traces/spans but the suggestion is to use ES. Also again to leverage the flow you can set up Kafka or an extra component which you can find out more about on the official Jaeger website. The ones represented here are the essential ones. Another thing that wasn’t clear for me at the first stages was, what is the Spark job and the role of it, and I got to know by not deploying it :), and then checked what’s getting broken. Basically, by deploying this spark component, you are deploying a cronjob which by default is set to run every night around 23:49 and it will load all the traces/spans stored in the DB for that specific day and process all of them to find the relation between them and build the trace tree. This tree gonna be used to represent the overall services relationship in the Dependency Graph of Jaeger dashboard, so yes, if you don’t deploy this spark job, you won’t have any dependency graph!
Please give the spark job enough memory, because it needs to load a full day of traces in Memory to be able to process it. It highly depends on the amount of traces you are producing per day but the suggestion is at least 4 to 8 Gb Ram.
Update 29th Nov 2021:
Thanks to Mr. Yuri Shkuro comment, I’ve found that the above sentence about “loading a full day of traces in Memory” is not very accurate and here is more precise explanation given by him;
“because Spark breaks traces in batches which are then reduced to the representation needed for the dependency graph, which is tiny compared to the full size of raw traces”.
How to deploy components
There are two different ways to deploy components of the above figure, one is a quick way for testing and help us to be able to see the traces from the sample HelloWorld application that I’m going to present to you later in this article, and the second one is the production way. First I will talk about the production and then a quick way.
- Production
This part highly depends on the network architecture and services you have and can be different case by case. In my case, we had plenty of nodes and services running beside each other based on EKS. Also, we had an existent ElasticSearch cluster, so for storing the jaeger data I used that one.
Jaeger
The Jaeger official website has good documentation about how to start and bring up the services and the suggested one is to use their operator. However, by myself, I found using the Helm chart much easier to use. This might be just for a personal reason, so if you are ok with Operator then just use that one. Btw here you can find the example of the value file which I used to customize, the default values for the jaeger helm chart. In here, I just enabled the necessary components and connected the jaeger to an existent ES provided by AWS.
OpenTelemetry agent and collector
Here you need to basically decide between two main architectures, either deploying the otel collector as DaemonSet or as a sidecar to your application. Each way has its owns pros and cons. To have a better idea, I suggest you read this article which explained more about it. Afterward depends on your architecture, for instance, in the case of Kubernetes, you need to first create a config file for both your collector and agent which you gonna use and then for example, create a deployment with a service for the collector. You can find more about the configuration and how to set up these configs on the OpenTelemetry website and afterward you can also check out my sample here.
- Test
To set up a test environment, I deployed otel agent and collector and used the all-in-one image of Jaeger which makes it easier for testing purposes. Take into account that while by using this image beside the otel ones, we would have both Jaeger agent and Otel agent, it depends on our configuration while instrumenting our code that where we want to export the traces to and where is our endpoint. The configuration files of the agent and collector are available in this repository.
So for testing purposes, I implement a sample HelloWorld application (advanced one :) )in Golang, I chose to use Go because first I found less documentation about it rather than Python or let’s say Java, and also in Go no magic can happen haha and everything should be coded and be obvious, so might be easier to find out what’s happening. I would like also to mention in advance that all the source code is available fully at my GitHub repository.
3- The Sample HelloWorld Application
I extended this example based on an example that was purposed in the Mastering Distributed Tracing book, and the original code is available here. However, I changed quietly the structure and redeveloped everything with the new implementation.
structure
To be able to see the propagation of the trace and how it works when we have different services, I tried to put a new service for that functionality. Generally, we have three services now:
Main: The main service is actually the main server and handles the request on port 8080. so as a user, we can run;
curl http://localhost:8080/sayHello/name
where a “name” can be anything, but there are several default names defined in the DB, which by using those ones you would get more interesting answers. for example, we can run, curl http://localhost:8080/sayHello/trace
then the main server will get the name and forward it to queryyer (I chose the name queryyer based on formatter :) ).
Queryyer: This service gets the name from the main server, then it will check the MySQL to see if there is information exist related to that name, if so, then send back the information, otherwise will just return the name by itself.
Formatter: The main server after retrieved the information from the queryyer, will send those to Formatter, where it will just put them in the right order, title->name->description, and will send it back to the main. and finally, the main server will return the response.
I tried to put comments as much as possible in the code, to make it more understandable and also tried to implement different features. The first thing is that you have this opportunity to choose between using the OTLP and send traces to the OTEL agent, or using the Jaeger exporter and directly export traces to your jaeger agent. The second is I implemented baggage propagation and tried to pass extra info as baggage to another service.
Result
Here is a screenshot of the Jaeger dashboard after we do the curl command once. We can see that the trace could successfully pass through all of our services.
4- General Suggestions
While I was developing this application, I faced several issues which some of them were taking days from me to find out the solution, here are the strategies and suggestion, in case you faced any issue as well;
1- Check the release notes! Mostly when you see that a command is not recognized via the development environment or you see an example on the web and while you are doing the same but you do not have the same result, please make sure, that specific command still exists and is in use. As both Jaeger and OpenTelemetry are still under development, by each release you might find some previous procedure deprecated. so try to check regularly the newest changes by each release.
2- Join the OpenTelmetry and Jaeger slack channel! These slack channels are very active and you will get an answer quite fast, and my suggestion is before asking a question, try to search over these channels about your issue. several times I could’ve found my issue which was discussed by others earlier. this will avoid overflowing the channels with duplicate msgs.
Conclusion
I really wanted to write more, especially about the codes I used, however, it’s gonna make this article quite lengthy, so I will try to write another one soon. BUT still, you can check the repo and all the codes you need is available there. I will try to write more articles as I’m getting more experience with tracing and consider that I’m still new by myself to this world! I would be glad to hear your opinion and also don’t hesitate to suggest any improvements. Cheers!