Let's start this week's Streaming Audio with a question. Do you remember being a beginner? Do you remember first coming into this industry and thinking, "Oh, my God, there is so much to learn." Maybe you're still feeling that. I know I do sometimes.
I was on vacation last week. I was on a boat, and I found myself wondering, "Do sailors ever get used to the size of the ocean?" I mean, they must do just like we get used to the size of our field. But it's interesting sometimes to step back and try and see it afresh, see it with new eyes and wonder, "What do you think would matter if you saw this industry again from scratch? What would you explore first? How would you navigate your way through our ocean?"
Well, I had a chance to get some of that perspective fresh firsthand because we have an internship program here at Confluent. I thought I'd ask one of our interns to be brave and come and tell us about his experience diving into this world for the first time. What did he learn? What did he need to learn? What did he do? What caught his interest?
One thing he did, which I thought was really interesting was a natural language processing application for Reddit threads, but he also had some surprising takes about balancing technical skills with soft skills and business skills. He's got really nice perspective. If you are new to this industry, I think you're going to hear this episode and realize we're all in the same boat at the end of the day. If you're a veteran, well, it might leave you better equipped to deal with some new hires and see things from their perspective.
Before we get started, this podcast is brought to you by Confluent Developer, which is our education site for Kafka. More about that at the end, but for now I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
My guest today is Shufan Liu. Welcome to the show, Shufan.
Hi. Thank you. Thanks for having me.
It's good to have you here. We have got you here because, let me see if I've got this right, you are studying at the University of Pennsylvania.
Yes, for master's.
Yeah, you're doing your master. You've come across from a business degree to a computer science master's.
That is correct.
Let's start there. Why did you do that?
I wasn't confident that I'm smart enough to handle technical stuff first going into college, but I don't want to say goodbye completely to that, so I picked a couple of computer science class during the course of my undergrad. And then from those courses, I feel like, "Oh, that is my stuff," so I decided to study more in master program. I applied to one of the technical computer science programs. Luckily, I got in, and then I happily started to pursue my career in technology.
As part of that, you've been doing an internship with us at Confluent, right?
Now I was trying to think back to my days when I first went from university into industry, and I remember being kind of overwhelmed by what an ocean of stuff there was to learn.
Well, I can assure you that feeling hasn't gone away. That's the same with me.
What's the most important stuff you've been picking up?
Well, trying to find the balance between reaching out and ask people for questions and actually diving into document and trying to solve the problem myself. That's a hard balance to find, but it's something important, I think, at the start of the career.
That's interesting because I would've thought a lot of people would answer that question with something technical, but you've gone straight to the really hard stuffing of [inaudible 00:04:09] science, which is balancing when to talk to people.
Well, that kind of concludes my struggle was technical stuff because mostly I ask technical stuff, reading document or reaching out to people, bothering them.
Well, one of the jobs of an intern is to make sure they're grabbing information from everybody they can.
Oh, well, I hope nobody is bored of me yet, but I ask all questions.
But that's good. That's good. One of the reasons I wanted to get you in talking to us is that whole fresh perspective thing, right? A lot of people listening to this podcast will have been in the industry for quite a while, maybe have used Kafka for quite a while, and we've simply forgotten what it's like to see this with fresh eyes.
I'm going to ask you some Kafka questions, but tell me what's it like coming into a technical company for the first time?
Everybody is so freelance. Everybody having a very chilling feeling doing their work, but they always get things done. That's something very impressive what I discovered in tech companies.
You think we seem relaxed on the surface?
Well, yes, but I know for sure that everybody's working really hard to get things done because those things are really hard to get done.
Yeah. We're just as much as anyone beating our heads against the computers and then trying to smile afterwards. But you've come in slightly... I mean my angle, I came in straight as a computer science major, I think is the American term, into a programming job. You wanted to go into DevX, DevRel, like the marriage between programming and talking to people.
Why choose that?
Well, part of the reason is because I am new to the industry. I want to experience much as possible to discover where my interest is. Software engineer seems like a very heavy job for me to do at the first year, first summer intern. A lot of my cohort friends got internship in software engineering, but I want to try something different. I want to try jobs that kind of helped me with preparing me for software engineering career, but also gave me a perspective of how the industry looks like in a bigger picture.
Working at DevX or DevRel, I have opportunity to work with the marketing team, the product management team, and I can learn the demand of our customer, that is a developer. I can try to learn what they're thinking and practice the way that I communicate with them. I think that's going to be beneficial for me in my career laurel.
Yeah, that's a very smart perspective. I think a lot of us focus on getting really good at programming, and then wake up one day and realize we actually have to talk to the rest of the company for it to be worth anything.
Well, I still envy them for having good programming skills. I need to practice that. That's something I really want as well.
Yeah. Well, the great thing about computers is they'll hold you to account. It's the human stuff you have to kind of mercurially judge your way through, right?
What's it been like being an intern? What have you learned? Tell me some things.
Well, I think the first stuff is what is DevX? What is Developer Advocate? What do we do? Coming into Confluent, before this, I had no idea what developer advocate is until I met Danica who just introduced a whole different world to me. I realized why it's important to a company, especially to a business effacing company like Confluent. I'm sure Confluent spend a lot of time with all DevX team, Confluent spend a lot of time educating the developers' community. It's our job to maintain that activity and try to help them with the best knowledge as we can.
Which parts of that did you get involved in and which parts did you enjoy?
I enjoy writing a blog and thinking as an audience. What do they want to hear from me? What do they want to know from Confluent and try to introduce my best knowledge to them. That's some really good experience. I think that's some unique experience that I wouldn't have gotten if I were a software engineering intern.
Yeah, it's something I think I mostly learned from user interface design, but that empathy for the user is such a vital skill.
I agree, yes. Empathy.
I kind of feel your business degree sneaking in here, too.
The whole holistic, what's actually going to benefit the business perspective. You think that's true?
Well, now it reminds me. That's probably intuitive. I didn't realize, but it just blends well with my work.
Yeah. What was the blog post you wrote about which I'm sure we will link to in the show notes?
My internship kind of divide into two parts. The first part is working with Danica, and I try to extend her data pipeline. I wrote my first blog as from a rookie perspective to Kafka. I specifically described the process of building data pipeline with Apache Kafka, and how to extend the data pipeline using cluster linking on Confluent cloud.
Right, you went straight into cluster linking.
That's biting off a big topic.
Well, thanks to Danica. We did pretty much everything together, and she's helpful. I keep asking question to her instead of diving into docs. That's my cheat code.
Well, we could get Danica in the room for a follow-up podcast reviewing you, but I don't think we'll do that. We'll skip over that.
She'll work it out with you.
I know you two have been working very closely, and she is great to work with. Let's just say that for the record.
Everyone in DevX team is great to work with.
Oh, you're too kind. Was there a particular reason or something that interested you about getting involved in the Kafka side of things?
I didn't know what Kafka is before coming into Confluent, let's say before my interview with Confluent. I know Confluent is a great company to work with, but I really didn't know what the product is for Confluent. But the interview opportunity showed up in front of me. I started to learn what Kafka is, what Confluent is doing. I started to see the business value inside Confluent, why it's so important to developers, why it's so important to businesses, especially those have demand with fault tolerance and highly elastic message queue stuff.
Yeah. I wonder, there must be a lot of people listening to this thinking, "Okay, well they've got some maybe junior developers coming in or someone of about your experience." What is it that we should be teaching them that matters about this? From what you've learned, what would you say to someone with a similar level of experience to yourself? What are the things to grasp mentally from it all?
Well, I think Confluent Developer 101 course is really helpful. Setting it up was good for me. Thanks Team Brooklyn, I had a very holistic picture of what Kafka is before coming in. I think the best way to get to know Kafka is to start trying to build a project with Kafka and see what Kafka can do with the project.
That's what I did for my second project. Once I get familiarized with building a data pipeline using Kafka, I started to build my own project with a purpose to educate our develop advocates. Well, educate seems too much of a selfish word. It seems like a big workpiece with-
No, no, not at all. You've learned some things, you're sharing them. That's all education is.
Well, thank you. I'm not sure if I'm qualifying now to say educating people as I'm fresh rookie here.
Okay. Let's move on from the humility. Tell me what you built.
The second project is building a microservice architecture application with Kafka and that's kind of inspired by Dave Klein's microservice pizza application and-
Oh, yeah. I've seen that one.
I was thinking maybe I can build something more interesting than building a pizza. Sorry, Dave Klein. That was a very good project, but I was thinking maybe I can do something a little bit more complicated than that, so I decided to use microservice application to build a sentiment analysis on Reddit. My application prompts user to give a request on which sub-Reddit they want to analyze in the specific time range. With my microservice architecture, the application first pull Reddit thread from Reddit API and then flows through the Apache Kafka topics and microservices applying the sentiment score on each of the sub-Reddit thread, calculate average in the end and figure out a way to display it in front of our user.
Okay. So you are splitting out the idea of gathering a large amount of data.
Then somehow processing that in probably quite time-consuming weight sentiment analysis.
Yeah, it processes pretty fast.
Okay. But it's something you wanted to separate out from the gathering phase?
What do you mean the gathering phase?
The phase where you gather the data from Reddit. You've got that as a separate thing to let's process this and analyze it.
Yes, they are separate microservices.
And then from there it dumps the analysis output and you think about displaying it separately.
Yes. Those are from different microservice application and they communicate through Apache Kafka topics.
That's a juicy enough template, kind of application for this kind of thing. It's something we'll all end up doing in one shape or form.
If you're dealing with Kafka in anger, right?
I have to ask this, did you do that because you thought it was the right solution to the problem or because you wanted to exercise the system and get a feel for it or both?
I wanted to build a data pipeline from scratch completely myself because the first project is building upon Danica's project. For the second project, I really wanted to do something that I really enjoy doing. I enjoy doing extending Danica's project, but I wanted to have something from my own and to think about something interesting. What I've learned from university, sentiment analysis came up to me and I start to think from there, how do I integrate Kafka application with sentiment analysis? I talked to a lot of people and here's the question problem I can solve with Apache Kafka and with microservices.
It's a good one, I think. They're kind of what is the sentiment behind a Reddit thread or a sub-Reddit? That's the word I'm looking for.
I think we can probably all think of a couple of sub-Reddits where we know the sentiment just from the topic, right? It's a very polarized place sometimes.
The worst sub-Reddit is politics. I think that's not [inaudible 00:18:20].
I think regardless of people's politics, they probably agree that the politics sub-Reddit is pretty spicy. Let's use that word.
That's a good word. That's a good word choice. On the other hand, I think one of the benefit of map applications is to provide a way to quantify sentiment from a time range. When you go over social media, you feel like, "Ah, I can feel the bad vibe. I can feel the positive vibe," looking through the thread, but it's hard to quantify and compare those sentiment. Thanks to Kafka and microservices, my application actually provides a pretty good way to summarize and quantify them and maybe for users to compare them.
Yeah. That's what gets really interesting, something like that, where you are automating, comparing lots of different threads or lots of different sub-Reddits and getting a quantitative balance between them, I would think.
Another example is analyzing the same sub-Reddit, but over different times. One example is there's all kind of sports on sub-Reddits. The sentiment when the team is on a winning streak is probably different from the sentiment when the team is on a losing streak. I had one example shown in a blog that is coming out soon, so stay tuned.
Okay. We'll link to that in the show notes. I'm not really a sports person, but I would've thought at the start of each season, all the supporters are really optimistic, and you've got a varying length window for the team to actually do well before they all start trashing on the team. Are any patterns like that?
Well, before season starts, there's a bunch of pre-season games. If your team performed well in those pre-season games, the sentiment is probably a higher rate, but that really depends on each teams.
Is sports the sentiment that you personally wanted to get to?
That's your hobby topic?
Yes. I really want to use that as an example to test if my sentiment application works, and luckily it worked just the way I thought.
Test in what way?
Whether the sentiment is accurate because I can choose a time period when my team is doing great and then quantify the sentiment and compare them to a period of time that the team is doing bad. You can look at a negative score, which is significantly higher when the team is doing bad.
So you're testing it against your experience and intuitions about the team's performance.
I see. Did you find any other interesting patterns from the data?
Well, it really depends on the sub-Reddit, right? Another interesting stuff is maybe... Are you playing video games?
Sometimes. Probably more than I should.
Well, there's some video games that really got their fans hyped up before releasing. Well, unfortunately, sometimes when the game get released, the fans start to realize it is not what the marketing team promised to be and the sentiment dropped. That's another example of how people react on social media.
Ohm yeah. Do you know that makes me think, change in sentiment over time. I would love to see this project run against a few NFT projects. I'm not blanket covering all of them, but some of them are hyped trains that build and build and crash.
Well, I mean there's blockchain sub-Reddits, and there's some of the stock investing sub-Reddit. It's really interesting to see their reaction over time. Right.
Yeah. I wonder if it ever leads. I mean, if you could use that change in sentiment on Reddit to analyze where you think the market's going to go.
I doubt it, but I think that's a good application.
I think you've come up with a good project because it sparks questions that you then want to use the tool to answer. Right?
Well, I... Go ahead.
You've written this up and you're going to publish the source code, right?
Okay. So maybe you should talk to us. It's going to be there for people to try out and try their own predictions. Talk to us a bit more about how it's built. What language did you use? What are the services actually coded as, that kind of thing?
I used Python for each of the microservices and there are, I think, four microservices. The first ones get user input for each request. Each request consists of a sub-Reddit name and a start date and an end date. After that microservices, the request will be appended as a message onto a Kafka topic, and then another microservice called API Poller will consume that message from the previous topic and then pull all the sub- Reddit thread accordingly. Once that happens, they will append all the thread to another topic which will be consumed by sentiment analysis. The sentiment analysis microservice just append a sentiment score to each of the thread.
How does that work? Let's just slow down on that one a bit. How do you calculate a sentiment?
I just use one very convenient Python package called NLTK. It's a three-sentence Python code. You can check it out on my new poll.
Okay. So you get sentiment score?
Mm-hmm. The sentiment score applies to each of the Sub-Reddit thread. I use [inaudible 00:25:09] to calculate average from the streaming data. After I create another table on [inaudible 00:25:18], that table also corresponds to another Kafka topic, which I will use to consume from my last application call display. That is to-
A good name.
Yeah. Very, very good. That is to consume the Kafka KTable from [inaudible 00:25:39] and display the results to our user.
Right. I can see how that pipeline puts together. I also, I think, see how that would scale quite well. Have you load tested it at all?
I have not, but I build this application just to make sure that it can scale well with microservice architecture, Apache Kafka, which is built to be fault intolerant and with high elasticity.
Yeah, I can see that. I can see how the architecture would lend itself towards that kind of scale, which I think probably most of our listeners will be able to see, I would hope. But there's finite time in the day and definitely finite time when you're an intern. You probably haven't load tested that hugely in anger.
Well, maybe I should try to do that and put it in my blog because the purpose of that is to show people that microservice and Apache Kafka is such a good combination for making the application that is very easy to expand horizontally.
What's the primary key on all those partitions? What's the key? Is it the sub-Reddit?
For the request ID, the key is... For this user input, the first topic, that's request ID. Each request has a unique ID, and we use that for aggregation later on in case of [inaudible 00:27:27].
All right. So you should see the chart by user request.
You could scale it per user request. That makes sense.
Each request is actually unique in this application.
Okay. I like this midterm project. This feels very tasty to me.
Did you enjoy working on it?
I do. I bounce into a lot of technical question. As I said, it's a good way to practice finding the balance between reading docs and bothering people. I always choose the easy way, to bother people. Everyone in DevX team are being bothered by me constantly.
What do you think are the most important things you learn from an internship in the industry?
Well, the first thing I learned is what is DevX, right? I think the most important stuff is to learn Kafka and Confluent cloud, and try to figure out the use case with Kafka. I think at the end of the day only technical stuff really, really matters in career growth. I mean, there's a lot of practice time for soft skills, but really the most important thing to get away from internship is the technical stuff that I learned. That is the top.
That's good. I think the heart of what we do in DevX has to be development. It has to be software engineering.
One thing that I like about DevX team is that you have to constantly learn new stuff and try to pick up the new technology really fast. That's one thing that I really enjoy being a develop advocate is I have opportunity to get exposed to technology that are new, that are interesting, and keep thinking how I can integrate those technology with Apache Kafka.
Yeah. That's something I enjoy about it. You get to be very technical, but you get to be imaginative and thinking, "How can I fit these ideas together and explain them and use them in interesting ways? How can I cheat and get paid to build things that I find interesting?"
I don't know that's something that we can discuss publicly, but yes, I agree with you.
I think it serves both sides, to be honest. I think you are most interesting talking about the things that you find interesting. I think you've chosen a great project in that it's something you find inherently interesting and useful. And so would I actually. I think I might have to check out your code and play around with it.
There are some synthesizer sub-Reddits I'd like to analyze.
Let me know, and we can do it together.
Oh, yeah, absolutely, in the future. I want to wrap up by asking you a couple of questions that you may not be able to answer 100% honestly, but I'm going to ask them anyway and see how unguarded you're feeling.
Would you work in DevX again when you finish your master's, and would you work for Confluent again? Would you use Kafka again? All three of those. You can be as honest as you like about any of them.
I would definitely work for DevX team and Confluent. It's been such a great experience. I enjoy talking to everyone in this team, and everyone is so supportive and helpful to me. I think this is a great place for me to grow in terms of career and in terms of learning new stuff.
I probably will write more code as software developer, but I definitely will learn more tech stack with DevX team. In terms of Apache Kafka, I don't see why not. It's the best message queue that [inaudible 00:32:00]. That's a bad word to describe Apache Kafka. It's not a message queue, but it's like a message queue. It's a great tool.
It's kind of a super set.
Yes. That's what I use for the microservices. I use that as message queue, but it can do a lot more than being a message queue.
Yeah, I put you on the spot there. Thank you for fielding those questions. I'll say from my side, it's been really great working with you. I wish you could stick around longer.
Maybe we'll see you on the podcast sometime in the future wen you're analyzing the sentiments of sports team in a way that... You'll probably be cast in the sequel to Moneyball one day. That's my prediction.
Did you see that film? That's a really good film. I
I did not, but-
Oh, you've got to see that film. It's all about data and numerical analysis of sports. Love it.
That's my dream job. Oh, my God.
Cool. On that, Shufan, great talking to you. Thanks for joining us on Streaming Audio.
I'll catch you soon.
Well, by the time you hear this, Shufan is going to be back in academia. Will he be writing his thesis on real time sentiment analysis? Will he be arguing about basketball online? I don't know. Both, I hope. Whatever it is, good luck, Shufan. It was great having you around here, and I wish you all the best.
To make the most of that knowledge, you're going to need a Kafka cluster. So you can try spinning one out at confluent.cloud, which is our Kafka cloud service. You can sign up and have a cluster running reliably in minutes. If you add the code PODCAST100 to your account, you'll get some extra free credit to run with.
Meanwhile, if you've enjoyed this episode, then do click like and subscribe and the rating buttons and all those good things. It helps people that like this kind of information to find us. It also helps us know which episodes you want to hear more of, which topics you want us to explore in more detail. If you want to get in touch with me directly, as always, my Twitter handle is in the show notes.
With that, it just reminds me to thank Shufan Liu for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.
How do you analyze Reddit sentiment with Apache Kafka® and microservices? Bringing the fresh perspective of someone who is both new to Kafka and the industry, Shufan Liu, nascent Developer Advocate at Confluent, discusses projects he has worked on during his summer internship—a Cluster Linking extension to a conceptual data pipeline project, and a microservice-based Reddit sentiment-analysis project. Shufan demonstrates that it’s possible to quickly get up to speed with the tools in the Kafka ecosystem and to start building something productive early on in your journey.
Shufan's Cluster Linking project extends a demo by Danica Fine (Senior Developer Advocate, Confluent) that uses a Kafka-based data pipeline to address the challenge of automatic houseplant watering. He discusses his contribution to the project and shares details in his blog—Data Enrichment in Existing Data Pipelines Using Confluent Cloud.
The second project Shufan presents is a sentiment analysis system that gathers data from a given subreddit, then assigns the data a sentiment score. He points out that its results would be hard to duplicate manually by simply reading through a subreddit—you really need the assistance of AI. The project consists of four microservices:
Interesting subreddits that Shufan has analyzed for sentiment include gaming forums before and after key releases; crypto and stock trading forums at various meaningful points in time; and sports-related forums both before the season and several games into it.
If there's something you want to know about Apache Kafka, Confluent or event streaming, please send us an email with your question and we'll hope to answer it on the next episode of Ask Confluent.Email Us