Getting It Out There

Today, the preprint for my thesis finally goes live on arXiv. The paper is a condensed version of the original, hundred-or-so page manuscript that now sits in hardback in my shelf among my prized collections of textbooks and monographs. My initiation, so to speak, into the world of academia. My adviser and examiners had the idea of putting up a preprint while we send the paper out for review in some journals.

While I hardly think I have any “regular” readers to speak of, I’m nevertheless apologizing to the casual visitor who may have gotten used to me writing about books and travel, as I am now about to enter into Mathematical territory. But I promise to be as accessible as possible. I just figured I’d like to celebrate this occasion with a short note about my thesis, and maybe wax optimistic about my future plans.

In loving memory of my old laptop, Caecilia, on which much of the coding work for the thesis was done.

In the summer of 2017, I graduated with my bachelor’s degree in Statistics from the University of the Philippines. Around this time, the idea of pursuing graduate studies was already floating around in my head, but I hadn’t yet fully decided on it. After all, my idea of graduate school life was, to say the least, less than ideal: broke, feeling left behind as friends start getting bonus paychecks and taking up expensive hobbies, working tirelessly on a project that may end up being totally irrelevant to the field at large. More importantly, I wasn’t exactly adept at thinking up research problems. At this point, whatever mildly successful research program I’d been a part of, the idea was thanks to a helpful tip from a professor, or to the innovative question of someone else in the team. In other words, innovation wasn’t my forte. If I entered grad school, I was sure I’d get eaten up by the time thesis work came on the horizon.

Then I came upon two important milestones. The first was I started working as a quality control statistician for a major industry player two months after my graduation, in August. Then I realized that the rigor and structure that I’d built my life around no longer applied in the corporate world. Being the subject matter expert in statistics and data analysis in the division where I got assigned, I ended up having nobody at work that could teach me more about my field. Which isn’t to say I learned nothing from the people at work. I have stayed in the same company for four years and still counting because the division is peopled by brilliant people who have opened for me the fields of quality management, process capability, supply chain management, project management, and finance. But regarding statistics, I quickly realized I was going to be practically the end of the learning journey: I could teach others, but I was going to have to teach myself if I wanted to move forward.

Not to mention the experience of working in the industry is much more open-ended than the classroom setting that I had gotten used to at university. In a statistics class there is always a right answer. There is always a golden formula, a well-factored equation, a QED, that merits full points. At work, there isn’t a lot of that. Sure, you turn up the wrong estimates for your boss and you’re bound to get a late-night call on MS Teams. But you could end up applying the wrong model, or use the wrong objective function, and no one’s gonna know. Most likely, no one knows. And unless the figures you’re plugging into your formulas are other people’s money, there’s very little room for your error to snowball into a crisis. Again – speaking in terms of my work setting – there was nobody in the division who was going to read my work and spot my error if ever I made one.

Like any sane person who finds his entire structure for living getting entirely obliterated, I panicked. Within the same month of my acceptance with the company, I went back to university and applied for a master’s degree. As an added bonus, my company was open to subsidizing my matriculation fees.

My attitude towards research hadn’t changed a bit. I still did not believe in my ability to put out original work. I didn’t even have an idea of what original work was supposed to mean. But I put that aside for the moment, welcoming the return of rigor, and class schedules, and exams, and the grading system, into a life that I feared was rapidly coming apart within the upside-down expectations of the nine-to-five world. I was back at university, I was back home.

The second milestone happened in two separate events: the first being a conference organized the same year of our graduation for research by both faculty and students, the second a larger academic conference held in Japan. Some of our student papers (again, ideas courtesy of my more innovative group mates) got well enough regard to be invited to such conferences. I got to present in the university conference, while at Japan I was mostly just there for the ride, as the co-author who doesn’t want to miss out on the opportunity to travel for (mostly) free. These two conferences gave me a taste of the real academia, of brushing shoulders with giants and amateurs alike, all nosing into obscure texts and formulas, in the hopes of coming up with something that would at least open a small peephole, if not entirely a door or window, into new knowledge. And for the first time, I felt intrigued.

Goofy guy on the left is me. Going right, there’s Isabella Benabaye, Patricia Donato, and Sean Escalante (presenter). The full paper, “Data Visualization for the New Age” is available on the proceedings of the 2018 International Conference on Teaching Statistics (ICOTS).

I still don’t see myself as an innovator in some regard, but I’ve long stopped thinking of it as the point. The arXiv collection alone contains (according to their home page) 1.9 million articles, and in that great ocean our fifteen-page article (co-authored in this version by my brilliant adviser, Dr. Erniel Barrios of the UP School of Statistics) is hardly a drop: but it is there. I still don’t have any world-changing ideas, but I do have a few questions I want to answer, and maybe someday one of those questions will lead to someone else’s great idea.

This particular question is a product of my competing interest in statistical modeling and the craft of writing. While my thesis is localized to the customer service industries, at its heart it is my attempt to marry these two great interests of mine in an attempt to answer the question, How do we understand texts? How does the mind process written words, language, into abstract ideas and concepts. The entire field of Topic Modeling has been preoccupied with exactly that question, attempting to represent, using various mathematical and algorithmic tools the interpretation of written language into higher level concepts – or as we call them, “Topics”.

In order to model language using statistical tools, I first had to convert writing into a format that plays well with these tools. For this, I applied what is known as the “Vector Space Model,” more commonly referred to in the Natural Language Processing (NLP) field as the “Bag of Words” approach. In one of my examples on the preprint, I demonstrate how the following paragraph from F. Scott Fitzgerald’s The Great Gatsby

About half way between West Egg and New York the motor-road hastily joins the railroad and runs beside it for a quarter of a mile, so as to shrink away from a certain desolate area of land. This is a valley of ashes—a fantastic farm where ashes grow like wheat into ridges and hills and grotesque gardens where ashes take the forms of houses and chimneys and rising smoke and finally, with a transcendent effort, of men who move dimly and already crumbling through the powdery air. Occasionally a line of grey cars crawls along an invisible track, gives out a ghastly creak and comes to rest, and immediately the ash-grey men swarm up with leaden spades and stir up an impenetrable cloud which screens their obscure operations from your sight.
The Great Gatsby, by F. Scott Fitzgerald

is processed into a vector, where each component represents a scoring of each word used in the document

Portion of Figure 1 from Dayta and Barrios (2021). Available on arXiv.

When documents are processed this way, it is then simply a matter of stacking up each vector to form a matrix that represents the collection of all documents in a library. In the thesis, this library or “corpus” is taken to be a set of feedback (complaints, commendations, inquiries, etc.) sent in by customers of a consumer-facing brand, but really it can be anything from a collection of users’ posts on social media sites like Facebook, Twitter, or Reddit, or even an author’s bibliography (say, the collected stories and novels of F. Scott Fitzgerald). The paper then goes into detail of how to process these corpora and extract their topics, extending some methods that have been put into use in the past.

I don’t want to write too much on the technicalities of the method. For one because I never really planned this blog to be an academic one, but also because the paper already illuminates on that in more detail that I can wish to write down legibly on a WordPress blog.

Now if you ask me if I think the idea is innovative, I’d say it points out some worthy details of discussion regarding the usefulness of topic modeling procedures. Will it lead to someone else’s great idea? I guess we’ll find out. But right now, all I’m really thinking about is how proud I am to finally announce that this work is (for the most part) complete, and how much I’m looking forward to doing more academic work in the future.

The paper, “Semiparametric Latent Topic Modeling on Consumer Generated Corpora” is now live as a pre-print on the arXiv repository. The paper is currently undergoing review with a major journal for data mining and statistics.

2 responses

Love You For 10,000 Years – Dominic Dayta

January 3, 2022

[…] have been going so well, I’m almost afraid of losing balance. This year was also the year I finally closed my two-year ordeal with my master’s thesis. I graduated with my master’s degree in August and, following the theme for my 2021, then followed […]

LikeLike

“SemiparTM” now published in the Annals of Data Science – Dominic Dayta

January 30, 2025

[…] four years ago now, I wrote on this blog about the completion of my master’s thesis, which we put up on the arXiv in a compact, paper form soon after its approval by my thesis […]

LikeLike

Getting It Out There

Share this:

2 responses

Leave a comment Cancel reply