Named Entity Recognition is a Natural Language Processing Technique which can help extract dates of interest from a PDF document
Data is important. Period. While one can find created datasets tailored specifically for a certain data science task, there will come a point where these datasets become unoriginal and overused. Recently I have been working a lot with PDFs in Python which, to be honest, is not initially the easiest of tasks. The reason you want to be able to work with PDF documents is they are very accessible on the internet and they can provide a ton of new data to your dataset. You can check out previous work I have done with PDFs below. I have since cleaned and changed some of the code I used, showing how the evolution of your code is ALWAYS changing!
If you are currently working on an NLP project, finding dates of interest could be beneficial for finding trends or patterns within your data. When did specific events happen? Is there a pattern to time between when events occurred? What important dates are located within the text? There are so many questions you could and should be asking yourself revolving around dates within a corpus.
The main libraries you will need for this code are PyPDF2, Spacy, and re (innate to Python). First, we will create our PDF parsing class. The first function we want to create is a PDF reader which allows us to read a pdf into Python.
I created a quick sample pdf to show you the power of this parsing class. The pdf was:
My name is Ben and I started working on data science spring of 2020. Today’s date is August 31, 2022.This class can help find dates of interest. For example, maybe you want to know more about an event that occurred in early 2014. This can help find those dates
To read the pdf into Python, simply get the file path of your pdf into a pdf parsing object class.
pdf = PdfParser('your file path here')
text = pdf.pdf_reader()
Now that our PDF is processed, we can create a function that extracts the dates.
This function is using Name Entity Recognition to find any tags within a corpus labeled as DATE. For this to work, you will need to ensure you download the ‘en_core_web_lg’ from Spacy. Find out how here! Within the function, a set is created of the dates just in case there are any duplicates (maybe we find the same year twice). You could change this to a list if you cared about the frequency of the date within a body of text. Next, let’s run the function on the text of our pdf.
Running this code produced the following output:
['early 2014', 'Today', 'August 31, 2022', '2020']
Boom! It works! Just like that we can convert a pdf into a python readable format and find the dates of interest!
Today we looked at how not only to read a PDF into Python but also extract the dates of interest from that PDF. While finding the sentiment of the text is important in NLP, knowing when an event occurred, or maybe when people are consistently making negative comments about a company, can be insightful. For example, the function can be used in business analytics when analyzing consumer reviews. What if all of the negative reviews a company receives are during the summertime? Maybe this is due to the company staffing fewer people because of an expected decrease in consumer demand (many people vacation in the summer). The company can attribute these reviews to that aspect instead of a faulty product (after further investigation), Since creating this function, I am constantly looking for timeline patterns in my analysis and it has definitely helped me discover latent patterns. Let me know how this works for you!
If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore (This really helps me out more than you can imagine)! If you don’t have a Medium account, sign up here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!
And here is the full code! I hope this can help in your future projects!