Human’s have been writing things down for thousands of years.
Over that time, our brain has gained a tremendous amount of experience in understanding natural language. When we read something written on a piece of paper or in a blog post on the internet, we understand what that thing really means in the real-world. We feel the emotions that reading that thing elicits and we often visualise how that thing would look in real life.
Natural Language Processing (NLP) is a sub-field of Artificial Intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language. The process of reading and understanding language is far more complex than it seems at first glance. There are many things that go in to truly understanding what a piece of text means in the real-world.
In the context of the Information Society, many documents are needed to be made public every day. Since some of these documents may contain confidential information about private entities, measures should be taken prior their publication to avoid revealing sensitive data or disclosing individuals’ identities.
Document sanitization precisely pursuits the removal of sensitive information from text (which can yield to revealing private information/identities of the entities referred in the document) so that it may be distributed to a broader audience. In the past, sanitization has been usually tackled manually by governments
The challenge under Datasense umbrella is to deliver the best NLP results on top of Portuguese language to identify personal and sensitive information, regarding the identification, relation extraction and human feedback.