As I am very interested into information extraction from textual data, from the research perspective, I have tried to used the services from Alchemy API.
Alchemy API is known as one of the best natural language processing services and offers a number of APIs for: entity extraction, sentiment analysis, keyword extraction, concept tagging, relation extraction, text categorization, author extraction, language detection, text extraction, microformats parsing, feed detection and linked data support. I am mostly interested into entity extraction and relation extraction as they are the key subtasks of information extraction.
The company offers AlchemyAPI as a service, on premise services and custom built services. They also offer 1000 requests/per day in the free tier (this is what I use).
To access the services, you need first to register for an access key and sign the agreement that you will not misuse the services. Then the services can be accessed using prepared SDKs (currently available for Node.js, Pyhton, Java, Android, Perl, C# and PHP) or directly using HTTP requests as the API is very straightforward.
The entity extraction should normally consist of named entity recognition and coreference resolution. Their service does not return coreference clusters, but they perform entity disambiguation and connect entities to known linked data sources like DBPedia, FreeBase, Yago and OpenCyC. An example of extracted data in the XML format:
I have also tried some tools that are available from some research groups, but the AlchemyAPI seems to work with very high precision on real-life data. In the field of NLP the AlchemyAPI also represents one of the best and most comprehensive NLP suite.
The conference was held in Spišská Nová Ves, Slovakia to where I decided to go by car.
On the 2nd November I went to Budapest where I met some friends and did some sightseeing. On monday (4th November) Bojan came by train and then we went to Slovakia. As we set the GPS to use the shortest path, we drove through Drožnjava, where the road is not in a very good condition and also there was a thick fog. We were moved from the Metropol hotel to Renesance hotel as there were some accomodation problems. Also, we were joined with a friend from Ukraine in the same room.
I Spisska everythin was closed at 9 o’clock and there was nothing to do in the evenings. Otherwise the city is nice and they also have some shopping centres – e.g. Tesco, Madaras, …. One afternoon I went there to buy some shoes and clothes.
The conference social events were really calm and ended soon – at around 9pm in the evenings. On tuesday we went to see Kežmarok castle, which is very nice with a lot of collections from 15h century until today. There I also tried the Slovakian national alcohol – Borovička.
The conference programme was more general informatics and programming languages oriented. One of the keynote speakers was Prof. Dr. Andreas Bollin from the University of Klagenfurt, whose title of the talk was “Evolution before Birth? – A Closer Look on Sofware Deterioration”. He presented some ideas of the formal models and future directions. The second keynote was regarding the definitions of languages from mathematical point of view and was given by Prof. Dr. Zoltán Fülöp. I also remember few interesting presentations, especially the one regarding website fragment processing – Isomorphic mapping of DOM trees for Cluster-Based Page Segmentation. The idea was to represent the webpage as a tree structured HTML and eliminate redundancy to encode the structure. Another talk was regarding the on-the-fly decisions how much data to send to the clients to reduce the number of server-side processing – Benchmark-based Optimization of Computational Capacity Distribution in a Client-server Web Application. There exist some frameworks that can benchmark clients and when a request is send to server, the server knows the capabilities – e.g. processing power – of the client. The third interested talk was about performance evaluation of Micro instances at Amazon EC2 – Performance of a Java Web Application Running on Amazon EC2 Micro Instance.
Our paper was presented by Bojan Furlan and I shot the presentation, which is available below:
PredictionIO (http://prediction.io/) is an open source machine learning (ML) server. Its goal is to make personalization and recommendation algorithms more accessible to programmers without ML knowledge. It includes recommendation engine and similarity engine which can be instantiated, configured and evaluated via web-based GUI.
Due to a limited number of integrated ML methods I do not think this product should be already called “machine learning server“. As I was curious how does the system work, I tested it. Therefore in this post I review how to install and use the server.
First we need to install the server and its dependencies. I was using Mac OSX Mavericks (10.9, GM):
We need to install MongoDB (http://www.mongodb.org). Currently, version 2.4.6 was available. To run the database, we need to create a db folder and run the service
git clone https://github.com/mongodb/mongo-hadoop.git
git checkout r1.1.0
git clone https://github.com/PredictionIO/PredictionIO.git
2. Run the server
After we packaged the distribution, we can run the server from dist/target/PredictionIO-<version>. First we need to run the setup script ./bin/setup.sh and then run it ./bin/start-all.sh.
The server is accessible only to registered users, which can be added using the following command ./bin/users. After that, we can login to the server via the default port: http://localhost:9000/.
Later, if we see the message “This feature will be available soon.”, we need to run the setup script again and restart the server.
3. Write an example application
Firstly, we create an application. The result of this step is an App Key, which is used for our script. Then we create an engine – we chose recommendation engine. We need to define item types and some basic recommendation parameters. Afterwards we select a recommendation algorithm and set its parameters.
The main idea is to have a set of users and a set of different items to predict new items for new or existing users.
Secondly, we need to populate the database via our program and then we can call functions to get new predictions. We published our sample code on GitHub (https://github.com/szitnik/prediction-io-Test). The key idea was to have 4 users and their friendships (modelled as view action) to predict new possible friendships.
After we inserted the data, the system calculated all possibilites and stored them into the MongoDB database:
On the 10th of August I left Sofia by train heading towards Plovdiv. I stayed there for two nights. On the first day, Didka’s friend Gergana showed me the most of the city, which was very kind of her. For the second, I visited some other things and other two hills.
In the morning of 12th August, more specifically at 5:30, I went to Plovdiv South Bus Station and luckily I got the last free seat for the bus to Sunny Beach. There I checked in to hostel 415 – very nice hostel with a pool :P. Few hours later I got to know a new friend with whom I visited Nessebar. On the second day I went to some clubs for which the Sunny Beach is known for. During the third day I was mostly reading a book that the French friend from hostel proposed and in the evening I went swimming.
At 6:30 in the morning of 15th August I left Sunny Beach and made the last stop in Bulgaria – in Varna. This day was also the official day of Varna (formerly also known as Stalingrad), so there were some festivity events. First I checked into a hostel and then joined the Free Varna Tour. After the tour I wanted to visit the historical museum, but it was unfortunately closed – there is an exhibition of gold or something very interesting. So therefore I went to the “Cathedral Sveto Uspenie Bogorodichno”. Lastly, I needed to visit the biggest mall in Varna to buy some Menthas 🙂 – I got them in three different shops on the way back to the hostel. In the evening I and some friends from the hostel went to the beach, where we were having a great time. At midnight there was a really huge and nice fireworks, people were making little hot air baloons …
Video of the festivity in front of the Cathedral:
I left Varna and Bulgaria on the 16th August. Thanks to Didka, her friend Gergana and lots of people in hostels I had a great time. For the next Bulgarian trip I must visit Veliko Tarnovo (previous capital of Bulgaria) and Panagyurishte. I also did not succeed to find “Jajca po Panagyurski” in Sunny Beach or Varna, so therefore I will try to make them at home.
After the conference, there were two days of workshops. I had applied for the BioNLP workshop in which Marinka and I have won the Gene Regulation Network Shared Task (GRN ST).
Throughout the first day of the workshop, researchers presented some general work in the field of BioNLP domain. I met the BioNLP ST organizers – Claire Nedellec, Robert Bossy and Zorana Ratkovic, who are working at INRA, France. Claire wrote a book chapter about the joint extraction of entities and relationships using ontologies from textual data. I was also interested if they continued their work and found out that Zorana recently published a paper that incorporates also coreference resolution (NER->COREF->REL) but in a pipeline manner.
On the first day I also visited a collocated workshop because there were some presentations regarding the extensions of NLP frameworks (e.g. U-Compare, UIMA, GATE) and a showcase how to make UIMA SPARQL interoperable.
In the late afternoon and in the evening I continued the work on my presentation.
The second day was devoted only to the ST presentations and posters. This year it was the third ST and is going to be continued in 2015 with similar tasks. All the tasks were related to some kind of text mining on biological data, for example knowledge base construction, relation extraction, event detection, … Most of the proposed systems extensively depend on syntactic structure of sentences or rules. A well known system – TEES participated in almost all of the shared tasks and also achieved some best results. Furthermore, as TEES is being developed since 2009, also some other competitors used it as a framework to develop their own techniques.
During the poster session I met Marting Krallinger – the organizer of BioCreative IV CHEMDNER challenge. He was enthusiastic about BioNLP STs and told me that there are already around 70 competitors that applied for CHEMDNER. The task of CHEMDNER is to automatically detect chemical compounds from text. The participants will have an option to publish 2-4 pages of their system technical report in the conference proceedings and will also be co-authors of the joint paper in BMC Bioinformatics. Moreover, in that journal there will also be a special issue in which best systems and systems with an interesting methodology will be published.
I gave my talk at 5pm, just before the last talk at the workshop. My slides:
If you do not see the presentation, you can download it from http://zitnik.si/temp/BioNLP2013_presentation_MarinkaSlavko.pdf.
After all the sessions there was a discussion about further work in the BioNLP ST. The proposal was to publish the source code for all the participating systems, to improve annotations, continue with existing tasks and propose new ones. There will also be a special issue in BMC Bioinformatics for this year’s BioNLP or at least a thematic series within the BMC journal.
In the evening I met with Vasilena and Pavlina (students from Bulgarian association of PhD students) and we went around the city, visited protests and lastly went to a birthday party in a park. Today (saturday) I will continue my trip to Plovdiv, then to Sunny Beach and on Friday I will go home from Varna.
My participation in the protests for Bulgarian rights :):
If you do not see the movie, you can download it from http://zitnik.si/temp/acl2013_5.mp4.
This morning’s keynote was given by Assistant Professor in the Department of Psychology at University of Washington, Dr. Chantel Prat. She focuses mostly into cognitive science. The most interesting part of her presentation I think it was a comparison of monolinguals and bilinguals at problem solving. Her findings show that bilinguals perform better on solving novel tasks. But, when both groups of test subjects were tested on already known tasks, the performance of bilinguals remained the same, while the monolinguals improved their performance and achieved the same results as bilinguals.
During the coffee break I visited the Maluuba stand (http://www.maluuba.com/). They are developing an application similar to Siri and Google Now. They began to develop their product at the same time as the big players, but there are only 25 of them and their product will be integrated into mobile phone systems, TVs, etc. Their focus at ACL is to find new people that would do research for them. I also said hi to the Google people. On monday I already spoke to them and got that nice red glasses :), but today I solved their simple “Research quiz” and got Google bottle – now I obviously need to go running tomorrow morning as I have full equipment with me. Btw, did you know that Google Researchers produce more than 300+ scientific publications per year?
In the first session I attended to A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations by Angeliki Lazaridou, Ivan Titov and Caroline Sporleder, Joint Inference for Fine-grained Opinion Extraction by Bishan Yang and Claire Cardie and Linguistic Models for Analyzing and Detecting Biased Language by Marta Recasens, Cristian Danescu-Niculescu-Mizil and Dan Jurafsky.
After lunch there was the ACL Business Meeting, where some facts about the conference were introduced. At this year’s ACL there were 987 registrations. There is only one researchers from Slovenia, who is ACL member – I suppose this is Tomaž Erjavec or Darja Fišer. After, there were 15 talks about ACL organization, changes, funding, events, similar conferences, journals, etc. The most of the talks were reports or presentations of similar conferences, where also ACL people cooperate – all of them had also been on my personal conference list. For NAACL they exposed the problem of financing and this year they published all talks on the internet (http://techtalks.tv/naacl/2013) for free and replaced USB proceedings with iOS/Android application, which was a good decision. Then there was a report of EMNLP 2013 and an overview of EMNLP 2014 in Quatar. Others that were presented are IJCNLP 2013/2014, EACL 2014, Coling 2014, ACL-IJCNLP 2015 and IOLING (Internation Olympiad for Linguistics for secondary school). If you would like to coorganize or host the ACL 2016, you are invited to contact firstname.lastname@example.org.
The ACL also controls two journals. The new journal is Transactions of the Association for Computational Linguistics (http://www.transacl.org/), which has a submission deadline on the 1st of every month with a review period of 3 weeks. The paper can be accepted, accepted with changes (author needs to resubmit the paper within two monts), rejected with changes (author has possibility to resubmit after 3 to 6 months) and rejected. The novelty is also that these papers can be presented at NAACL or ACL conferences in the form of Paper, Poster or Talk.
The second journal they own is Computational Linguistics, which is one of the top journals in the field and has also a high SCI impact factor. They said that it takes about 2 months to receive first decision about the submitted paper.
Then I attended Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews by Kang Liu, Liheng Xu and Jun Zhao, Mining Opinion Words and Opinion Targets in a Two-Stage Framework by Liheng Xu, Kang Liu, Siwei Lai, Yubo Chen and Jun Zhao and Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning by Song Feng, Jun Seok Kang, Polina Kuznetsova and Yejin Choi
After the afternoon coffee break I attended Recognizing Identical Events with Graph Kernels by Goran Glavaš and Jan Snajder, Automatic Term Ambiguity Detection by Tyler Baldwin, Yunyao Li and Bogdan Alexe
The presentation was given by Ms. Li, who works at Disney. She presented their simple workflow TAD, which is able to determine if a specific word is ambiguous or not. Their major problem is to detect opinion about movies from Twitter. For example, the name Skyfall 007 is not ambiguous because if we search twitter with this query, almost all of the answers will be about the movie. But on the other hand, the title of Brave movie is very ambiguous as lots of results have no connection to the movie, e.g. “He is so brave, that …”.
Towards Accurate Distant Supervision for Relational Facts Extraction by Xingxing Zhang, Jianwen Zhang, Junyu Zeng, Jun Yan, Zheng Chen and Zhifang Sui and Sequence Labeling for Determining Opinions in Online Forums by Kazi Hasan and Vincent Ng
They propose a system for opinion mining and present the problem as a sequence labeling. They try to improve classification of ideological debates. Their baseline systems are (Baseline 1): One classifier per each domain, SVM model, each training instance corresponds to a post, that can be positive or negative, feature types are: Basic – unigrams, Sentiment -dependencies, sentiment word counts, Argument – words in a post, (Baseline 2): They add author constraint to the Baseline 1. Further, they propose a system with two constraints: (1) Ideology Constraint (IC) – applicable for the same author, but for different domains. IC motivation: anti-abortion person is likely to be anti-Obama person. (2) User Interaction Constraint (UC): Regularities between interactions: a sequence of posts for which they use CRFs. They evaluated on 4 datasets: Abortions support, Gay rights, Obama support, Marijuana legalizations, and showed significant improvements over the baseline systems.
After this last session, the Lifetime Achievement Reward was given to Jerry Hobbs, who then had a very interesting talk about history and “future” of NLP. We may have some useful system “in the next 10 years”. With this event, the ACL 2013 finished and now the two days of workshops will begin.
If you do not see the movie, you can download it from http://zitnik.si/temp/acl2013_4.mp4.
For dinner I used Metro (btw., there is really nice and clean Metro in Sofia, uncomparable to some “famous” EU cities) to get to City Center Sofia. There I went to KFC and for a cheeseburger in MacDonald’s. If I define a cheeseburger as a one unit a person can eat and compare prices between Slovenia and Bulgaria, I can conclude that there are no significant differences. In Slovenia, 1 cheeseburger is 1 EUR and in Sofia 1 cheeseburger is 1,99BGN (=cca. 1EUR).
In the morning there was a keynote given by Lars Rasmussen who got his PhD at the University of Edinburgh, then worked in his own start-up that was bought by Google. At Google he was then working on Google Wave and now he is employed at Facebook, where he is working on Facebook graph search using natural language interface. In his keynote he presented the new Facebook NLP graph search with some interesting queries, e.g. “Photos of my friends who knit”, “Photos of my friends from national parks”, “Restaurants in Sofia by locals”, etc.. He also pointed out some notions of problems in understanding the system by Facebook employees and by public. For example, how would you select people who knit? Their idea was to support obvious query “People who like Knitting”, but public generated questions like “People who knit”, “knitters”, etc. He also presented the development of the system since the beginning of 2 years ago and gave a high level architecture overview. Next to the obvious NLP parts, they also introduce “de-sillyfication” method before doing further NLP processing. Interestingly, for more detailed questions at the end of the talk there were also some engineers from his team that could answer more technical questions.
In the first session I attended to the following talks: A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation by Zhenhua Tian, Hengheng Xiang, Ziqi Liu and Qinghua Zheng, ImpAr: A Deterministic Algorithm for Implicit Semantic Role Labelling by Egoitz Laparra and German Rigau and Cross-lingual Transfer of Semantic Role Labeling Models by Mikhail Kozhevnikov and Ivan Titov
During the lunch break I visited Pavlina Ivanova and her friend from the association of doctoral candidates in Bulgaria. We had a really nice talk and tomorrow evening we are planning to go around the Sofia center.
In the afternoon I attended to:
Argument Inference from Relevant Event Mentions in Chinese Argument Extraction by Peifeng Li, Qiaoming Zhu and Guodong Zhou
Fine-grained Semantic Typing of Emerging Entities by Ndapandula Nakashole, Tomasz Tylenda and Gerhard Weikum
The talk was about how to detect emerging entities that will become popular. For Out-of-KB Entity Detection, they focused on noun phrases, that can represent: a class/general concept and not an entity, already known within a KB, a new name for an old entity in KB, a new entity, unknown in DB. They use PATTY, which has a collection of 300.000 synsets, e.g. PATTY phrase <musician> released <album>, <music band> released <album>, <company> released <product>. Using PATTY, they propose a probabilistic weight model P(t1,t2|p) = P(t1,t2|p)/P(p), where <*> p <*>. If one of the entities is known in a DB, the IMENOVALCE can also be P(p, t2).
Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction by Barbara Plank and Alessandro Moschitti
Their task is to find binary ACE-2004 (newspaper and broadcast news) type relations. What happens when you change the domain? They propose a term generalization approach and a general syntactic structure. They crawled a pivot corpus from WWW. They have idea to use standard syntactic tree kernel, i.e. similarity between two trees is counting the number of the same subtrees. Issue here: there are similar syntactic structure, but the leaves differ, e.g. mother of two VS. governor from Texas have similar subtrees. So they tried to employ semantic syntactic tree kernel, which allows soft matches between terminal nodes. How is this good for domain adaptation? They focus on two types of semantic similarities: (1) Brown word clusters, they induced 1k clusters from ukWc corpus (Baroni et al.) and (2) — i forgot what — :). Their system workflow looks as follows: Raw Text-> (Charniak) Parser-> Parse Trees with entities-> Tree Kernel based SVMs-> Multi-class classification. They also tested within ACE datasets – train on one, test on other (the thing I wanted to ask :):) because I did the same thing for coreference resolution across different corpora), and as expected, results were lower by about 10%. Interestingly, there is almost no related work on DA (domain adaptation) on RE (I think it is the same for coreference resolution).
During the coffee break I talked to the Baidu people, which say that Baidu is the largest search engine in the world (Maybe, but I am not completely sure about that…). I also found out that “Baidu” is a term from a Chinese poem and means something like: “You search for something and it suddenly appears”. Baidu is also this year’s biggest ACL sponsor and the organizers said we should look a little on sponsor pages, so you as a reader should also gaze a little bit at the following photos:
In the last session I attended to: Smatch: an Evaluation Metric for Semantic Feature Structures by Shu Cai and Kevin Knight, Variable Bit Quantisation for LSH by Sean Moran, Victor Lavrenko and Miles Osborne, Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora by Dhouha Bouamor, Nasredine Semmar and Pierre Zweigenbaum and lastly The Effects of Lexical Resource Quality on Preference Violation Detection by Jesse Dunietz, Lori Levin and Jaime Carbonell
In the evening there was a Banquet (aka. dinner) at Sheraton Hotel. I socialized a bit more and found out there are a lot of people from Industry, just observing the conference and some of them having no papers at all. I got to know people from Google (they have quite a lot of papers), Intel, Nuance (Siri uses their speech recognition if you did not know, very successful company), Sony, Microsoft, Nice Systems, …
If you cannot see the videos, you can download them from URLs http://zitnik.si/temp/acl2013_1.mp4, http://zitnik.si/temp/acl2013_2.mp4 and http://zitnik.si/temp/acl2013_3.mp4.
Today it was the official conference opening. Obviously, this year’s ACL is one of the biggest conferences. There were almost 1000 papers submitted with acceprance rate of 26%. During the conference there will also be presentations of journal papers from the new Transactions of ACL.
The keynote was given by Prof. Dr. Rolf Harald Baayen, who is a pioneer in empirical linguistic research. He was talking about understanding of a language by observing the focus of human eyes when reading english compounds. For example, what does a handbag mean, worker, etc. from the perspective od learning a computer program to understand their notions. Mostly these words do not have direct meaning in the text and this is a problem.
During the first session I attended the talk Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms given by Elijah Mayfield, David Adamson and Carolyn Penstein Rosé
They were talking about processing of chats. Interestingly, to get the important meaning or best extractions, they found out the best way to achieve this is to remove everything before a sentence that ends with an exclamation mark. They also mentioned a general IE tool, named LightSide.
Next lecture was Decentralized Entity-Level Modeling for Coreference Resolution by Greg Durrett, David Hall and Dan Klein
They proposed a new architecture with classic entity level features. Their approach is decentralized as each mention has a cloud of semantic properties, which enables to maintain the tractability of a pairwise system. Furthermore, they separate properties and mentions to form two separate models and connect them via factors. The resulting model is non-convex, but they still could perform standard training and inference using belief propagation technique. They tested their system against CoNLL 2011 ST dataset with three different settings. The first used baseline features, the second standard entity features (i.e. gender, animacy, NE tags) and the third was enriched using semantic features. Their system gained a 1% of accuracy over a baseline system in the first setting, but was worse or equal in other two settings.
During the “Student Lunch” I and found out an interesting idea that an important person said from a person that I would also rather not mention: “IR is grep” 🙂 The IR people were obviously insulted, but on a very basic level, it is true :):):)
In the second session I attended A Computational Approach to Politeness with Application to Social Factors by Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec and Christopher Potts
First slide started with a picture of two dogs and one of them saying: “I only sniffed his ass to be polite”. They focused into detecting and measure politeness. They use data from Wikipedia – 35k (4,5k annotated) requests – actions and StackExchange – 373k (6,5k annotated) requests. They had 5 annotators to annotate the dataset and opened it to public. They also showed some interesting notions how a sentence should be formed to sound polite. Lastly, the most interesting thing they presented was how politeness changes for political candidates. Before elections, people that would win are mostly more polite than others. After the elections the politenes of the winners lowers and “loosers” become more polite.
The second talk I attended was Modeling Thesis Clarity in Student Essays by Isaac Persing and Vincent Ng.
After the coffee break I listened to the following talks:
Exploiting Topic-based Twitter Sentiment for Stock Prediction by Jianfeng Si, Arjun Mukherjee, Bing Liu, Qing Li, Huayi Li and xiaotie Deng
They crawled Twitter for company hashtags and predicting if a specific stock will raise or fall.
Learning Entity Representation for Entity Disambiguation by Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Houfeng Wang and Longkai Zhang
They try to link entities to an ontology by directly optimizing similarity using a two-stage approach.
Natural Language Models for Predicting Programming Comments by Dana Movshovitz-Attias and William Cohen
They proposed a model to suggest autocompletion of words when writing source code comments. All the data they used was from lucene library and StackOverflow posts that use the word Java. Their results show that prediction is better when using more data and bag-of-words approach. Next to the basic experiments they also measured how good is prediction somewhere in the middle of software project development.
Paraphrasing Adaptation for Web Search Ranking by Chenguang Wang, Nan Duan, Ming Zhou and Ming Zhang
They presented and adapting paraphrasing technique to web search from three aspects: a search-oriented paraphrasing model, an NDCG-based parameter optimization algorithm and an enhanced ranking model leveraging augmented features computed on paraphrases of original queries. They also showed that the search performance can be significantly improved by up to 3% in NDCG gains.
In the evening, two poster sessions were organized, which lasted until 9pm. There were really a lot of posters and demos. Especially interesting is ARGO (http://argo.nactem.ac.uk/) – IOBIE should also go this way. I attach also some interesting images: