Masato Hagiwara

I'm a Senior AI Researcher at Earth Species Project working on decoding non-human communication with AI/ML technologies. Currently I'm focusing on building foundation models (AVES) and benchmarks (BEANS) for non-human animals. I call this a field of (non-human) Animal Language Processing, or ALP.

Formerly, I was a Machine Learning Engineer / Researcher at Duolingo. I love language and machine learning, and help people connect the two. I speak Chinese, Japanese, and English fluently, and am learning Korean and Lojban. I led the launch of the Japanese, Korean, and Chinese courses on Duolingo. My research projects appeared on TechCrunch and Quartz.

You can find my resume here.

News

  • Mar. 2024: Our paper (Project MOSLA) was accepted at LREC-COLING 2024 and nominated as one of the Best Paper Candidates!
  • Feb. 2024: Our paper (ISPA) was accepted at the XAI-SA Workshop co-located with ICASSP 2024.
  • Feb. 2023: Two papers (AVES and BEANS) my colleagues at Earth Species Projects and I co-authored were accepted at ICASSP 2023.
  • Nov. 2021: I'm joining Earth Species Project as a Senior AI Researcher. I'm thrilled to work on decoding non-human communication with AI/ML technologies!
  • Aug. 2021: I'll be giving an invited talk on "Machine Learning for Language Learning" hosted by Waseda University. See the official announcement and the talk slides for more info.
  • Jul. 2021: I'm working on a book about Japanese NLP with Paul O'Leary McCann. We'll be covering everything from tokenization/morphological analysis up to recent neural methods and BERT. See the official website for more info.
  • Apr. 2021: I'm happy to announce GrammarTagger, a neural multilingual grammar profiler, and EXPATS, a toolkit for explainable automated text scoring!
  • Apr. 2020: I'm now working with Mirai Translate, a Japan-based startup offering human-level machine translation services, and ACTNext, ACT's research and development unit for educational research.
  • Dec. 2019: We launched GitHub Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors. The paper was accepted to appear at LREC 2020.
  • Nov. 2019: I'm presenting our ultra fine-grained NER system at TAC KBP 2019, which ranked #2 among 9 strong competitors in the EDL track (joint work with Studio Ousia)!
  • Aug. 2019: Our paper on TEASPN: Framework and Protocol for Integrated Writing Assistance Environments, is accepted to appear at EMNLP 2019 (system demonstration)!
  • Jul. 2019: My book "Real-World Natural Language Processing" is available via MEAP, Manning Early Access Program. Feedback is welcome!


Projects

  • ALP (Animal Language Processing) — For the past few years, I've been the lead researcher on many projects on the application of AI for animal communications, including:
    • AVES, a self-supervised, transformer-based audio representation model for encoding animal vocalizations ("BERT for animals").
    • BEANS, a collection of bioacoustics tasks and public datasets, specifically designed to measure the performance of machine learning algorithms in the field of bioacoustics.
    • ISPA, a precise, concise, and interpretable system designed for transcribing animal sounds into text ("IPA for animals").

  • NLP — I'm the main researcher and developer of many ML/NLP open source projects and datasets, including:
    • TEASPN, a protocol and a framework for integrated writing environments
    • Rakuten MA, a morphological analyzer for Chinese and Japanese written entirely in JavaScript
    • NanigoNet, a language detector for code-mixed input supporting 150+19 human+programming languages
    • Github Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors
    • Open Language Profiles, a platform for sharing open linguistic resources for language education

  • Duolingo - I built and worked on research for Duolingo, the most popular language learning app in the world, and Duolingo English Test, an affordable and accessible English certification test developed by Duolingo.

  • Music - In my free time, I create music and play jazz.


Experience

  • Nov. 2021 - Present: Senior AI Research Scientist (Berkeley, CA)
    • Developed AVES, a self-supervised foundation model for animal sounds ("BERT for animals")
    • Built BEANS, a benchmark for bioacoustics ML models (“GLUE for animals”)

  • Feb. 2019 - Present: Owner & Independent NLP/ML Engineer and Researcher - Octanove Labs LLC (Seattle, WA)
    • Worked as a consultant for early-to-mid stage startups in the US/Japan on their ML strategies
    • Worked on QA and NER with Stduio Ousia (ranked #2 at TAC KBP 2019 fine-grained NER track)
    • Built educational research and open-source projects with RIKEN (TEASPN, NanigoNet, and Github Typo Corpus)
    • Built a free, Web-based AllenNLP course in collaboration with with Matt Gardner at Allen Institute for AI

  • Feb. 2015 - Feb. 2019: Senior Machine Learning Engineer / Researcher - Duolingo, Inc. (Pittsburgh, PA)
    • Built automatic grading technologies for Duolingo English Test using neural networks
    • Led data creation and analysis for various research projects, including user behavior analysis and second language acquisition modeling (SLAM) shared task
    • Led the content creation of Chinese, Japanese, and Korean from English courses

  • Oct. 2010 - Feb. 2015: Lead Scientist - Rakuten Institute of Technology (New York, NY)
    • Developed machine transliteration (NLP2011 paper award) and machine translation algorithms for the largest Japanese e-commerce website (Rakuten)
    • Built a Chinese/Japanese word segmentation / morphological analyzer (RakutenMA)
    • Developed a writing support system for English as a Second Language (ESL) learners

  • Apr. 2008 - Jul. 2008: Research Intern - Microsoft Research (Redmond, WA; Mentor: Hisami Suzuki)
    • Built a state-of-the-art method for Japanese query alteration for spelling correction and spelling/transliteration normalization
    • Implemented the system using Visual C#, SQL Server, and Ruby, with tens of gigabytes of query log, which was integrated into Microsoft Live Search
    • Published a research paper on the query alteration algorithm at NAACL 2009 and at the 3rd NLP Symposium for Young Researchers (Outstanding Presentation Award)

  • Aug. 2005 - Sep. 2005: Intern (Software Engineer), Google Inc. (Mountain View, CA; Mentors: Dekang Lin and Jun Wu)
    • Improved Japanese query suggestion, which is currently used as the basis for the query suggestion shown at the top and bottom of the Google search result
    • Ran knowledge extraction algorithms on the distributed computation infrastructure (MapReduce and the Google's large network clusters)


Education

  • Apr. 2006 - Mar. 2009: Ph.D., Information Engineering,
    • Graduate School of Information Science, Nagoya University, Japan.
    • Doctoral Thesis: "Modeling and Selection of Context for Better Synonym Acquisition"

  • Apr. 2004 - Mar. 2006 : Master's Degree, Information Engineering,
    • Graduate School of Information Science, Nagoya University, Japan
    • Skipped a year in undergraduate due to the excellent academic performance. Overall GPA: 3.8
    • Master's Thesis: "Utilization of Probabilistic Latent Semantics for Automatic Thesaurus Construction"

  • Apr. 2001 - Mar. 2004 : Information Engineering Course, School of Engineering,
    • Nagoya University, Japan. Computer Science GPA: 3.9

Awards & Professional Activities

  • Invited talk on “Education and AllenNLP” at AllenNLP Summit, 2019.
  • Co-organizer of the Workshop for Natural Language Processing Open Source Software (NLP-OSS), co-located at ACL 2018.
  • Invited keynote at the Optimizing Human Learning workshop co-located with ITS 2018 (Montréal, Canada, June 2018).
  • Invited talk at CUNY NLP Seminar (hosted by Prof. Heng Ji) Title: Word Segmentation and Transliteration in Chinese and Japanese, April 2013. slides
  • 2011 Field Innovation Award from the Japanese Society for Artificial Intelligence: ANPI_NLP: Safety Information Confirmation Support using Natural Language Processing for The 2011 Tohoku Earthquake.
  • Paper Award at NLP2011 “Latent Class Transliteration based on Source Language Origins” (the largest Japanese NLP academic conference)
  • Best Paper Award at NLP2009 “Semantic Category Extraction from Unsegmented Text using Graph Kernels” (the largest Japanese NLP academic conference, chosen among 235 papers)
  • Paper Award at the 3rd NLP Symposium for Young Researchers. Presentation: “A Unified Approach to Japanese Query Alteration based on Semantic Similarity”


Publications

Books

Journal Papers

  • Burr Settles, Geoffrey T. LaFlair, Masato Hagiwara: Machine Learning–Driven Language Assessment. Transactions of the Association for Computational Linguistics, Vol. 8, pp. 247–263, 2020.
  • Masato Hagiwara, Koji Murakami, Graham Neubig, Yuichiroh Matsubayashi: Robust NLP for Real-world Data : 7. ANPI_NLP - Mining Safety Information after Disasters Using Natural Language Processing-. Information Processing Society of Japan Magazine. Vol. 53, No. 3, pp. 241-248, 2012.
  • 萩原正人,小川泰弘,外山勝彦: グラフカーネルを用いた非分かち書き文からの漸次的語彙知識獲得, 人工知能学会誌, Vol.26, No.3, pp.440-450, 2011.
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Supervised Synonym Acquisition Using Distributional Features and Syntactic Patterns. Journal of Natural Language Processing, Vol. 16, Num. 2, pp. 59-83, 2009.
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. A Comparative Study on Effective Context Selection for Distributional Similarity. Journal of Natural Language Processing, Vol. 5, Num. 5, pp. 119-150, 2008.
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Effective Use of Indirect Dependency for Distributional Similarity. Journal of Natural Language Processing, Vol. 15, Num. 4, pp. 19-42, 2008.
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. New Frontiers in Artificial Intelligence: JSAI 2008 Conference and Workshops, Revised Selected papers, Lecture Notes in Computer Science, Vol. 5447, pp. 213-227, 2009.

Conference Papers (Selected)

  • Masato Hagiwara, Joshua Tanner. Project MOSLA: Recording Every Moment of Second Language Acquisition. LREC-COLING 2024 (Best Paper Candidate) [paper].
  • Masato Hagiwara, Marius Miron, Jen-Yu Liu. ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds. XAI-SA Workshop @ ICASSP 2024. [paper].
  • Masato Hagiwara. AVES: Animal Vocalization Encoder based on Self-Supervision. ICASSP 2023 [paper].
  • Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, Katie Zacarian. BEANS: The Benchmark of Animal Sounds. ICASSP 2023 [paper].
  • Yoshinari Fujinuma, Masato Hagiwara. Semi-Supervised Joint Estimation of Word and Document Readability. TextGraphs-15, 2021 [paper].
  • Takumi Ito, Tatsuki Kuribayashi, Hayato Kobayashi, Ana Brassard, Masato Hagiwara, Jun Suzuki and Kentaro Inui. Diamonds in the Rough: Generating Fluent Sentences from Early-stage Drafts. ILNG 2019 [paper].
  • Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki and Kentaro Inui. TEASPN: Framework and Protocol for Integrated Writing Assistance Environments. EMNLP (system demonstrations), 2019. [paper]
  • Burr Settles, Chris Brust, Erin Gustafson, Masato Hagiwara, Nitin Madnani. Second Language Acquisition Modeling. BEA 2018, 2018. [paper]
  • Ayah Zirikly, Masato Hagiwara. Cross-lingual Transfer of Named Entity Recognizers without Parallel Corpora. ACL 2015, pp. 390-396, 2015. [paper]
  • Masato Hagiwara, Satoshi Sekine. Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning. COLING 2014 system demonstration, pp. 39-43, 2014. [paper]
  • Haibo Li, Masato Hagiwara, Qi Li, Heng Ji. Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese, LREC 2014, pp.2532-2536, 2014. [paper]
  • Masato Hagiwara, Satoshi Sekine. Accurate Word Segmentation using Transliteration and Language Model Projection, ACL 2013, pp 183-189. [paper]
  • Masato Hagiwara, Soh Masuko. KooSHO: Japanese Text Input Environment based on Aerial Hand Writing. The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), demo session, pp. 24-27. 2013. [paper]
  • Yuta Hayashibe, Masato Hagiwara, Satoshi Sekine. phloat : Integrated Writing Environment for ESL learners, Second Workshop on Advances in Text Input Methods (WTIM 2012), pp.57-72, 2012. [paper] [slides]
  • Masato Hagiwara, Satoshi Sekine. Latent Semantic Transliteration using Dirichlet Mixture. NEWS 2012 (the 4th Named Entities Workshop), pp. 30-37, 2012. [paper]
  • Graham Neubig, Yuichiroh Matsubayashi, Masato Hagiwara, Koji Murakami. Safety Information Mining — What can NLP do in a disaster —, Proc. of IJCNLP 2011. [paper]
  • Masato Hagiwara and Satoshi Sekine. Latent Class Transliteration based on Source Language Origins. Proc. of ACL-HLT 2011, pp. 53-57, 2011. [paper]
  • Masato Hagiwara and Hisami Suzuki. Japanese Query Alteration Based on Lexical Semantic Similarity. Proc. of NAACL HLT 2009, pp. 191-199, 2009. [paper]
  • Nobuyuki Shimizu, Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama and Hiroshi Nakagawa. Metric learning for synonym acquisition. Proc. of COLING 2008, pp. 793-800, 2008. [paper]
  • Masato Hagiwara. A Supervised Learning Approach to Automatic Synonym Identification based on Distributional Features. Proc. of ACL 2008 Student Research Workshop, pp. 1-6, 2008. [paper] [link]
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Context Feature Selection for Distributional Similarity. Proc. of IJCNLP 2008, pp. 553-560, 2008. [paper] [link]
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Selection of Effective Contextual Information for Automatic Synonym Acquisition. Proc. of COLING/ACL 2006, pp. 353 - 360, 2006. [paper] [link]
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. PLSI Utilization for Automatic Thesaurus Construction. Proc. of IJCNLP 2005, pp. 334 - 345, 2005. [paper]


Press


Blog

In English

In Japanese