Masato Hagiwara

I'm a Research Lead at Earth Species Project working on decoding non-human communication with AI/ML technologies. Currently I'm focusing on building foundation models (NatureLM-audio, AVES) and benchmarks (BEANS) for non-human animals. I call this a field of (non-human) Animal Language Processing, or ALP.

Formerly, I was a Machine Learning Engineer / Researcher at Duolingo. I love language and machine learning, and help people connect the two. I speak Chinese, Japanese, and English fluently, and am learning Korean and Lojban. I led the launch of the Japanese, Korean, and Chinese courses on Duolingo. My research projects appeared on TechCrunch and Quartz.

You can find my resume here.

News

Jan. 2025: 🎉 Exciting news! Our paper on NatureLM-audio, the first large audio-language model for understanding animal sounds, has been accepted at ICLR 2025! 🐦🔊
Nov. 2024: We announced NatureLM-audio, the first large audio-language model tailored for understanding animal sounds. I'm proud to be work as one of the lead researchers for this project.
July 2024: I gave an invited talk titled "Intro to Animal Language Processing" at NLP Colloquium (a Japanese online NLP workshop).
Mar. 2024: Our paper (Project MOSLA) was accepted at LREC-COLING 2024 and nominated as one of the Best Paper Candidates!
Feb. 2024: Our paper (ISPA) was accepted at the XAI-SA Workshop co-located with ICASSP 2024.
Feb. 2023: Two papers (AVES and BEANS) my colleagues at Earth Species Projects and I co-authored were accepted at ICASSP 2023.
Nov. 2021: I'm joining Earth Species Project as a Senior AI Researcher. I'm thrilled to work on decoding non-human communication with AI/ML technologies!
Aug. 2021: I'll be giving an invited talk on "Machine Learning for Language Learning" hosted by Waseda University. See the official announcement and the talk slides for more info.
Jul. 2021: I'm working on a book about Japanese NLP with Paul O'Leary McCann. We'll be covering everything from tokenization/morphological analysis up to recent neural methods and BERT. See the official website for more info.
Apr. 2021: I'm happy to announce GrammarTagger, a neural multilingual grammar profiler, and EXPATS, a toolkit for explainable automated text scoring!
Apr. 2020: I'm now working with Mirai Translate, a Japan-based startup offering human-level machine translation services, and ACTNext, ACT's research and development unit for educational research.
Dec. 2019: We launched GitHub Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors. The paper was accepted to appear at LREC 2020.
Nov. 2019: I'm presenting our ultra fine-grained NER system at TAC KBP 2019, which ranked #2 among 9 strong competitors in the EDL track (joint work with Studio Ousia)!
Aug. 2019: Our paper on TEASPN: Framework and Protocol for Integrated Writing Assistance Environments, is accepted to appear at EMNLP 2019 (system demonstration)!
Jul. 2019: My book "Real-World Natural Language Processing" is available via MEAP, Manning Early Access Program. Feedback is welcome!

Projects

ALP (Animal Language Processing) — For the past few years, I've been the lead researcher on many projects on the application of AI for animal communications, including:
- NatureLM-audio, the first large audio-language model tailored for understanding animal sounds.
- AVES and BirdAVES, a self-supervised, transformer-based audio representation model for encoding animal vocalizations ("BERT for animals").
- BEANS, a collection of bioacoustics tasks and public datasets, specifically designed to measure the performance of machine learning algorithms in the field of bioacoustics.
- ISPA, a precise, concise, and interpretable system designed for transcribing animal sounds into text ("IPA for animals").

NLP — I'm the main researcher and developer of many ML/NLP open source projects and datasets, including:
- TEASPN, a protocol and a framework for integrated writing environments
- Rakuten MA, a morphological analyzer for Chinese and Japanese written entirely in JavaScript
- NanigoNet, a language detector for code-mixed input supporting 150+19 human+programming languages
- Github Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors
- Open Language Profiles, a platform for sharing open linguistic resources for language education

Education — I love teaching NLP to the world. Books/courses I wrote include:
- Real-World Natural Language Processing with Manning Publications
- Introduction to Japanese Natural Language Processing with Paul O'Leary McCann
- Official AllenNLP Course with Matt Gardner and the AllenNLP team

I'm also a co-author or a co-translator of the following Japanese books:
- Natural Language Processing: Basics and Technology (Shoeisha)
- Natural Language Processing in Python (O'Reilly)
- Machine Learning for Hackers (O'Reilly)

Duolingo - I built and worked on research for Duolingo, the most popular language learning app in the world, and Duolingo English Test, an affordable and accessible English certification test developed by Duolingo.
Music - In my free time, I create music and play jazz.

Experience

Nov. 2021 - Present: Research Lead - Earth Species Project (Berkeley, CA)
- Led the development of NatureLM-audio, the first audio LLM for understanding animal sounds.
- Developed AVES, a self-supervised foundation model for animal sounds ("BERT for animals")
- Built BEANS, a benchmark for bioacoustics ML models (“GLUE for animals”)

Feb. 2019 - Present: Owner & Independent NLP/ML Engineer and Researcher - Octanove Labs LLC (Seattle, WA)
- Worked as a consultant for early-to-mid stage startups in the US/Japan on their ML strategies
- Worked on QA and NER with Stduio Ousia (ranked #2 at TAC KBP 2019 fine-grained NER track)
- Built educational research and open-source projects with RIKEN (TEASPN, NanigoNet, and Github Typo Corpus)
- Built a free, Web-based AllenNLP course in collaboration with with Matt Gardner at Allen Institute for AI

Feb. 2015 - Feb. 2019: Senior Machine Learning Engineer / Researcher - Duolingo, Inc. (Pittsburgh, PA)
- Built automatic grading technologies for Duolingo English Test using neural networks
- Led data creation and analysis for various research projects, including user behavior analysis and second language acquisition modeling (SLAM) shared task
- Led the content creation of Chinese, Japanese, and Korean from English courses

Oct. 2010 - Feb. 2015: Lead Scientist - Rakuten Institute of Technology (New York, NY)
- Developed machine transliteration (NLP2011 paper award) and machine translation algorithms for the largest Japanese e-commerce website (Rakuten)
- Built a Chinese/Japanese word segmentation / morphological analyzer (RakutenMA)
- Developed a writing support system for English as a Second Language (ESL) learners

Apr. 2009 - Sep. 2010: Research and Development Engineer - Baidu Japan, Inc. (Shanghai / Beijing / Tokyo)
- Led NLP data initiatives including Unnatural language processing contest and Baidu mobile corpus and timed corpus
- Improved the ranking and page analysis algorithms including spam detection and emoticon search for Baidu mobile search
- Worked as a consultant on various NLP projects including Japanese Input Method BaiduType

Apr. 2008 - Jul. 2008: Research Intern - Microsoft Research (Redmond, WA; Mentor: Hisami Suzuki)
- Built a state-of-the-art method for Japanese query alteration for spelling correction and spelling/transliteration normalization
- Implemented the system using Visual C#, SQL Server, and Ruby, with tens of gigabytes of query log, which was integrated into Microsoft Live Search
- Published a research paper on the query alteration algorithm at NAACL 2009 and at the 3rd NLP Symposium for Young Researchers (Outstanding Presentation Award)

Aug. 2005 - Sep. 2005: Intern (Software Engineer), Google Inc. (Mountain View, CA; Mentors: Dekang Lin and Jun Wu)
- Improved Japanese query suggestion, which is currently used as the basis for the query suggestion shown at the top and bottom of the Google search result
- Ran knowledge extraction algorithms on the distributed computation infrastructure (MapReduce and the Google's large network clusters)

Education

Apr. 2006 - Mar. 2009: Ph.D., Information Engineering,
- Graduate School of Information Science, Nagoya University, Japan.
- Doctoral Thesis: "Modeling and Selection of Context for Better Synonym Acquisition"

Apr. 2004 - Mar. 2006 : Master's Degree, Information Engineering,
- Graduate School of Information Science, Nagoya University, Japan
- Skipped a year in undergraduate due to the excellent academic performance. Overall GPA: 3.8
- Master's Thesis: "Utilization of Probabilistic Latent Semantics for Automatic Thesaurus Construction"

Apr. 2001 - Mar. 2004 : Information Engineering Course, School of Engineering,
- Nagoya University, Japan. Computer Science GPA: 3.9

Awards & Professional Activities

Invited talk on “Education and AllenNLP” at AllenNLP Summit, 2019.
Co-organizer of the Workshop for Natural Language Processing Open Source Software (NLP-OSS), co-located at ACL 2018.
Invited keynote at the Optimizing Human Learning workshop co-located with ITS 2018 (Montréal, Canada, June 2018).
Invited talk at CUNY NLP Seminar (hosted by Prof. Heng Ji) Title: Word Segmentation and Transliteration in Chinese and Japanese, April 2013. slides
2011 Field Innovation Award from the Japanese Society for Artificial Intelligence: ANPI_NLP: Safety Information Confirmation Support using Natural Language Processing for The 2011 Tohoku Earthquake.
Paper Award at NLP2011 “Latent Class Transliteration based on Source Language Origins” (the largest Japanese NLP academic conference)
Best Paper Award at NLP2009 “Semantic Category Extraction from Unsegmented Text using Graph Kernels” (the largest Japanese NLP academic conference, chosen among 235 papers)
Paper Award at the 3rd NLP Symposium for Young Researchers. Presentation: “A Unified Approach to Japanese Query Alteration based on Semantic Similarity”

Publications

Books

Masato Hagiwara, Real-World Natural Language Processing, To be published by Manning Publications, 2019.
Yoh Okuno, Graham Neubig, Masato Hagiwara, Mamoru Komachi. Natural Language Processing: Basics and Technology (Shoeisha) (in Japanese). Shoeisha, 2016.
Drew Conway, John Myles White, 萩原正人 (Masato Hagiwara), 奥野陽 (Yoh Okuno), 水野貴明 (Takaaki Mizuno), 木下哲也 (Tetsuya Kinoshita) (translation). 入門機械学習 (Machine Learning for Hackers). O'Reilly Japan, 2012. O'Reilly Japan - 入門機械学習
Steven Bird, Ewan Klein, Edward Loper. 萩原正人 (Masato Hagiwara), 中山敬広 (Takahiro Nakayama), 水野貴明(Takaaki Mizuno) (translation). 入門自然言語処理 (Natural Language Processing with Python). O'Reilly Japan, 2010. O'Reilly Japan - 入門自然言語処理

Journal Papers

Burr Settles, Geoffrey T. LaFlair, Masato Hagiwara: Machine Learning–Driven Language Assessment. Transactions of the Association for Computational Linguistics, Vol. 8, pp. 247–263, 2020.
Masato Hagiwara, Koji Murakami, Graham Neubig, Yuichiroh Matsubayashi: Robust NLP for Real-world Data : 7. ANPI_NLP - Mining Safety Information after Disasters Using Natural Language Processing-. Information Processing Society of Japan Magazine. Vol. 53, No. 3, pp. 241-248, 2012.
萩原正人，小川泰弘，外山勝彦: グラフカーネルを用いた非分かち書き文からの漸次的語彙知識獲得, 人工知能学会誌, Vol.26, No.3, pp.440-450, 2011.
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Supervised Synonym Acquisition Using Distributional Features and Syntactic Patterns. Journal of Natural Language Processing, Vol. 16, Num. 2, pp. 59-83, 2009.
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. A Comparative Study on Effective Context Selection for Distributional Similarity. Journal of Natural Language Processing, Vol. 5, Num. 5, pp. 119-150, 2008.
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Effective Use of Indirect Dependency for Distributional Similarity. Journal of Natural Language Processing, Vol. 15, Num. 4, pp. 19-42, 2008.
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. New Frontiers in Artificial Intelligence: JSAI 2008 Conference and Workshops, Revised Selected papers, Lecture Notes in Computer Science, Vol. 5447, pp. 213-227, 2009.

Conference Papers (Selected)

Masato Hagiwara, Joshua Tanner. Project MOSLA: Recording Every Moment of Second Language Acquisition. LREC-COLING 2024 (Best Paper Candidate) [paper].
Masato Hagiwara, Marius Miron, Jen-Yu Liu. ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds. XAI-SA Workshop @ ICASSP 2024. [paper].
Masato Hagiwara. AVES: Animal Vocalization Encoder based on Self-Supervision. ICASSP 2023 [paper].
Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, Katie Zacarian. BEANS: The Benchmark of Animal Sounds. ICASSP 2023 [paper].
Yoshinari Fujinuma, Masato Hagiwara. Semi-Supervised Joint Estimation of Word and Document Readability. TextGraphs-15, 2021 [paper].
Takumi Ito, Tatsuki Kuribayashi, Hayato Kobayashi, Ana Brassard, Masato Hagiwara, Jun Suzuki and Kentaro Inui. Diamonds in the Rough: Generating Fluent Sentences from Early-stage Drafts. ILNG 2019 [paper].
Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki and Kentaro Inui. TEASPN: Framework and Protocol for Integrated Writing Assistance Environments. EMNLP (system demonstrations), 2019. [paper]
Burr Settles, Chris Brust, Erin Gustafson, Masato Hagiwara, Nitin Madnani. Second Language Acquisition Modeling. BEA 2018, 2018. [paper]
Ayah Zirikly, Masato Hagiwara. Cross-lingual Transfer of Named Entity Recognizers without Parallel Corpora. ACL 2015, pp. 390-396, 2015. [paper]
Masato Hagiwara, Satoshi Sekine. Lightweight Client-Side Chinese/Japanese Morphological Analyzer Based on Online Learning. COLING 2014 system demonstration, pp. 39-43, 2014. [paper]
Haibo Li, Masato Hagiwara, Qi Li, Heng Ji. Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese, LREC 2014, pp.2532-2536, 2014. [paper]
Masato Hagiwara, Satoshi Sekine. Accurate Word Segmentation using Transliteration and Language Model Projection, ACL 2013, pp 183-189. [paper]
Masato Hagiwara, Soh Masuko. KooSHO: Japanese Text Input Environment based on Aerial Hand Writing. The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT 2013), demo session, pp. 24-27. 2013. [paper]
Yuta Hayashibe, Masato Hagiwara, Satoshi Sekine. phloat : Integrated Writing Environment for ESL learners, Second Workshop on Advances in Text Input Methods (WTIM 2012), pp.57-72, 2012. [paper] [slides]
Masato Hagiwara, Satoshi Sekine. Latent Semantic Transliteration using Dirichlet Mixture. NEWS 2012 (the 4th Named Entities Workshop), pp. 30-37, 2012. [paper]
Graham Neubig, Yuichiroh Matsubayashi, Masato Hagiwara, Koji Murakami. Safety Information Mining — What can NLP do in a disaster —, Proc. of IJCNLP 2011. [paper]
Masato Hagiwara and Satoshi Sekine. Latent Class Transliteration based on Source Language Origins. Proc. of ACL-HLT 2011, pp. 53-57, 2011. [paper]
Masato Hagiwara and Hisami Suzuki. Japanese Query Alteration Based on Lexical Semantic Similarity. Proc. of NAACL HLT 2009, pp. 191-199, 2009. [paper]
Nobuyuki Shimizu, Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama and Hiroshi Nakagawa. Metric learning for synonym acquisition. Proc. of COLING 2008, pp. 793-800, 2008. [paper]
Masato Hagiwara. A Supervised Learning Approach to Automatic Synonym Identification based on Distributional Features. Proc. of ACL 2008 Student Research Workshop, pp. 1-6, 2008. [paper] [link]
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Context Feature Selection for Distributional Similarity. Proc. of IJCNLP 2008, pp. 553-560, 2008. [paper] [link]
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Selection of Effective Contextual Information for Automatic Synonym Acquisition. Proc. of COLING/ACL 2006, pp. 353 - 360, 2006. [paper] [link]
Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. PLSI Utilization for Automatic Thesaurus Construction. Proc. of IJCNLP 2005, pp. 334 - 345, 2005. [paper]

Press

In English
- The best time of day to learn a new language, according to Duolingo data (Feb. 2018, Quartz)
- 3 habits of successful language learners (Mar. 2017, TechCrunch)
In Japanese
- How I work - Masato Hagiwara at Duolingo (Jan. 2016, Lifehacker.jp)
- Why you shouldn't study at weekends - Data reveal three common traits of successful language learners (Dec. 2016, TechCrunch Japan)
- Difference between successful and unsuccessful language learners, according to a researcher at Duolingo (Dec. 2016, Lifehacker.jp)
- Aptitude doesn't matter for language learning - Interview with Masato Hagiwara, a Japanese software engineer at Duolingo (Aug. 2015, Lifehacker.jp)
- Humans still learning languages in 30 years? (Aug. 2015, Lifehacker.jp)
- Free language learning app Duolingo raises $45 million from Google Capital (June 2015, Nikkei Computer)
- Process Emojis as 'words' - Emojis not used as defined (July 2010, INTERNET Watch)
- Process Emojis as 'words' - algorithm to distinguish 'beers' from 'parties' (July 2010, INTERNET Watch)
- Character encoding experts turn Baidu Emoji search episodes into an academic paper (Mar. 2010, INTERNET Watch)