The sound quality sounds like TTY synthesizers created by Taiwanese universities about 10 years ago.
There’s the facebook one based on neural networks.
The interface for this one is voice to voice, instead of just text to voice. You would have to select the model for Taigi to English or English to Taigi manually. There are 2 models for each direction. The unity models used a Mandarin middle layer, and the s2ut models don’t.