GLUE and SuperGLUE
The General Language Understanding Evaluation (GLUE) is a benchmark based on 9 English sentence understanding tasks used to gauge how well an NLP model is able to perform on general language tasks. The 9 tasks are selected to have varied goals, training data volume, and language style/genre (and to be difficult). The assumption is that if a model does well on this diverse set of tasks, then it would also do well on general interest language tasks.
SuperGLUE was created later for a similar purpose. It follows the same framework but has a few key differences: 1) the tasks are harder, 2) the task formats are more complex, and 3) human baselines are included.