STATE-OF-THE-PRACTICE IN EVALUATION OF QUALITY OF EXPERIENCE IN REAL-LIFE ENVIRONMENTS

,


INTRODUCTION
In the last 5 years we have witnessed a rapid expansion of different services which can now be used on many different devices, platforms and networks.A few years ago a mobile phone and a portable computer were able to satisfy almost all communication needs of a typical user.Nowadays, the functionality of smart phones is miles away from old mobile phones.Concurrently, tablets occupied a segment of a market which was nonexistent in the past, different computer producers are making lighter portable computers (e.g.nettops) which combine the best features of desktop computers, tablets and smartphones, etc.All of these devices are connected to the Internet which makes good ground for development of new markets by offer-ing services like office in the cloud, web storage, media sharing and others.
On their path from origin to destination, IP packets are liable to traverse through several network sections of (sometimes) different network and Internet service providers, resulting in a network delay and packet loss, which are common anomalies in today's packet-switched networks such as the Internet.These anomalies may cause the degradation of service quality and this has adverse effect on the user satisfaction.As elaborated in [1], the key question in IP-based networks is how to achieve certain level of Quality of Service (QoS) and, at the same time, keep the related costs acceptable for all subjects on the telecommunication market (end users, network and service providers).
Since different applications react differently to network anomalies it is useful to categorize them.In [2] six application classes are defined based on their network performance demands which are expressed by the maximum values of the following QoS parameters: delay, delay variation (jitter) and packet loss.In general, there are two mechanisms which may be used to satisfy QoS demands of specific applications.The first one is the Integrated Service model (IntServ) described in [3] and the second one is the Differentiated Service model (DiffServ) described in [4].However, these models are only partially implemented in certain network segments.Hence, the Internet network still provides only the Best Effort type of service where there are no quality guaranties whatsoever.
In this environment, where a spectrum of devices and services coexists with no real quality guarantees from the network, it is important to understand how the users perceive the service quality, especially when it is degraded.This information is crucial for the network and service provision, because efficient resource provisioning and QoS management are one of the paramount factors of success on competitive market filled with fickle customers.However, providing enough resources to specific user does not automatically increase their level of satisfaction, because looking from their perspective it is absolutely irrelevant how a specific service is delivered or what the structure, design and/or performance of the network and its segments is.What matters most is the service quality on the application layer [5].This fact was also stressed by Bai et al. in [6] who concluded that it is perceptual quality, rather than network-perceived service quality, that determines the success or failure of a network.Similar to these findings, Digital Subscriber Line (DSL) forum experts stated that subscriber is indifferent about how quality is achieved.Their sole interest is whether or not the service meets their expectations [7].
In the light of these statements, it is also noteworthy to mention that Ghinea et al. found that video quality perceived by the viewers depends not only on picture and audio quality, but also on the viewer's age group and type of content which they were testing [8].Similar results were presented by Dick et al. in [9].While experimenting with real-time network games, on a sample of 8 players, they showed that under the same network conditions (delay and delay variation) the quality of the game play experience varies between the players.Conversely, the group of authors in [10] showed that players will continue to use the application even though the values of QoS parameters are to a relatively large extent lower than the defined QoS demands of specific application.This is because they do not want to end their session.In [11] it is emphasized that the results of a video quality analysis may depend on the viewer's experience and expectations.We also refer the readers to the work of Kemp (ref.no.[12]) who studied the impact of price of service on user satisfaction, as well as [13,14,15] where the impact of content quality and diversity was analyzed.
In addition to these findings another emerging problem is highlighted: ranking of network parameters by their importance.For instance, Ghinea et al. concluded in [13] that some QoS parameters affect the user perception more than others.As an example they described the case of video quality evaluation: the viewer will hardly notice that one of 25 fps in a video clip was missing during playback, but it is most likely that they will immediately detected any type of lip synchronization problem.In [16] it is reported that the users will be more forgiving if the picture is delayed in relation to the sound, compared to the opposite situation.Finally, we will mention the work of Menkovski et al. who concluded that it is highly inefficient to assign resources to a user if they will not notice the improvement [17].
Since it is evident that many quantitative and qualitative parameters are important in the evaluation of service quality and that analysis of QoS parameters does not constitute all important elements, another term was defined: Quality of Experience (QoE).The QoE concept provides a holistic view on the evaluation of service quality and imposes certain requirements to the evaluation methods.One of these indispensable requirements is the need to move the experiments out of the artificial environments (laboratories) and into more natural surroundings.Hence, in this paper we will try to provide a review of state-of-the-practice in subjective evaluation of QoE in real-life environments, where the services are actually used.The remainder of this paper is structured as follows.Section 2 briefly describes the QoE concept and the changes which it introduced.Two main types of methods for evaluation of service quality are discussed in Section 3, while Section 4 is devoted solely to subjective evaluation of QoE in real-life environments.Section 5 brings conclusions as well as the outline of our future research in this field.

Definitions of Quality of Experience
After more than a decade of testing the quality of various services, the understanding of this process was profoundly changed and up-scaled.This is why we first start by providing a short background of QoE concept.In 2006 Lopez et al. defined QoE as "an extension of the traditional QoS in the sense that QoE provides information about the delivered service from an end-user point of view" [18].Soldani et al. in [19] stated that "QoE is how a user perceives the usability of a service when in use -how satisfied he/she is with a service in terms of, e.g., usability, accessibility, retainability and integrity."International Telecommunication Union (ITU) mentions QoE for the first time in [20] where QoE is defined as "the overall acceptability of an application or service, as perceived subjectively by the end-user."Two additional notes were given to that definition.First one stresses that QoE includes the complete end-to-end system effects (client, terminal, network, services infrastructure, etc.).The second one is more user-oriented and emphasizes that overall acceptability may be influenced by user expectations and context.
The European Telecommunications Standards Institute defined QoE as "user perceived experience of what is being presented by a communication service or application user interface" [21].According to the DSL forum QoE is "the overall performance of a system from the point of view of the users.QoE is a measure of end-to-end performance at the services level from the user perspective and an indication of how well the system meets the user's needs" [7].Similar to the QoE definition of DSL forum experts, Joskowicz et al. in [22] defined QoE as "overall performance of a system, from the user perspective." In 2010 Möller provided a more hedonistic definition of QoE: "Degree of delight of the user of a service.In the context of communication services, it is influenced by content, network, device, application, user expectations and goals, and context of use" [23].To our knowledge the latest step towards defining the QoE was made in 2012 by the group of experts of the European Network on Quality of Experience in Multimedia Systems and Services.They accepted and expanded the Möller's definition by concluding that "Quality of Experience is the degree of delight or annoyance of the user of an application or service.It results from the fulfilment of his or her expectations with respect to the utility and/or enjoyment of the application or service in the light of the user's personality and current state" [24].
It is evident that QoE is not uniformly defined,because the understanding of it is still evolving.However, various authors and research organizations emphasized that QoE focuses on the user (subjective) perception about the quality of service/application; in contrast to mere objective evaluation of QoS parameters.The definitions stress the importance of end-to-end service quality since the typical users are not bothered with achieved performances of specific network segments.It could be said that QoS provides an insight into network level service quality, while QoE gives information about the quality on the service level.Some authors call this service level the new pseudo-level and claim that it represents the expansion of the application layer in a way that it includes user perception (e.g.Nieblas et al. in [25]).
Figure 1 depicts the scope of QoE evaluation.This representation corresponds to the aforementioned QoE definitions, because it implies that a much broader context has to be taken into account.Achieving the desired level of QoE of specific service still undoubtedly requires a certain level of network performances in access and core network (which can be measured), but QoS parameters are no more the only merit of success.Different psychological measures also affect the user QoE, such as previous user experience with the service and its expectations, internal state (condition, feelings) and other parameters which have to be investigated through subjective tests (usually via surveys).

Changes in the technological environment
Even if the quality of many different services had been tested on many different devices and networks long before the advent of the QoE concept, it can be debated with relatively strong arguments that only after the introduction of QoE, the term quality was properly understood.Primarily, the QoE concept promotes a layered view on service quality, but this time these layers are not confined by, for example, classical views of the OSI model.The concept encourages holistic and above all interdisciplinary evaluation approach [26].To this end, we quote Kilkki, one of the leading experts in this field: "QoE is everything that matters." Due to its innovativeness, it is clear that QoE also caused certain changes in the technological environment.Some of them are: plained in [27], QoE demands should be the primary target of the future DiffServ mechanisms.
-The network capacity should be designed to accommodate QoE demands, but this does not necessarily mean that QoE demands are stricter compared to QoS demands.-New capacity allocation techniques and admission control mechanisms should be developed in mobile networks where the performances are under high impact of radio conditions experienced by the users (see the work of Larté in [28] and [29]).-Revision of pricing policies is needed as indicated in [30].-New objective and subjective methods for evaluation of QoE should be defined in order to take into account the factors influencing the user QoE.
To cope with these changes a series of interdisciplinary meetings was launched.The idea is to bring experts and practitioners in this field together with social scientist, psychologists, marketing experts etc., and to discuss issues and challenges within QoE domain.The meetings are often held as part of scientific conferences, forums, workshops and similar events.First of these meetings was held in Leibniz in 2009.As Fiedler, Kilkki and Reichl report in [31], during this meeting it was stressed that QoE introduced quite an amount of changes in the technological environment and that considerable efforts needed to be invested in defining the QoE and methods for QoE evaluation, as well as to investigate its impact on the market, economy and standardization.

METHODS OF QoE EVALUATION
Depending on the type of parameters which are measured, there are two methods which can be used in the evaluation of service quality: objective and subjective method.Parameters that are measurable (e.g. with instruments) and for which a performance value is assigned quantitatively may be classified as objective parameters, while subjective or qualitative parameters are those which can be expressed using human judgment and understanding [32].
Over the years numerous studies developed a wide spectrum of evaluation methods.Some authors, like Perkis et al. in [33], emphasize that this variety caused a problem: by using the same evaluation methodology on the same type of service, different authors get different results.This problem was earlier identified by the group of authors in [17] who stressed that a more holistic evaluation approach was needed.In this respect, the work of Kunze et al. in [34] can be helpful.After surveying telecommunication experts they listed 14 selection criteria by their importance (Figure 2).The criteria may be used when choosing the appropriate evaluation method.The criteria which were not indicated by the experts (Other) were: Validity, Reliability, Objectivity, Generalizability, Representativeness, Results, Consistency/Credibility, Thoroughness, Robustness and Fairness.

Objective methods
The main goal of every objective evaluation method is to try to develop a method which could, to some extent, provide similar results like subjective tests with actual users of the service.This is simply because subjective tests are often high-demanding in terms of required resources.There are three main categories of objective methods, depending on the availability of the original, non-processed signal (e.g.audio or video): a) Full-Reference method; b) No-Reference method;  c) Reduced-Reference.Furthermore, five types of objective models exist today: a) media-layer models; b) packet-layer models; c) bitstream-layer models; d) hybrid models; d) planning models.Media-layer models use actual media signals as their input, packet-layer models use only information from the header of IP packets, while bitstream-layer models take not only the encoded bitstream information, but also the packet header information as its input.A hybrid model is the combination of the previously mentioned models.
It employs as much information as possible to predict QoE.The input for planning models includes the quality planning parameters of networks or terminals.Such models can be applied to network planning and terminal/application design [22].
In our view, hybrid models that use quantitative (objective) and qualitative (subjective) inputs are particularly interesting for QoE evaluation.Nowadays, a considerable effort is being invested into the development of such models (read e.g.[17,35,36,37,38]).

Subjective methods
By accepting the definition of a speech quality test, defined by Jekosch in [39], ITU-T in [40] categorizes subjective evaluation methods into two categories: analytical and utilitarian ones.The analytical methods are used when the goal is to test the user perception about the quality on a full set of quality characteristics, while the utilitarian method is used when it is necessary to test only one quality characteristic or the whole service quality.The most precise method for any quality evaluation, therefore also for the evaluation of QoE, will always be subjective methods, because subjective tests are the only available tool for data collection about user expectations, opinions, perception and experience about the service.Furthermore, subjective methods are complementary to objective methods, because, as mentioned earlier, hybrid objective models can also use the results of subjective tests in order to derive QoE score more accurately.This is why objective methods will never be able to fully replace subjective tests.
In general, subjective methods require building a panel of human observers.During subjective testing the observers use the service/application after which they fulfil the questionnaire about the service quality.The most common used merit of the level of subjective quality is Mean Opinion Score or MOS.
Since the QoE evaluation methods evolved from QoS evaluation methods, often the subjective testing of QoE takes place in a laboratory.Usually, the procedures of these tests are rigorously defined by different recommendations and standards of international organizations.However, services are not used is such artificial environments.Since the QoE includes such a wide range of factors, we believe that most accurate evaluation of QoE can only happen in real-life environments.

EVALUATION OF QoE IN REAL-LIFE
It might be surprising to discover that not so many subjective evaluations of QoE were conducted in reallife environments.This is also indicated in 2012 by Van den Broeck et al. in [41].There are two main reasons: a) it is simpler to conduct subjective tests in the laboratory; b) due to very well defined and detailed procedures, results obtained from subjective tests conducted in the laboratory can be easily compared to each other.
To our knowledge, one of the first real-life subjective tests of QoE was conducted by Reichl et al. in [42].The service in the focus was the mobile multimedia streaming.They installed two cameras and a WiFi transmitter on a woman's hat.The woman was wearing the hat during her everyday routine (leisure, shopping etc.), as well as when she was using the aforementioned application (e.g. while waiting at a bus station).The cameras were recording her facial expressions and the signal was transmitted to the nearby operator equipped with storage devices.Later, the stored video could be analysed in order to determine the degree of the woman's enjoyment, frustration, boredom, etc.The authors call their unusual approach LiLiPUT (Lightweight Lab Equipment for Portable User Testing in Telecommunications).Given the fact that this method is highly impractical, it can be said that the primary objective of their research was to encourage future real-life experiments.
The same type of application was analysed in [43] by Jumisko-Pyykkö et al., but with more test subjects.There were two groups of test subjects.The subjects in the first group used the application on a train station, in a bus and in a coffee shop.During multimedia streaming sessions the quality of 60-second video clips was degraded by packet loss rates of 1.7%, 6.9%, 13.8% and 20.7%.The length of the time interval when the loss occurred varied between 1, 4, 8 and 12 seconds.The second group of test subjects viewed the same video clips under the same network conditions, but in controlled environment.The results showed that the first group of users did not notice so many impairments as the second group, i.e. the QoE score of the first group of users was higher.In their conclusions the authors confirmed the earlier findings of Kaikkonen et al. from [44]: results of laboratory tests can suggest that a specific service or application needs higher QoS demands than it is actually the case.Hence, Kaikkonen et al. raise a question of usability of such experiments (results).Lastly about the research of Jumisko-Pyykkö et al., it is stressed that the evaluation of QoE by using short video clips (60 seconds) does not quite match the real-life quality perception, i.e. in real-life it is necessary to increase the duration of test sequences, as indicated in [45].
Staelens et al. in [46] and [47] report on the results of the evaluation of QoE of full-length movies, whose quality was degraded in several segments by introducing packet loss and using video coded with lower bit rates.After receiving the copies of a DVD video, the test subjects were asked just to watch the movie as they would normally do (e.g. in the comfort of their home) and to evaluate its quality after watching by completing the questionnaire (users were ignorant about the topic of the questionnaire prior to watching).Concurrently, the tests were conducted in the controlled environment with the second group of test subjects.The obtained results differed substantially.In general, the first group evaluated the quality of a DVD movie higher compared with the second group.In addition to the author's findings, it has to be stressed that human short-term memory is one of the factors which surely affected the results.This conclusion was drawn based on the results of Jelassi et al. in [48] who tested the QoE of a VoIP service.It should also be stressed that even if the tests were conducted in natural environment, where users normally consume this kind of video content, the impact of the social context and the user habits were not analysed (the importance of these factors was highlighted in [49,50]).
In [51] the authors developed and distributed to 30 test subjects a mobile application which was capable of monitoring the device activity and QoS parameters during the day.Moreover, at least three times a day the application activated a short questionnaire on the user device.The users were asked to quickly rate, on a scale from 1 to 5, the quality of a mobile service which they just finished using.Once a week, an interview with test subjects was conducted in order to collect the data about social context in which specific services were used (were they alone or not) and the user physical condition (e.g. were they driving, walking or sitting during the service usage).Among several conclusions drawn by these authors the following two are pointed out: 1.The users who use a specific type of application on their computers, rate mobile version of that application poorer, i.e. their previous experience significantly affects their QoE of mobile applications.2. Based on a routine of the user it was evident that different sets of applications were used in the morning, in the evening, in the car and outside the office.The authors conclude that the user rating is influenced by the user's environment and the importance of the mobile application to the task at hand.Strohmeier et al. in [52] subjectively evaluated the user QoE while they were watching 3D video content in a coffee shop.The authors claim that this is the environment in which this type of content is normally consumed, although we cannot completely agree with that claim.The quality of the content was not degraded in any way.Contrary to the conclusions reported in [43,44,46,47], Strohmeier et al. state that there were no real differences between the results of real-life and laboratory testing.
The QoE of web browsing sessions was evaluated by Ataeian et al. in [53].The authors designed an addon application which was installed in web browsers of 35 test subjects.During browsing sessions, the application recorded the response time of web sites and collected quality scores from the subjects for each response time experienced (more than 1,000 data pairs were obtained).Based on the collected data, the authors develop fuzzy membership functions of different fuzzy sets.Each response time was assigned to one or more fuzzy sets (five fuzzy sets were defined, from MOS 1 to MOS 5).
Finally, in [41] the quality of a multimedia streaming of Koksijde City Council meetings was analysed.In total, 42 test subjects were asked to watch live streams of the meetings from their homes and rate their quality.Since this was a specific type of content, the users stressed the importance of audio quality, so the authors devoted their attention only to this aspect of the service.However, the authors were unable to connect QoS parameters experienced by the users with QoE scores, since they did not have the information about the network performances during multimedia streaming sessions.

CONCLUSION AND FUTURE WORK
The QoE concept brought a new perspective to the evaluation of service quality and defined new requirements in terms of data collection, because the analysis of QoE has to include a set of concatenated objective and subjective parameters.Maybe the biggest turnaround in evaluation methodology is the fact that quality must be assessed in real-life conditions and not only in laboratories.As indicated by several authors, the differences between the results obtained in these two environments are substantial, which raises the question of usability of specific methodologies.In general, in real-life environments the users are more forgiving to the quality impairments.In addition, reallife experimenting can give an insight into the social context in which the services are used, user condition (physical and psychological) etc.
From the presented review it is noticeable that the evaluation of multimedia content in real-life environment gained the momentum over the last 2-3 years.This is something that could have been expected simply because the Internet traffic today contains a sig-nificant amount of compressed video traffic.The accessibility of broadband access in fixed and mobile networks, together with the development of various devices (smart phones, tablets, smart TV, etc.), certainly affected the popularity of multimedia streaming service.
However, it was also noticed that the majority of subjective QoE evaluation methods still rely on laboratory tests.As result, we did not come across to any objective evaluation model which would use the results of subjective tests, conducted in real-life environments, in its inference system (e.g. for training the neural network or for defining the rules of machine learning or fuzzy based inference system).Therefore, in our future research we will try to develop such a model which could be used for the evaluation of QoE of multimedia streaming service.We plan to create an emulated network environment and stream one hour HD videos between two computers connected in a peer-2-peer connection.The emulated environment will enable full control over the QoS parameters.During streaming, we plan to degrade the video quality by introducing various packet loss rates.The degraded video will be stored also in HD format.Several segments of the degraded video (of different lengths) will be inserted in the original video.These videos will be recorded on DVD and distributed to test subjects who will watch and rate it at home.Their video quality scores together with known values of packet loss rate, length of the time interval in which the loss occurred and the number of those intervals in one video, will be used to develop a hybrid objective model whose inference system will be based on fuzzy logic.The questionnaire will also contain questions about social context, user condition, their previous experience and expectations as well as user perception about the magnitude of quality degradation, etc. Fuzzy logic is chosen for two reasons: a) QoE parameters are fuzzy in nature; b) fuzzy inference system operates with linguistic variables which is convenient when conclusions have to be drawn based on the combination of objective and subjective input parameters.

Figure 2 -
Figure 2 -Criteria for choosing the evaluation method (based on the work ofKunze et al.)