Themen für Abschlussarbeiten

Generelle Themenbereiche

Wir bieten Abschlussarbeitsthemen aus dem Bereich Mixed Models, Joint Modelling, Quantilregression und Mixture Density (Networks) an. Die Methoden werden in allen Fällen sowohl bayesianisch, als auch durch statistische Lernverfahren weiterentwickelt. Kommen Sie gerne auf uns zu, wenn Sie ein Thema aus diesen Bereichen bearbeiten wollen!

Weitere Abschlussarbeiten im Bereich Statistik bietet der Lehrstuhl für Statistik an.

Bachelorarbeiten

Nach der Wirtschaftskrise im Jahr 2008 gab es weltweit viele Proteste gegen die politischen/wirtschaftlichen Maßnahmen, und auch das Wahlverhalten hat sich verändert. Auf Basis eines europäischen Datensatz soll der Einfluss der Proteste und weiterer ökonomischer und soziologischer Einflussgrößen auf die Wahlergebnisse analysiert werden. Die Methode, die dafür verwendet werden soll ist die Dirichletregression.


Kontakt: elisabeth.bergherr@uni-goettingen.de

Diese Arbeit basiert auf einem umfangreichen Datensatz, der pre-match und post-match Variablen aus den europäischen Top 5 Fußball-Ligen der letzten Jahre umfasst. Das Hauptziel dieser Arbeit ist es, den Erklärungsgehalt der post-match Variablen eines Modells zu quantifizieren. Die vorherzusagenden Variablen sind die Anzahl der erzielten Tore und das Spielergebnis (Sieg, Unentschieden oder Niederlage). Bei den vorgeschlagenen Modellen handelt es sich um generalisierte lineare gemischte Modelle. In Anbetracht der hohen Korrelation zwischen den Variablen ermöglicht die Forschungsfrage mindestens zwei getrennte Arbeiten unter Verwendung der folgenden Methoden:
Ansatz I: Der erste Ansatz beinhaltet modellbasiertes Gradienten Boosting für gemischte Modelle, das gleichzeitig eine Variablenauswahl vornimmt und die Effekte schätzt.
Ansatz II: Der zweite Ansatz nutzt Data-Engineering-Techniken, um Dimensionsreduktion (z. B. PCA, kanonische Korrelation) und Variablenauswahl zu integrieren, bevor die gemischten Modelle durch die Maximierung ihrer penalisierten Likelihood geschätzt werden. Vor Anwendung der oben genannten Methoden wird diese Forschungsarbeit einen kurzen Überblick über frühere Studien auf dem Gebiet der Fußballvorhersagen und eine explorative Datenanalyse beinhalten, um tiefere Einblicke in Fußballvorhersagen zu gewinnen.

Der Vorschlag einer eigenen Forschungsfrage unter Verwendung dieses Datensatzes ist ebenfalls willkommen.

Kontakt: lars.knieper@uni-goettingen.de

Kategoriale Regressionsmodelle wie multinomiale, geordnete Logit- und sequentielle Modelle befassen sich mit nicht-binären kategorialen Zielgrößen.Derart kodierte Variablen und damit verbundene Forschungsfragen sind häufig in sozialwissenschaftlichen Umfragen zu finden.Die Bachelorarbeit untersucht die Anwendung dieser Modelltypen in einem spezifischen Bereich der Sozialwissenschaften (z.B. Wahlergebnisse, Zustimmung zu sozialen Themen, (Lebens-)Zufriedenheit).Nach einer Beschreibung der Methoden und einem kurzen Literaturüberblick wird mindestens ein Modelltyp auf einen realen Datensatz (z.B. ALLBUS, SOEP, Eurobarometer,...) angewendet und die Ergebnisse interpretiert.

Der inhaltliche Schwerpunkt der Arbeit orientiert sich an den Interessen der Bewerberin/des Bewerbers. Die Arbeit kann in deutscher oder englischer Sprache verfasst werden.

Kontakt: sophie.potts@uni-goettingen.de

Diese Bachelorarbeit wird in Kooperation mit Destatis (Statistisches Bundesamt) angeboten. Destatis führt jährliche Haushaltsbudgeterhebungen (EVS und LWR) durch, bei denen die Haushaltsmitglieder ihre Ausgaben für mindestens einen Monat angeben müssen. Um die Belastung der Teilnehmer zu verringern, soll eine neue Funktion der verwendeten Smartphone-App es den Teilnehmern ermöglichen, ein Foto ihrer Quittung zu machen, anstatt jede Ausgabenposition manuell einzugeben. Die Bachelorarbeit befasst sich damit, wie Ausgabenpositionen aus Quittungen erfolgreich der COICOP („Classification of Individual Consumption by Purpose“) zugeordnet werden können. Dies kann entweder durch ein String-Matching von Ausgabenpositionen mit bereits klassifizierten Verbraucherpreisdaten (z.B. mit der Methode von Kaufman & Klevs 2022) erreicht werden.



Kaufman, A. R., & Klevs, A. (2022). Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field. Political Analysis, 30(4), 590-596.

Kontakt:sophie.potts@uni-goettingen.de,
Kontakt:lars.knieper@uni-goettingen.de


Masterarbeiten

Model-based component-wise gradient boosting is a popular tool for data-driven variable selection in regression models. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed (e.g probing, stability selection, deselection). This thesis gives an overview of the modifications and compares their performance for Generalized Additive Models regarding variable selection and prediction accuracy based on an extensive simulation study and applied on (common R or real world) data sets.

A special focus can be set on the different types of base-learners (linear, splines, tree-based, spatial).

Contact: sophie.potts@uni-goettingen.de

Component-wise gradient boosting methods are known for their good performance in case of correlated covariates, since in each iteration the model is updated only using a small fixed step length. When boosting GAMLSS, using adaptive step lengths can result in more balanced submodels for the different distributional parameters. The aim of this thesis is to investigate the performance of an adaptive step length approach compared to an approach using fixed step lengths in a setting with correlated covariates. This is done with an extensive simulation study.


Contact: alexandra.daub@uni-goettingen.de

When estimating GAMLSS, variable selection is often an issue. Due to their intrinsic variable selection, component-wise boosting approaches can provide a remedy with respect to that, where naturally the order of the updates plays an important role. This order of updates can depend on the base-learner selection criteria. There are two natural criteria for the selection of the base-learner update in GAMLSS: the inner and the outer loss (see for example Thomas et al. (2018), Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates), which can however in some cases yield different base-learner updates. The aim of this thesis is to investigate how this selection criterion affects the variable selection as well as the estimated coefficients in the overall model, i.a. considering different types of covariates. This is done with an extensive simulation study.


Contact: alexandra.daub@uni-goettingen.de

The application of regression models such as GLMs oder GAMs is commonly based on distributional assumptions. As a consequence, outliers that violate these assumptions have the potential to heavily dominate the results of a model. Aim of this thesis is to implement different penalization strategies for distributional regression models via the framework of artificial neural networks.


Contact: tbs.hepp@fau.de

Mixture regression models are used in scenarios with unobserved heterogeneity, i.e. the (assumed) presence of different latent classes in the data. Depending on the number of latent classes and the corresponding (distributional) regression models, this quickly results in many unknown parameters to be modeled via separate prediction functions. Given their identifiability, component-wise boosting algorithms are generally able to estimate the unknown coefficients of this model framework, but the optimal specification of the algorithm remains unclear. Therefore, this thesis will investigate and evaluate different initialization and updating strategies inside the boosting algorithm.


Contact: tbs.hepp@fau.de

Numerical approximation methods allow us to find a solution for complex problems; this means that generally they can’t be solved analytically. These types of problems are commonly viewed in spatial models where we introduce covariates to explain a response variable, so that we need to consider how they also can vary in the space as well. Thus, the structure of the model becomes difficulties the analytical solution of the problem. For the above the Laplace approximation has gained popularity due to fast and easy way to express the approximation for the solution, especially for spatial models. However, is known that this method can fail for small data observations and other specific circumstances. For the above we could consider another approximation method called “Saddlepoint approximation” and evaluate its behavior in spatial models.


Contact: j.cavieres.g@gmail.com

Template Model Builder (TMB) is a frequentist statistical software that enables the estimation of parameters for non-linear models through Laplace approximation and automatic differentiation. This software offers computational efficiency, as the Laplace approximation method provides a close approximation to the true solution, while automatic differentiation automatically computes first and second derivatives. In contrast, Stan is a probabilistic software specifically designed for Bayesian inference. It employs the Hamiltonian Monte Carlo algorithm as the estimation method to derive posterior distributions for model parameters. By using of the “tmbstan” library and applying prior distributions to the parameters, we can easily transform a model developed in TMB into a Bayesian framework, so we can work directly with all the features of a Stan object.


Contact: j.cavieres.g@gmail.com

In geo-additive models spatial confounding might be observed when a covariate is correlated with the spatial field. This phenomenon occurs when the spatial field estimates mask the correlated fixed effects' estimate, thereby inducing bias. Recently, Dupont et al.'s "Spatial+" gained attention by proposing a straightforward approach, which can be used for generalized models as well. In accordance with previous research (restricted spatial regression, geo-additive structural equation model), the focus has been on continuous confounded covariates. The thesis aims to concentrate on binary and categorical confounded covariates and to compare different approaches to spatial confounding. This includes an overview of existing methods, suggest approaches in the Spatial+-manner and accompany these with an extensive simulation study.


Contact: lars.knieper@uni-goettingen.de

There is a large body of literature on the relationship between marital satisfaction (satisfaction with the relationship) and divorce. Some studies focus on the mediating effect of marital satisfaction, e.g. for the relationship between (perceived) inequity in the division of household labor and divorce. However, many studies do not take advantage of the richness of panel data i.e., they use only two time points or focus on only one of the two outcomes (marital satisfaction, risk of divorce). The class of joint models for longitudinal and time-to-event data overcomes this problem and is also able to deal with the endogeneity of marital satisfaction. Endogeneity arises from the fact that the trajectory of marital satisfaction over the course of the marriage is unlikely to be independent of the event outcome (divorce/no divorce). This thesis implements a joint model to disentangle the relationships between marital happiness, divorce and (perceived) inequity in the division of household labor. The pairfam dataset is used to exploit the richness of longitudinal data. As the structure of the data set allows a couple data approach, this may be included as well. Some prior knowledge of time-to-event analysis may be helpful but is not mandatory.


Contact: sophie.potts@uni-goettingen.de

Are there distinct groups of newlyweds' satisfaction trajectories? How are they characterized? Lavner & Bradbury (2010) found latent groups of satisfaction trajectories among newlyweds, but how they differ with respect to divorce was examined using rates rather than time-to-event analysis models. The class of joint latent class models for longitudinal and time-to-event data overcomes this problem and is also able to deal with the endogeneity of marital satisfaction. The endogeneity arises from the fact that the trajectory of marital satisfaction throughout the marriage is unlikely to be independent of the event outcome (divorce/no divorce). This paper implements a joint latent class mixed model for longitudinal and time-to-event data to enrich the findings of Lavner & Bradbury (2010). For this purpose, the pairfam dataset can be used. Some prior knowledge of time-to-event analysis is recommended.


Lavner, J. A., & Bradbury, T. N. (2010). Patterns of change in marital satisfaction over the newlywed years. Journal of Marriage and Family, 72(5), 1171-1187.

Contact: sophie.potts@uni-goettingen.de