PSD estimation computes power spectrum from finite, noisy data. Periodogram (|DFT|²) is simple but biased; Welch method averages segmented periodograms to reduce variance at the cost of frequency resolution. Parametric methods assume signal model and achieve higher resolution from shorter records but fail if the model is misspecified. All methods involve bias-variance and resolution tradeoffs.
Your prerequisites give you two essential building blocks: the autocorrelation function / PSD relationship (the Wiener-Khinchin theorem, which tells you the PSD is the Fourier transform of the autocorrelation) and the DFT/FFT as a practical tool for computing spectra from finite data. PSD estimation is the problem of combining these to get a useful, accurate power spectrum from a real measurement — which is always finite in length and contaminated with noise. The theory and the reality turn out to be frustratingly different, and understanding why produces the key insight of this topic.
The naive approach is the periodogram: take your N-point data record x[n], compute its DFT X[k], and form the estimate Ŝ(k) = |X[k]|²/N. This seems right by the definition of PSD, and it is asymptotically unbiased (more data → less bias). The problem is variance. A fundamental statistical result says that for a wide-sense stationary process, the variance of the periodogram estimate at each frequency does not decrease as N increases — it stays approximately equal to the squared true PSD value: Var[Ŝ(k)] ≈ S²(k). No matter how much data you collect, adjacent periodogram bins fluctuate wildly. In practice, a raw periodogram looks jagged and nearly useless for identifying spectral features with confidence.
The Welch method solves this with a simple but powerful idea: trade frequency resolution for variance reduction by averaging. Divide the N-point record into K overlapping segments of length M (with overlap of 50% typically). Compute a periodogram for each segment, then average the K periodograms. Averaging K independent estimates reduces variance by a factor of K, smoothing the spectrum substantially. The cost is frequency resolution: each segment of length M produces a frequency grid with spacing Δf = f_s/M. Shorter segments → more averages → lower variance, but coarser frequency resolution and less ability to distinguish closely spaced spectral features. Longer segments → finer resolution but fewer averages → higher variance. This is the fundamental bias-variance-resolution tradeoff, and Welch parameter selection (segment length, overlap, window function) is an engineering judgment call based on which matters more for the application.
Parametric methods — most commonly AR (autoregressive) spectral estimation — take a different approach: instead of averaging periodograms, assume the signal was generated by a specific model (e.g., white noise passed through an AR filter) and estimate the model parameters from the data. Given good model parameters, you can compute the implied PSD analytically to arbitrarily fine frequency resolution, even from short data records. The Burg and Yule-Walker methods are standard AR parameter estimators. The enormous advantage is super-resolution: you can resolve two closely spaced spectral peaks that the Welch method would smear together. The enormous risk is model misspecification: if the true signal doesn't follow an AR model (or you choose the wrong AR order), the estimated spectrum can show spurious peaks or miss real features entirely. Parametric methods are powerful for narrowband or line-spectrum signals (vibration analysis, radar) but fragile for broadband or poorly characterized sources. Choosing between Welch and parametric estimation requires knowing something about the signal you're measuring.