기록하는 이유

모델의 속도를 높일 수 있는 최적화 알고리즘(Optimization Algorithms)을 알아보자.

1. Mini-Batch

지금까지 우리는 전체 데이터 X ( $n x \times m n_{x} \times m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>n</mi><mi>x</mi></msub><mo>\times</mo><mi>m</mi></math>$ )에 대해 gradient를 계산하는 Batch gradient descent 방법을 사용했다.

Batch gradient descent는 m이 커질수록 gradient를 계산이 느려지는 문제가 있다.

전체 데이터를 mini-batch로 나눈뒤 gradient를 계산하는 Mini-Batch gradient descent 를 사용하여 gradient 계산 속도를 높일 수 있다.

ex) m=5,000,000인 데이터 셋을 mini-batch size=1,000으로 나누는 경우,

$X {1}, X {2}, . . ., X {5, 000} X^{{1}}, X^{{2}}, . . ., X^{{5, 000}} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>1</mn><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>2</mn><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>5</mn><mo>,</mo><mn>000</mn><mo fence="false" stretchy="false">}</mo></mrow></msup></math>$ ( $n x \times 1, 000 n_{x} \times 1, 000 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>n</mi><mi>x</mi></msub><mo>\times</mo><mn>1</mn><mo>,</mo><mn>000</mn></math>$ )

$Y {1}, Y {2}, . . ., Y {5, 000} Y^{{1}}, Y^{{2}}, . . ., Y^{{5, 000}} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>1</mn><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><msup><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>2</mn><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msup><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mn>5</mn><mo>,</mo><mn>000</mn><mo fence="false" stretchy="false">}</mo></mrow></msup></math>$ ( $1 \times 1, 000 1 \times 1, 000 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>1</mn><mo>\times</mo><mn>1</mn><mo>,</mo><mn>000</mn></math>$ )

의 mini-batch로 나뉠 것이다.

Mini-Batch gradient descent는 각 mini-batch t( $X {t}, Y {t} X^{{t}}, Y^{{t}} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mi>t</mi><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><msup><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mi>t</mi><mo fence="false" stretchy="false">}</mo></mrow></msup></math>$ )에 대해 gradient descent를 수행한다.

하나의 mini-batch에 대한 gradient descent를 1 epoch이라고 한다. 5,000개의 mini-batch의 경우 1 iteration에서 5,000 epoch을 진행한다.

하나의 데이터 $x (t) x^{(t)} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow></msup></math>$ 를 하나의 mini-batch $X {t} X^{{t}} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mi>t</mi><mo fence="false" stretchy="false">}</mo></mrow></msup></math>$ 로 두는 경우 Stochastic gradient descent라고 한다. (mini-batch size=1)

Stochastic gradient descent은 noise가 많이 발생하고, 벡터화를 통한 빠른 계산속도를 잃는다는 단점이 있다.

적당한 mini-batch size를 찾아 Mini-Batch gradient descent를 수행하는 것이 계산 속도를 높일 수 있다.

적당한 mini-batch size를 찾는 방법은 다음과 같다.

- training set이 작다면, Batch gradient descent를 사용하라. (ex. m ≤ 2,000)

- mini-batch size는 2의 지수값으로 설정하라. (ex. 64, 126, 256, 512)

- mini-batch $X {t}, Y {t} X^{{t}}, Y^{{t}} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>X</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mi>t</mi><mo fence="false" stretchy="false">}</mo></mrow></msup><mo>,</mo><msup><mi>Y</mi><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><mi>t</mi><mo fence="false" stretchy="false">}</mo></mrow></msup></math>$ 가 CPU/GPU 메모리에 들어갈 수 있도록 설정하라.

2. Momentum / RMSprop / Adam

Momentum, RMSprop, Adam에 사용되는 개념인 Exponentially weighted average를 알아보자.

1년간 런던 날씨를 수집하여 차트로 나타내면 아래와 같이 분포되어 있다.

이전 날씨 값의 가중 평균을 더하여, 런던 날씨의 trend를 알 수 있다.

Exponentially weighted average 식은 아래와 같다.

$v t = β v t - 1 + (1 - β) θ t v_{t} = β v_{t - 1} + (1 - β) θ_{t} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mi>t</mi></msub><mo>=</mo><mi>β</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>β</mi><mo stretchy="false">)</mo><msub><mi>θ</mi><mi>t</mi></msub></math>$

$b e t a b e t a <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mi>e</mi><mi>t</mi><mi>a</mi></math>$ 가 0.9인 경우 $v v <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>v</mi></math>$ 를 전개하면, 예전 값일수록 가중치가 기하급수적으로 줄어든다.

$v 100 = 0.1 θ 100 + 0.1 * (0.9) θ 99 + 0.1 * (0.9) 2 θ 98 + \dots + 0.1 * (0.9) 99 θ 1 v_{100} = 0.1 θ_{100} + 0.1 * (0.9) θ_{99} + 0.1 * (0.9)^{2} θ_{98} + \dots + 0.1 * (0.9)^{99} θ_{1} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mn>100</mn></mrow></msub><mo>=</mo><mn>0.1</mn><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mn>100</mn></mrow></msub><mo>+</mo><mn>0.1</mn><mo>*</mo><mo stretchy="false">(</mo><mn>0.9</mn><mo stretchy="false">)</mo><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mn>99</mn></mrow></msub><mo>+</mo><mn>0.1</mn><mo>*</mo><mo stretchy="false">(</mo><mn>0.9</mn><msup><mo stretchy="false">)</mo><mn>2</mn></msup><msub><mi>θ</mi><mrow data-mjx-texclass="ORD"><mn>98</mn></mrow></msub><mo>+</mo><mo>\dots</mo><mo>+</mo><mn>0.1</mn><mo>*</mo><mo stretchy="false">(</mo><mn>0.9</mn><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>99</mn></mrow></msup><msub><mi>θ</mi><mn>1</mn></msub></math>$

예전 날씨는 기하급수적으로 적게 반영되어, $v t v_{t} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mi>t</mi></msub></math>$ 는 약 $11−β11−β<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>β</mi></mrow></mfrac></math>$ 일 평균 기온으로 볼 수 있다.

Exponentially weighted average의 초반부는 $v 0 = 0 v_{0} = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mn>0</mn></msub><mo>=</mo><mn>0</mn></math>$ 으로 실제 값보다 작은 값을 가지는 경향이 있다.

Bias correction으로 초반부의 값을 더 정확하게 계산할 수 있다.

$vt=vt1−βtvt=vt1−βt<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mi>t</mi></msub><mo>=</mo><mfrac><msub><mi>v</mi><mi>t</mi></msub><mrow><mn>1</mn><mo>−</mo><msup><mi>β</mi><mi>t</mi></msup></mrow></mfrac></math>$

2.1 Momentum

Momentum은 Gradient Descent에 Exponentially weighted average를 적용한 알고리즘이다.

Mini-Batch Gradient Descent의 Cost 최적화 과정에 있는 진동폭을 Momentum 사용하여 줄일 수 있다.

$v d W = β v d W + (1 - β) d W v_{d W} = β v_{d W} + (1 - β) d W <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>=</mo><mi>β</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>β</mi><mo stretchy="false">)</mo><mi>d</mi><mi>W</mi></math>$

$v d b = β v d b + (1 - β) d b v_{d b} = β v_{d b} + (1 - β) d b <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>=</mo><mi>β</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>β</mi><mo stretchy="false">)</mo><mi>d</mi><mi>b</mi></math>$

$W = W - α v d W W = W - α v_{d W} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>=</mo><mi>W</mi><mo>-</mo><mi>α</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub></math>$

$b = b - α v d b b = b - α v_{d b} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mi>b</mi><mo>-</mo><mi>α</mi><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub></math>$

2.2 RMSprop

RMSprop는 root mean square propagation의 약자로 제곱평균제곱근(root mean square)이 작은 방향으로 업데이트한다.

(rms가 크면 진동폭이 크고, rms가 작으면 진동폭이 작음을 의미하기 때문에 진동폭이 작은 방향으로 업데이트하여 cost 최적화 속도를 높일 수 있다.)

$s d b = β 2 s d b + (1 - β 2) d b 2 s_{d b} = β_{2} s_{d b} + (1 - β_{2}) d b^{2} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>2</mn></msub><mo stretchy="false">)</mo><mi>d</mi><msup><mi>b</mi><mn>2</mn></msup></math>$

$W=W−αdW√sdW+ϵW=W−αdW√sdW+ϵ<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>=</mo><mi>W</mi><mo>−</mo><mi>α</mi><mfrac><mrow><mi>d</mi><mi>W</mi></mrow><mrow><msqrt><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac></math>$

$b=b−αdb√sdb+ϵb=b−αdb√sdb+ϵ<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mi>b</mi><mo>−</mo><mi>α</mi><mfrac><mrow><mi>d</mi><mi>b</mi></mrow><mrow><msqrt><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac></math>$

2.3 Adam

Adam은 Momentum과 RMSprop를 결합한 알고리즘이다.

$β 1 β_{1} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mn>1</mn></msub></math>$ 은 0.9, $β 2 β_{2} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mn>2</mn></msub></math>$ 는 0.999, $ϵ ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>$ 은 $10 - 8 10^{- 8} <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mn>10</mn><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>8</mn></mrow></msup></math>$ 값을 권장한다.

$v d W = β 1 v d W + (1 - β 1) d W v_{d W} = β_{1} v_{d W} + (1 - β_{1}) d W <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mi>d</mi><mi>W</mi></math>$ , $v d b = β 1 v d b + (1 - β 1) d b v_{d b} = β_{1} v_{d b} + (1 - β_{1}) d b <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>1</mn></msub><mo stretchy="false">)</mo><mi>d</mi><mi>b</mi></math>$ ← Momentum

$s d W = β 2 s d W + (1 - β 2) d W 2 s_{d W} = β_{2} s_{d W} + (1 - β_{2}) d W^{2} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>2</mn></msub><mo stretchy="false">)</mo><mi>d</mi><msup><mi>W</mi><mn>2</mn></msup></math>$ , $s d b = β 2 s d b + (1 - β 2) d b 2 s_{d b} = β_{2} s_{d b} + (1 - β_{2}) d b^{2} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>=</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>β</mi><mn>2</mn></msub><mo stretchy="false">)</mo><mi>d</mi><msup><mi>b</mi><mn>2</mn></msup></math>$ ← RMSproop

$vcorrecteddW=vdW1−βt1vcorrecteddW=vdW1−βt1<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mo>=</mo><mfrac><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>1</mn><mi>t</mi></msubsup></mrow></mfrac></math>$ , $vcorrecteddb=vdb1−βt1vcorrecteddb=vdb1−βt1<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mo>=</mo><mfrac><msub><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>1</mn><mi>t</mi></msubsup></mrow></mfrac></math>$ , $scorrecteddW=sdW1−βt2scorrecteddW=sdW1−βt2<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mo>=</mo><mfrac><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>2</mn><mi>t</mi></msubsup></mrow></mfrac></math>$ , $scorrecteddb=sdb1−βt2scorrecteddb=sdb1−βt2<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mo>=</mo><mfrac><msub><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow></msub><mrow><mn>1</mn><mo>−</mo><msubsup><mi>β</mi><mn>2</mn><mi>t</mi></msubsup></mrow></mfrac></math>$ ← Bias correction

$W=W−αvcorrecteddW√scorrecteddW+ϵW=W−αvcorrecteddW√scorrecteddW+ϵ<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>=</mo><mi>W</mi><mo>−</mo><mi>α</mi><mfrac><msubsup><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mrow><msqrt><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>W</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac></math>$ , $b=b−αvcorrecteddb√scorrecteddb+ϵb=b−αvcorrecteddb√scorrecteddb+ϵ<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi><mo>=</mo><mi>b</mi><mo>−</mo><mi>α</mi><mfrac><msubsup><mi>v</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup><mrow><msqrt><msubsup><mi>s</mi><mrow data-mjx-texclass="ORD"><mi>d</mi><mi>b</mi></mrow><mrow data-mjx-texclass="ORD"><mi>c</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>e</mi><mi>d</mi></mrow></msubsup></msqrt><mo>+</mo><mi>ϵ</mi></mrow></mfrac></math>$ ← Adam

3. Learning rate decay

Mini-Batch Gradient Descent의 경우 최적값 주변을 맴돌면서 수렴하지 않을 수 있다.

Learning rate $α α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>$ 값을 epoch을 진행할수록 감소시켜서 최적값에 수렴시킬 수 있다.

$α=11+decayRate×epochNumα0α=11+decayRate×epochNumα0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>+</mo><mi>d</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>y</mi><mi>R</mi><mi>a</mi><mi>t</mi><mi>e</mi><mo>×</mo><mi>e</mi><mi>p</mi><mi>o</mi><mi>c</mi><mi>h</mi><mi>N</mi><mi>u</mi><mi>m</mi></mrow></mfrac><msub><mi>α</mi><mn>0</mn></msub></math>$

최적화 알고리즘을 사용하다 보면 local optima에 수렴할 것 같지만, 실제 Neural Network에서 local optima에 수렴할 확률은 매우 낮다. 기울기가 0인 지점은 대부분 saddle point이다.

https://www.coursera.org/learn/deep-neural-network/home/week/2

'Deep Learning' 카테고리의 다른 글

2.3 Hyperparameter Tuning, Batch Normalization (0)	2023.03.27
2.1 Practical Aspects of Deep Learning (0)	2023.02.28
1.3 Deep Neural Networks (0)	2023.02.21
1.2 Shallow Neural Networks (0)	2023.02.14
1.1 Neural Networks Basics (1)	2023.02.07

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

기록하는 이유

2.2 Optimization Algorithms

1. Mini-Batch

2. Momentum / RMSprop / Adam

3. Learning rate decay

'Deep Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

2.2 Optimization Algorithms

1. Mini-Batch

2. Momentum / RMSprop / Adam

3. Learning rate decay

'Deep Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역