'Data Engineering & DataOps' 카테고리의 글 목록 (4 Page)

Notice

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

목록Data Engineering & DataOps (35)

without haste but without rest

읽어보면 좋은 카프카 자료

- 카프카 96초당 1TB 데이터 전송하기 https://www.confluent.io/blog/scaling-kafka-to-10-gb-per-second-in-confluent-cloud/ Scaling Kafka to 10+ GB/Second in Confluent Cloud Behind the magic: how Confluent Cloud scales Kafka to over 10 gigabytes per second in seven clicks, with zero downtime. www.confluent.io - 카프카 벤치마크 https://www.confluent.io/blog/kafka-fastest-messaging-system/#:~:text=Throughput%3A%20Kafka%..

Data Engineering & DataOps/Kafka 2021. 5. 21. 10:54

카프카 컨테이너로 빠르게 구축하기 with. confluent cp-all-in-one image

0. 요약 confluent 사에서 제공하는 도커 컴포즈 파일을 이용해서 카프카 브로커를 빠르게 구축한다. 도커 및 도커 컴포즈를 사용할 줄 안다고 가정한다. 장점은 곧바로 카프카를 사용할 수 있다는 점이며 도커 컴포즈 파일 기반이기 때문에 포트 번호와 옵션 등을 쉽게 수정할 수 있다. 카프카 클러스터, ksqlDB, 컨트롤 센터, 쥬키퍼 등을 제공한다. 주의할 점은 개발 용도의 파일이라 프로덕션 목적을 금지하고 있다. (커뮤니티 버전이 따로 있긴한데, 해당 파일은 컨트롤 센터가 없다.) 1. cp-all-in-one 파일 다운로드 confluentinc/cp-all-in-one docker-compose.yml files for cp-all-in-one , cp-all-in-one-community, ..

Data Engineering & DataOps/Kafka 2021. 5. 13. 15:01

카프카 중복 메시지 핸들링 with python

※ 쉬는 텀 없이 실시간으로 계속 들어오는 경우 사용하기 어려운 코드 dp 알고리즘의 메모이제이션을 응용했다. 현재 수집하는 데이터는 큐 구조로 1시간마다 갱신이 되는데, 새로운 데이터만 주는 것이 아니라 기존 데이터에 갱신된 데이터를 추가해서 보내준다. 따라서 while 문이 돌기 전 빈 딕셔너리를 선언하고 해당 자료구조를 이용해서 중복체크를 한다. if __name__ == "__main__": memoization_dict = {} while True: # 중복 검사 및 추출 res_list = [] for raw in raw_list: # raw_list는 갱신 받은 데이터다. key = raw["serial"] try: if memoization[key]: continue except: res_..

Data Engineering & DataOps 2021. 4. 29. 13:19

Airflow on Docker

Running Airflow in Docker — Airflow Documentation airflow.apache.org Update History 2022.03.04 Apple Silicon - Airfow 2.0.2 버전에서 에러 발생 2.2.4 버전으로 수정 1. 설치 mkidr Airflow-Demo # step 1. yaml 파일 다운로드 curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.4/docker-compose.yaml' # step 2. 기본 설정 적용 docker compose up airflow-init # step 3. 도커 컴포즈 실행 docker compose up -d # step 4. 실행 확인 docker co..

Data Engineering & DataOps/Airflow 2021. 4. 23. 10:40

HDFS 네임노드 tixd 시작 - We expected txid... 이슈

하둡 설정 건드리던 중 클러스터 작동 중에 네임노드 포맷을 해버림 stop-all.sh 명령어 이후 네임노드와 데이터노드들이 꼬여서 실행이 불가능했음 "java.io.IOException: There appears to be a gap in the edit log. We expected txid 266, but got txid 2672" 위와 비슷한 에러 로그 발견 열심히 구글링 했으나 마땅한 방법이 없어보임 하둡 네임노드, 데이터노드 디렉토리 "hadoop-data"에 위치한 파일을 모두 날렸더니 다시 작동은 한다. 데이터도 같이 휘발되므로 백업 필수

Data Engineering & DataOps/Hadoop 2021. 3. 2. 10:57

하둡 web 권한 문제 - dr.who permission denied

core-site.xml 파일에 아래 프로퍼티를 추가해준다. "your-hadoop-user-name"에는 하둡을 실행 중인 유저의 이름을 적어주면 된다. 해당 설정 추가하고 하둡 재시작. hadoop.http.staticuser.user your-hadoop-user-name

Data Engineering & DataOps/Hadoop 2021. 2. 25. 11:13

에어플로우 " no module name airflow " 트러블 슈팅

Quick start — Airflow Documentation airflow.apache.org 에어플로우 퀵스타트에서 제공하는 도커 컴포즈 파일을 그대로 실행하면 'No Module name 'airflow' 라는 로그를 띄우며 airflow-init 이미지가 실행이 안 된다. 리눅스의 경우 퀵 스타트에서 아래와 같이 권한 설정을 해주는 부분이 있다. mkdir ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env 아래 명령어로 AIRFLOW_UID와 AIRFLOW_GID 를 확인했을 때, AIRFLOW_UID가 50000이 아닌 경우 50000으로 수정하고 재실행한다. (그룹은 안 바꿔줘도 실행에 지장 없다.) ca..

Data Engineering & DataOps/Airflow 2021. 2. 22. 16:36

[HDFS] hdfs 파일을 aws S3로 복사하기

참조 - https://medium.com/dataseries/copy-hadoop-data-hive-to-s3-bucket-d1ffb59279c8 Copy Hadoop Data — Hive to S3 Bucket WHAT IS S3: S3 stands for “Simple Storage Service” and is offered by Amazon Web Services. It provides a simple to use file object storage… medium.com 짧게 요약한 아티클 참조 - https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_admin_distcp_data_cluster_migrate.html Copyi..

Data Engineering & DataOps/Hadoop 2020. 8. 6. 12:06

Prev 1 2 3 4 5 Next

목록Data Engineering & DataOps (35)

without haste but without rest

티스토리툴바